Skip to main content

· 8 min read
Ganesh Chand

Scala is a Scalable general-purpose programming language with a huge ecosystem of scala and jvm libraries.

It is designed to allow us to interactively learn and build prototype with the Scala REPL and evolve the prototype to automate your tasks with Scala scripts that can utilize the full compute power of a machine. As you gain more skills, you can write complex programs to solve real world problems using powerful language level features and scale your application as demand grows.

In my previous blog, I gave a beginner level introduction to the scala-cli

In this blog, let's write a scala script to solve a real world problem.

Problem Statement

Given a directory, find the largest Scala files.

Requirements

  • It must look for Scala files (both .sc and .scala) in all of its sub-folders.
  • It must count number of lines in each file
  • Code comments are not treated any differently
  • It must display top x files in descending order with line count and the relative path for each file where x is 5 by default.

Solution

First, let's create a working directory

mkdir ~/my_automation
cd ~/my_automation
touch find_largest_files.sc

Next, open find_largest_files.sc in any of the supported code editors. I am using VS Code in this example. Here, we open a file in VS Code editor from the terminal. You can also use the VS Code Menu to open a file. See here if you need help getting started with with VS Code.

code find_largest_files.sc 

Now, we are ready to code!

We are going to use Scala 3 and com.lihaoyi/os-lib library to accomplish file system tasks such as listing files from a directory recursively, getting a current working directory, etc.

Here's the first version of the script that aims to solve the problem satisfying all the requirements.

//> using scala 3
//> using lib "com.lihaoyi::os-lib::0.9.0"

import scala.util.Try
// args(0) fails with java.lang.ArrayIndexOutOfBoundsException if no argument was provided by the user
val inputDir: Try[String] = Try(args(0))
// Default input directory is current working directory where the script is being run at.
val directory: os.Path = inputDir.map(os.Path(_)).getOrElse(os.pwd)
// Default number of largest files to show is 5.
val topN: Int = Try(args(1).toInt).getOrElse(5)

println(s"Finding top $topN largest Scala files in $directory")
os
.walk(directory)
.filter(path => os.isFile(path) && (path.ext == "scala" || path.ext == "sc"))
.map(path => (path, os.read.lines(path).size))
.sortBy((path, lineCount) => lineCount) // short form: sortBy(_._2)
.reverse
.take(topN)
.foreach { (path, lineCount) =>
val relativeFilePath: String =
directory.toNIO.toUri().relativize(path.toNIO.toUri()).getPath
println(s"$lineCount $relativeFilePath")
}

Let me explain the important parts of the above script.

  • Line 1 and Line 2 are scala-cli directives that that allow us to declare dependency amongst many other things. With Scala CLI, you can provide configuration information using directives — a dedicated syntax that can be embedded in any .scala or .sc file.

    • Line 1: use Scala 3 to compile and run the script
    • Line 2: Use com-lihaoyi/os-lib library's API to recursively list files in a given directory
  • Line 6 - 8: We accept user input with error handling and use default values.

  • Line 11 - 22: We list files in a given directory and all of its sub-directories and filter files that have .scala or .sc extension. We sort files by line count in the files in descending order using the reverse() function. Then, we only display the relative file path of the first topN files.

That's it. Just like that, your scala script is ready!

One of the most popular project written in Scala is Delta Lake. I forked and cloned the delta lake the Delta Lake github repo on my local machine.

Let's find out what are the top 5 largest scala files in Delta Lake source code.

scala-cli find_largest_files.sc -- /Users/Shared/repos/opensource/delta

Finding top 5 largest Scala files in /Users/Shared/repos/opensource/delta
38153 benchmarks/src/main/scala/benchmark/TPCDSBenchmarkQueries.scala
5305 core/src/test/scala/org/apache/spark/sql/delta/MergeIntoSuiteBase.scala
3060 core/src/test/scala/org/apache/spark/sql/delta/DeltaSuite.scala
2996 core/src/main/scala/org/apache/spark/sql/delta/DeltaErrors.scala
2785 core/src/test/scala/org/apache/spark/sql/delta/DeltaErrorsSuite.scala

Let's find out what are the top 5 largest scala files in Delta Lake source code.

$ scala-cli find_largest_files.sc -- /Users/Shared/repos/opensource/delta 10
Finding top 10 largest Scala files in /Users/Shared/repos/opensource/delta
38153 benchmarks/src/main/scala/benchmark/TPCDSBenchmarkQueries.scala
5305 core/src/test/scala/org/apache/spark/sql/delta/MergeIntoSuiteBase.scala
3060 core/src/test/scala/org/apache/spark/sql/delta/DeltaSuite.scala
2996 core/src/main/scala/org/apache/spark/sql/delta/DeltaErrors.scala
2785 core/src/test/scala/org/apache/spark/sql/delta/DeltaErrorsSuite.scala
2335 core/src/test/scala/org/apache/spark/sql/delta/DeltaTableCreationTests.scala
2218 core/src/test/scala/org/apache/spark/sql/delta/DeltaSourceSuite.scala
1789 core/src/test/scala/org/apache/spark/sql/delta/stats/DataSkippingDeltaTests.scala
1764 core/src/test/scala/org/apache/spark/sql/delta/GeneratedColumnSuite.scala
1738 core/src/test/scala/org/apache/spark/sql/delta/schema/SchemaUtilsSuite.scala

You are happy with your script and you'd like to share it to the rest of the scala community and beyond. Wait!, not quite yet. The tool you built only works for scala files. To make your tool more useful, you want to support other file extensions too.

Let's work on the requirements again

Revised Requirements

  • * It must look for Scala files (both .sc and .scala) in all of its sub-folders.
  • It must allow users to provide any valid file extensions and look for them in all sub-folders. If a user doesn't provide any, it defaults to Scala files (both .sc and .scala).
  • It must count number of lines in each file
  • Code comments are not treated any differently
  • It must display top x files in descending order with line count and the relative path for each file where x is 5 by default.

Revised Solution

//> using scala 3
//> using lib "com.lihaoyi::os-lib::0.9.0"

import scala.util.Try
// args(0) fails with java.lang.ArrayIndexOutOfBoundsException if no argument was provided by the user
val inputDir: Try[String] = Try(args(0))
// Default input directory is current working directory where the script is being run at.
val directory: os.Path = inputDir.map(os.Path(_)).getOrElse(os.pwd)
// Default number of largest files to show is 5.
val topN: Int = Try(args(1).toInt).getOrElse(5)
// Default file type is scala
val fileType: String = Try(args(2)).getOrElse("scala").toLowerCase()

def isFileOfType(path: os.Path, fileType: String): Boolean = {
val ext: String = path.ext.toLowerCase()
os.isFile(path) && (fileType match {
case "scala" => ext == "scala" || ext == "sc"
case "java" => ext == "java"
case "python" => ext == "py"
case "sql" => ext == "sql"
case "text" => ext == "txt"
case "json" => ext == "json"
case "xml" => ext == "xml"
case "yaml" => ext == "yaml" || ext == "yml"
case "markdown" => ext == "md"
case "html" => ext == "html"
case "css" => ext == "css"
case "javascript" => ext == "js"
case "typescript" => ext == "ts"
case "shell" => ext == "sh"
case _ => ext == fileType || ext == s".$fileType"
})

}

println(s"Finding top $topN largest ${fileType.toUpperCase()} files in $directory")

os
.walk(directory)
.filter(path => isFileOfType(path, fileType))
.map(path => (path, os.read.lines(path).size))
.sortBy((path, lineCount) => lineCount) // short form: sortBy(_._2)
.reverse
.take(topN)
.foreach { (path, lineCount) =>
val relativeFilePath: String =
directory.toNIO.toUri().relativize(path.toNIO.toUri()).getPath
println(s"$lineCount $relativeFilePath")
}

Let's review the changes.

  • As you can see above, all we had to do was generalize the .filter() on line 36 and search for files with the extension type provided by the user.
  • We wrote a new function isFileOfType() which does the actual work.
  • We also parameterized fileType in the print statement on line 33.

Rest of the code is same as before and let's run the script.

Find Top 5 Java files in Delta Lake project source code.

scala-cli find_largest_files.sc -- /Users/Shared/repos/opensource/delta 5 java

Finding top 5 largest JAVA files in /Users/Shared/repos/opensource/delta
413 storage-s3-dynamodb/src/main/java/io/delta/storage/BaseExternalLogStore.java
364 storage/src/main/java/io/delta/storage/S3SingleDriverLogStore.java
345 storage-s3-dynamodb/src/main/java/io/delta/storage/S3DynamoDBLogStore.java
215 core/src/test/java/io/delta/tables/JavaDeltaTableBuilderSuite.java
194 storage/src/main/java/io/delta/storage/HDFSLogStore.java

Find Top 5 Python files in Delta Lake project source code.

$ scala-cli find_largest_files.sc -- /Users/Shared/repos/opensource/delta python
Finding top 5 largest PYTHON files in /Users/Shared/repos/opensource/delta
1369 python/delta/tables.py
1220 python/delta/tests/test_deltatable.py
806 examples/tutorials/saiseu19/SAISEu19 - Delta Lake Python Tutorial.py
454 benchmarks/scripts/benchmarks.py
411 run-integration-tests.py

Publish

Now, you are pretty happy with the tool you have built and if you want to share it to the rest of the world. you can of course commit your code to your github repo and announce it publicly. However, creating a github repo for one script is probably not a good idea. 😃

The best way to share scala-cli scripts is via github gist.

  • Step 1 - Create GitHub gist

      gh gist create find_largest_files.sc
  • Step 2 - Share the gist URL. Rest of the world with scala-cli installed on their machine can now run your script using the GitHub gist URL as shown below:

      scala-cli <YOUR_GIST_URL> -- <SCRIPT PARAMETERS>

I have already published this script as a GitHub gist.

Closing Thoughts

Scala is truly general purpose and scalable programming language. You can experiment quickly using the Scala REPL and then build a prototype quickly using scala-cli script and share it with your team members. As you add more features to your prototype, you many need to organize your scripts in modules and also write unit tests. scala-cli is ideal for single module project and you don't need to use any build tool at all.

However, as your code base gets complex, you might want to customize the build and tests. At this point, you would want to use the proper build tool such as SBT or Mill.

If you have any feedback or comments for me on this blog, please feel free to reach out to me on LinkedIn or Twitter. The script is published here and please feel free to clone it or leave a comment/questions.

· 10 min read
Ganesh Chand

For a programming language to be widely adopted, it must be beginner friendly.

Every master was once a beginner. Every pro was once an amateur. -- Robin Sharma

As a beginner to programming world, I am looking for a general purpose programming language that:

  • Requires minimal setup and gets me going
  • Is intuitive and easy to learn
  • Helps me avoid making obvious mistakes
  • Guides me with helpful and easy to understand error messages
  • Helps me build simple apps for fun without having to learn any build tools
  • Has welcoming and helpful user community
  • Has tons of free and paid resources
  • Provides me job opportunity

I think Scala checks most of those boxes but is it really a beginner friendly? The language itself is easy to learn but the pre-requisites to setup a learning and protyping environment can cause frustrations to begineers.

Common challenges for the beginners

Scala is primarily a JVM language. This means, as a beginner, I also need to learn installing Java Development Kit installed on my machine and oh! by the way, it should be compatible with the Scala version I am using. OMG! there are so many JDKs available. Which flavor of JDK should I use?

Scala official getting started guide recommends installing Scala using the installer tool coursier and use the SBT build tool to compile, run and test your program. For a beginner, that means spending hour(s) setting up a learning environment. You are also forced to learn the build tool which is really not required at all for beginners.

The language really should make it easier for beginners to get started with minimal steps and helps me stay engaged and motivated to learn more and grow my skills by applying the technique to solve interesting problems.

Enter scala-cli

Thanks to Scala Center and Virtuslab, we finally have scala-cli command line tool that really makes it super easy to start programming in Scala in minutes. As beginners, all you need to do is install scala-cli and you are ready to go.

  • It automatically installs Java and Scala and additional tools so you can just focus on learning the core language features.
  • It provides you both the Scala and ammonite REPLs for interactive learning and rapid prototyping.
  • It provides first-class support for scripting
  • It provides many other commands, as shown below. But you don't have to learn any of these commands to get started.

ganeshchand.com/blog

I would strongly recommend reading scala-cli documentation if you would like to explore all commands and features in details. My top 5 reasons why I love scala-cli and use it every day are:

  1. I am more productive: I don't have to worry about managing multiple Java and Scala versions on my computer. I can specify which Scala version and Java version to use either in my script or as command argument and let scala-cli figure out how to run it with the correct version. This also allows me to focus on the task at hand and quickly test my Scala code against different Scala and Java versions.

  2. I can easily share my Scala code/scripts to the community: Github Gists provide a simple way to share code snippets with others. Every gist is a Git repository, which means that it can be forked and cloned. scala-cli allows you to run the Scala code directly using the Gist URL. This incredibly simplifies especially in the open-source community how we report bugs and share a code that reproduces the bug.

  3. I can debug and build prototypes interactively and very quickly: I can quickly switch back and forth between the REPL and the code editor. I use REPL to explore the APIs and prototype my functions and expressions. Once I am happy with the prototype, I copy the code, get out of the REPL and add them to my scripts in the editor.

  4. I can use my favorite IDEs: IDEs help with auto-completion, code navigation and code refactoring. I use both IntelliJ and VSCode.

  5. I can create Jars and publish them locally and/or to artifactory: Without having any knowledge of build tool, I can convert my Scala script to an Open Source Scala Library by publishing it to Maven.

Getting Started with scala-cli

Now, let us see how Scala programming language, scala-cli tool, REPL and scripting all come together.

First, Install scala-cli on your machine.

# one command and that's all!
brew install Virtuslab/scala-cli/scala-cli

Next, verify if the installation was successful.

echo 'println("Hello")' | scala-cli -

Next, let's use the scala REPL to quickly experiment with our hello world program.

$ scala-cli console
Welcome to Scala 3.2.2 (11.0.18, Java OpenJDK 64-Bit Server VM).
Type in expressions for evaluation. Or try :help.


scala> println("Hello World, let us live in peace and harmony")
Hello World, let us live in peace and harmony

scala> :quit

Next, let's see how we can easily turn our above experiment into a Scala script that we can run from the command line.

Scala as a scripting language

Scala has a set of convenient constructs that help you get started quickly and let you program in a pleasantly concise style. Because of its expressiveness and conciseness with in-built type safety, it becomes an ideal scripting language.

The name Scala stands for scalable language." The language is so named because it was designed to grow with the demands of its users. You can apply Scala to a wide range of programming tasks, from writing small scripts to building large systems.

-Martin Odersky; Lex Spoon; Bill Venners; and Frank Sommers. Programming in Scala, Fifth Edition

Let's create a directory where we will create and edit our scripts. This is strictly not required but it's always better to stay organized.

mkdir  ~/learn_scala
cd ~/learn_scala

Now, we are in our ~/learn_scala directory. Let's create a hello world script.

# create a script that prints my message to the world
echo 'println("Hello World, let us live live in peace and harmony")' >> greetings.sc

# display the content of greetings.sc file
cat greetings.sc

println("Hello World, let us live live in peace and harmony")

Notice the above script looks exactly like a Python script. No ceremony is required.

Next, let's run this script.

scala-cli greetings.sc 
tip

When you run the scala-cli script first time, it will check your local environment and download and install the required libraries for you and therefore might take couple seconds . The subsequent execution should be super almost instantaneous!

scala-cli provides a run command to execute your program. But it's optional.

scala-cli run greetings.sc is same as scala-cli greetings.sc

Next, let's make this script executable. i.e. I should be able to run the script simply by by running the shell command: ./greetings.sc

First, add #! /usr/bin/env scala-cli to the top of the script. If you have scripted in python, you have probably seen #!/usr/bin/env python in your script.

The script should look like as shown below:

cat greetings.sc

#! /usr/bin/env scala-cli
println("Hello World, let us live live in peace and harmony")

Next, let's make this script executable by changing the file permission.

chmod +x greetings.sc

Now, you can run the script as ./greetings.sc You might ask why do I need ./ in the beginning and why can't I simply run it as greetings.sc? As answered here, dot-slash is a safety mechanism to indicate the program being executed is a user created command located in the current directory.

However, if you really really want to execute the script by simply typing greetings.sc, you can do so by adding your script directory to the $PATH environment variable as shown below:

export PATH="$HOME/learn_scala:$PATH"
greetings.sc

Hello World, let us live live in peace and harmony

You can take it even one step further by creating a command alias so you don't have to type the file extension. You can name it whatever you want as long as it doesn't conflict with existing commands.

alias hello="greetings.sc"
hello

Hello World, let us live live in peace and harmony

As I said earlier, you can share your script as Github Gist and rest of the world with scala-cli installed on their local machine can run your script using the Gist URL. Try running the below command:

scala-cli https://gist.github.com/ganeshchand/d2fbb4c03238329e0cd4b0019f373d25

Bonus content

The above script always prints "Hello World, let us live live in peace and harmony" no matter who runs it or when it runs. Real world scripts are parameterized to make them dynamic.

In our example, we should allow the end-users of our program to print their own message to the world.

With scala-cli scripts, command line arguments are accessed through the special args variable which is an array of strings. Users can provide space delimited strings as program arguments.

let's create a new script custom_greetings.sc that is parameterized on the message that needs to be printed. Also, let's customize the greeting based on what time of the day the script is run.

// content of the script custom_greetings.sc 

// greet with good morning or good afternoon or good evening based on the time of the day
val hour = java.time.LocalTime.now.getHour
val greeting = hour match {
case h if h < 12 => "Good Morning"
case h if h < 16 => "Good Afternoon"
case h if h < 20 => "Good Evening"
case _ => "Good Night"
}

val defaultMessage = "let us live live in peace and harmony"
val message = if(args.isEmpty) defaultMessage else args(0)
println(s"$greeting World, $message!")

Run this script from command line:

scala-cli custom_greetings.sc -- "Scala is fun"    
Good Afternoon World, Scala is fun!

Next, I am super excited about this cool command line tool I built and I would love to share with the larger community.

There are couple of ways to do it:

  1. Publish and share your script as github gist. Now, anyone can run your script by using the gist URL as shown below:
    scala-cli https://gist.github.com/ganeshchand/fa703cc5459aa92dd07210ea6d549765 -- "YOUR_MESSAGE"
  2. Publish your script in the public git repo and others can clone or download the script on their local machine and run it as shown below:
    scala-cli <LOCAL_PATH_TO_THE_SCRIPT> -- "YOUR_MESSAGE"
  3. Package your script as an executable jar or a scala library. You'd typically do this when you are building a real-world application or a utility library that typically requires you to organize your scripts as modules and provide a single entry point to your library or an application. Knowing how to package and distribute your code is not something as a beginner you are required to know. If you are comfortable using this feature, read here.

Closing Thoughts

scala-cli genuinely aims to simplify the developer experience for Scala users. As beginners, you can now focus on learning the core language constructs, programming concepts and solve real world problems without the distraction of build tools and environment setup. I am sure the tool will mature even more over time. In the upcoming blog in #scala-cli series, I will talk about using external libraries in your scripts and write a script to solve a real world problem. Stay tuned!

If you have any feedback or comments for me on this blog, please feel free to reach out to me on LinkedIn or Twitter.

· 2 min read
Ganesh Chand

ganeshchand.com/blog

Organizations across the globe are going through digital transformations, migrating to cloud, modernizing their stacks, embracing data driven culture, and automating left,right, and centre.

This definitely sounds exciting and clearly presents opportunities for all of us. This means, we must embrace change, quickly! A big part of embracing chang are learning and acquring new skills as per the market demand and being able to let go your past knowledge and skills that are no more relevant.

The Only Constant is Change - The Greek philosopher Heraclitus

This continuous cycle of learning and unlearning process requires discipline, motivation and growth mindset. But, having the right skillet is not enough in today's collaborative work environment. Communication skill, not matter your job function, is the foundational skill and a secret sauce to your success.

I will be honest. I struggle at writing and not very good at speaking. This is precisely why I decided to start blogging. ganeshchand.com is going to be my space to document my learnings & lessons and share them with the world. Admittedly, I bought this domain 6 years ago and after years of procrastination, I am so glad I am finally going to make use of it. You'll find me writing mostly on the following topics:

  • Functional Programming - Scala
  • Lakehouse architecture
  • Apache Spark and DeltaLake
  • Data & ML Engineering
  • Technical White Paper summary
  • Summary of books I've read
  • Travel and life lessons

I really hope you all will find the content useful. If you wish to connect with me, please reach out to me on linkedin or twitter.Wish me good luck!

Cheers to the growth mindset and willingness to start all over again!