Skip to main content

One post tagged with "publishing"

View All Tags

· 8 min read
Ganesh Chand

Scala is a Scalable general-purpose programming language with a huge ecosystem of scala and jvm libraries.

It is designed to allow us to interactively learn and build prototype with the Scala REPL and evolve the prototype to automate your tasks with Scala scripts that can utilize the full compute power of a machine. As you gain more skills, you can write complex programs to solve real world problems using powerful language level features and scale your application as demand grows.

In my previous blog, I gave a beginner level introduction to the scala-cli

In this blog, let's write a scala script to solve a real world problem.

Problem Statement

Given a directory, find the largest Scala files.

Requirements

  • It must look for Scala files (both .sc and .scala) in all of its sub-folders.
  • It must count number of lines in each file
  • Code comments are not treated any differently
  • It must display top x files in descending order with line count and the relative path for each file where x is 5 by default.

Solution

First, let's create a working directory

mkdir ~/my_automation
cd ~/my_automation
touch find_largest_files.sc

Next, open find_largest_files.sc in any of the supported code editors. I am using VS Code in this example. Here, we open a file in VS Code editor from the terminal. You can also use the VS Code Menu to open a file. See here if you need help getting started with with VS Code.

code find_largest_files.sc 

Now, we are ready to code!

We are going to use Scala 3 and com.lihaoyi/os-lib library to accomplish file system tasks such as listing files from a directory recursively, getting a current working directory, etc.

Here's the first version of the script that aims to solve the problem satisfying all the requirements.

//> using scala 3
//> using lib "com.lihaoyi::os-lib::0.9.0"

import scala.util.Try
// args(0) fails with java.lang.ArrayIndexOutOfBoundsException if no argument was provided by the user
val inputDir: Try[String] = Try(args(0))
// Default input directory is current working directory where the script is being run at.
val directory: os.Path = inputDir.map(os.Path(_)).getOrElse(os.pwd)
// Default number of largest files to show is 5.
val topN: Int = Try(args(1).toInt).getOrElse(5)

println(s"Finding top $topN largest Scala files in $directory")
os
.walk(directory)
.filter(path => os.isFile(path) && (path.ext == "scala" || path.ext == "sc"))
.map(path => (path, os.read.lines(path).size))
.sortBy((path, lineCount) => lineCount) // short form: sortBy(_._2)
.reverse
.take(topN)
.foreach { (path, lineCount) =>
val relativeFilePath: String =
directory.toNIO.toUri().relativize(path.toNIO.toUri()).getPath
println(s"$lineCount $relativeFilePath")
}

Let me explain the important parts of the above script.

  • Line 1 and Line 2 are scala-cli directives that that allow us to declare dependency amongst many other things. With Scala CLI, you can provide configuration information using directives — a dedicated syntax that can be embedded in any .scala or .sc file.

    • Line 1: use Scala 3 to compile and run the script
    • Line 2: Use com-lihaoyi/os-lib library's API to recursively list files in a given directory
  • Line 6 - 8: We accept user input with error handling and use default values.

  • Line 11 - 22: We list files in a given directory and all of its sub-directories and filter files that have .scala or .sc extension. We sort files by line count in the files in descending order using the reverse() function. Then, we only display the relative file path of the first topN files.

That's it. Just like that, your scala script is ready!

One of the most popular project written in Scala is Delta Lake. I forked and cloned the delta lake the Delta Lake github repo on my local machine.

Let's find out what are the top 5 largest scala files in Delta Lake source code.

scala-cli find_largest_files.sc -- /Users/Shared/repos/opensource/delta

Finding top 5 largest Scala files in /Users/Shared/repos/opensource/delta
38153 benchmarks/src/main/scala/benchmark/TPCDSBenchmarkQueries.scala
5305 core/src/test/scala/org/apache/spark/sql/delta/MergeIntoSuiteBase.scala
3060 core/src/test/scala/org/apache/spark/sql/delta/DeltaSuite.scala
2996 core/src/main/scala/org/apache/spark/sql/delta/DeltaErrors.scala
2785 core/src/test/scala/org/apache/spark/sql/delta/DeltaErrorsSuite.scala

Let's find out what are the top 5 largest scala files in Delta Lake source code.

$ scala-cli find_largest_files.sc -- /Users/Shared/repos/opensource/delta 10
Finding top 10 largest Scala files in /Users/Shared/repos/opensource/delta
38153 benchmarks/src/main/scala/benchmark/TPCDSBenchmarkQueries.scala
5305 core/src/test/scala/org/apache/spark/sql/delta/MergeIntoSuiteBase.scala
3060 core/src/test/scala/org/apache/spark/sql/delta/DeltaSuite.scala
2996 core/src/main/scala/org/apache/spark/sql/delta/DeltaErrors.scala
2785 core/src/test/scala/org/apache/spark/sql/delta/DeltaErrorsSuite.scala
2335 core/src/test/scala/org/apache/spark/sql/delta/DeltaTableCreationTests.scala
2218 core/src/test/scala/org/apache/spark/sql/delta/DeltaSourceSuite.scala
1789 core/src/test/scala/org/apache/spark/sql/delta/stats/DataSkippingDeltaTests.scala
1764 core/src/test/scala/org/apache/spark/sql/delta/GeneratedColumnSuite.scala
1738 core/src/test/scala/org/apache/spark/sql/delta/schema/SchemaUtilsSuite.scala

You are happy with your script and you'd like to share it to the rest of the scala community and beyond. Wait!, not quite yet. The tool you built only works for scala files. To make your tool more useful, you want to support other file extensions too.

Let's work on the requirements again

Revised Requirements

  • * It must look for Scala files (both .sc and .scala) in all of its sub-folders.
  • It must allow users to provide any valid file extensions and look for them in all sub-folders. If a user doesn't provide any, it defaults to Scala files (both .sc and .scala).
  • It must count number of lines in each file
  • Code comments are not treated any differently
  • It must display top x files in descending order with line count and the relative path for each file where x is 5 by default.

Revised Solution

//> using scala 3
//> using lib "com.lihaoyi::os-lib::0.9.0"

import scala.util.Try
// args(0) fails with java.lang.ArrayIndexOutOfBoundsException if no argument was provided by the user
val inputDir: Try[String] = Try(args(0))
// Default input directory is current working directory where the script is being run at.
val directory: os.Path = inputDir.map(os.Path(_)).getOrElse(os.pwd)
// Default number of largest files to show is 5.
val topN: Int = Try(args(1).toInt).getOrElse(5)
// Default file type is scala
val fileType: String = Try(args(2)).getOrElse("scala").toLowerCase()

def isFileOfType(path: os.Path, fileType: String): Boolean = {
val ext: String = path.ext.toLowerCase()
os.isFile(path) && (fileType match {
case "scala" => ext == "scala" || ext == "sc"
case "java" => ext == "java"
case "python" => ext == "py"
case "sql" => ext == "sql"
case "text" => ext == "txt"
case "json" => ext == "json"
case "xml" => ext == "xml"
case "yaml" => ext == "yaml" || ext == "yml"
case "markdown" => ext == "md"
case "html" => ext == "html"
case "css" => ext == "css"
case "javascript" => ext == "js"
case "typescript" => ext == "ts"
case "shell" => ext == "sh"
case _ => ext == fileType || ext == s".$fileType"
})

}

println(s"Finding top $topN largest ${fileType.toUpperCase()} files in $directory")

os
.walk(directory)
.filter(path => isFileOfType(path, fileType))
.map(path => (path, os.read.lines(path).size))
.sortBy((path, lineCount) => lineCount) // short form: sortBy(_._2)
.reverse
.take(topN)
.foreach { (path, lineCount) =>
val relativeFilePath: String =
directory.toNIO.toUri().relativize(path.toNIO.toUri()).getPath
println(s"$lineCount $relativeFilePath")
}

Let's review the changes.

  • As you can see above, all we had to do was generalize the .filter() on line 36 and search for files with the extension type provided by the user.
  • We wrote a new function isFileOfType() which does the actual work.
  • We also parameterized fileType in the print statement on line 33.

Rest of the code is same as before and let's run the script.

Find Top 5 Java files in Delta Lake project source code.

scala-cli find_largest_files.sc -- /Users/Shared/repos/opensource/delta 5 java

Finding top 5 largest JAVA files in /Users/Shared/repos/opensource/delta
413 storage-s3-dynamodb/src/main/java/io/delta/storage/BaseExternalLogStore.java
364 storage/src/main/java/io/delta/storage/S3SingleDriverLogStore.java
345 storage-s3-dynamodb/src/main/java/io/delta/storage/S3DynamoDBLogStore.java
215 core/src/test/java/io/delta/tables/JavaDeltaTableBuilderSuite.java
194 storage/src/main/java/io/delta/storage/HDFSLogStore.java

Find Top 5 Python files in Delta Lake project source code.

$ scala-cli find_largest_files.sc -- /Users/Shared/repos/opensource/delta python
Finding top 5 largest PYTHON files in /Users/Shared/repos/opensource/delta
1369 python/delta/tables.py
1220 python/delta/tests/test_deltatable.py
806 examples/tutorials/saiseu19/SAISEu19 - Delta Lake Python Tutorial.py
454 benchmarks/scripts/benchmarks.py
411 run-integration-tests.py

Publish

Now, you are pretty happy with the tool you have built and if you want to share it to the rest of the world. you can of course commit your code to your github repo and announce it publicly. However, creating a github repo for one script is probably not a good idea. 😃

The best way to share scala-cli scripts is via github gist.

  • Step 1 - Create GitHub gist

      gh gist create find_largest_files.sc
  • Step 2 - Share the gist URL. Rest of the world with scala-cli installed on their machine can now run your script using the GitHub gist URL as shown below:

      scala-cli <YOUR_GIST_URL> -- <SCRIPT PARAMETERS>

I have already published this script as a GitHub gist.

Closing Thoughts

Scala is truly general purpose and scalable programming language. You can experiment quickly using the Scala REPL and then build a prototype quickly using scala-cli script and share it with your team members. As you add more features to your prototype, you many need to organize your scripts in modules and also write unit tests. scala-cli is ideal for single module project and you don't need to use any build tool at all.

However, as your code base gets complex, you might want to customize the build and tests. At this point, you would want to use the proper build tool such as SBT or Mill.

If you have any feedback or comments for me on this blog, please feel free to reach out to me on LinkedIn or Twitter. The script is published here and please feel free to clone it or leave a comment/questions.