Introduction

I wanted to write a brief tutorial on how to get set up building Spark apps using Scala in my local environment because I have a couple of posts I want to write that depend on this set up.   My hope is that this will be straight forward, and that by the end anyone will be able to build and run all the Spark Examples @  https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples

Outline

The process that we are going to follow is below:

  1. Install Scala Build Tool (SBT)
  2. Install Scala IDE (IntelliJ)
    1. Install Scala Plugin for IntelliJ
  3. Build and Run the Pi Example
  4. Build and Run the Spark Examples

Warning:   This tutorial is originally writing in Nov 2016.   The rate of change of both Spark and IntelliJ may make these instructions obsolete sooner rather than later.  I provide links to sources wherever possible.   Post a comment if you do find something obsolete and I will attempt to fix it.

Install Scala Build Tool

The instructions for installing SBT 0.13.5 are found here.  Unfortunately the link uses the explicit version, and not ‘latest’, so you may want to click around until you find the latest instructions.

The Mac instructions and Ubuntu instructions are straight forward.

## NO BREW!?!?!?!?!?
## /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
#Mac Install
brew install sbt

#Ubuntu Install
echo "deb https://dl.bintray.com/sbt/debian /"
sudo tee -a /etc/apt/sources.list.d/sbt.list sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
sudo apt-get update
sudo apt-get install sbt

Install Scala IDE

Strictly speaking, you do not need an IDE for Scala, but I highly recommend one.  If you are new to Scala, and have build any Spark applications in python or R, you will likely get caught by Scala being strongly typed.   An IDE like IntelliJ will make it explicit for you, and catch mistakes as you make them.

Scala IDE Options

There are two options that are worth considering ala Quora and Reddit

  1. Eclipse with Scala Plugin: http://scala-ide.org/
  2. IntelliJ (Community Edition): https://www.jetbrains.com/idea/download/ 

I use IntelliJ, and if you choose this route you will need install the SBT Plugin.

  1. Open Prefrences
  2. Click on the ‘Plugins’ tab on the right hand side
  3. Search for ‘SBT’ or ‘Scala’ in the search panel on the right hand side
  4. Click ‘Install’

sbt-plugin.png

Full instructions for installing plugins are found here : https://www.jetbrains.com/help/idea/2016.2/installing-updating-and-uninstalling-repository-plugins.html

Build and Run Pi Example

Setup

The first thing we need to do is set up an project.  I have made a template and it is hosted on github.

git clone https://github.com/bryantravissmith/Spark-Scala-Template.git

Using IntelliJ, you can import the exisiting project after you have cloned it.  Open IntelliJ and create a project from an existing source.

existing-source-intellij

Select the cloned location and import it.

select-source-intellij.png

Choose import SBT for the existing project.

choose-sbt-intellij.png

Click Next, choose your preferences, and click finished.  The project will take a little time to import and download supporting files.   After all is done you should see an open project with the following structure.

import-complete-intellij.png

Review build.sbt Structure

If you are new to Scala than I think you might be curious about this build.sbt file.  It is used to sets up our project and describes the libraries we want to include in our project.   Here is the file with some additional comments


//Defines the version of our application and the scala version
//Change scala version for your project
// Spark 2+ Scala 2.11, Spark 1.6 Scala 2.10
val globalSettings = Seq(
  version := "1.0",
  scalaVersion := "2.11.8"
)

//Defines locations to look for dependencies
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"
resolvers += "Repo at github.com/ankurdave/maven-repo" at "https://github.com/ankurdave/maven-repo/raw/master"

//Defines project structure.
lazy val sparkScalaTemplate = (project in file("."))
 .settings(name := "SparkAppName-ToBeReplaced")
 .settings(globalSettings:_*)
 .settings(libraryDependencies ++= dependencies)

//Defines the version of spark we are going to use
val sparkVersion = "2.0.1"

//Lists the dependencies
lazy val dependencies = Seq(
 "org.apache.spark" %% "spark-core" % sparkVersion,
  // %% means auto include the Scala verion
  // "org.apache.spark" % "spark-mllib_2.11" % sparkVersion,
 "org.apache.spark" %% "spark-mllib" % sparkVersion,
 "org.apache.spark" %% "spark-graphx" % sparkVersion,
 "org.apache.spark" %% "spark-sql" % sparkVersion
)

//Defines how to include jars in the UberJar
assemblyMergeStrategy in assembly := {
 case PathList("META-INF", xs @ _*) => MergeStrategy.discard
 case x => MergeStrategy.first
}

The first thing you are likely to do is want add additional libraries to integrate with your project or to integrate with your production system.  Maven is going to be the best and first source for these libraries.   You can see the different spark projects here:  https://mvnrepository.com/artifact/org.apache.spark

screen-shot-2016-11-13-at-8-27-30-am

You can see the structure that is in the build.sbt file in this.  If we click on the first link, https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.10, we see the following (Forgive the AMEX add).

Screen Shot 2016-11-13 at 8.31.18 AM.png

You can see the build.sbt format from this.

Group:  org.apache.spark
Artifact: spark-core_2.10
Version: 2.0.1

"org.apache.spark" %% "spark-core" % "2.0.1"
"org.apache.spark" % "spark-score_2.11" % "2.0.1"

Aside – Including a new dependency

Lets say I want my spark app to write to a PostgreSQL 9.4 Database.  I would start off by searching mvn for ‘postgresql;.  It quickly takes me here: https://mvnrepository.com/artifact/org.postgresql/postgresql

screen-shot-2016-11-13-at-8-42-50-am

I can now add a dependencies to the build.sbt

Group: org.postgresql
Artifact: postgresql //Notice no scala version
Version: 9.4.12011.jre7

lazy val dependencies = Seq(
 "org.apache.spark" %% "spark-core" % sparkVersion,
 "org.apache.spark" %% "spark-mllib" % sparkVersion,
 "org.apache.spark" %% "spark-graphx" % sparkVersion,
 "org.apache.spark" %% "spark-sql" % sparkVersion,
 "org.postgresql" % "postgresql" % "9.4.1211.jre7"
)

To get the documentation on how to use this library I would do a google search of ‘org.postgresql javadocs’.  The first result is https://jdbc.postgresql.org/documentation/publicapi/.  I will let you engage in the thrill of discovery in how to implement this.

Spark PI

Now that we have a project, lets quickly get to implementing and running the PI example.  I have included it in the github template with a minor addition.

package org.apache.spark.examples

import scala.math.random

import org.apache.spark.sql.SparkSession

/** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {
    val spark = SparkSession
      .builder
      .master("local[*]") //Added from originals for local run in IntelliJ
      .appName("Spark Pi")
      .getOrCreate()
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = math.min(1000000L * slices, Int.MaxValue).toInt // avoid overflow
    val count = spark.sparkContext.parallelize(1 until n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y <= 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / (n - 1))
    spark.stop()
  }
}

Run

With SparkPi.scala open in the IntelliJ IDE you can ‘Run’ the app.

Screen Shot 2016-11-13 at 12.49.59 PM.png

The output is below:

Pi is roughly 3.1416575708287855

Build and Run any Spark Example

So now I’m going to work through pulling all of the Spark Examples so you can build and play with yourself.

First we are going to clone a portion of the repo, and restructure the files to fit into the IntelliJ project format.

mkdir SparkExamples
cd SparkExamples
git init
git remote add -f origin https://github.com/apache/spark.git
git config core.sparseCheckout true
echo "data/*" >> .git/info/sparse-checkout
echo "examples/src/main/scala/*" >> .git/info/sparse-checkout
git pull origin master
mv examples/* ./
rm -r examples
rm -r src/main/scala/org/apache/spark/examples/streaming/

Now we want to create a new Scala-SBT  project in this directory.

Screen Shot 2016-11-13 at 1.56.29 PM.png

screen-shot-2016-11-13-at-2-01-49-pm

Now you can replace the build.sbt with the below code.


val globalSettings = Seq(
  version := "1.0",
  scalaVersion := "2.11.8"
)

resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
resolvers += "Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"
resolvers += "Repo at github.com/ankurdave/maven-repo" at "https://github.com/ankurdave/maven-repo/raw/master"

lazy val sparkScalaTemplate = (project in file("."))
  .settings(name := "SparkExamples")
  .settings(globalSettings:_*)
  .settings(libraryDependencies ++= dependencies)

val sparkVersion = "2.0.1"

lazy val dependencies = Seq(
  "org.apache.spark" %% "spark-core" % sparkVersion,
  "org.apache.spark"    %% "spark-sql"             % sparkVersion,
  "org.apache.spark"    %% "spark-mllib"           % sparkVersion,
  "org.apache.spark"    %% "spark-graphx"          % sparkVersion,
  "com.github.scopt" %% "scopt" % "3.5.0"
)

When you save this IntelliJ will ask if you want to auto import and/or refresh.  I set it to do this automatically.  There may be some red warnings.  Ignore them for a moment.

Go ahead and open SparkKMeans.scala and run it.

I get a few errors having to do with the examples using previous versions of the Spark API.  There are only three files.

LogisticRegressionWithElasticNetExample.scala

comment out .setFamily("multinomial") on line 54
change coefficientMatrix to coefficients on line 59
change interceptMatrix to intercept on line 60

MulticlassLogisticRegressionWithElasticNetExample.scala

change coefficientMatrix to coefficients on line 50
change interceptMatrix to intercept on line 51

SparkSQLExample.scala

change df.createGlobalTempView("people") to df.createTempView("people")

Now when you try to run SparkKMeans.scala you will get the following error because we have not supplied arguments.

Usage: SparkKMeans <file> <k> <convergeDist>

Process finished with exit code 1

In the top right of the IDE we can edit and change the configurations.

screen-shot-2016-11-13-at-2-36-15-pm

And add the following arguments

data/mllib/kmeans_data.txt 2 1.0

screen-shot-2016-11-13-at-2-38-31-pm

The last thing we need to do is add a master(“local[*]”) to the spark session builder.

val spark = SparkSession
.builder
.master(&quot;local[*]&quot;)
.appName(&quot;SparkKMeans&quot;)
.getOrCreate()

Now we can run the example and get the following output.

Final centers:
DenseVector(0.2, 0.2, 0.2)
DenseVector(0.1, 0.1, 0.1)

Conclusion

At this point you should be able to run all the Spark Examples with some minor edits.  All the examples will need to add master(“local[*]”) in the building the spark session.  Some of the examples will require the you provide arguments.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: