Thursday, February 26, 2015

Installing Spark on Ubuntu

Why Spark?


Spark comes with tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming). Rather than having to mix and match a set of tools (e.g., Hive, Hadoop, Mahout, S4/Storm), you only have to learn one programming paradigm. For SQL enthusiasts, the added bonus is that Shark tends to run faster than Hive.


Environment

  1. Apache Spark 1.2.1
  2. Ubuntu 14.10
  3. Git
  4. JDK 1.8.0_31
Make sure you have Java and Git installed.  Double check that your $JAVA_HOME environment variable points to a valid jdk and not a jre.


1. Installing Spark


The following commands have been tested, and are operational, under Ubuntu 14.10.
mkdir ~/spark
cd ~/spark
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.2.1.tgz
tar -zxvf spark-1.2.1.tgz

I don't recommend using a directory installation path which will require the use of sudo privileges to install spark.  I attempted this initially, and ran into an Access Control Exception when trying to connect Spark to my existing HDFS cluster.

2. Modifying the Path


Modify both the PATH and CLASSPATH to point to the new Scala installation.

Use the text editor of your choice to edit your environment file:
sudo gedit /etc/environment

The italicized text was added:
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/apache/spark/1.2.1/bin"
 ...   
export SPARK_HOME=/usr/lib/apache/spark/1.2.1

Once the environment file is saved, reload it:
source /etc/environment



3. Verifying the Installation


We're going to run a sample program that is bundled with Spark to verify the installation was performed correctly.

The program is called SparkPi and will output a value for Pi. We're first going to build this program using the Simple Build Tool (SBT) for Scala. SBT is to Scala what Maven is for Java.

We're first going to build Spark:
sbt/sbt -Dhadoop.version=2.6.0 assembly
Notice the hadoop version I'm passing in.  You'll want to change this to the version of hadoop running on your existing cluster.

The script took nearly 20 minutes to run and completed with this line:
[success] Total time: 1039 s, completed Feb 26, 2015 4:42:04 PM

SparkPi is a sample application that computes an approximate value for pi:
./bin/run-example SparkPi 10

The end of a successful run looks like this:
15/02/26 16:59:29 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool   
15/02/26 16:59:29 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:35, took 6.157368 s  
Pi is roughly 3.14192  
15/02/26 16:59:29 INFO SparkUI: Stopped Spark web UI at http://192.168.1.10:4040  
15/02/26 16:59:29 INFO DAGScheduler: Stopping DAGScheduler  
15/02/26 16:59:30 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped!  
15/02/26 16:59:30 INFO MemoryStore: MemoryStore cleared  
15/02/26 16:59:30 INFO BlockManager: BlockManager stopped  
15/02/26 16:59:30 INFO BlockManagerMaster: BlockManagerMaster stopped  
15/02/26 16:59:30 INFO SparkContext: Successfully stopped SparkContext  
15/02/26 16:59:30 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.  
15/02/26 16:59:30 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.  
15/02/26 16:59:30 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.  

I'm using this code to perform the WordCount program on an existing dataset on my HDFS cluster and output the results:
var file = sc.textFile("hdfs://192.168.1.70:9000/nyt/singles")
val count = file.flatMap(line => line.split(",")).map(word => (word, 1)).reduceByKey(_+_)
count.saveAsTextFile("hdfs://192.168.1.70:9000/nyt/out/02")

You can determine the port number of your namenode by typing:
sudo netstat -tulpn
And looking through the output for a port number, typically either 8020 or 9000.


Troubleshooting

  1. A Java execution error can occur if the sample application is executed without sudo privileges:
    craig@U14SPARK01:/usr/lib/apache/spark/1.2.1$ ./bin/run-example SparkPi 10
    /usr/lib/apache/spark/1.2.1/bin/spark-class: line 114: [: : integer expression expected
    /usr/lib/apache/spark/1.2.1/bin/spark-class: line 188: /usr/lib/jvm/jdk/1.0.8_31/bin/java: No such file or directory
    



References

  1. Installing Apache Spark 1.1.0 on Ubuntu 14.04
  2. Troubleshooting
    1. [StackOverflow] Can't execute javac when running sbt

2 comments:

  1. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Java developer learn from Java Training in Chennai.

    ReplyDelete
  2. Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Java developer learn from Java Training in Chennai.

    ReplyDelete