Why Spark?
Spark comes with tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming). Rather than having to mix and match a set of tools (e.g., Hive, Hadoop, Mahout, S4/Storm), you only have to learn one programming paradigm. For SQL enthusiasts, the added bonus is that Shark tends to run faster than Hive.
Environment
- Apache Spark 1.2.1
- Ubuntu 14.10
- Git
- JDK 1.8.0_31
1. Installing Spark
The following commands have been tested, and are operational, under Ubuntu 14.10.
mkdir ~/spark cd ~/spark wget http://d3kbcqa49mib13.cloudfront.net/spark-1.2.1.tgz tar -zxvf spark-1.2.1.tgz
I don't recommend using a directory installation path which will require the use of sudo privileges to install spark. I attempted this initially, and ran into an Access Control Exception when trying to connect Spark to my existing HDFS cluster.
2. Modifying the Path
Modify both the PATH and CLASSPATH to point to the new Scala installation.
Use the text editor of your choice to edit your environment file:
sudo gedit /etc/environment
The italicized text was added:
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/apache/spark/1.2.1/bin" ... export SPARK_HOME=/usr/lib/apache/spark/1.2.1
Once the environment file is saved, reload it:
source /etc/environment
3. Verifying the Installation
We're going to run a sample program that is bundled with Spark to verify the installation was performed correctly.
The program is called SparkPi and will output a value for Pi. We're first going to build this program using the Simple Build Tool (SBT) for Scala. SBT is to Scala what Maven is for Java.
We're first going to build Spark:
sbt/sbt -Dhadoop.version=2.6.0 assembly
The script took nearly 20 minutes to run and completed with this line:
[success] Total time: 1039 s, completed Feb 26, 2015 4:42:04 PM
SparkPi is a sample application that computes an approximate value for pi:
./bin/run-example SparkPi 10
The end of a successful run looks like this:
15/02/26 16:59:29 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/02/26 16:59:29 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:35, took 6.157368 s Pi is roughly 3.14192 15/02/26 16:59:29 INFO SparkUI: Stopped Spark web UI at http://192.168.1.10:4040 15/02/26 16:59:29 INFO DAGScheduler: Stopping DAGScheduler 15/02/26 16:59:30 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 15/02/26 16:59:30 INFO MemoryStore: MemoryStore cleared 15/02/26 16:59:30 INFO BlockManager: BlockManager stopped 15/02/26 16:59:30 INFO BlockManagerMaster: BlockManagerMaster stopped 15/02/26 16:59:30 INFO SparkContext: Successfully stopped SparkContext 15/02/26 16:59:30 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/02/26 16:59:30 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/02/26 16:59:30 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
I'm using this code to perform the WordCount program on an existing dataset on my HDFS cluster and output the results:
var file = sc.textFile("hdfs://192.168.1.70:9000/nyt/singles") val count = file.flatMap(line => line.split(",")).map(word => (word, 1)).reduceByKey(_+_) count.saveAsTextFile("hdfs://192.168.1.70:9000/nyt/out/02")
You can determine the port number of your namenode by typing:
sudo netstat -tulpn
Troubleshooting
- A Java execution error can occur if the sample application is executed without sudo privileges:
craig@U14SPARK01:/usr/lib/apache/spark/1.2.1$ ./bin/run-example SparkPi 10 /usr/lib/apache/spark/1.2.1/bin/spark-class: line 114: [: : integer expression expected /usr/lib/apache/spark/1.2.1/bin/spark-class: line 188: /usr/lib/jvm/jdk/1.0.8_31/bin/java: No such file or directory
References
- Installing Apache Spark 1.1.0 on Ubuntu 14.04
- Troubleshooting
- [StackOverflow] Can't execute javac when running sbt
Hi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Java developer learn from Java Training in Chennai.
ReplyDeleteHi, Great.. Tutorial is just awesome..It is really helpful for a newbie like me.. I am a regular follower of your blog. Really very informative post you shared here. Kindly keep blogging. If anyone wants to become a Java developer learn from Java Training in Chennai.
ReplyDelete