Monday, March 23, 2015

Spark Architecture and Design

Cluster Mode Overview

Spark applications run as independent sets of processes on a cluster, coordinated by the SparkCOntext object in your main program (aka Driver Program).
Fig 1: Cluster Mode Overview



What is RDD?


Write programs in terms of transformations on distributed datasets.
    Resilient Distributed Datasets
  1. Collections of objects spread across a cluster, stored in RAM or on Disk
  2. Built through parallel transformations
  3. Automatically rebuilt on failure Operations
    Operations:
  1. Transformations (e.g. map, filter, groupBy)
  2. Actions (e.g. count, collect, save)



References

  1. Cluster Design
    1. [Spark] Cluster Mode Overview
  2. RDD
    1. The RDD API by Example
      1. Zhen He's page at La Trobe University.  
      2. Current with Spark 1.1.0
      3. A helpful introduction to the RDD API.
    2. [DataBricks, PDF] Spark Tutorial Summit 2013
      1. Introductory level talk

2 comments:

  1. very nice article,keep sharing more posts with us.

    thank you...

    big data and hadoop course

    ReplyDelete
  2. Your article always possess much of really up to date info. Where do you come up with this? Just stating you are very innovative. Thanks again residential architects in georgia

    ReplyDelete