DevOps: Zeppelin and Spark: Merge Multiple CSVs into Parquet

Tuesday, September 20, 2016

Zeppelin and Spark: Merge Multiple CSVs into Parquet

Introduction

The purpose of this article is to demonstrate how to load multiple CSV files on an HDFS filesystem into a single Dataframe and write to Parquet.

Two approaches are demonstrated. The first approach is not recommended, but is shown for completeness.

First Approach

One approach might be to define each path:

%pyspark

import locale
locale.setlocale(locale.LC_ALL, 'en_US')

p1 = "/data/output/followers/mitshu/ec2-52-39-251-219.us-west-2.compute.amazonaws.com/0-ec2-52-39-251-219.us-west-2.compute.amazonaws.com/twitterFollowers.csv"
p2 = "/data/output/followers/mitshu/ec2-52-42-100-207.us-west-2.compute.amazonaws.com/0-ec2-52-42-100-207.us-west-2.compute.amazonaws.com/twitterFollowers.csv"
p3 = "/data/output/followers/mitshu/ec2-52-42-198-4.us-west-2.compute.amazonaws.com/0-ec2-52-42-198-4.us-west-2.compute.amazonaws.com/twitterFollowers.csv"
p4 = "/data/output/followers/mitshu/ec2-54-70-37-224.us-west-2.compute.amazonaws.com/0-ec2-54-70-37-224.us-west-2.compute.amazonaws.com/twitterFollowers.csv"

and then open each CSV at that path as an RDD and transform to a dataframe:

%pyspark

rdd_m1 = sc.textFile(p1)
print rdd_m1.take(5)

df_m1 = rdd_m1.\
    map(lambda x: x.split("\t")).\
    filter(lambda x: len(x) == 6). \
    map(lambda x: {
        'id':x[0],
        'profile_id':x[1],
        'profile_name':x[2],
        'follower_id':x[3],
        'follwer_name':x[4],
        'unknown':x[5]})\
    .toDF()
df_m1.limit(5).show()
df_m1.registerTempTable("df_m1")

This would need to be repeated for each dataframe.

The dataframes could then be merged using the unionAll operator.

%pyspark
import pandas as pd

df = df_m1.unionAll(df_m2).unionAll(df_m3).unionAll(df_m4)

print "DF 1: {0}".format(df_m1.count())
print "DF 2: {0}".format(df_m2.count())
print "DF 3: {0}".format(df_m3.count())
print "DF 4: {0}".format(df_m4.count())
print "Merged Dataframe: {0}".format(df.count())

and finally written to parquet.

%pyspark

df.write.parquet("/data/output/followers/mitshu/joined.prq")

Easier Approach

Notice the convenient way of reading multiple CSV in nested directories into a single RDD:

%pyspark

path="/data/output/followers/mitshu/*/*/*.csv"
rdd = sc.textFile(path)
print "count = {}".format(rdd.count())

This is clearly better than defining each path individually.

There are multiple ways to transform RDDs into Dataframes (DFs):

%pyspark

def to_json(r):
    j = {}
    t = r.split("\t")
    j['num_followers'] = t[0]
    j['followed_userid'] = t[1]
    j['followed_handle'] = t[2]
    j['follower_userid'] = t[3]
    j['follower_handle'] = t[4]
    return j
    
df = rdd.map(to_json).toDF()
print "count = {}".format(df.count())
df.show()

This is not necessarily superior to the first approach; but it is an alternative to consider.

Load from Parquet

For subsequent analysis, load from Parquet using this code:

%pyspark

df = sqlContext.read.parquet("/data/output/followers/mitshu/joined.prq")
df.limit(5).show()

References

[Blogger] Writing to Parquet

12 comments:

UnknownFebruary 14, 2017 at 2:01 AM
nice steps you are covered in this topic. its much useful to me. keep update more things about search engine optimization issues and how to rectify it.
PTE Coaching in Chennai
ReplyDelete
Replies
kabeeshaJune 20, 2018 at 9:54 PM
Its really an Excellent post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog. Thanks for sharing....

Shriram Magizhchi
Shriram Magizhchi Guduvancheri
Shriram Magizhchi price
Shriram Magizhchi Guduvancheri Chennai
Shriram Magizhchi apartments
Shriram Magizhchi flats
Shriram Magizhchi review
ReplyDelete
Replies
UnknownAugust 11, 2018 at 12:07 AM
Robotic Process Automation (RPA) is one of the most exciting developments in Business Process Management (BPM) in recent history. Some industry experts believe it may be even more transformational than cloud computing transformational than cloud Automationminds team. (RPA)Automationminds lets you program in (RPA),
ReplyDelete
Replies
UnknownNovember 28, 2018 at 2:50 AM
thanks for giving that type of information. ielts coaching in gurgaon
ReplyDelete
Replies
UnknownDecember 17, 2018 at 9:41 PM
Have you been thinking about the power sources and the tiles whom use blocks I wanted to thank you for this great read!! I definitely enjoyed every little bit of it and I have you bookmarked to check out the new stuff you post

Data Science Training in Chennai | Best Data science Training in Chennai
Data Science training in kalyan nagar
Data science training in Bangalore | Data Science training institute in Bangalore
Data Science training in marathahalli | Data Science training in Bangalore
Data Science interview questions and answers
Data science training in jaya nagar | Data science Training in Bangalore
ReplyDelete
Replies
zaraDecember 21, 2018 at 4:06 AM
Interesting post for Devops Training in Chennai
ReplyDelete
Replies
AdminFebruary 17, 2020 at 1:56 AM
I think you did an awesome job explaining it. Sure beats having to research it on my own. Thanks
Agra BCom Time Table 2020
Allahabad BCom Time Table 2020
Brij BCOM TimeTable 2020
ReplyDelete
Replies
anishJune 12, 2020 at 7:54 AM
Thanks a lot...Keep doing these kind of work.
Oracle Training | Online Course | Certification in chennai | Oracle Training | Online Course | Certification in bangalore | Oracle Training | Online Course | Certification in hyderabad | Oracle Training | Online Course | Certification in pune | Oracle Training | Online Course | Certification in coimbatore
ReplyDelete
Replies
AdminOctober 14, 2020 at 10:47 PM
Here is the site(bcomexamresult.in) where you get all Bcom Exam Results. This site helps to clear your all query.
Rdvv BCOM 3rd Year Result 2020
BA 3rd year Result 2019-20
Sdsuv University B.COM 3rd/HONOURS Sem Exam Result 2018-2021
ReplyDelete
Replies
Ruhi SukhlaDecember 15, 2020 at 2:34 AM
It was really informative. Your website is very useful. Thank you for sharing!

RMLAU BA First Year Result
ReplyDelete
Replies
DeviMay 24, 2021 at 6:23 AM
Reach to the best software training institute in Chennai, Infycle Technologies, to enter the IT industry with well-defined skills. Infycle Technologies is the rapidly developing software training cum placement center in Chennai and is generally known for its significance in providing quality hands-on practical training with 200% guaranteed outcomes! Call 7502633633 to book a free demo and to avail the best offers.Best Software Training Institute in Chennai | Infycle Technologies
ReplyDelete
Replies
BASANT KUMAROctober 31, 2021 at 2:52 AM
I wish to say that this 3rd year exam date post is amazing.
ReplyDelete
Replies