Introduction
The purpose of this article is to demonstrate how to load multiple CSV files on an HDFS filesystem into a single Dataframe and write to Parquet.Two approaches are demonstrated. The first approach is not recommended, but is shown for completeness.
First Approach
One approach might be to define each path:%pyspark import locale locale.setlocale(locale.LC_ALL, 'en_US') p1 = "/data/output/followers/mitshu/ec2-52-39-251-219.us-west-2.compute.amazonaws.com/0-ec2-52-39-251-219.us-west-2.compute.amazonaws.com/twitterFollowers.csv" p2 = "/data/output/followers/mitshu/ec2-52-42-100-207.us-west-2.compute.amazonaws.com/0-ec2-52-42-100-207.us-west-2.compute.amazonaws.com/twitterFollowers.csv" p3 = "/data/output/followers/mitshu/ec2-52-42-198-4.us-west-2.compute.amazonaws.com/0-ec2-52-42-198-4.us-west-2.compute.amazonaws.com/twitterFollowers.csv" p4 = "/data/output/followers/mitshu/ec2-54-70-37-224.us-west-2.compute.amazonaws.com/0-ec2-54-70-37-224.us-west-2.compute.amazonaws.com/twitterFollowers.csv"
and then open each CSV at that path as an RDD and transform to a dataframe:
%pyspark rdd_m1 = sc.textFile(p1) print rdd_m1.take(5) df_m1 = rdd_m1.\ map(lambda x: x.split("\t")).\ filter(lambda x: len(x) == 6). \ map(lambda x: { 'id':x[0], 'profile_id':x[1], 'profile_name':x[2], 'follower_id':x[3], 'follwer_name':x[4], 'unknown':x[5]})\ .toDF() df_m1.limit(5).show() df_m1.registerTempTable("df_m1")
The dataframes could then be merged using the unionAll operator.
%pyspark import pandas as pd df = df_m1.unionAll(df_m2).unionAll(df_m3).unionAll(df_m4) print "DF 1: {0}".format(df_m1.count()) print "DF 2: {0}".format(df_m2.count()) print "DF 3: {0}".format(df_m3.count()) print "DF 4: {0}".format(df_m4.count()) print "Merged Dataframe: {0}".format(df.count())
and finally written to parquet.
%pyspark df.write.parquet("/data/output/followers/mitshu/joined.prq")
Easier Approach
Notice the convenient way of reading multiple CSV in nested directories into a single RDD:%pyspark path="/data/output/followers/mitshu/*/*/*.csv" rdd = sc.textFile(path) print "count = {}".format(rdd.count())
There are multiple ways to transform RDDs into Dataframes (DFs):
%pyspark def to_json(r): j = {} t = r.split("\t") j['num_followers'] = t[0] j['followed_userid'] = t[1] j['followed_handle'] = t[2] j['follower_userid'] = t[3] j['follower_handle'] = t[4] return j df = rdd.map(to_json).toDF() print "count = {}".format(df.count()) df.show()
Load from Parquet
For subsequent analysis, load from Parquet using this code:%pyspark df = sqlContext.read.parquet("/data/output/followers/mitshu/joined.prq") df.limit(5).show()
References
- [Blogger] Writing to Parquet
nice steps you are covered in this topic. its much useful to me. keep update more things about search engine optimization issues and how to rectify it.
ReplyDeletePTE Coaching in Chennai
Its really an Excellent post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog. Thanks for sharing....
ReplyDeleteShriram Magizhchi
Shriram Magizhchi Guduvancheri
Shriram Magizhchi price
Shriram Magizhchi Guduvancheri Chennai
Shriram Magizhchi apartments
Shriram Magizhchi flats
Shriram Magizhchi review
Robotic Process Automation (RPA) is one of the most exciting developments in Business Process Management (BPM) in recent history. Some industry experts believe it may be even more transformational than cloud computing transformational than cloud Automationminds team. (RPA)Automationminds lets you program in (RPA),
ReplyDeletethanks for giving that type of information. ielts coaching in gurgaon
ReplyDeleteHave you been thinking about the power sources and the tiles whom use blocks I wanted to thank you for this great read!! I definitely enjoyed every little bit of it and I have you bookmarked to check out the new stuff you post
ReplyDeleteData Science Training in Chennai | Best Data science Training in Chennai
Data Science training in kalyan nagar
Data science training in Bangalore | Data Science training institute in Bangalore
Data Science training in marathahalli | Data Science training in Bangalore
Data Science interview questions and answers
Data science training in jaya nagar | Data science Training in Bangalore
Interesting post for Devops Training in Chennai
ReplyDeleteI think you did an awesome job explaining it. Sure beats having to research it on my own. Thanks
ReplyDeleteAgra BCom Time Table 2020
Allahabad BCom Time Table 2020
Brij BCOM TimeTable 2020
Thanks a lot...Keep doing these kind of work.
ReplyDeleteOracle Training | Online Course | Certification in chennai | Oracle Training | Online Course | Certification in bangalore | Oracle Training | Online Course | Certification in hyderabad | Oracle Training | Online Course | Certification in pune | Oracle Training | Online Course | Certification in coimbatore
Here is the site(bcomexamresult.in) where you get all Bcom Exam Results. This site helps to clear your all query.
ReplyDeleteRdvv BCOM 3rd Year Result 2020
BA 3rd year Result 2019-20
Sdsuv University B.COM 3rd/HONOURS Sem Exam Result 2018-2021
It was really informative. Your website is very useful. Thank you for sharing!
ReplyDeleteRMLAU BA First Year Result
Reach to the best software training institute in Chennai, Infycle Technologies, to enter the IT industry with well-defined skills. Infycle Technologies is the rapidly developing software training cum placement center in Chennai and is generally known for its significance in providing quality hands-on practical training with 200% guaranteed outcomes! Call 7502633633 to book a free demo and to avail the best offers.Best Software Training Institute in Chennai | Infycle Technologies
ReplyDeleteI wish to say that this 3rd year exam date post is amazing.
ReplyDelete