Thursday, May 12, 2016

Zeppelin and Spark: Writing to Parquet

A colleague gave me a CSV that was nearly a TB.

The format was simple:
TeamMarijo,34231221
TeVesoli,34279539
Keardly_,34262814
xElyFlds,34963944
Rrelaived,34289263
ayedema28,39008303 
 A user name followed by a user id.

The first step was to compress the file and scp it to the cluster, and upload it the remote server:
$ tar -cjf users.tar.gz users.csv
$ scp -P 22 users.tar.gz dsuser@<IP>:~
$ ssh dsuser@<IP>

Once on the remote server, copy to HDFS:
$ tar -xvf users.tar.gz
$ rm users.tar.gz
$ sudo su hdfs
$ hadoop fs -put users.csv /team/dev/craig/
$ hadoop fs -ls -h /team/dev/craig
Found 1 items
-rw-r--r--   2 dsuser hdfs      925.9 G 2016-05-12 15:59 /team/dev/craig/pi20160208.csv


I need to access the CSV, transform it to a data frame, and save as parquet:
%pyspark

rdd = sc.textFile("/team/dev/craig/pi20160208.csv")
df = rdd.map(lambda x : x.split(',')).map(lambda x : { 'handle': x[0].strip(), 'userid': x[1].strip()}).toDF()

print df.describe()
df.limit(5).show()
print df.count()

df.write.parquet("/team/dev/craig/pi20160208/")

The output looks like this:
DataFrame[summary: string]
+-------------+------+
|       handle|userid|
+-------------+------+
|     Huoiwnic| 13853|
|      hoviabo| 14864|
|       trisly| 55173|
|   PixlRkxoot| 55293|
+-------------+------+

29008260086

Now the file is saved in Apache Parquet format, I can load it like this:
%pyspark

df = sqlContext.read.parquet("/team/dev/craig/pi20160208")

3 comments:

  1. Wow! That's really great information guys.I know lot of new things here. Really great contribution.Thank you ...

    jira online training

    ReplyDelete
  2. Thanks for the post. It was very interesting and meaningful. I really appreciate it! Keep updating stuffs like this. If you are looking for the python training in chennai.

    python Training in chennai

    python Course in chennai

    ReplyDelete
  3. Harrah's Casino and Resort Spa Launches Online Gaming
    Harrah's Hotel and Casino in 창원 출장샵 Phoenix, AZ. Harrah's Resort Spa 영천 출장마사지 is an MGM 사천 출장마사지 owned, 부천 출장샵 opened casino property 안양 출장샵 in 2001.

    ReplyDelete