The format was simple:
TeamMarijo,34231221A user name followed by a user id.
TeVesoli,34279539
Keardly_,34262814
xElyFlds,34963944
Rrelaived,34289263
ayedema28,39008303
The first step was to compress the file and scp it to the cluster, and upload it the remote server:
$ tar -cjf users.tar.gz users.csv $ scp -P 22 users.tar.gz dsuser@<IP>:~ $ ssh dsuser@<IP>
Once on the remote server, copy to HDFS:
$ tar -xvf users.tar.gz $ rm users.tar.gz $ sudo su hdfs $ hadoop fs -put users.csv /team/dev/craig/ $ hadoop fs -ls -h /team/dev/craig Found 1 items -rw-r--r-- 2 dsuser hdfs 925.9 G 2016-05-12 15:59 /team/dev/craig/pi20160208.csv
I need to access the CSV, transform it to a data frame, and save as parquet:
%pyspark rdd = sc.textFile("/team/dev/craig/pi20160208.csv") df = rdd.map(lambda x : x.split(',')).map(lambda x : { 'handle': x[0].strip(), 'userid': x[1].strip()}).toDF() print df.describe() df.limit(5).show() print df.count() df.write.parquet("/team/dev/craig/pi20160208/")
The output looks like this:
DataFrame[summary: string] +-------------+------+ | handle|userid| +-------------+------+ | Huoiwnic| 13853| | hoviabo| 14864| | trisly| 55173| | PixlRkxoot| 55293| +-------------+------+ 29008260086
Now the file is saved in Apache Parquet format, I can load it like this:
%pyspark
df = sqlContext.read.parquet("/team/dev/craig/pi20160208")
Wow! That's really great information guys.I know lot of new things here. Really great contribution.Thank you ...
ReplyDeletejira online training
Thanks for the post. It was very interesting and meaningful. I really appreciate it! Keep updating stuffs like this. If you are looking for the python training in chennai.
ReplyDeletepython Training in chennai
python Course in chennai
Harrah's Casino and Resort Spa Launches Online Gaming
ReplyDeleteHarrah's Hotel and Casino in 창원 출장샵 Phoenix, AZ. Harrah's Resort Spa 영천 출장마사지 is an MGM 사천 출장마사지 owned, 부천 출장샵 opened casino property 안양 출장샵 in 2001.