IntroductionThe purpose of this article is to demonstrate how to load multiple CSV files on an HDFS filesystem into a single Dataframe and write to Parquet.
Two approaches are demonstrated. The first approach is not recommended, but is shown for completeness.
First ApproachOne approach might be to define each path:
and then open each CSV at that path as an RDD and transform to a dataframe:
This would need to be repeated for each dataframe.
The dataframes could then be merged using the unionAll operator.
and finally written to parquet.
Easier ApproachNotice the convenient way of reading multiple CSV in nested directories into a single RDD:
This is clearly better than defining each path individually.
There are multiple ways to transform RDDs into Dataframes (DFs):
This is not necessarily superior to the first approach; but it is an alternative to consider.
Load from ParquetFor subsequent analysis, load from Parquet using this code:
- [Blogger] Writing to Parquet