Introduction
Use of Python pandas library to read and process Geonames open data.Required
- Download allCountries.zip.
- The schema and data set acquisition is discussed in more depth in this blog post.
- Python and the pandas library installed.
The following code will read
#!/usr/bin/env python # -*- coding: UTF-8 -*- import time import pandas as pd import logging logging.basicConfig(level=logging.DEBUG) logger = logging.getLogger(__name__) class GeonamesCountriesTxtFileReader(object): """ A GeonamesCountriesTxtFileReader """ def __init__(self, input_file): """Return a GeonamesCountriesTxtFileReader object""" self.input_file = input_file def get_header_row(self): return [ 'geonameid', 'name', 'asciiname', 'alternatenames', 'latitute', 'longitude', 'feature-class', 'feature-code', 'country-code', 'alt-country-codes', 'admin-code-1', 'admin-code-2', 'admin-code-3', 'admin-code-4', 'population', 'elevation', 'dem', 'timezone', 'modification-date' ] def get_data_types(self): return { 'geonameid' : int, 'name' : str, 'asciiname' : str, 'alternatenames' : str, 'latitute' : float, 'longitude' : float, 'feature-class' : str, 'feature-code' : str, 'country-code' : str, 'alt-country-codes' : str, 'admin-code-1' : str, 'admin-code-2' : str, 'admin-code-3' : str, 'admin-code-4' : str, 'population' : int, 'elevation' : str, 'dem' : str, 'timezone' : str, 'modification-date' : str } def read_csv(self): start = time.time() df = pd.read_csv( self.input_file, delim_whitespace=False, sep='\t', error_bad_lines=False, skiprows=0, encoding='utf-8', names=self.get_header_row(), dtype=self.get_data_types(), na_values=['none'], usecols=self.get_header_row()) end = time.time() logger.info('Read CSV File (path = {}, elapsed-time = {})'.format(self.input_file, (end - start))) return df
This takes about 30 seconds to read on my Macbook Pro.
Working with the Reader
The following code shows how to access the reader and acquire a pandas Dataframe instance:#!/usr/bin/env python # -*- coding: UTF-8 -*- import codecs import pandas as pd from geonames_reader import GeonamesCountriesTxtFileReader import logging logging.basicConfig(level=logging.DEBUG) logger = logging.getLogger(__name__) in_file = '/Users/craigtrim/data/Data/geonames-2/allCountries.txt' reader = GeonamesCountriesTxtFileReader(in_file) df = reader.read_csv()
All manipulation of the data will occur after this point.
Filtering, Sorting and Writing to Disk
The following code will filter by populations greater than a million and sort the result:df_pop = df[df.population > 1000000].sort(['population', 'name'], ascending=[0, 0]) df_pop.to_csv( out_file, sep='\t', header=[ 'name', 'population', 'alternatenames' ], columns=[ 'name', 'population', 'alternatenames' ], mode='w', encoding='utf-8')
The result is another data frame that is written to disk.
Filtering by Partial Strings
Select code containing a string value:df[df['alternatenames'].str.contains("United States of America|USA")==True]
Note that if you use this variation:
df[df['alternatenames'].str.contains("United States of America")]
Writing to CSV
This function preserves the formatting of the original CSV file:
def to_csv(self, df, out_file): df.to_csv( out_file, delim_whitespace=False, sep='\t', error_bad_lines=False, index=False, # Write row names (index), default=True header=False, # Write out column names, default=True mode='w', # Python write mode, default ‘w’ encoding='utf-8', # defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3. names=self.get_header_row(), dtype=self.get_data_types(), na_values=['none'], usecols=self.get_header_row())
self.get_header_row()
and self.get_data_types()
are defined in GeonamesCountriesTxtFileReader
.Issues
- The "United States" and "United Kingdom" do not appear in the output file.
- On closer inspect, "United Kingdom" is listed as having a population of 0.
- Note that it is possible to iterate through each row in a dataframe like this:
for index, row in df.iterrows(): name = row['name'].lower().strip() if 'united' in name: print name, ", ", row['population']
- I don't know if this is because the actual data is bad, or if the delimiters were not specified correctly, and the "0" is another column value mistakenly assigned to population.
References
- Acquiring Geonames Data
- [StackOverflow] Dataframe Selection by Partial String
No comments:
Post a Comment