DevOps: Python: Pandas and Geonames

Introduction

Use of Python pandas library to read and process Geonames open data.

Required

Download allCountries.zip.

The schema and data set acquisition is discussed in more depth in this blog post.

Python and the pandas library installed.

The data format is tab-delimited text in utf8 encoding.

The following code will read

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import time
import pandas as pd 

import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

class GeonamesCountriesTxtFileReader(object):
 """
  A GeonamesCountriesTxtFileReader 
 """
 def __init__(self, input_file):
  """Return a GeonamesCountriesTxtFileReader object"""   
  self.input_file = input_file

 def get_header_row(self):
  return [ 
   'geonameid',
   'name',
   'asciiname',
   'alternatenames',
   'latitute',
   'longitude',
   'feature-class',
   'feature-code',
   'country-code',
   'alt-country-codes',
   'admin-code-1',
   'admin-code-2',
   'admin-code-3',
   'admin-code-4',
   'population',
   'elevation',
   'dem',
   'timezone',
   'modification-date' ]

 def get_data_types(self):
  return { 
   'geonameid' : int,
   'name' : str,
   'asciiname' : str,
   'alternatenames' : str,
   'latitute' : float,
   'longitude' : float,
   'feature-class' : str,
   'feature-code' : str,
   'country-code' : str,
   'alt-country-codes' : str,
   'admin-code-1' : str,
   'admin-code-2' : str,
   'admin-code-3' : str,
   'admin-code-4' : str,
   'population' : int,
   'elevation' : str,
   'dem' : str,
   'timezone' : str,
   'modification-date' : str }

 def read_csv(self):
  start = time.time()
  df = pd.read_csv(
   self.input_file,
   delim_whitespace=False,
   sep='\t',
   error_bad_lines=False,
   skiprows=0,
   encoding='utf-8',
   names=self.get_header_row(),
   dtype=self.get_data_types(),
   na_values=['none'],
   usecols=self.get_header_row())
  
  end = time.time()
  logger.info('Read CSV File (path = {}, elapsed-time = {})'.format(self.input_file, (end - start)))
  
  return df

This takes about 30 seconds to read on my Macbook Pro.

Working with the Reader

The following code shows how to access the reader and acquire a pandas Dataframe instance:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import codecs
import pandas as pd 

from geonames_reader import GeonamesCountriesTxtFileReader

import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger(__name__)

in_file = '/Users/craigtrim/data/Data/geonames-2/allCountries.txt'

reader = GeonamesCountriesTxtFileReader(in_file)
df = reader.read_csv()

All manipulation of the data will occur after this point.

Filtering, Sorting and Writing to Disk

The following code will filter by populations greater than a million and sort the result:

df_pop = df[df.population > 1000000].sort(['population', 'name'], ascending=[0, 0])

df_pop.to_csv(
 out_file, 
 sep='\t', 
 header=[ 'name', 'population', 'alternatenames' ],
 columns=[ 'name', 'population', 'alternatenames' ],
 mode='w',
 encoding='utf-8')

The result is another data frame that is written to disk.

Filtering by Partial Strings

Select code containing a string value:

df[df['alternatenames'].str.contains("United States of America|USA")==True]

Note that if you use this variation:

df[df['alternatenames'].str.contains("United States of America")]

You will likely end up with this error 'cannot index with vector containing NA / NaN values'.

Writing to CSV

This function preserves the formatting of the original CSV file:

 def to_csv(self, df, out_file):
  df.to_csv(
   out_file, 
   delim_whitespace=False,
   sep='\t', 
   error_bad_lines=False,
   index=False,     # Write row names (index), default=True
   header=False,     # Write out column names, default=True
   mode='w',      # Python write mode, default ‘w’
   encoding='utf-8',    # defaults to ‘ascii’ on Python 2 and ‘utf-8’ on Python 3.
   names=self.get_header_row(),
   dtype=self.get_data_types(),
   na_values=['none'],
   usecols=self.get_header_row())

Note: The function calls to self.get_header_row() and self.get_data_types() are defined in GeonamesCountriesTxtFileReader.

Issues

The "United States" and "United Kingdom" do not appear in the output file.

On closer inspect, "United Kingdom" is listed as having a population of 0.

Note that it is possible to iterate through each row in a dataframe like this:

for index, row in df.iterrows():
 name = row['name'].lower().strip()
 if 'united' in name:
  print name, ", ", row['population']

I don't know if this is because the actual data is bad, or if the delimiters were not specified correctly, and the "0" is another column value mistakenly assigned to population.

References

Acquiring Geonames Data
[StackOverflow] Dataframe Selection by Partial String

DevOps

Thursday, March 17, 2016

Python: Pandas and Geonames