Tuesday, April 7, 2009

Loading data from the NHTS

The data in the NHTS is available in a couple of forms. My preference is to work with the CSV format. In Python, it is easy to load a CSV table into an object in memory. For the CSV format, the NHTS data is organized into 4 files. Each file contains the data for one table. There are common columns in each file that allow information to be correlated between the tables. A CSV file typically is organized into a header row followed by data rows. The NHTS follows this format.
HOUSEID,VHCASEID,VEHID, ... ,DRVRCNT,MSAPOP 010000018,01000001801,01, ... ,2,7608070 . . . 915637259,91563725904,04, ... ,2,-1
One way to get this data into Python is using the file object. This allows a file to be opened for read access, then for each line in the file to be loaded individually. To organize the data, there are several options. For this data I used a Python dictionary which associated each header with a column from the file. The simplest implementation of this function is only a few lines:
def loadTable(filename,maxRows=1e9,keepList=[],ignoreList=[]): f = file(filename,'r') table = {} for line in f: if count == 0: headers = line.split(',') for header in headers: table[header]=[] else: line = line.strip() line = line.strip('\n') row = line.split(',') for idx in range(0,len(row)): table[headers[idx]].append(row[idx]) return table
While this code will work, it is not robust to a host of problems. The wrong file name can be supplied, the system may not have enought memory, or a row may be ill formed. Additionally, no real documentation is provided so the dir() can provide help on the function. Also, you might not want all of the data in the file to be loaded. Certain columns can safely be omitted. To correct these issues, the following function will be used to load the data tables:
def loadTable(filename,maxRows=1e9,keepList=[],ignoreList=[]): ''' This function will load up to maxRows from the CSV file with the first row specifying the names of the columns. into a dictionary where each key specified a column. To minimize memory use, there are two optional lists of strings which limit the columns loaded from the file. If keepList==[] and ignoreList==[] then all columns from the file will be loaded. If keepList!=[], then only those colums in keepList and not in the ignoreList are loaded. If keepList==[] then all columns which match names in ignoreList are omitted from the returned table. ''' try: f = file(filename,'r') count = 0 table = {} for line in f: if count == 0: headers = line.split(',') for header in headers: table[header]=[] else: line = line.strip() line = line.strip('\n') row = line.split(',') for idx in range(0,len(row)): if ((headers[idx] in keepList or len(keepList)==0) and (len(ignoreList)==0 or not (headers[idx] in ignoreList))): table[headers[idx]].append(row[idx]) count += 1 if count==maxRows: print 'Terminated load - maximum number of rows exceeded' return table return except IOError: print 'IOError: File can not be opened...' return {} except MemoryError: print 'MemoryError: ran out of working memory...' return table except: print 'Unknown error when loading table ....' return {}

No comments:

Post a Comment