TIGER, OSM US & MapRoulette

TIGER #

The US Census Bureau's Topologically Integrated Geographic Encoding and Referencing (TIGER) system. The Census Bureau maintains the TIGER database to assist in its various mandated programs, including the decennial US Census. Because the TIGER database is built with public funding, it is by law in the public domain. TIGER contains the locations of nearly every street, highway, railroad, body of water and legal boundary in the US. It is built from a combination of original US Geological Survey and Census Bureau maps, updated with data collected by Census Bureau staff while in the field.

Bulk Import #

"I noticed that someone is importing TIGER line data. Should I keep making changes to the map or should I wait for the TIGER data to show up?"

"The reality is that people have been told for years not to map too much in the US because the TIGER upload will obviate the need for your work. That has kept mappers away."

The TIGER data has long been a tempting target for OSM. Being in the public domain, its use is not restricted by licensing. And it is available electronically, in vector format, so if a suitable conversion utility were available, it could be converted automatically to OSM's format. A bulk import of TIGER data was attempted in 2005, but the initial trials failed to produce quality results and the work was abandoned.

In the spring of 2007, Brandon Martin-Anderson and Dave Hansen undertook a brand new effort, hunting down bugs in the previous conversion and import code, and starting fresh. Martin-Anderson had written TIGER parsing code before, and with some help from other OSM developers worked it into a TIGER-to-OSM conversion script.

Attribute Mapping #

The time-consuming part, he says, was mapping attributes from one form to the other. "People have a lot to say about how various TIGER tags are converted to OSM tags, whether an A-class TIGER road is residential-class OSM road, et cetera. I spent a great deal more time working out the tag conversion with other members of the community than writing software."

Conversion #

Once all involved were happy with the attribute mapping, Hansen downloaded the entire TIGER data set from the Census Bureau Web site in county-sized chunks, and ran the conversion script on his home computer. The resulting OSM-compatible data set consisted of 379,836,373 objects.

Upload #

Converting the data took several days of constant work, but it still needed to be uploaded to the live OSM server. In a postmortem of the attempted 2005 import, Hansen discovered that some of that effort's problems were the result of trying to import the converted data directly into the database. For reliability, Hansen initially began uploading the newly converted data through JOSM, the client-side application typically used for annotating and uploading GPS traces. This ensured that the TIGER data went through the same API as any other OSM input, averting the breakage associated with the 2005 attempt. It was safer than attempting to bypass the API and alter the database directly.

But, this import method was agonizingly slow.

"I've been very painfully uploading the TIGER-generated data through JOSM. At the rate, I'm going it will probably take 5 or 10 years to upload the entire US. I'm uploading one or two counties a day and there are 3,234 counties in the country." - Dave Hansen, OSM-dev mailing list, August 28, 2007

Eventually, the OSM team devised a better plan. Hansen transferred the already converted data files to a development machine on the same rack as the OSM map server, and admin Tom Hughes dedicated three of the map server's 12 import daemons solely to the bulk upload.

This improved data import started in early September. Running night and day, seven days a week, the TIGER import should be completed in May or June of 2008. A public web page keeps track of the stats, including the current throughput, percentage of the TIGER data imported, and a list of completed counties.

Although the improved TIGER import is far faster than the old, there is still a long time to wait before it finishes. Hansen came up with a way to make the wait less painful. He began by sorting the counties in the upload queue by population, so the most populated areas go first. Plus, he takes requests. If you want your county imported next, email Hansen and he will bump them up in the queue.

But time is not the only important factor. Shortly after starting the TIGER import, it became clear that the database machine itself would run out of disk space within a matter of weeks. Hughes added additional storage capacity on September 27 and says he is prepared for more complications to crop up along the way.

Quadtiles #

One such example is the size of the database index. The team had known for a long time that the existing index was inefficient. The database indexed all of its entries by their latitude and longitude, requiring the lookup of thousands of double-precision floating point values for any given geographic area. Once the TIGER import began, the number of indices shot up dramatically. Luckily, a solution was already in the works. The database switched over to a new index by using quadtiles, dividing the globe into discrete tiles and putting far less strain on the server, resulting in greatly shortened database lookup times.

Quadtiles recursively split each quadrant of the map into four subquadrants, allowing for better space efficiency by only subdividing those quadrants that require more detail. A quadrant containing only ocean and therefore no roads, for example, would not require subdivision, whereas a metropolitan city center would. The quadtile keys are shorter (32 bits as opposed to 16 bytes for the old lat/long indices). Because of quadtiles' hierarchical nature, geographically close nodes are adjacent in the database index, which improves cache performance.

MapRoulette #

The TIGER data set's successful import does not mean that the work is finished. Users who have collected their own GPS logs in areas covered by the TIGER maps and uploaded the resulting data report sporadic problems with TIGER's information. Problems include misalignment of roads, missing features, and occasional confusion on features such as cul-de-sacs. Since the TIGER map data was produced from aerial photography, such problems are bound to occur.

An enormous clean up and improvement effort was needed, while the US OSM community was quite small in 2012 (<100 daily active mappers). Something was needed to organize the work. So the idea of MapRoulete was born : a tool to work on small, randomly assigned tasks. The aims were to make a huge mapping effort feel more doable by breaking it up in small tasks and make repetitive mapping more fun.

MapRoulette, the open source web-based micro-tasking platform for OSM, was first announced at State of the Map US in 2012 as a tool to solve the many errors introduced by the import of TIGER road data in the United States. After a successful and quick cleanup of over 60.000 common problems found in the TIGER data, it was clear that the idea of a micro-tasking tool was worth developing further.

MapRoulette proved to be a great way to focus the community on a mapping goal. Now anyone can create challenges on MapRoulette. It's useful for mapping parties where you want a smaller goal.

References : #

Nathan Willis (October 11, 2007) "OpenStreetMap project imports US government maps" linux.com
Nathan Willis (January 23, 2008) "OpenStreetMap project completes import of United States TIGER data" linux.com
Martijn van Exel (August 19, 2022) "10 years of MapRoulette" State of The Map 2022 - Firenze

last updated: 2022-09-06