Fire Engine Callouts - Part 1: from CSV to GeoJSON

22 Oct 2013 Edward Mac Gillavry

We recently created a visualisation of fire engine callouts in the Amsterdam-Amstelland area between January 2006 and September 2010. The data behind the visualisation was first released as a CSV file. GeoJSON is gaining traction as a format to incorporate geographic data into web applications. This lab experiment shows how we managed to convert the data.

Previous work

Since the fire brigade of Amsterdam-Amstelland first released the data for the Apps for Amsterdam 2011 contest, we are certainly not the first to crunch that data set. At the time, the people at Hack de Overheid already took a first stab at making sense of the original CSV file and now host a modified version.

Incident Id,Nr Incident,Dtg Start Incident,Dtg Einde Incident,Brw Melding Cl,Brw Melding Cl1,Brw Melding Cl2,Naam Locatie1,Postcode4pp,Plaats Naam,Gemeente Naam,T X Coord Loc,T Y Coord Loc
622.659.209,1,01-01-2006 00:00,01-01-2006 00:25,Brand,Buitenbrand              brand,Overig                  buitbr,SINT WILLIBRORDUSDWARSSTRAAT,,AMSTERDAM,AMSTERDAM,122.094,485.317
622.660.251,2,01-01-2006 00:03,01-01-2006 00:25,Brand,Binnenbrand              brand,.Wonen                   binbr,BERNARD LODERSTRAAT,1063,AMSTERDAM,AMSTERDAM,116.315,488.262
...
641.256.091,11.519,15-09-2010 09:54,15-09-2010 11:14,Dienstverlening,Algemeen                dienst,Ass. Politie          Algemeen,BOOMSTRAAT,1015,AMSTERDAM,AMSTERDAM,120.725,488.028
641.255.950,11.521,15-09-2010 10:39,15-09-2010 11:13,Dienstverlening,Burger                  Dienst,Buitensluiting          Burger,VAN DER HOOPLAAN,1185,AMSTELVEEN,AMSTELVEEN,118.681,478.956

Having uncompressed that ZIP file and opening the CSV file in a text editor, we reckoned we could use the following columns in the data set for our visualisation:

  • Incident Id (unique value to identify an incident)
  • Brw Melding Cl (type of incident)
  • Dtg Start Incident (start time)
  • T X Coord Loc (X coordinate)
  • T Y Coord Loc (Y coordinate)

So, we are all set? Not quite! We noticed not only that most column headers contain spaces, but also that some of the column values are formatted in the Dutch locale:

  1. All numeric values contain a thousands separator (in the Dutch locale a full stop)
  2. The timestamp information is formatted dd-mm-yyyy hh:mm
  3. The coordinates are in the Dutch coordinate reference system: Rijksdriehoekstelsel (RD)

That's all good and well for human beings, but most software won't be able to understand this formatting straight away. So, let's get to work!

Reformatting using sed and OGR

Fortunately, the full stops in the data sets are only there to act as a thousands separator. As they don't play another meaningful role, let's simply strip them from the CSV file. Instead of going through the data set line by line, we used the Unix text processing utility sed to get rid of the full stops in one go.

sed 's/\.//g' BrwAAincidenten_reshaped.csv > calloutscleaned.csv

Now that's our first issue related to the Dutch locale tackled. Look at that: no full stops anymore!

Incident Id,Nr Incident,Dtg Start Incident,Dtg Einde Incident,Brw Melding Cl,Brw Melding Cl1,Brw Melding Cl2,Naam Locatie1,Postcode4pp,Plaats Naam,Gemeente Naam,T X Coord Loc,T Y Coord Loc
622659.209,1,01-01-2006 00:00,01-01-2006 00:25,Brand,Buitenbrand              brand,Overig                  buitbr,SINT WILLIBRORDUSDWARSSTRAAT,,AMSTERDAM,AMSTERDAM,122094,485317
622660.251,2,01-01-2006 00:03,01-01-2006 00:25,Brand,Binnenbrand              brand,.Wonen                   binbr,BERNARD LODERSTRAAT,1063,AMSTERDAM,AMSTERDAM,116.315,488.262
...
641256091,11519,15-09-2010 09:54,15-09-2010 11:14,Dienstverlening,Algemeen                dienst,Ass. Politie          Algemeen,BOOMSTRAAT,1015,AMSTERDAM,AMSTERDAM,120725,488028
641255950,11521,15-09-2010 10:39,15-09-2010 11:13,Dienstverlening,Burger                  Dienst,Buitensluiting          Burger,VAN DER HOOPLAAN,1185,AMSTELVEEN,AMSTELVEEN,118681,478956

Reformatting the timestamps doesn't require a line by line scan, but actually has to address specific columns in the data. Also, we would like to reproject the coordinates from the Dutch coordinate reference system to WGS 84, the default coordinate reference system for GeoJSON. This is a specific scenario, where we can use the full might of the OGR Simple Feature Library that's part of GDAL, the Geospatial Data Abstraction Library:

  1. OGR reads and writes primarily non-spatial tabular data stored in CSV files
  2. OGR extracts spatial information (points) from a CSV file which has columns for the X and Y coordinates, through the use of the VRT driver
  3. OGR executes queries provided in the -sql flag
  4. OGR transforms between coordinate reference systems provided in the -s_srs and -t_srs flags

We first create calloutscleaned.vrt, a VRT file to derive a spatial layer from a flat table with coordinates in attribute columns. We also use the VRT file to associate the Dutch coordinate system information with the coordinates in the CSV file:

<OGRVRTDataSource>
    <OGRVRTLayer name="calloutscleaned">
        <SrcDataSource>calloutscleaned.csv</SrcDataSource>
        <GeometryType>wkbPoint</GeometryType>
        <GeometryField encoding="PointFromColumns" x="T X Coord Loc" y="T Y Coord Loc"/>
        <LayerSRS>+proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +towgs84=565.2369,50.0087,465.658,-0.406857330322398,0.350732676542563,-1.8703473836068,4.0812 +no_defs</LayerSRS>
    </OGRVRTLayer>
</OGRVRTDataSource>

The VRT file now contains information about the layer name (<OGRVRTLayer>) and the name of the CSV file (<SrcDataSource>). The <GeometryType> tag indicates that the data is related to point locations. The <GeometryField> tags tells OGR which columns in the CSV file contain the coordinate information. Finally, the <LayerSRS> tag contains a Well-Known Text (WKT) string with all the parameters that describe the Dutch coordinate reference system.

With the VRT file calloutscleaned.vrt in place we can now address the other issues related to the Dutch locale in one go. Here's the one-liner we used that takes the VRT file and outputs a GeoJSON file. Don't worry, we'll dissect this one-liner in detail next!

ogr2ogr -f GeoJSON -t_srs EPSG:4326 -sql "SELECT 'Incident Id' AS cid, 'Brw Melding Cl' AS class, CONCAT(SUBSTR('Dtg Start Incident',7,4), '-', SUBSTR('Dtg Start Incident',4,2),'-',SUBSTR('Dtg Start Incident',1,2),SUBSTR('Dtg Start Incident',11,6)) AS starttime FROM calloutscleaned WHERE ('T X Coord Loc' != '') AND (CAST('T X Coord Loc' AS integer) > 0) AND (CAST('T X Coord Loc' AS integer) < 300000) AND (CAST('T Y Coord Loc' AS integer) > 289000) AND (CAST('T Y Coord Loc' AS integer) < 629000)" callouts.geojson calloutscleaned.vrt -lco COORDINATE_PRECISION=6 -skipfailures
-f
This tells OGR to output the data as GeoJSON.
-t_srs
This tells OGR to transform the coordinates to WGS 84. EPSG codes are short codes to identify coordinate reference systems. The code 4326 refers to WGS 84.
-sql
That's the full SQL statement to tell OGR how to format the various columns. We rename the columns that contain spaces using AS and reformat the timestamp information to the yyyy-mm-dd hh:mm format (SQL 92 standard for TIMESTAMP data type). Finally, we check whether the coordinates actually are within the extent of the Dutch coordinate reference system. As a result, we only kept 62415 of the total 63016 incidents.
-lco
This is a layer creation option, specific to the GeoJSON file format that tells OGR how many decimals of the coordinate value have to be output. This option reduced the size of the output file by an extra megabyte.

GeoJSON on GitHub

The GeoJSON file callouts.geojson now contains the columns we identified previously in a more software-friendly format:

{
    "type": "FeatureCollection",
    "crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
    "features": [
        { "type": "Feature", "properties": { "cid": "622659209", "class": "Brand", "starttime": "2006-01-01 00:00" }, "geometry": { "type": "Point", "coordinates": [ 4.904198, 52.354767 ] } },
        { "type": "Feature", "properties": { "cid": "622660251", "class": "Brand", "starttime": "2006-01-01 00:03" }, "geometry": { "type": "Point", "coordinates": [ 4.819037, 52.380858 ] } },
        ...
        { "type": "Feature", "properties": { "cid": "641256091", "class": "Dienstverlening", "starttime": "2010-09-15 09:54" }, "geometry": { "type": "Point", "coordinates": [ 4.883828, 52.379048 ] } }
        { "type": "Feature", "properties": { "cid": "641255950", "class": "Dienstverlening", "starttime": "2010-09-15 10:39" }, "geometry": { "type": "Point", "coordinates": [ 4.854789, 52.297384 ] } },
    ]
}

The GeoJSON file is also available on GitHub. Actually, GitHub supports rendering GeoJSON files within repositories. Also, GitHub automatically clusters nearby markers at higher zoom levels.

Since our file exceeds the 10Mb file size limit for this feature, we created a few extra GeoJSON files that contain all incidents for a given year. The -sql flag in the ogr2ogr command was extended to include an additional WHERE clause that specifies the year. We now have 5 GeoJSON files that are all well below the 10Mb file size limit and are automatically mapped within GitHub!

Next time you publish your data under an open licence, give it some extra thought. Use well-known units (e.g. longitude and latitude for locations) and use standard formatting for the different data types (e.g. timestamps)! This gives developers some data they can start with right away, so they have more time to create some amazing applications.