Mapped on a five metre grid a single raster image of Great Britain contains 24.2 billion pixels. Raster datasets make big geographic information easy to share and quick to visualise. But what if you only need one pixel?
In this post I introduce Addresscloud's new back-end raster service, built to improve the speed of location intelligence queries using a new data format "Cloud Optimised Geotiffs". The new service allows us to consume big data from our partners, faster, whilst maintaining Addresscloud's sub-second performance.
Using open source GIS we implemented a new, web-first, data format for our rasters and built a serverless hosting application using Amazon Web Services. Our client-facing intelligence applications are now able to query multi-gigabyte data-sets for values at a single location with an average response time of 100 milliseconds or better.
A New Raster Service
Building Scale, Nationwide
Addresscloud provides building-level intelligence across multiple countries, encompassing 17 different data-sets, and each containing up to 50 attributes per location. Data-sets with discrete geometries (for example buildings and trees) are provided in vector formats, whereas continuous data such as flood models are raster. With sub-building resolutions and country-scale coverage the raster data are often significantly larger than the vector, with individuals files measuring up to 200 gigabytes when uncompressed.
Previously, raster datasets were loaded into our PostgreSQL database using PostGIS. This database, including additional vector layers, formed our back-end intelligence data-store. Whilst this approach worked well for the most part we observed two bottlenecks in the system:
- The speed of querying large raster data was pushing response times above our required latency
- Processes to load the rasters into the database often took more than 24 hours
The challenge therefore was how to add increasing numbers of large rasters to Addresscloud without sacrificing application speed or increasing data-refresh times.
Additionally, the decision to look for an alternative solution was swayed by the PostGIS maintainers' decision to remove raster support from the core PostGIS library. Whilst the library still exists, this decision created a breaking change to the core of PostGIS. As a result, if we kept our rasters in PostGIS we would need to design and implement a new upgrade pathway for future releases of our production database.
I first heard about Cloud Optimized GeoTiffs (COGs) at the 2018 Free and Open Source Geospatial Conference in Tanzania during Alex Leith's presentation on OpenDataCube, and follow-up discussions with the OpenDroneMap team.
The COG specification is a backwards compatible implementation of the existing GeoTiff format. COGs are cloud ready in that they enable large rasters to be stored in a single file and support access via the web. Critically, COGs support querying areas of interest so that users don't have to download the whole image (potentially gigabytes of data) to get values for one location.
Thanks to first-class support for COGs in the rasterio library COG's can be accessed and queried remotely using a simple Python script. One of the most impressive features is rasterio's native support for the AWS S3 protocol, meaning that you can open and manipulate rasters directly from the cloud:
src = rasterio.open('s3://bucket/cog.tif')
Check-out Sean Gillie's IPython Notebook which dives into technical details of rasterio's support for the format.
Due to their size many of our rasters cannot fit into an array in memory and so can't be converted to COGs using rasterio. As an alternative the free and open source Geospatial Data Abstraction Library (GDAL) supports more memory-efficient COG creation:
To build our new data-store we ran this script on a large AWS EC2 instance against our raster data-sets and uploaded the resulting COG files to an AWS S3 bucket. Depending on the input file size the COG creation completed within a number of hours - a big improvement on our previous PostGIS raster loading process!
With COGs available in an S3 bucket we were able to provision a service to query the rasters using an AWS Lambda function. In this use-case a Lambda-based architecture provide two distinct advantages:
- Lambda functions are "serverless" - there are no servers for us to build or run, and the service can scale seamlessly with demand
- Our code can be co-located next to our data-store, making use of AWS' private fibre that connects their data-centres in each availability zone (i.e. rasterio can get the data fast)
Below is an example Lambda function to query pixel values for a given latitude and longitude. The function opens the raster file in the S3 bucket, reads the pixel values for the specified window and returns the result as JSON.
Speeding Things Up
In the above example rasterio is opening the COG file on each invocation of the Lambda function. From a sample of 100 requests the function returns an average response time of 868 milliseconds when reading pixels from a raster covering Great Britain. We can speed this up by taking advantage of the way the Lambda functions operate. The container used to run the function is kept online for 5 to 15 minutes after processing is complete (this makes subsequent requests quicker as a new container doesn't need to be provisioned). In this "warm" state if we move the rasterio open statement outside the function we can reuse rasterio's existing file object for subsequent requests too.
Using this strategy it is possible to effectively cache open file objects, meaning that the latency for subsequent requests to the function reduces by ~90%, to an average of 74 ms. By including our new Lambda in our internal monitoring service to keep it in a "warm" state we can ensure that the majority of calls will return similarly low latency. The chart below shows the difference in response times for a 100 calls using the two strategies including a number of "cold starts" when more than one function is initialized to cope with the surge in demand.
The final step to provision access to the query functions was to create an API Gateway instance in the AWS environment. This internal API is then used by Addresscloud services to perform lookups on raster intelligence data. As the load on the API increases the Lambda service scales automatically to handle incoming requests, helping address our challenge of keeping latency low throughout our infrastructure.
The raster service is currently in testing and will be rolled-out internally to Addresscloud services in the coming weeks. COGs represent an exciting contribution to open geospatial data standards with lots of room for new services and features to be built on-top of scaleable cloud architectures.
Using COGs and a cloud application we've met our challenges of reducing system latency and improving data-refresh time. The read capacity of an S3 bucket object is extremely high and we'll be load testing this as part of our roll-out to see how fast the system is in production. I look forward to posting updates and reporting on how the new raster service is performing in the near future!