You may not have even noticed, but geospatial data has become an indispensable part of our life. We use maps and GPS trackers almost every day — generating or consuming lots of data with coordinates in one way or another. Therefore, leveraging data science to analyze this data is of interest for many individuals and organizations. Is this the case for you?
As Dataiku software engineers, we have recently developed new features that provide our users with a simple and smooth way to get the most out of their geospatial data and we want you to benefit from this experience. In an upcoming blog post (stay tuned!), we will explain how we implemented a geo join as a visual Dataiku recipe. But first things first — in this blog post, we’d like to tell you about everything that we would have liked to know about geospatial analysis and the tools that it requires just before starting our project. Take it as an introduction to this quite complex but interesting world.
Geodata Basics
What do we understand when looking at values like 48.8584° N, 2.2945° E ?
We can tell that these are the point coordinates and that this point is somewhere in the northern hemisphere, a little east of the Greenwich meridian. With the appropriate map, we could even be more precise and associate those coordinates with the position of the Eiffel Tower in Paris.
But are there any assumptions that we make when we try to find a place represented by these coordinates?
Yes, first of all, to calculate the position of a certain object on the Earth, we need some approximation of our globe. Generally speaking, it has a pretty complex shape which can be represented by a geoid. However, since it’s hard to apply math to such a complex object, there are a number of different simpler approximations called datums. Some of them are local and are better adapted to specific places on Earth, other ones work pretty well all around the globe. For example, WSG84 is widely used and maps the Eiffel Tower location to 48.8584° N, 2.2945° E.
Coordinate Systems
The coordinates are obtained using an ellipsoid approximation, so how do we now use them on flat maps? This is where projections come into play. Projections are used to turn geographic (spherical or ellipsoidal) coordinate systems into projected coordinate systems (PCS) using math transformations. There are different kinds of PCS depending on how the projection was done, for example:
- Azimuthal
- Conic
- Cylindrical
Depending on a projection, the object’s size or shape could be distorted on a map.
Spatial Reference Systems
So with datums and projections, there’s a lot of information to keep in mind in order to understand how to use geo data. Luckily, there’s a broader concept of Spatial Reference System (SRS) which combines information like:
- Coordinate system
- Datum
- Prime meridian
- Projection
- Unit of measurement
SRS can be adapted to specific regions like EPSG:27572 for France or for the whole world. Two widely used examples of global SRS are:
- EPSG:3857
Unit of measurement: meters
Example: POINT(264041.37 6248507.56)
- EPSG:4326
Unit of measurement: degrees
Example: POINT(2.3719238 48.844442)
It’s worth saying that both points in the examples above represent the same place on Earth, but since their reference systems are different, the coordinates are also different.
EPSG:4326 is a special projection because the coordinates aren’t actually projected, they are raw latitude and longitude values.
Internally, Dataiku uses EPSG:4326 for geospatial operations. However, if your incoming data has a different coordinate system, you can convert it with a prepare recipe in Dataiku.
Storing Spatial Data
Now that it’s clearer what we need to have while extracting and plotting geo data, the last question is how to store it. Dataiku supports two commonly used formats:
As it was explained above, when conducting a geospatial analysis, having metadata is equally important as having coordinates themselves. Without knowing the correct SRS, the same coordinates may be pointing to completely different places on Earth.
The WKT format is more concise, but it only stores object type and coordinates. GeoJSON, on the other hand, is more explicit and — in addition to geometry data — can embed coordinate reference system code, meaning that you don’t have to store it elsewhere.
Geospatial Index
We’ve mentioned geospatial indexes several times. How do they work and why do they make joining operations faster?
There’s a number of different types of spatial indexes that exist. However, one of the most widely used is the R-Tree index. This is the one Dataiku uses when running a geo join recipe with an embedded engine.
“R” in the index name stands for rectangle and it’s what the index is based on. In order to add a geometry object into an index, first we calculate its bounding box, also called an envelope. Next, these bounding boxes are grouped and larger bounding rectangles are created. Repeating the grouping operation produces a tree structure which is called an R-tree.
Using bounding boxes instead of actual objects is good for two reasons:
- No matter how complex the geometry can be, its bounding box can be represented by only two coordinate pairs.
- Spatial operations are a lot easier on rectangles than on complex shapes.
Of course there’s a downside that if a spatial condition holds on two bounding boxes, it doesn’t mean that it’ll be true for two underlying geometries. For example, if there are two intersecting bounding boxes, it doesn’t mean that two geometries also intersect. R-tree allows us to eliminate a significant number of candidates, but verification should still be done using original objects matching the index query.
What You Should Have Learned So Far
In this article, we summarized our findings during the preparation work for a new Dataiku geospatial feature implementation. The goal of this feature is to apply a join operation between geospatial data while doing data preparation. We started by enhancing our understanding of how a point on earth can be fully located using standard coordinate systems. We then introduced alternatives to store such objects and, finally, shared a useful data structure to efficiently query those objects. We hope that by reading this article, you got a better understanding of the building blocks necessary for geospatial data handling. A second article will dive into the concrete implementation of this feature in Dataiku.