Plotting maps with European data in Python: Part I
This multi-part tutorial is aimed at data scientists or programmers skilled with Python but lacking experience with visualizing geographic data. It is an introduction to the topic, with examples shown on European geography. It uses primarily data sources providing data for all European countries. The maps will be shown for Germany, but the code can be easily adapted for any other European country. And I am going to use JupyterLab 3.0.0.
1) Setup the environment
Before we start, we need to prepare our virtual environment. In the first part of the tutorial, we will need these packages:
- matplotlib
- pandas
- geopandas: an extension of pandas allowing operations on geometries
- contextily: a package for loading background maps (map tiles) for our pictures
I use Anaconda for maintaining the virtual environments, we set up one:
conda create -n geo_tutorial
conda activate geo_tutorial# make the kernel available in JupyterLab
conda install ipykernel
ipython kernel install user-name=geo_tutorial_kernelconda install geopandas -c conda-forge
conda install -c conda-forge matplotlib
conda install -c conda-forge descartes
Restart JupyterLab server, create a new notebook and pick the geo_tutorial_kernel.
2) Download the data
Go to the Eurostat website and from there download one file with the geospatial vector data. Pick scale — the lower number, the more detailed map. You can choose from multiple formats. Shapefile (SHP), TopoJSON and GeoJSON are all common geospatial data formats. I have picked TopoJSON since it is a concise format and 1:1 million scale (with maximum details). I will use this file throughout this tutorial.
Unpack the zip into a data folder. In my case, it resides in the data/ref-nuts-2021-01m
folder.
3) Import packages
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
import contextily as ctx
4) Load data
You can see that the data folder contains many files. The naming format is:
NUTS_AA_BBB_YYYY_CCCC_LEVL_D.json where
- NUTS is an abbreviation of Nomenclature of Territorial Units for Statistics
- AA is either BN (boundary), LB (label) or RG (region)
- BBB denotes the scale (in our case 01M)
- YYYY is year when the map was produced
- CCCC describes the Coordinate Reference System
- D is a digit standing for the NUTS level (0 for countries, 1 for regions etc.)
Now, let’s see what the differences between boundaries, labels and regions.
data_dir = “data/ref-nuts-2021–01m/”
path_rg = data_dir + "NUTS_RG_01M_2021_3035_LEVL_0.json"
gdf_rg = gpd.read_file(path_rg)path_bn = data_dir + "NUTS_BN_01M_2021_3035_LEVL_0.json"
gdf_bn = gpd.read_file(path_bn)path_lb = data_dir + "NUTS_LB_2021_3035_LEVL_0.json"
gdf_lb = gpd.read_file(path_lb)
As we can see, the TopoJSON is loaded with the read_file() function from the GeoPandas package. The object returned is a GeoDataFrame, very similar to a well known Pandas DataFrame.
5) Plot a map of Europe
ax = gdf_rg.plot(figsize=(20,15), color="gray")
gdf_bn.plot(figsize=(20,15), ax=ax, color="red")
gdf_lb.plot(figsize=(20,15), ax=ax, color="yellow")
We can see that the boundary file (red) contains border coordinates, in this case borders between countries because we have loaded level 0. The region file (gray) describes areas (countries) and the label file represents single point per area (useful for placing a label there).
We are going to use only the regions file, we can easily draw also borders with it and find centroids of areas. Now we only need to understand the projections.
6) Understand the projections
“In cartography, a map projection is a way to flatten a globe’s surface into a plane in order to make a map.” Wikipedia
For locating a certain point on the Earth’s surface, we need a Coordinate Reference System (CRS). A commonly used CRS, e.g. in GPS, is EPSG 4326 and the coordinates are given in degrees.
Many times, we need to work with geography visualized in two dimensions (a map). In such a case, we have to project the globe’s surface into a plane. Such an operation produces distortions one way or another. Either the angles in the plane do not correspond to those on the globe or the area differs.
In this tutorial, we will work with two projections:
- EPSG 3857 — a Spherical Mercator projection CRS used by web services such as Google or OpenStreetMap
- EPSG 3035 — a Lambert Azimuthal Equal Area (LAEA) projection, is commonly used for Pan-European data analysis because it accurately represents area.
Let’ see the difference between those three CRS’. TopoJSON does not contain information about the coordinate reference system. We have to set this information manually. And if needed, reproject to another CRS:
# remove France from the dataset to make the picture more concise
# (due to the France's overseas territories) - sorry, France :-/
gdf0 = gdf_rg[gdf_rg.CNTR_CODE != "FR"]# the TopoJSON file does not contain information about the projection,
# but we know from the file name that this is 3035
gdf0.crs = "EPSG:3035"# reproject the GeoDataFrame into 4326 and save it as a copy
gdf1 = gdf0.to_crs("EPSG:4326")# reproject the GeoDataFrame into 4326 and save it as a copy
gdf2 = gdf0.to_crs("EPSG:3857")# plot both projections side by side
fig, axes = plt.subplots(1, 3, figsize=(20,8), dpi=150)gdf0.plot(ax=axes[0])
axes[0].set_title(gdf0.crs, fontsize="20")gdf1.plot(ax=axes[1])
axes[1].set_title(gdf1.crs, fontsize="20")gdf2.plot(ax=axes[2])
axes[2].set_title(gdf2.crs, fontsize="20")# hide ticks
for i in range(0,3):
axes[i].get_xaxis().set_ticks([])
axes[i].get_yaxis().set_ticks([])
We can immediately notice that the EPSG 3857 projection:
- depicts faithfully angles (see the orientation of the individual countries)
- it distorts the distances on the y axis
On the other hand, the EPSG 3035 projection:
- distorts the angles
- represents area accurately
For more information about projections, see:
7) Let’s focus on one country (Germany)
We need to load another file that will provide us with a more detailed geography than the country level. And we will need only the region (RG) file.
path = data_dir + "NUTS_RG_01M_2021_3035_LEVL_1.json"
gdf = gpd.read_file(path)# specify CRS (see the filename)
gdf.crs = "EPSG:3035"
Slice the table to keep just data for Germany:
gdf_de = gdf[gdf.CNTR_CODE == "DE"]
Let’s plot the states (Bundesländer) of Germany. We can use the very same GeoDataFrame for visualizing the borders as well:
# we need to transform the data into the 3857 CRS because of the Contextily base map
gdf_de = gdf_de.to_crs("EPSG:3857")# plot the German federated states (Bundesländer)
ax = gdf_de.plot(figsize=(20,15), color="lightgray")# plot borders between the states green
gdf_de.boundary.plot(color="darkgreen", ax=ax)# add background map by OpenStreetMap
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.HOT)
Plotting with GeoDataFrame is simple. The contextily package allows us to add another layer (base map) to the picture that sets our data visualisation into a well known geographical context.
8) Conclusion
In the first part of the tutorial, we have learnt:
- where to download geographical vector data for Europe
- how to use this data to plot European countries
- the basics of geographical reference systems
- how to add base maps to our plots
The Jupyter notebook to this part can be found on GitHub. If you use the vector data by Eurostat, make sure you comply with the license.