Analyzing geospatial environmental open data

This tutorial is part of the 2021 Call for Code Global Challenge.

In this tutorial, learn how to pull 40 years’ of global satellite-based soil moisture data from the European Commission, then train a model to compute moisture trends to identify regions that have a high probability to dry out and have droughts.

The data set

Copernicus is the European Union’s Earth observation program, looking at our planet and its environment. It offers information services that draw from satellite Earth observation and in-situ (non-space) data. Vast amounts of global data from satellites and ground-based, airborne, and seaborne measurement systems provide information. The information services provided are free and openly accessible.

From this database, I am showing how to extract 40 years’ of historic (spatiotemporal) soil moisture data from the entire planet in chunks of 500 square kilometers; basically, dividing Earth into a grid of 20 by 20-kilometer tiles with moisture level time resolution of one day.

Data set
Figure 1. Global soil moisture for a particular day plotted directly from the original data set

The tools

Elyra is a set of open source JupyterLab extensions around artificial intelligence and machine learning. In this tutorial, I use the Elyra pipeline editor to orchestrate a set of Jupyter Notebooks that pull, transform, analyze, and visualize data. This pipeline serves as a template for your future analysis. Elyra is able to run the pipelines locally but also push the work to KubeFlow Pipelines running on Kubernetes or to Apache Airflow. Here, I use local execution mode only. I provide a Docker image that you can start with a single command so that there is no issue during the installation process.

The Elyra pipeline editor
Figure 2. The Elyra pipeline editor

Prerequisites

To follow this tutorial, you need:

  1. Docker
  2. Elyra

Estimated time

It should take you approximately 30 minutes to complete the tutorial.

Steps

Step 1. Install Docker

I’ve provided a ready-made Docker image for you. Therefore, you only need to install Docker on your local machine. For Windows and MacOS, I recommend Docker Desktop. For Linux operating systems, find the appropriate installation procedure for your operating system.

Step 2. Run Elyra

After you have Docker installed, you can run Elyra with one command. Open a terminal window, and enter the following command.

docker run -it -p 8888:8888  --shm-size=10G -v elyra_work:/home/jovyan/work -v elyra_runtimes:/home/jovyan/.local/share/jupyter/metadata romeokienzler/elyra-ai-cfc21:latest  jupyter-lab --no-browser

Note that I’m creating two volumes. Therefore, if you restart Elyra, all of your work is still saved within those volumes.

Step 3. Access Elyra

Look at the output in the terminal window, and wait for a line similar to the following code.

[I 2021-03-18 13:14:21.406 ServerApp] Jupyter Server 1.4.1 is running at:
[I 2021-03-18 13:14:21.406 ServerApp] http://7e80d979bd07:8888/lab?token=138a179b82f613242c2a21d4abebed64d8a66bb322aa070a
[I 2021-03-18 13:14:21.406 ServerApp]  or http://127.0.0.1:8888/lab?token=138a179b82f613242c2a21d4abebed64d8a66bb322aa070a

This indicates that Elyra is available under http://127.0.0.1:8888/lab?token=138a179b82f613242c2a21d4abebed64d8a66bb322aa070a. So, you can just click it.

Step 4. Have fun

After you’ve opened the URL in a browser, you are presented with Elyra.

It is highly recommended to run each notebook manually, one by one first, before you click the play button to execute the entire pipeline at once. This makes it easier to spot errors. (You also have to inject your API key manually.) Double-click on the different pipeline components and it takes you to the underlying notebook. The notebooks support check-pointing, which means that already-completed tasks are skipped. Therefore, if you want to run from scratch, you must delete the contents of the “data” folder.

  1. Double-click the climate-copernicus.pipeline.
  2. Add the API key of the compernicus data provider (see #input-climate-copernicus).
  3. Run the pipeline by clicking the play icon.

    Note: The pipeline caches data in the data directory. If you make significant changes to the pipeline, for example, changing the sampling rate, delete the contents of the data directory, and all pipeline steps will run from scratch.

Running the pipeline

Understanding the pipeline

In addition to real-time querying of the data set to find regions in your area with higher moisture levels for identifying possible water sources, time-series forecasting is also possible. Therefore, for each of the grid tiles, I fit a low-order polynomial (regression line) to obtain future trends to answer the question “Will this local grid tile be more dry in the near future?” Basically, this data is a worldwide grid of trend values that I use to plot an interactive heat map of the planet.

Trend of soil moisture over the past 40 years visualized as heat map
Figure 4. Trend of soil moisture over the past 40 years visualized as a heat map

Map at higher zoom level. Note that not all tiles have been computed
Figure 5. Map at higher zoom level. Note that not all tiles have been computed

Description of individual pipeline steps

The pipeline consists of seven steps. All of the processing steps are available from a library of predefined pipeline components, the Elyra CLAIMED component library. CLAIMED stands for Component Library for AI, machine learning, ETL, and data science. You can find the latest version of the library at https://github.com/elyra-ai/component-library. Each component is implemented as Jupyter Notebook for easy creation, modification, and learning.

input-climate-copernicus

This component pulls data from the Copernicus Data Provider (European Commission), which is available at cds.climate.copernicus.eu. The only parameter necessary for the component is an API key that lets you submit data pull jobs and download the data. Therefore, you must create an account. Open the following link after you’ve registered and click Submit to accept the terms and conditions of Copernicus. Otherwise, you can’t download data.

https://cds.climate.copernicus.eu/cdsapp/#!/terms/licence-to-use-copernicus-products

When your account is ready, you must access your user profile. Then, scroll down and take note of your UID and API key. The API key that you must provide to the component within the Elyra pipeline editor is of the form UID:APIKey.

To provide the key:

  1. Hover over the first pipeline component.
  2. Click the three vertical dots.
  3. Click Properties. Note that you can always double-click a component and the underlying Jupyter Notebook will open.

    Properties
    Figure 6. Properties

  4. Scroll down to the Environment Variables section and insert your API key.

    input-climate-copernicus.ipynb
    Figure 7. input-climate-copernicus component

Note that this first stage runs for several hours depending on your available computing resources.

spark-csv-to-parquet

This component takes the CSV file that is created by the input stage and converts it to the Parquet file format using Apache Spark because it is much faster running analytics on top of Parquet versus using CSV.

spark-sample

This component takes a random subset from the original data set (1 percent per default). I recommend leaving it at that value until everything works correctly. When you are ready, you can work on the entire data set and leave the process running overnight.

This component fits a low-order polynomial (regression, rank two) to the time series. In this case, for every grid tile this is a computationally intensive step. Therefore, you should first try it out with a small sample.

Low-order regression line
Figure 8. Example of fitting a low-order regression line to a time series to obtain a trend (Note: The slope of the regression line is used as trend.)

spark-parquet-to-csv

Because most of the visualization libraries can’t read Parquet, you should convert back to CSV using this component.

map-from-coordinates

This component takes a list of coordinates and an associated value as input and creates an interactive heat map using OpenStreetMap data and the Leaflet library.

The finalized pipeline
Figure 9. The finalized pipeline

To run the pipeline, click the play button at the upper left and wait. Check the stdout of your terminal window from where you’ve started the Elyra Docker image for errors, which are displayed there during the local pipeline run.

Summary

With this pipeline, it is easy to add additional data sources or processing steps to further create insights. For example, you could implement a search application to identify regions with high moisture around you. Or, you could identify regions at risk for drying out and having droughts in the near future.