Oil Reservoir Simulations

Overview

The Oil Reservoir Simulation Dataset (ORSD) collection contains tens of thousands of sequences generated by a physics-based simulator of an oil-reservoir field. Each simulation comprises an input action sequence and an output prediction sequence. The goal of releasing this dataset is to aid the development of machine-learning models that can accurately predict the output sequence, given an input sequence, thus offering a large data corpus for evaluating sequence-to-sequence models.

The ORSD collection consists of two separate databases, each targeting a slightly different scenario (one with and one without “drift”) and each containing about 30,000 simulations.

The ORSD caters to researchers across various fields, in particular, to the broader Machine Learning (ML) community, i.e., researchers and data scientists who would like to use the data to validate new (sequential) algorithms, but also to oil reservoir experts working in the narrower domain who want to examine the simulations and build their work on these.

A detailed introduction and full data description is provided in the notebook accompanying this dataset.

Dataset Metadata

Field Value
Format JSON
HDF5
License CDLA-Sharing
Domain [Sequence Modelling, Time Series Analysis]
Number of Records 30000 simulations
Data Split Split by data type, including SPE9-TRIANGLE, SPE9-MAX and other supporting data files.
Size 4.7 GB
Author Jiri Navratil, Georgios Kollias, Andres Codas
Dataset Origin IBM Research
Dataset Version Update Version 1 – April 10, 2020
Data Coverage The ORSD data collection contains tens of thousands of sequences (simulations) generated by a physics-based simulator of an oil-reservoir field.
Business Use Case ML Prediction: Build, train and develop machine-learning models so people can accurately predict output sequence, given an input sequence.
Visualization: Build three-dimension visualization of an oil-reservoir field.

Dataset Archive Contents

File or Folder Description
SPE9-MAX data files This type of dataset contains simulations generated by random drilling sequences that distribute uniformly over the entire surface grid of the SPE9 RM (24×25). The aim of the SPE9-MAX is to evaluate a ML model as an interpolator, i.e., a predictor of previously unseen sequences drawn from the same distributions as well as the same geological region. It contains 30,000 such simulations partitioned into training, development, and test sets.
SPE9-TRIANGLE data files This type of dataset contains simulations generated by random drilling sequences that distribute uniformly over a constrained triangular portions of the reservoir surface. The aim of the SPE9-TRIANGLE is to evaluate a ML model as an extrapolator, i.e., a predictor of sequences drawn from a different geological region. It also contains 30,000 such simulations partitioned into training, development, and test sets.
SPE9_simulation_illustration.png Picture used in notebooks to illustrate the model.
SPE9_triangle.png Picture used in the notebooks to illustrate the train and test split of oil reservoir dataset.
LICENSE.txt Terms of Use

Data Glossary and Preview

Click here to explore the data glossary, sample records, and additional dataset metadata.

Use the Dataset

This dataset is complemented by data exploration, data analysis, and modeling Python notebooks to help you get started:

Legend