Overview
The Oil Reservoir Simulation Dataset (ORSD) collection contains tens of thousands of sequences generated by a physics-based simulator of an oil-reservoir field. Each simulation comprises an input action
sequence and an output prediction
sequence. The goal of releasing this dataset is to aid the development of machine-learning models that can accurately predict the output sequence, given an input sequence, thus offering a large data corpus for evaluating sequence-to-sequence models.
The ORSD collection consists of two separate databases, each targeting a slightly different scenario (one with and one without “drift”) and each containing about 30,000 simulations.
The ORSD caters to researchers across various fields, in particular, to the broader Machine Learning (ML) community, i.e., researchers and data scientists who would like to use the data to validate new (sequential) algorithms, but also to oil reservoir experts working in the narrower domain who want to examine the simulations and build their work on these.
A detailed introduction and full data description is provided in the notebook accompanying this dataset.
Dataset Metadata
Field | Value |
---|---|
Format | JSON HDF5 |
License | CDLA-Sharing |
Domain | [Sequence Modelling, Time Series Analysis] |
Number of Records | 30000 simulations |
Data Split | Split by data type, including SPE9-TRIANGLE, SPE9-MAX and other supporting data files. |
Size | 4.7 GB |
Author | Jiri Navratil, Georgios Kollias, Andres Codas |
Dataset Origin | IBM Research |
Dataset Version Update | Version 1 – April 10, 2020 |
Data Coverage | The ORSD data collection contains tens of thousands of sequences (simulations) generated by a physics-based simulator of an oil-reservoir field. |
Business Use Case | ML Prediction: Build, train and develop machine-learning models so people can accurately predict output sequence, given an input sequence. Visualization: Build three-dimension visualization of an oil-reservoir field. |
Dataset Archive Contents
File or Folder | Description |
---|---|
SPE9-MAX data files |
This type of dataset contains simulations generated by random drilling sequences that distribute uniformly over the entire surface grid of the SPE9 RM (24×25). The aim of the SPE9-MAX is to evaluate a ML model as an interpolator, i.e., a predictor of previously unseen sequences drawn from the same distributions as well as the same geological region. It contains 30,000 such simulations partitioned into training, development, and test sets. |
SPE9-TRIANGLE data files |
This type of dataset contains simulations generated by random drilling sequences that distribute uniformly over a constrained triangular portions of the reservoir surface. The aim of the SPE9-TRIANGLE is to evaluate a ML model as an extrapolator, i.e., a predictor of sequences drawn from a different geological region. It also contains 30,000 such simulations partitioned into training, development, and test sets. |
SPE9_simulation_illustration.png |
Picture used in notebooks to illustrate the model. |
SPE9_triangle.png |
Picture used in the notebooks to illustrate the train and test split of oil reservoir dataset. |
LICENSE.txt |
Terms of Use |
Data Glossary and Preview
Click here to explore the data glossary, sample records, and additional dataset metadata.
Use the Dataset
This dataset is complemented by data exploration, data analysis, and modeling Python notebooks to help you get started: