Managing Data Traceability in the Data Lifecycle for Deep Learning Applied to Seismic Data

Renan Souza, Emilio Vital Brazil, Leonardo Azevedo, Rodrigo Ferreira, Daniel Salles Chevitarese, Elton Soares, Raphael Thiago, Marcelo Nery, Viviane Torres, Renato Cerqueira

Managing Data Traceability in the Data Lifecycle for Deep Learning Applied to Seismic Data

Renan Souza, Emilio Vital Brazil, Leonardo Azevedo, Rodrigo Ferreira, Daniel Salles Chevitarese, Elton Soares, Raphael Thiago, Marcelo Nery, Viviane Torres, Renato Cerqueira

IBM Research

May 19-22 2019 – 2019 AAPG Annual Convention and Exhibition, San Antonio, Texas

Posted: June 30, 2019

Abstract

Identification of geological features in seismic images is a typical activity for the discovery of oil reservoirs. Geoscientists spend hours recognizing structures such as salt bodies, faults, and other geological features in the subsurface. Trying to automate such activity is of high interest for academia and O&G industry. Deep Learning (DL) has become a powerful AI technique to help geoscientists to accelerate their tasks. However, training DL classifiers to identify textures in seismic images requires a non-trivial data lifecycle comprising large geological data preprocessing, filtering, and analysis all the way from raw files (e.g., SEG-Y files) until the generation of trained DL models for geological textures classification.

Due to the complexity, the data lifecycle requires a decomposition into smaller parts, each one potentially addressed by different teams in a collaboration of geoscientists, computer scientists, statisticians, and others. Each unit has a workflow to automate data-intensive tasks and to store data. Although this increases the productivity of each group, it introduces a significant problem when one needs to analyze the data in an integrated way across the distributed data stores. This paper applies scientific workflow provenance techniques to the data lifecycle of a DL-based classifier of geological textures in seismic images. The applied techniques are based on (i) dataflow modeling, considering the data dependencies across dataflows that use different data stores; (ii) workflow code instrumentation to capture data provenance; and (iii) creation of a provenance database enriched with application-specific data to provide data traceability throughout the entire data lifecycle.

We validated our methodology in the data lifecycle to train the DL classifier. It is composed of five workflows that process over 5 TB of geological data stored in five distributed stores. The workflows (i) clean and filter seismic and horizons raw data files; (ii) create geospatial indexes and add geoscientists’ annotations for seismic files; and (iii) for horizon files; (iv) generate input training and validation datasets for the DL classifier; and (v) train the classifier in a High-Performance Computing machine. We show that the applied techniques helped the multidisciplinary teams to find, analyze, and understand the processed data in the workflows in an integrated way.

Search and Discovery
Featured Articles

AAPG Store
Featured Digital Pubs

GIS Map Publishing Program

Online Journal for E&P Geoscientists

Managing Data Traceability in the Data Lifecycle for Deep Learning Applied to Seismic Data

Abstract

Search and Discovery
Featured Articles

Archives

AAPG Store
Featured Digital Pubs

GIS Map Publishing Program

Online Journal for E&P Geoscientists

Managing Data Traceability in the Data Lifecycle for Deep Learning Applied to Seismic Data

Abstract

Search and DiscoveryFeatured Articles

Archives

AAPG StoreFeatured Digital Pubs

GIS Map Publishing Program

Search and Discovery
Featured Articles

AAPG Store
Featured Digital Pubs