Usage#

Ingestion and Export#

The following example shows how to minimally ingest a 3D seismic stack into a local MDIO file. Only one lossless copy will be made.

There are many more options, please see the CLI Reference.

$ mdio segy import \
    path_to_segy_file.segy \
    path_to_mdio_file.mdio \
    -loc 181,185 \
    -names inline,crossline

To export the same file back to SEG-Y format, the following command should be executed.

$ mdio segy export \
    path_to_mdio_file.mdio \
    path_to_segy_file.segy

Cloud Connection Strings#

MDIO supports I/O on major cloud service providers. The cloud I/O capabilities are supported using the fsspec and its specialized version for:

Amazon Web Services (AWS S3) - s3fs
Google Cloud Provider (GCP GCS) - gcsfs
Microsoft Azure (Datalake Gen2) - adlfs

Any other file-system supported by fsspec will also be supported by MDIO. However, we will focus on the major providers here.

The protocols that help choose a backend (i.e. s3://, gs://, or az://) can be passed prepended to the MDIO path.

The connection string can be passed to the command-line-interface (CLI) using the -storage, --storage-options flag as a JSON string or the Python API with the storage_options keyword argument as a Python dictionary.

Warning

On Windows clients, JSON strings are passed to the CLI with a special escape character.

For instance a JSON string:

{"key": "my_super_private_key", "secret": "my_super_private_secret"}

must be passed with an escape character \ for inner quotes as:

"{\"key\": \"my_super_private_key\", \"secret\": \"my_super_private_secret\"}"

whereas, on Linux bash this works just fine:

'{"key": "my_super_private_key", "secret": "my_super_private_secret"}'

If this done incorrectly, you will get an invalid JSON string error from the CLI.

Amazon Web Services#

Credentials can be automatically fetched from pre-authenticated AWS CLI. See here for the order s3fs checks them. If it is not pre-authenticated, you need to pass --storage-options.

Prefix:
s3://

Storage Options:
key: The auth key from AWS
secret: The auth secret from AWS

Using UNIX:

mdio segy import \
  path/to/my.segy \
  s3://bucket/prefix/my.mdio \
  --header-locations 189,193 \
  --storage-options '{"key": "my_super_private_key", "secret": "my_super_private_secret"}'

Using Windows (note the extra escape characters \):

mdio segy import \
  path/to/my.segy \
  s3://bucket/prefix/my.mdio \
  --header-locations 189,193 \
  --storage-options "{\"key\": \"my_super_private_key\", \"secret\": \"my_super_private_secret\"}"

Google Cloud Provider#

Credentials can be automatically fetched from pre-authenticated gcloud CLI. See here for the order gcsfs checks them. If it is not pre-authenticated, you need to pass --storage-options.

GCP uses service accounts to pass authentication information to APIs.

Prefix:
gs:// or gcs://

Storage Options:
token: The service account JSON value as string, or local path to JSON

Using a service account:

mdio segy import \
  path/to/my.segy \
  gs://bucket/prefix/my.mdio \
  --header-locations 189,193 \
  --storage-options '{"token": "~/.config/gcloud/application_default_credentials.json"}'

Using browser to populate authentication:

mdio segy import \
  path/to/my.segy \
  gs://bucket/prefix/my.mdio \
  --header-locations 189,193 \
  --storage-options '{"token": "browser"}'

Microsoft Azure#

There are various ways to authenticate with Azure Data Lake (ADL). See here for some details. If ADL is not pre-authenticated, you need to pass --storage-options.

Prefix:
az:// or abfs://

Storage Options:
account_name: Azure Data Lake storage account name
account_key: Azure Data Lake storage account access key

mdio segy import \
  path/to/my.segy \
  az://bucket/prefix/my.mdio \
  --header-locations 189,193 \
  --storage-options '{"account_name": "myaccount", "account_key": "my_super_private_key"}'

Advanced Cloud Features#

There are additional functions provided by fsspec. These are advanced features and we refer the user to read fsspec documentation. Some useful examples are:

Caching Files Locally
Remote Write Caching
File Buffering and random access
Mount anything with FUSE

Note

When combining advanced protocols like simplecache and using a remote store like s3 the URL can be chained like simplecache::s3://bucket/prefix/file.mdio. When doing this the --storage-options argument must explicitly state parameters for the cloud backend and the extra protocol. For the above example it would look like this:

{
  "s3": {
    "key": "my_super_private_key",
    "secret": "my_super_private_secret"
  },
  "simplecache": {
    "cache_storage": "/custom/temp/storage/path"
  }
}

In one line:

{"s3": {"key": "my_super_private_key", "secret": "my_super_private_secret"}, "simplecache": {"cache_storage": "/custom/temp/storage/path"}

CLI Reference#

MDIO provides a convenient command-line-interface (CLI) to do various tasks.

For each command / subcommand you can provide --help argument to get information about usage.

mdio#

Welcome to MDIO!

MDIO is an open source, cloud-native, and scalable storage engine for various types of energy data.

MDIO supports importing or exporting various data containers, hence we allow plugins as subcommands.

From this main command, we can see the MDIO version.

mdio [OPTIONS] COMMAND [ARGS]...

Options

--version#: Show the version and exit.

copy#

Copy a MDIO dataset to anpther MDIO dataset.

Can also copy with empty data to be filled later. See excludes and includes parameters.

More documentation about excludes and includes can be found in Zarr’s documentation in zarr.convenience.copy_store.

mdio copy [OPTIONS] SOURCE_MDIO_PATH TARGET_MDIO_PATH

Options

-access, --access-pattern <access_pattern>#

Access pattern of the file

Default:: 012

-exc, --excludes <excludes>#: Data to exclude during copy, like chunked_012. The data values won’t be copied but an empty array will be created. If blank, it copies everything.

-inc, --includes <includes>#: Data to include during copy, like trace_headers. If not specified, and certain data is excluded, it will not copy headers. To preserve headers, specify trace_headers. If left blank, it will copy everything except what is specified in the ‘excludes’ parameter.

-storage, --storage-options <storage_options>#: Custom storage options for cloud backends

-overwrite, --overwrite#

Flag to overwrite if mdio file if it exists

Default:: False

Arguments

SOURCE_MDIO_PATH#: Required argument

TARGET_MDIO_PATH#: Required argument

info#

Provide information on a MDIO dataset.

By default, this returns human-readable information about the grid and stats for the dataset. If output-format is set to json then a json is returned to facilitate parsing.

mdio info [OPTIONS] MDIO_PATH

Options

-access, --access-pattern <access_pattern>#

Access pattern of the file

Default:: 012

-format, --output-format <output_format>#

Output format. Pretty console or JSON.

Default:: pretty
Options:: pretty | json

Arguments

MDIO_PATH#: Required argument

segy#

MDIO and SEG-Y conversion utilities. Below is general information about the SEG-Y format and MDIO features. For import or export specific functionality check the import or export modules:

mdio segy import –help

mdio segy export –help

MDIO can import SEG-Y files to a modern, chunked format.

The SEG-Y format is defined by the Society of Exploration Geophysicists as a data transmission format and has its roots back to 1970s. There are currently multiple revisions of the SEG-Y format.

MDIO can unravel and index any SEG-Y file that is on a regular grid. There is no limitation to dimensionality of the data, as long as it can be represented on a regular grid. Most seismic surveys are on a regular grid of unique shot/receiver IDs or are imaged on regular CDP or INLINE/CROSSLINE grids.

The SEG-Y headers are used as identifiers to take the flattened SEG-Y data and convert it to the multi-dimensional tensor representation. An example of ingesting a 3-D Post-Stack seismic data can be though as the following, per the SEG-Y Rev1 standard:

–header-names inline,crossline
–header-locations 189,193
–header-types int32,int32

Our recommended chunk sizes are:
(Based on GCS benchmarks)
3D: 64 x 64 x 64
2D: 512 x 512

The 4D+ datasets chunking recommendation depends on the type of 4D+ dataset (i.e. SHOT vs CDP data will have different chunking).

MDIO also import or export big and little endian coded IBM or IEEE floating point formatted SEG-Y files. MDIO can also build a grid from arbitrary header locations for indexing. However, the headers are stored as the SEG-Y Rev 1 after ingestion.

mdio segy [OPTIONS] COMMAND [ARGS]...

export#

Export MDIO file to SEG-Y.

SEG-Y format is explained in the “segy” group of the command line interface. To see additional information run:

mdio segy –help

MDIO allows exporting multidimensional seismic data back to the flattened seismic format SEG-Y, to be used in data transmission.

The input headers are preserved as is, and will be transferred to the output file.

The user has control over the endianness, and the floating point data type. However, by default we export as Big-Endian IBM float, per the SEG-Y format’s default.

The input MDIO can be local or cloud based. However, the output SEG-Y will be generated locally.

mdio segy export [OPTIONS] MDIO_FILE SEGY_PATH

Options

-access, --access-pattern <access_pattern>#

Access pattern of the file

Default:: 012

-format, --segy-format <segy_format>#

SEG-Y sample format

Default:: ibm32
Options:: ibm32 | ieee32

-storage, --storage-options <storage_options>#: Custom storage options for cloud backends.

-endian, --endian <endian>#

Endianness of the SEG-Y file

Default:: big
Options:: little | big

Arguments

MDIO_FILE#: Required argument

SEGY_PATH#: Required argument

import#

Ingest SEG-Y file to MDIO.

SEG-Y format is explained in the “segy” group of the command line interface. To see additional information run:

mdio segy –help

MDIO allows ingesting flattened seismic surveys in SEG-Y format into a multidimensional tensor that represents the correct geometry of the seismic dataset.

The SEG-Y file must be on disk, MDIO currently does not support reading SEG-Y directly from the cloud object store.

The output MDIO file can be local or on the cloud. For local files, a UNIX or Windows path is sufficient. However, for cloud stores, an appropriate protocol must be provided. Some examples:

File Path Patterns:

If we are working locally: –input_segy_path local_seismic.segy –output-mdio-path local_seismic.mdio

If we are working on the cloud on Amazon Web Services: –input_segy_path local_seismic.segy –output-mdio-path s3://bucket/local_seismic.mdio

If we are working on the cloud on Google Cloud: –input_segy_path local_seismic.segy –output-mdio-path gs://bucket/local_seismic.mdio

If we are working on the cloud on Microsoft Azure: –input_segy_path local_seismic.segy –output-mdio-path abfs://bucket/local_seismic.mdio

The SEG-Y headers for indexing must also be specified. The index byte locations (starts from 1) are the minimum amount of information needed to index the file. However, we suggest giving names to the index dimensions, and if needed providing the header types if they are not standard. By default, all header entries are assumed to be 4-byte long (int32).

The chunk size depends on the data type, however, it can be chosen to accommodate any workflow’s access patterns. See examples below for some common use cases.

By default, the data is ingested with LOSSLESS compression. This saves disk space in the range of 20% to 40%. MDIO also allows data to be compressed using the ZFP compressor’s fixed accuracy lossy compression. If lossless parameter is set to False and MDIO was installed using the lossy extra; then the data will be compressed to approximately 30% of its original size and will be perceptually lossless. The compression amount can be adjusted using the option compression_tolerance (float). Values less than 1 gives good results. The higher the value, the more compression, but will introduce artifacts. The default value is 0.01 tolerance, however we get good results up to 0.5; where data is almost compressed to 10% of its original size. NOTE: This assumes data has amplitudes normalized to have approximately standard deviation of 1. If dataset has values smaller than this tolerance, a lot of loss may occur.

Usage:

Below are some examples of ingesting standard SEG-Y files per the SEG-Y Revision 1 and 2 formats.

3D Seismic Post-Stack: Chunks: 128 inlines x 128 crosslines x 128 samples –header-locations 189,193 –header-names inline,crossline

3D Seismic Imaged Pre-Stack Gathers: Chunks: 16 inlines x 16 crosslines x 16 offsets x 512 samples –header-locations 189,193,37 –header-names inline,crossline,offset –chunk-size 16,16,16,512

2D Seismic Shot Data (Byte Locations Vary): Chunks: 16 shots x 256 channels x 512 samples –header-locations 9,13 –header-names shot,chan –chunk-size 16,256,512

3D Seismic Shot Data (Byte Locations Vary): Let’s assume streamer number is at byte 213 as a 2-byte integer field. Chunks: 8 shots x 2 cables x 256 channels x 512 samples –header-locations 9,213,13 –header-names shot,cable,chan –header-types int32,int16,int32 –chunk-size 8,2,256,512

We can override the dataset grid by the grid_overrides parameter. This allows us to ingest files that don’t conform to the true geometry of the seismic acquisition.

For example if we are ingesting 3D seismic shots that don’t have a cable number and channel numbers are sequential (i.e. each cable doesn’t start with channel number 1; we can tell MDIO to ingest this with the correct geometry by calculating cable numbers and wrapped channel numbers. Note the missing byte location and type for the “cable” index.

Usage:

3D Seismic Shot Data (Byte Locations Vary): Let’s assume streamer number does not exist but there are 800 channels per cable. Chunks: 8 shots x 2 cables x 256 channels x 512 samples –header-locations 9,None,13 –header-names shot,cable,chan –header-types int32,None,int32 –chunk-size 8,2,256,512 –grid-overrides ‘{“ChannelWrap”: True, “ChannelsPerCable”: 800,

“CalculateCable”: True}’

If we do have cable numbers in the headers, but channels are still sequential (aka. unwrapped), we can still ingest it like this. –header-locations 9,213,13 –header-names shot,cable,chan –header-types int32,int16,int32 –chunk-size 8,2,256,512 –grid-overrides ‘{“ChannelWrap”:True, “ChannelsPerCable”: 800}’ For shot gathers with channel numbers and wrapped channels, no grid overrides are necessary.

In cases where the user does not know if the input has unwrapped channels but desires to store with wrapped channel index use: –grid-overrides ‘{“AutoChannelWrap”: True}’

For cases with no well-defined trace header for indexing a NonBinned grid override is provided.This creates the index and attributes an incrementing integer to the trace for the index based on first in first out. For example a CDP and Offset keyed file might have a header for offset as real world offset which would result in a very sparse populated index. Instead, the following override will create a new index from 1 to N, where N is the number of offsets within a CDP ensemble. The index to be auto generated is called “trace”. Note the required “chunksize” parameter in the grid override. This is due to the non-binned ensemble chunksize is irrelevant to the index dimension chunksizes and has to be specified in the grid override itself. Note the lack of offset, only indexing CDP, providing CDP header type, and chunksize for only CDP and Sample dimension. The chunksize for non-binned dimension is in the grid overrides as described above. The below configuration will yield 1MB chunks. –header-locations 21 –header-names cdp –header-types int32 –chunk-size 4,1024 –grid-overrides ‘{“NonBinned”: True, “chunksize”: 64}’

A more complicated case where you may have a 5D dataset that is not binned in Offset and Azimuth directions can be ingested like below. However, the Offset and Azimuth dimensions will be combined to “trace” dimension. The below configuration will yield 1MB chunks. –header-locations 189,193 –header-names inline,crossline –header-types int32,int32 –chunk-size 4,4,1024 –grid-overrides ‘{“NonBinned”: True, “chunksize”: 16}’

For dataset with expected duplicate traces we have the following parameterization. This will use the same logic as NonBinned with a fixed chunksize of 1. The other keys are still important. The below example allows multiple traces per receiver (i.e. reshoot). –header-locations 9,213,13 –header-names shot,cable,chan –header-types int32,int16,int32 –chunk-size 8,2,256,512 –grid-overrides ‘{“HasDuplicates”: True}’

mdio segy import [OPTIONS] SEGY_PATH MDIO_PATH

Options

-loc, --header-locations <header_locations>#: Required Byte locations of the index attributes in SEG-Y trace header.

-types, --header-types <header_types>#: Data types of the index attributes in SEG-Y trace header.

-names, --header-names <header_names>#: Names of the index attributes

-chunks, --chunk-size <chunk_size>#: Custom chunk size for bricked storage

-endian, --endian <endian>#

Endianness of the SEG-Y file

Default:: big
Options:: little | big

-lossless, --lossless <lossless>#

Toggle lossless, and perceptually lossless compression

Default:: True

-tolerance, --compression-tolerance <compression_tolerance>#

Lossy compression tolerance in ZFP.

Default:: 0.01

-storage, --storage-options <storage_options>#: Custom storage options for cloud backends

-overwrite, --overwrite#

Flag to overwrite if mdio file if it exists

Default:: False

-grid-overrides, --grid-overrides <grid_overrides>#: Option to add grid overrides.

Arguments

SEGY_PATH#: Required argument

MDIO_PATH#: Required argument