Reference#

Readers / Writers#

MDIO accessor APIs.

class mdio.api.accessor.MDIOAccessor(mdio_path_or_buffer, mode, access_pattern, storage_options, return_metadata, new_chunks, backend, memory_cache_size, disk_cache)#

Accessor class for MDIO files.

The accessor can be used to read and write MDIO files. It allows you to open an MDIO file in several mode and access_pattern combinations.

Access pattern defines the dimensions that are chunked. For instance if you have a 3-D array that is chunked in every direction (i.e. a 3-D seismic stack consisting of inline, crossline, and sample dimensions) its access pattern would be “012”. If it was only chunked in the first two dimensions (i.e. seismic inline and crossline), it would be “01”.

By default, MDIO will try to open with “012” access pattern, and will raise an error if that pattern doesn’t exist.

After dataset is opened, when the accessor is sliced it will either return just seismic trace data as a Numpy array or a tuple of live mask, headers, and seismic trace in Numpy based on the parameter return_metadata.

Regarding object store access, if the user credentials have been set system-wide on local machine or VM; there is no need to specify credentials. However, the storage_options option allows users to specify credentials for the store that is being accessed. Please see the fsspec documentation for configuring storage options.

MDIO currently supports Zarr and Dask backends. The Zarr backend is useful for reading small amounts of data with minimal overhead. However, by utilizing the Dask backend with a larger chunk size using the new_chunks argument, the data can be read in parallel using a Dask LocalCluster or a distributed Cluster.

The accessor also allows users to enable fsspec caching. These are particularly useful when we are accessing the data from a high-latency store such as object stores, or mounted network drives with high latency. We can use the disk_cache option to fetch chunks the local temporary directory for faster repetitive access. We can also turn on the Least Recently Used (LRU) cache by using the memory_cache option. It has to be specified in bytes.

Parameters
  • mdio_path_or_buffer (str) – Store URL for MDIO file. This can be either on a local disk, or a cloud object store.

  • mode (str) – Read or read/write mode. The file must exist. Options are in {‘r’, ‘r+’}.

  • access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’, ‘01234’.

  • storage_options (dict) – Options for the storage backend. By default, system-wide credentials will be used. If system-wide credentials are not set and the source is not public, an authentication error will be raised by the backend.

  • return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.

  • new_chunks (tuple[int, ...]) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.

  • backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.

  • memory_cache_size (int) – Maximum, in memory, least recently used (LRU) cache size in bytes.

  • disk_cache (bool) – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.

Raises

MDIONotFoundError – If the MDIO file can not be opened.

Notes

The combination of the Dask backend and caching schemes are experimental. This configuration may cause unexpected memory usage and duplicate data fetching.

Examples

Assuming we ingested my_3d_seismic.segy as my_3d_seismic.mdio we can open the file in read-only mode like this.

>>> from mdio import MDIOReader
>>>
>>>
>>> mdio = MDIOReader("my_3d_seismic.mdio")

This will open the file with the lazy Zarr backend. To access a specific inline, crossline, or sample index we can do:

>>> inline = mdio[15]  # get the 15th inline
>>> crossline = mdio[:, 15]  # get the 50th crossline
>>> samples = mdio[..., 250]  # get the 250th sample slice

The above will variables will be Numpy arrays of the relevant trace data. If we want to retreive the live mask and trace headers for our sliding we need to open the file with the return_metadata option.

>>> mdio = MDIOReader("my_3d_seismic.mdio", return_metadata=True)

Then we can fetch the data like this (for inline):

>>> il_live, il_headers, il_traces = mdio[15]

Since MDIOAccessor returns a tuple with these three Numpy arrays, we can directly unpack it and use it further down our code.

Accessor initialization function.

coord_to_index(*args, dimensions=None)#

Convert dimension coordinate to zero-based index.

The coordinate labels of the array dimensions are converted to zero-based indices. For instance if we have an inline dimension like this:

[10, 20, 30, 40, 50]

then the indices would be:

[0, 1, 2, 3, 4]

This method converts from coordinate labels of a dimension to equivalent indices.

Multiple dimensions can be queried at the same time, see the examples.

Parameters
  • *args – Variable length argument queries. # noqa: RST213

  • dimensions (Optional[Union[str, list[str]]]) – Name of the dimensions to query. If not provided, it will query all dimensions in the grid and will require len(args) == grid.ndim

Returns

Zero-based indices of coordinates. Each item in result corresponds to indicies of that dimension

Raises
  • ShapeError – if number of queries don’t match requested dimensions.

  • ValueError – if requested coordinates don’t exist.

Return type

tuple[numpy.ndarray[Any, numpy.dtype[int]], …]

Examples

Opening an MDIO file.

>>> from mdio import MDIOReader
>>>
>>>
>>> mdio = MDIOReader("path_to.mdio")
>>> mdio.coord_to_index([10, 7, 15], dimensions='inline')
array([ 8,  5, 13], dtype=uint16)
>>> ils, xls = [10, 7, 15], [5, 10]
>>> mdio.coord_to_index(ils, xls, dimensions=['inline', 'crossline'])
(array([ 8,  5, 13], dtype=uint16), array([3, 8], dtype=uint16))

With the above indices, we can slice the data:

>>> mdio[ils]  # only inlines
>>> mdio[:, xls]  # only crosslines
>>> mdio[ils, xls]  # intersection of the lines

Note that some fancy-indexing may not work with Zarr backend. The Dask backend is more flexible when it comes to indexing.

If we are querying all dimensions of a 3D array, we can omit the dimensions argument.

>>> mdio.coord_to_index(10, 5, [50, 100])
(array([8], dtype=uint16),
 array([3], dtype=uint16),
 array([25, 50], dtype=uint16))
copy(dest_path_or_buffer, excludes='', includes='', storage_options=None, overwrite=False)#

Makes a copy of an MDIO file with or without all arrays.

Refer to mdio.api.convenience.copy for full documentation.

Parameters
  • dest_path_or_buffer (str) – Destination path. Could be any FSSpec mapping.

  • excludes (str) – Data to exclude during copy. i.e. chunked_012. The raw data won’t be copied, but it will create an empty array to be filled. If left blank, it will copy everything.

  • includes (str) – Data to include during copy. i.e. trace_headers. If this is not specified, and certain data is excluded, it will not copy headers. If you want to preserve headers, specify trace_headers. If left blank, it will copy everything except specified in excludes parameter.

  • storage_options (Optional[dict]) – Storage options for the cloud storage backend. Default is None (will assume anonymous).

  • overwrite (bool) – Overwrite destination or not.

property binary_header: dict#

Get seismic binary header metadata.

property chunks: tuple[int, ...]#

Get dataset chunk sizes.

property live_mask: Union[_SupportsArray[dtype], _NestedSequence[_SupportsArray[dtype]], bool, int, float, complex, str, bytes, _NestedSequence[Union[bool, int, float, complex, str, bytes]], Array]#

Get live mask (i.e. not-null value mask).

property n_dim: int#

Get number of dimensions for dataset.

property shape: tuple[int, ...]#

Get shape of dataset.

property stats: dict#

Get global statistics like min/max/rms/std.

property text_header: list#

Get seismic text header.

property trace_count: int#

Get trace count from seismic MDIO.

class mdio.api.accessor.MDIOReader(mdio_path_or_buffer, access_pattern='012', storage_options=None, return_metadata=False, new_chunks=None, backend='zarr', memory_cache_size=0, disk_cache=False)#

Read-only accessor for MDIO files.

For detailed documentation see MDIOAccessor.

Parameters
  • mdio_path_or_buffer (str) – Store URL for MDIO file. This can be either on a local disk, or a cloud object store.

  • access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’, ‘01234’.

  • storage_options (dict) – Options for the storage backend. By default, system-wide credentials will be used. If system-wide credentials are not set and the source is not public, an authentication error will be raised by the backend.

  • return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.

  • new_chunks (tuple[int, ...]) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.

  • backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.

  • memory_cache_size – Maximum, in memory, least recently used (LRU) cache size in bytes.

  • disk_cache – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.

Initialize super class with r permission.

class mdio.api.accessor.MDIOWriter(mdio_path_or_buffer, access_pattern='012', storage_options=None, return_metadata=False, new_chunks=None, backend='zarr', memory_cache_size=0, disk_cache=False)#

Writable accessor for MDIO files.

For detailed documentation see MDIOAccessor.

Parameters
  • mdio_path_or_buffer (str) – Store URL for MDIO file. This can be either on a local disk, or a cloud object store.

  • access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’, ‘01234’.

  • storage_options (dict) – Options for the storage backend. By default, system-wide credentials will be used. If system-wide credentials are not set and the source is not public, an authentication error will be raised by the backend.

  • return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.

  • new_chunks (tuple[int, ...]) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.

  • backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.

  • memory_cache_size – Maximum, in memory, least recently used (LRU) cache size in bytes.

  • disk_cache – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.

Initialize super class with r+ permission.

Data Converters#

Seismic Data#

Note

By default, the SEG-Y ingestion tool uses Python’s multiprocessing to speed up parsing the data. This almost always requires a __main__ guard on any other Python code that is executed directly like python file.py. When running inside Jupyter, this is NOT needed.

1if __name__ == "__main__":
2    segy_to_mdio(...)

When the CLI is invoked, this is already handled.

See the official multiprocessing documentation here and here.

Conversion from SEG-Y to MDIO.

mdio.converters.segy.segy_to_mdio(segy_path, mdio_path_or_buffer, index_bytes, index_names=None, index_lengths=None, chunksize=None, endian='big', lossless=True, compression_tolerance=0.01, storage_options=None, overwrite=False)#

Convert SEG-Y file to MDIO format.

MDIO allows ingesting flattened seismic surveys in SEG-Y format into a multidimensional tensor that represents the correct geometry of the seismic dataset.

The SEG-Y file must be on disk, MDIO currently does not support reading SEG-Y directly from the cloud object store.

The output MDIO file can be local or on the cloud. For local files, a UNIX or Windows path is sufficient. However, for cloud stores, an appropriate protocol must be provided. See examples for more details.

The SEG-Y headers for indexing must also be specified. The index byte locations (starts from 1) are the minimum amount of information needed to index the file. However, we suggest giving names to the index dimensions, and if needed providing the header lengths if they are not standard. By default, all header entries are assumed to be 4-byte long.

The chunk size depends on the data type, however, it can be chosen to accommodate any workflow’s access patterns. See examples below for some common use cases.

By default, the data is ingested with LOSSLESS compression. This saves disk space in the range of 20% to 40%. MDIO also allows data to be compressed using the ZFP compressor’s fixed rate lossy compression. If lossless parameter is set to False and MDIO was installed using the lossy extra; then the data will be compressed to approximately 30% of its original size and will be perceptually lossless. The compression ratio can be adjusted using the option compression_ratio (integer). Higher values will compress more, but will introduce artifacts.

Parameters
  • segy_path (str) – Path to the input SEG-Y file

  • mdio_path_or_buffer (str) – Output path for MDIO file

  • index_bytes (Sequence[int]) – Tuple of the byte location for the index attributes

  • index_names (Optional[Sequence[str]]) – Tuple of the index names for the index attributes

  • index_lengths (Optional[Sequence[int]]) – Tuple of the byte lengths for the index attributes Default is 4-byte for each index key.

  • chunksize (Optional[Sequence[int]]) – Override default chunk size, which is (64, 64, 64) if 3D, and (512, 512) for 2D.

  • endian (str) – Endianness of the input SEG-Y. Rev.2 allows little endian. Default is ‘big’. Must be in {“big”, “little”}

  • lossless (bool) – Lossless Blosc with zstandard, or ZFP with fixed precision.

  • compression_tolerance (float) – Tolerance ZFP compression, optional. The fixed accuracy mode in ZFP guarantees there won’t be any errors larger than this value. The default is 0.01, which gives about 70% reduction in size. Will be ignored if lossless=True.

  • storage_options (Optional[dict[str, Any]]) – Storage options for the cloud storage backend. Default is None (will assume anonymous)

  • overwrite (bool) – Toggle for overwriting existing store

Raises
  • GridTraceCountError – Raised if grid won’t hold all traces in the SEG-Y file.

  • ValueError – If length of chunk sizes don’t match number of dimensions.

  • NotImplementedError – If can’t determine chunking automatically for 4D+.

Return type

None

Examples

If we are working locally and ingesting a 3D post-stack seismic file, we can use the following example. This will ingest with default chunks of 128 x 128 x 128.

>>> from mdio import segy_to_mdio
>>>
>>>
>>> segy_to_mdio(
...     segy_path="prefix1/file.segy",
...     mdio_path_or_buffer="prefix2/file.mdio",
...     index_bytes=(189, 193),
...     index_names=("inline", "crossline")
... )

If we are on Amazon Web Services, we can do it like below. The protocol before the URL can be s3 for AWS, gcs for Google Cloud, and abfs for Microsoft Azure. In this example we also change the chunk size as a demonstration.

>>> segy_to_mdio(
...     segy_path="prefix/file.segy",
...     mdio_path_or_buffer="s3://bucket/file.mdio",
...     index_bytes=(189, 193),
...     index_names=("inline", "crossline"),
...     chunksize=(64, 64, 512),
... )

Another example of loading a 4D seismic such as 3D seismic pre-stack gathers is below. This will allow us to extract offset planes efficiently or run things in a local neighborhood very efficiently.

>>> segy_to_mdio(
...     segy_path="prefix/file.segy",
...     mdio_path_or_buffer="s3://bucket/file.mdio",
...     index_bytes=(189, 193, 37),
...     index_names=("inline", "crossline", "offset"),
...     chunksize=(16, 16, 16, 512),
... )

Conversion from to MDIO various other formats.

mdio.converters.mdio.mdio_to_segy(mdio_path_or_buffer, output_segy_path, endian='big', access_pattern='012', out_sample_format='ibm32', storage_options=None, new_chunks=None, selection_mask=None, client=None)#

Convert MDIO file to SEG-Y format.

MDIO allows exporting multidimensional seismic data back to the flattened seismic format SEG-Y, to be used in data transmission.

The input headers are preserved as is, and will be transferred to the output file.

The user has control over the endianness, and the floating point data type. However, by default we export as Big-Endian IBM float, per the SEG-Y format’s default.

The input MDIO can be local or cloud based. However, the output SEG-Y will be generated locally.

A selection_mask can be provided (in the shape of the spatial grid) to export a subset of the seismic data.

Parameters
  • mdio_path_or_buffer (str) – Input path where the MDIO is located

  • output_segy_path (str) – Path to the output SEG-Y file

  • endian (str) – Endianness of the input SEG-Y. Rev.2 allows little endian. Default is ‘big’.

  • access_pattern (str) – This specificies the chunk access pattern. Underlying zarr.Array must exist. Examples: ‘012’, ‘01’

  • out_sample_format (str) – Output sample format. Currently support: {‘ibm32’, ‘float32’}. Default is ‘ibm32’.

  • storage_options (dict) – Storage options for the cloud storage backend. Default: None (will assume anonymous access)

  • new_chunks (tuple[int, ...]) – Set manual chunksize. For development purposes only.

  • selection_mask (np.ndarray) – Array that lists the subset of traces

  • client (distributed.Client) – Dask client. If None we will use local threaded scheduler. If auto is used we will create multiple processes (with 8 threads each).

Raises
  • ImportError – if distributed package isn’t installed but requested.

  • ValueError – if cut mask is empty, i.e. no traces will be written.

Return type

None

Examples

To export an existing local MDIO file to SEG-Y we use the code snippet below. This will export the full MDIO (without padding) to SEG-Y format using IBM floats and big-endian byte order.

>>> from mdio import mdio_to_segy
>>>
>>>
>>> mdio_to_segy(
...     mdio_path_or_buffer="prefix2/file.mdio",
...     output_segy_path="prefix/file.segy",
... )

If we want to export this as an IEEE big-endian, using a selection mask, we would run:

>>> mdio_to_segy(
...     mdio_path_or_buffer="prefix2/file.mdio",
...     output_segy_path="prefix/file.segy",
...     selection_mask=boolean_mask,
...     out_sample_format="float32",
... )

Core Functionality#

Dimensions#

Dimension (grid) abstraction and serializers.

class mdio.core.dimension.Dimension(coords, name)#

Dimension class.

Dimension has a name and coordinates associated with it. The Dimension coordinates can only be a vector.

Parameters
  • coords (list | tuple | numpy.ndarray[Any, numpy.dtype[numpy._typing._generic_alias.ScalarType]] | range) – Vector of coordinates.

  • name (str) – Name of the dimension.

classmethod deserialize(stream, stream_format)#

Deserialize buffer into Dimension.

Parameters
  • stream (str) –

  • stream_format (str) –

Return type

Dimension

classmethod from_dict(other)#

Make dimension from dictionary.

Parameters

other (dict[str, Any]) –

Return type

Dimension

max()#

Get maximum value of dimension.

Return type

ndarray[Any, dtype[float]]

min()#

Get minimum value of dimension.

Return type

ndarray[Any, dtype[float]]

serialize(stream_format)#

Serialize the dimension into buffer.

Parameters

stream_format (str) –

Return type

str

to_dict()#

Convert dimension to dictionary.

Return type

dict[str, Any]

property size: int#

Size of the dimension.

class mdio.core.dimension.DimensionSerializer(stream_format)#

Serializer implementation for Dimension.

Initialize serializer.

Parameters

stream_format (str) – Stream format. Must be in {“JSON”, “YAML”}.

deserialize(stream)#

Deserialize buffer into Dimension.

Parameters

stream (str) –

Return type

Dimension

serialize(dimension)#

Serialize Dimension into buffer.

Parameters

dimension (Dimension) –

Return type

str

Data I/O#

(De)serialization factory design pattern.

Current support for JSON and YAML.

class mdio.core.serialization.Serializer(stream_format)#

Serializer base class.

Here we define the interface for any serializer implementation.

Parameters

stream_format (str) – Format of the stream for serialization.

Initialize serializer.

Parameters

stream_format (str) – Stream format. Must be in {“JSON”, “YAML”}.

abstract deserialize(stream)#

Abstract method for deserialize.

Parameters

stream (str) –

Return type

dict

abstract serialize(payload)#

Abstract method for serialize.

Parameters

payload (dict) –

Return type

str

static validate_payload(payload, signature)#

Validate if required keys exist in the payload for a function signature.

Parameters
  • payload (dict) –

  • signature (Signature) –

Return type

dict

mdio.core.serialization.get_deserializer(stream_format)#

Get deserializer based on format.

Parameters

stream_format (str) –

Return type

Callable

mdio.core.serialization.get_serializer(stream_format)#

Get serializer based on format.

Parameters

stream_format (str) –

Return type

Callable