API Reference

Readers / Writers

MDIO accessor APIs.

class mdio.api.accessor.MDIOAccessor(mdio_path_or_buffer, mode, access_pattern, storage_options, return_metadata, new_chunks, backend, disk_cache)

Accessor class for MDIO files.

The accessor can be used to read and write MDIO files. It allows you to open an MDIO file in several mode and access_pattern combinations.

Access pattern defines the dimensions that are chunked. For instance if you have a 3-D array that is chunked in every direction (i.e. a 3-D seismic stack consisting of inline, crossline, and sample dimensions) its access pattern would be “012”. If it was only chunked in the first two dimensions (i.e. seismic inline and crossline), it would be “01”.

By default, MDIO will try to open with “012” access pattern, and will raise an error if that pattern doesn’t exist.

After dataset is opened, when the accessor is sliced it will either return just seismic trace data as a Numpy array or a tuple of live mask, headers, and seismic trace in Numpy based on the parameter return_metadata.

Regarding object store access, if the user credentials have been set system-wide on local machine or VM; there is no need to specify credentials. However, the storage_options option allows users to specify credentials for the store that is being accessed. Please see the fsspec documentation for configuring storage options.

MDIO currently supports Zarr and Dask backends. The Zarr backend is useful for reading small amounts of data with minimal overhead. However, by utilizing the Dask backend with a larger chunk size using the new_chunks argument, the data can be read in parallel using a Dask LocalCluster or a distributed Cluster.

The accessor also allows users to enable fsspec caching. These are particularly useful when we are accessing the data from a high-latency store such as object stores, or mounted network drives with high latency. We can use the disk_cache option to fetch chunks the local temporary directory for faster repetitive access.

Parameters:
  • mdio_path_or_buffer (str) – Store or URL for MDIO file.

  • mode (str) – Read or read/write mode. The file must exist. Options are in {‘r’, ‘r+’, ‘w’}. ‘r’ is read only, ‘r+’ is append mode where only existing arrays can be modified, ‘w’ is similar to ‘r+’ but rechunking or other file-wide operations are allowed.

  • access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’.

  • storage_options (dict | None) – Options for the storage backend. By default, system-wide credentials will be used.

  • return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.

  • new_chunks (tuple[int, ...] | None) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.

  • backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.

  • disk_cache (bool) – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.

Raises:

MDIONotFoundError – If the MDIO file can not be opened.

Examples

Assuming we ingested my_3d_seismic.segy as my_3d_seismic.mdio we can open the file in read-only mode like this.

>>> from mdio import MDIOReader
>>>
>>>
>>> mdio = MDIOReader("my_3d_seismic.mdio")

This will open the file with the lazy Zarr backend. To access a specific inline, crossline, or sample index we can do:

>>> inline = mdio[15]  # get the 15th inline
>>> crossline = mdio[:, 15]  # get the 50th crossline
>>> samples = mdio[..., 250]  # get the 250th sample slice

The above will variables will be Numpy arrays of the relevant trace data. If we want to retreive the live mask and trace headers for our sliding we need to open the file with the return_metadata option.

>>> mdio = MDIOReader("my_3d_seismic.mdio", return_metadata=True)

Then we can fetch the data like this (for inline):

>>> il_live, il_headers, il_traces = mdio[15]

Since MDIOAccessor returns a tuple with these three Numpy arrays, we can directly unpack it and use it further down our code.

coord_to_index(*args, dimensions=None)

Convert dimension coordinate to zero-based index.

The coordinate labels of the array dimensions are converted to zero-based indices. For instance if we have an inline dimension like this:

[10, 20, 30, 40, 50]

then the indices would be:

[0, 1, 2, 3, 4]

This method converts from coordinate labels of a dimension to equivalent indices.

Multiple dimensions can be queried at the same time, see the examples.

Parameters:
  • *args (list[int] | int) – Variable length argument queries.

  • dimensions (str | list[str] | None) – Name of the dimensions to query. If not provided, it will query all dimensions in the grid and will require len(args) == grid.ndim

Returns:

Zero-based indices of coordinates. Each item in result corresponds to indicies of that dimension

Raises:
  • ShapeError – if number of queries don’t match requested dimensions.

  • ValueError – if requested coordinates don’t exist.

Return type:

tuple[NDArray[int], …]

Examples

Opening an MDIO file.

>>> from mdio import MDIOReader
>>>
>>>
>>> mdio = MDIOReader("path_to.mdio")
>>> mdio.coord_to_index([10, 7, 15], dimensions='inline')
array([ 8, 5, 13], dtype=uint16)
>>> ils, xls = [10, 7, 15], [5, 10]
>>> mdio.coord_to_index(ils, xls, dimensions=['inline', 'crossline'])
(array([ 8, 5, 13], dtype=uint16), array([3, 8], dtype=uint16))

With the above indices, we can slice the data:

>>> mdio[ils]  # only inlines
>>> mdio[:, xls]  # only crosslines
>>> mdio[ils, xls]  # intersection of the lines

Note that some fancy-indexing may not work with Zarr backend. The Dask backend is more flexible when it comes to indexing.

If we are querying all dimensions of a 3D array, we can omit the dimensions argument.

>>> mdio.coord_to_index(10, 5, [50, 100])
(array([8], dtype=uint16),
 array([3], dtype=uint16),
 array([25, 50], dtype=uint16))
property binary_header: dict

Get seismic binary header metadata.

property chunks: tuple[int, ...]

Get dataset chunk sizes.

property live_mask: npt.ArrayLike | DaskArray

Get live mask (i.e. not-null value mask).

property n_dim: int

Get number of dimensions for dataset.

property shape: tuple[int, ...]

Get shape of dataset.

property stats: dict

Get global statistics like min/max/rms/std.

property text_header: list

Get seismic text header.

property trace_count: int

Get trace count from seismic MDIO.

class mdio.api.accessor.MDIOReader(mdio_path_or_buffer, access_pattern='012', storage_options=None, return_metadata=False, new_chunks=None, backend='zarr', disk_cache=False)

Read-only accessor for MDIO files.

For detailed documentation see MDIOAccessor.

Parameters:
  • mdio_path_or_buffer (str) – Store or URL for MDIO file.

  • access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’.

  • storage_options (dict) – Options for the storage backend. By default, system-wide credentials will be used.

  • return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.

  • new_chunks (tuple[int, ...]) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.

  • backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.

  • disk_cache (bool) – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.

class mdio.api.accessor.MDIOWriter(mdio_path_or_buffer, access_pattern='012', storage_options=None, return_metadata=False, new_chunks=None, backend='zarr', disk_cache=False)

Writable accessor for MDIO files.

For detailed documentation see MDIOAccessor.

Parameters:
  • mdio_path_or_buffer (str) – Store or URL for MDIO file.

  • access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’.

  • storage_options (dict) – Options for the storage backend. By default, system-wide credentials will be used.

  • return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.

  • new_chunks (tuple[int, ...]) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.

  • backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.

  • disk_cache (bool) – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.

Data Converters

Seismic Data

Note

By default, the SEG-Y ingestion tool uses Python’s multiprocessing to speed up parsing the data. This almost always requires a __main__ guard on any other Python code that is executed directly like python file.py. When running inside Jupyter, this is NOT needed.

1if __name__ == "__main__":
2    segy_to_mdio(...)

When the CLI is invoked, this is already handled.

See the official multiprocessing documentation here and here.

Conversion from SEG-Y to MDIO.

mdio.converters.segy.segy_to_mdio(segy_path, mdio_path_or_buffer, index_bytes, index_names=None, index_types=None, chunksize=None, lossless=True, compression_tolerance=0.01, storage_options_input=None, storage_options_output=None, overwrite=False, grid_overrides=None)

Convert SEG-Y file to MDIO format.

MDIO allows ingesting flattened seismic surveys in SEG-Y format into a multidimensional tensor that represents the correct geometry of the seismic dataset.

The SEG-Y file must be on disk, MDIO currently does not support reading SEG-Y directly from the cloud object store.

The output MDIO file can be local or on the cloud. For local files, a UNIX or Windows path is sufficient. However, for cloud stores, an appropriate protocol must be provided. See examples for more details.

The SEG-Y headers for indexing must also be specified. The index byte locations (starts from 1) are the minimum amount of information needed to index the file. However, we suggest giving names to the index dimensions, and if needed providing the header lengths if they are not standard. By default, all header entries are assumed to be 4-byte long.

The chunk size depends on the data type, however, it can be chosen to accommodate any workflow’s access patterns. See examples below for some common use cases.

By default, the data is ingested with LOSSLESS compression. This saves disk space in the range of 20% to 40%. MDIO also allows data to be compressed using the ZFP compressor’s fixed rate lossy compression. If lossless parameter is set to False and MDIO was installed using the lossy extra; then the data will be compressed to approximately 30% of its original size and will be perceptually lossless. The compression ratio can be adjusted using the option compression_ratio (integer). Higher values will compress more, but will introduce artifacts.

Parameters:
  • segy_path (str | Path) – Path to the input SEG-Y file

  • mdio_path_or_buffer (str | Path) – Output path for the MDIO file, either local or cloud-based (e.g., with s3://, gcs://, or abfs:// protocols).

  • index_bytes (Sequence[int]) – Tuple of the byte location for the index attributes

  • index_names (Sequence[str] | None) – List of names for the index dimensions. If not provided, defaults to dim_0, dim_1, …, with the last dimension named sample.

  • index_types (Sequence[str] | None) – Tuple of the data-types for the index attributes. Must be in {“int16, int32, float16, float32, ibm32”}. Default is 4-byte integers for each index key.

  • chunksize (tuple[int, ...] | None) – Tuple specifying the chunk sizes for each dimension of the array. It must match the number of dimensions in the input array.

  • lossless (bool) – If True, uses lossless Blosc compression with zstandard. If False, uses ZFP lossy compression (requires zfpy library).

  • compression_tolerance (float) – Tolerance for ZFP compression in lossy mode. Ignored if lossless=True. Default is 0.01, providing ~70% size reduction.

  • storage_options_input (dict[str, Any] | None) – Dictionary of storage options for the SEGY input output file (e.g., cloud credentials). Defaults to None.

  • storage_options_output (dict[str, Any] | None) – Dictionary of storage options for the MDIO output output file (e.g., cloud credentials). Defaults to None.

  • overwrite (bool) – If True, overwrites existing MDIO file at the specified path.

  • grid_overrides (dict | None) – Option to add grid overrides. See examples.

Raises:
  • GridTraceCountError – Raised if grid won’t hold all traces in the SEG-Y file.

  • ValueError – If length of chunk sizes don’t match number of dimensions.

  • NotImplementedError – If can’t determine chunking automatically for 4D+.

Return type:

None

Examples

If we are working locally and ingesting a 3D post-stack seismic file, we can use the following example. This will ingest with default chunks of 128 x 128 x 128.

>>> from mdio import segy_to_mdio
>>>
>>>
>>> segy_to_mdio(
...     segy_path="prefix1/file.segy",
...     mdio_path_or_buffer="prefix2/file.mdio",
...     index_bytes=(189, 193),
...     index_names=("inline", "crossline")
... )

If we are on Amazon Web Services, we can do it like below. The protocol before the URL can be s3 for AWS, gcs for Google Cloud, and abfs for Microsoft Azure. In this example we also change the chunk size as a demonstration.

>>> segy_to_mdio(
...     segy_path="prefix/file.segy",
...     mdio_path_or_buffer="s3://bucket/file.mdio",
...     index_bytes=(189, 193),
...     index_names=("inline", "crossline"),
...     chunksize=(64, 64, 512),
... )

Another example of loading a 4D seismic such as 3D seismic pre-stack gathers is below. This will allow us to extract offset planes efficiently or run things in a local neighborhood very efficiently.

>>> segy_to_mdio(
...     segy_path="prefix/file.segy",
...     mdio_path_or_buffer="s3://bucket/file.mdio",
...     index_bytes=(189, 193, 37),
...     index_names=("inline", "crossline", "offset"),
...     chunksize=(16, 16, 16, 512),
... )

We can override the dataset grid by the grid_overrides parameter. This allows us to ingest files that don’t conform to the true geometry of the seismic acquisition.

For example if we are ingesting 3D seismic shots that don’t have a cable number and channel numbers are sequential (i.e. each cable doesn’t start with channel number 1; we can tell MDIO to ingest this with the correct geometry by calculating cable numbers and wrapped channel numbers. Note the missing byte location and word length for the “cable” index.

>>> segy_to_mdio(
...     segy_path="prefix/shot_file.segy",
...     mdio_path_or_buffer="s3://bucket/shot_file.mdio",
...     index_bytes=(17, None, 13),
...     index_lengths=(4, None, 4),
...     index_names=("shot", "cable", "channel"),
...     chunksize=(8, 2, 128, 1024),
...     grid_overrides={
...         "ChannelWrap": True, "ChannelsPerCable": 800,
...         "CalculateCable": True
...     },
... )

If we do have cable numbers in the headers, but channels are still sequential (aka. unwrapped), we can still ingest it like this.

>>> segy_to_mdio(
...     segy_path="prefix/shot_file.segy",
...     mdio_path_or_buffer="s3://bucket/shot_file.mdio",
...     index_bytes=(17, 137, 13),
...     index_lengths=(4, 2, 4),
...     index_names=("shot_point", "cable", "channel"),
...     chunksize=(8, 2, 128, 1024),
...     grid_overrides={"ChannelWrap": True, "ChannelsPerCable": 800},
... )

For shot gathers with channel numbers and wrapped channels, no grid overrides necessary.

In cases where the user does not know if the input has unwrapped channels but desires to store with wrapped channel index use: >>> grid_overrides = { … “AutoChannelWrap”: True, … “AutoChannelTraceQC”: 1000000 … }

For ingestion of pre-stack streamer data where the user needs to access/index common-channel gathers (single gun) then the following strategy can be used to densely ingest while indexing on gun number:

>>> segy_to_mdio(
...     segy_path="prefix/shot_file.segy",
...     mdio_path_or_buffer="s3://bucket/shot_file.mdio",
...     index_bytes=(133, 171, 17, 137, 13),
...     index_lengths=(2, 2, 4, 2, 4),
...     index_names=("shot_line", "gun", "shot_point", "cable", "channel"),
...     chunksize=(1, 1, 8, 1, 128, 1024),
...     grid_overrides={
...         "AutoShotWrap": True,
...         "AutoChannelWrap": True,
...         "AutoChannelTraceQC":  1000000
...     },
... )

For AutoShotWrap and AutoChannelWrap to work, the user must provide “shot_line”, “gun”, “shot_point”, “cable”, “channel”. For improved common-channel performance consider modifying the chunksize to be (1, 1, 32, 1, 32, 2048) for good common-shot and common-channel performance or (1, 1, 128, 1, 1, 2048) for common-channel performance.

For cases with no well-defined trace header for indexing a NonBinned grid override is provided.This creates the index and attributes an incrementing integer to the trace for the index based on first in first out. For example a CDP and Offset keyed file might have a header for offset as real world offset which would result in a very sparse populated index. Instead, the following override will create a new index from 1 to N, where N is the number of offsets within a CDP ensemble. The index to be auto generated is called “trace”. Note the required “chunksize” parameter in the grid override. This is due to the non-binned ensemble chunksize is irrelevant to the index dimension chunksizes and has to be specified in the grid override itself. Note the lack of offset, only indexing CDP, providing CDP header type, and chunksize for only CDP and Sample dimension. The chunksize for non-binned dimension is in the grid overrides as described above. The below configuration will yield 1MB chunks:

>>> segy_to_mdio(
...     segy_path="prefix/cdp_offset_file.segy",
...     mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio",
...     index_bytes=(21,),
...     index_types=("int32",),
...     index_names=("cdp",),
...     chunksize=(4, 1024),
...     grid_overrides={"NonBinned": True, "chunksize": 64},
... )

A more complicated case where you may have a 5D dataset that is not binned in Offset and Azimuth directions can be ingested like below. However, the Offset and Azimuth dimensions will be combined to “trace” dimension. The below configuration will yield 1MB chunks.

>>> segy_to_mdio(
...     segy_path="prefix/cdp_offset_file.segy",
...     mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio",
...     index_bytes=(189, 193),
...     index_types=("int32", "int32"),
...     index_names=("inline", "crossline"),
...     chunksize=(4, 4, 1024),
...     grid_overrides={"NonBinned": True, "chunksize": 64},
... )

For dataset with expected duplicate traces we have the following parameterization. This will use the same logic as NonBinned with a fixed chunksize of 1. The other keys are still important. The below example allows multiple traces per receiver (i.e. reshoot).

>>> segy_to_mdio(
...     segy_path="prefix/cdp_offset_file.segy",
...     mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio",
...     index_bytes=(9, 213, 13),
...     index_types=("int32", "int16", "int32"),
...     index_names=("shot", "cable", "chan"),
...     chunksize=(8, 2, 256, 512),
...     grid_overrides={"HasDuplicates": True},
... )

Conversion from to MDIO various other formats.

mdio.converters.mdio.mdio_to_segy(mdio_path_or_buffer, output_segy_path, endian='big', access_pattern='012', storage_options=None, new_chunks=None, selection_mask=None, client=None)

Convert MDIO file to SEG-Y format.

We export N-D seismic data to the flattened SEG-Y format used in data transmission.

The input headers are preserved as is, and will be transferred to the output file.

Input MDIO can be local or cloud based. However, the output SEG-Y will be generated locally.

A selection_mask can be provided (same shape as spatial grid) to export a subset.

Parameters:
  • mdio_path_or_buffer (str) – Input path where the MDIO is located.

  • output_segy_path (str) – Path to the output SEG-Y file.

  • endian (str) – Endianness of the input SEG-Y. Rev.2 allows little endian. Default is ‘big’.

  • access_pattern (str) – This specificies the chunk access pattern. Underlying zarr.Array must exist. Examples: ‘012’, ‘01’

  • storage_options (dict) – Storage options for the cloud storage backend. Default: None (anonymous)

  • new_chunks (tuple[int, ...]) – Set manual chunksize. For development purposes only.

  • selection_mask (np.ndarray) – Array that lists the subset of traces

  • client (distributed.Client) – Dask client. If None we will use local threaded scheduler. If auto is used we will create multiple processes (with 8 threads each).

Raises:
  • ImportError – if distributed package isn’t installed but requested.

  • ValueError – if cut mask is empty, i.e. no traces will be written.

Return type:

None

Examples

To export an existing local MDIO file to SEG-Y we use the code snippet below. This will export the full MDIO (without padding) to SEG-Y format using IBM floats and big-endian byte order.

>>> from mdio import mdio_to_segy
>>>
>>>
>>> mdio_to_segy(
...     mdio_path_or_buffer="prefix2/file.mdio",
...     output_segy_path="prefix/file.segy",
... )

If we want to export this as an IEEE big-endian, using a selection mask, we would run:

>>> mdio_to_segy(
...     mdio_path_or_buffer="prefix2/file.mdio",
...     output_segy_path="prefix/file.segy",
...     selection_mask=boolean_mask,
... )

Conversion from Numpy to MDIO.

mdio.converters.numpy.numpy_to_mdio(array, mdio_path_or_buffer, chunksize, index_names=None, index_coords=None, header_dtype=None, lossless=True, compression_tolerance=0.01, storage_options=None, overwrite=False)

Conversion from NumPy array to MDIO format.

This module provides functionality to convert a NumPy array into the MDIO format. The conversion process organizes the input array into a multidimensional tensor with specified indexing and compression options.

Parameters:
  • array (NDArray) – Input NumPy array to be converted to MDIO format.

  • mdio_path_or_buffer (str) – Output path for the MDIO file, either local or cloud-based (e.g., with s3://, gcs://, or abfs:// protocols).

  • chunksize (tuple[int, ...]) – Tuple specifying the chunk sizes for each dimension of the array. It must match the number of dimensions in the input array.

  • index_names (list[str] | None) – List of names for the index dimensions. If not provided, defaults to dim_0, dim_1, …, with the last dimension named sample.

  • index_coords (dict[str, NDArray] | None) – Dictionary mapping dimension names to their coordinate arrays. If not provided, defaults to sequential integers (0 to size-1) for each dimension.

  • header_dtype (DTypeLike | None) – Data type for trace headers, if applicable. Defaults to None.

  • lossless (bool) – If True, uses lossless Blosc compression with zstandard. If False, uses ZFP lossy compression (requires zfpy library).

  • compression_tolerance (float) – Tolerance for ZFP compression in lossy mode. Ignored if lossless=True. Default is 0.01, providing ~70% size reduction.

  • storage_options (dict[str, Any] | None) – Dictionary of storage options for the MDIO output file (e.g., cloud credentials). Defaults to None (anonymous access).

  • overwrite (bool) – If True, overwrites existing MDIO file at the specified path.

Raises:

ValueError – When length of chunksize does not match the number of dims in the input array or if an element of index_names not included in the index_coords dictionary. Also raised when size of a coordinate array in does not match the corresponding dimension.

Return type:

None

Examples

To convert a 3D NumPy array to MDIO format locally with default chunking:

>>> import numpy as np
>>> from mdio.converters import numpy_to_mdio
>>>
>>> array = np.random.rand(100, 200, 300)
>>> numpy_to_mdio(
...     array=array,
...     mdio_path_or_buffer="output/file.mdio",
...     chunksize=(64, 64, 64),
...     index_names=["inline", "crossline", "sample"],
... )

For a cloud-based output on AWS S3 with custom coordinates:

>>> coords = {
...     "inline": np.arange(0, 100, 2),
...     "crossline": np.arange(0, 200, 4),
...     "sample": np.linspace(0, 0.3, 300),
... }
>>> numpy_to_mdio(
...     array=array,
...     mdio_path_or_buffer="s3://bucket/file.mdio",
...     chunksize=(32, 32, 128),
...     index_names=["inline", "crossline", "sample"],
...     index_coords=coords,
...     lossless=False,
...     compression_tolerance=0.01,
... )

To convert a 2D array with default indexing and lossless compression:

>>> array_2d = np.random.rand(500, 1000)
>>> numpy_to_mdio(
...     array=array_2d,
...     mdio_path_or_buffer="output/file_2d.mdio",
...     chunksize=(512, 512),
... )

Convenience Functions

Convenience APIs for working with MDIO files.

mdio.api.convenience.copy_mdio(source_path, target_path, overwrite=False, copy_traces=False, copy_headers=False, storage_options_input=None, storage_options_output=None)

Copy MDIO file.

This function copies an MDIO file from a source path to a target path, optionally including trace data, headers, or both, for all access patterns. It creates a new MDIO file at the target path with the same structure as the source, and selectively copies data based on the provided flags. The function supports custom storage options for both input and output, enabling compatibility with various filesystems via FSSpec.

Parameters:
  • source_path (str) – Source MDIO path. Data will be copied from here

  • target_path (str) – Destination path. Could be any FSSpec mapping.

  • overwrite (bool) – Overwrite destination or not.

  • copy_traces (bool) – Flag to enable copying trace data for all access patterns.

  • copy_headers (bool) – Flag to enable copying headers for all access patterns.

  • storage_options_input (dict[str, Any] | None) – Storage options for input MDIO.

  • storage_options_output (dict[str, Any] | None) – Storage options for output MDIO.

Return type:

None

mdio.api.convenience.rechunk(source, chunks, suffix, compressor=None, overwrite=False)

Rechunk MDIO file adding a new variable.

Parameters:
  • source (MDIOAccessor) – MDIO accessor instance. Data will be copied from here.

  • chunks (tuple[int, ...]) – Tuple containing chunk sizes for new rechunked array.

  • suffix (str) – Suffix to append to new rechunked array.

  • compressor (Codec | None) – Data compressor to use, optional. Default is Blosc(‘zstd’).

  • overwrite (bool) – Overwrite destination or not.

Return type:

None

Examples

To rechunk a single variable we can do this

>>> accessor = MDIOAccessor(...)
>>> rechunk(accessor, (1, 1024, 1024), suffix="fast_il")
mdio.api.convenience.rechunk_batch(source, chunks_list, suffix_list, compressor=None, overwrite=False)

Rechunk MDIO file to multiple variables, reading it once.

Parameters:
  • source (MDIOAccessor) – MDIO accessor instance. Data will be copied from here.

  • chunks_list (list[tuple[int, ...]]) – List of tuples containing new chunk sizes.

  • suffix_list (list[str]) – List of suffixes to append to new chunk sizes.

  • compressor (Codec | None) – Data compressor to use, optional. Default is Blosc(‘zstd’).

  • overwrite (bool) – Overwrite destination or not.

Return type:

None

Examples

To rechunk multiple variables we can do things like:

>>> accessor = MDIOAccessor(...)
>>> rechunk_batch(
>>>     accessor,
>>>     chunks_list=[(1, 1024, 1024), (1024, 1, 1024), (1024, 1024, 1)],
>>>     suffix_list=["fast_il", "fast_xl", "fast_z"],
>>> )

Core Functionality

Dimensions

Dimension (grid) abstraction and serializers.

class mdio.core.dimension.Dimension(coords, name)

Dimension class.

Dimension has a name and coordinates associated with it. The Dimension coordinates can only be a vector.

Parameters:
  • coords (list | tuple | NDArray | range) – Vector of coordinates.

  • name (str) – Name of the dimension.

coords

Vector of coordinates.

Type:

list | tuple | NDArray | range

name

Name of the dimension.

Type:

str

classmethod deserialize(stream, stream_format)

Deserialize buffer into Dimension.

Parameters:
  • stream (str)

  • stream_format (str)

Return type:

Dimension

classmethod from_dict(other)

Make dimension from dictionary.

Parameters:

other (dict[str, Any])

Return type:

Dimension

max()

Get maximum value of dimension.

Return type:

NDArray[float]

min()

Get minimum value of dimension.

Return type:

NDArray[float]

serialize(stream_format)

Serialize the dimension into buffer.

Parameters:

stream_format (str)

Return type:

str

to_dict()

Convert dimension to dictionary.

Return type:

dict[str, Any]

property size: int

Size of the dimension.

class mdio.core.dimension.DimensionSerializer(stream_format)

Serializer implementation for Dimension.

Parameters:

stream_format (str)

deserialize(stream)

Deserialize buffer into Dimension.

Parameters:

stream (str)

Return type:

Dimension

serialize(dimension)

Serialize Dimension into buffer.

Parameters:

dimension (Dimension)

Return type:

str

Creation

Module for creating empty MDIO datasets.

This module provides tools to configure and initialize empty MDIO datasets, which are used for storing multidimensional data with associated metadata. It includes:

  • MDIOVariableConfig: Config for individual variables in the dataset.

  • MDIOCreateConfig: Config for the dataset, including path, grid, and variables.

  • create_empty: Function to create the empty dataset based on provided configuration.

  • create_empty_like: Create an empty dataset with same structure as an existing one.

The create_empty function sets up the Zarr hierarchy with metadata and data groups, creates datasets for each variable and their trace headers, and initializes attributes such as creation time, API version, grid dimensions, and basic statistics.

The create_empty_like function creates a new empty dataset by replicating the structure of an existing MDIO dataset, including its grid, variables, and headers.

For detailed usage and parameters, see the docstring of the create_empty function.

class mdio.core.factory.MDIOCreateConfig(path, grid, variables)

Configuration for creating an MDIO dataset.

This dataclass encapsulates the parameters needed to create an MDIO dataset, including the storage path, grid specification, and a list of variable configurations.

Parameters:
path

The file path or URI where the MDIO dataset will be created.

Type:

str

grid

The grid specification defining the dataset’s spatial structure.

Type:

mdio.core.grid.Grid

variables

A list of configurations for variables to be included in dataset.

Type:

list[mdio.core.factory.MDIOVariableConfig]

class mdio.core.factory.MDIOVariableConfig(name, dtype, chunks=None, compressors=None, header_dtype=None)

Configuration for creating an MDIO variable.

This dataclass defines the parameters required to configure a variable in an MDIO dataset, including its name, data type, chunking strategy, compression method, and optional header data type.

Parameters:
name

The name of the variable.

Type:

str

dtype

The data type of the variable (e.g., ‘float32’, ‘int16’).

Type:

str

chunks

The chunk size for the variable along each dimension.

Type:

tuple[int, …] | None

compressors

The compression algorithm(s) to use.

Type:

collections.abc.Iterable[dict[str, str | int | float | collections.abc.Mapping[str, JSON] | collections.abc.Sequence[JSON] | None] | zarr.abc.codec.BytesBytesCodec | numcodecs.abc.Codec] | dict[str, str | int | float | collections.abc.Mapping[str, JSON] | collections.abc.Sequence[JSON] | None] | zarr.abc.codec.BytesBytesCodec | numcodecs.abc.Codec | Literal[‘auto’] | None

header_dtype

The data type for the variable’s header.

Type:

numpy.dtype[Any] | None | type[Any] | numpy._typing._dtype_like._SupportsDType[numpy.dtype[Any]] | str | tuple[Any, int] | tuple[Any, SupportsIndex | collections.abc.Sequence[SupportsIndex]] | list[Any] | numpy._typing._dtype_like._DTypeDict | tuple[Any, Any]

mdio.core.factory.create_empty(config, overwrite=False, storage_options=None, consolidate_meta=True)

Create an empty MDIO dataset.

This function initializes a new MDIO dataset at the specified path based on the provided configuration. It constructs a Zarr hierarchy with groups for metadata and data, creates datasets for each variable and its associated trace headers, and sets initial attributes such as creation time, API version, grid dimensions, and basic statistics (all initialized to zero). An empty ‘live_mask’ dataset is also created to track valid traces.

Important: It is up to the user to update live_mask and other attributes.

Parameters:
  • config (MDIOCreateConfig) – Configuration object specifying the dataset’s path, grid structure, and a list of variable configurations (e.g., name, dtype, chunks).

  • overwrite (bool) – If True, overwrites any existing dataset at the specified path. If False, an error is raised if the dataset exists. Defaults to False.

  • storage_options (dict[str, Any] | None) – Options for the storage backend, such as credentials or settings for cloud storage (e.g., S3, GCS). Defaults to None.

  • consolidate_meta (bool) – If True, consolidates metadata into a single file after creation, improving performance for large metadata. Defaults to True.

Returns:

The root Zarr group representing the newly created MDIO dataset.

Return type:

Group

mdio.core.factory.create_empty_like(source_path, dest_path, overwrite=False, storage_options_input=None, storage_options_output=None)

Create an empty MDIO dataset with the same structure as an existing one.

This function initializes a new empty MDIO dataset at the specified destination path, replicating the structure of an existing dataset, including its grid, variables, chunking strategy, compression, and headers. It copies metadata such as text and binary headers from the source dataset and sets initial attributes like creation time, API version, and zeroed statistics.

Important: It is up to the user to update headers, live_mask and stats.

Parameters:
  • source_path (str) – The path or URI of the existing MDIO dataset to replicate.

  • dest_path (str) – The path or URI where the new MDIO dataset will be created.

  • overwrite (bool) – If True, overwrites any existing dataset at the destination.

  • storage_options_input (dict[str, Any] | None) – Options for storage backend of the source dataset.

  • storage_options_output (dict[str, Any] | None) – Options for storage backend of the destination dataset.

Return type:

None

Data I/O

(De)serialization factory design pattern.

Current support for JSON and YAML.

class mdio.core.serialization.Serializer(stream_format)

Serializer base class.

Here we define the interface for any serializer implementation.

Parameters:

stream_format (str) – Stream format. Must be in {“JSON”, “YAML”}.

static validate_payload(payload, signature)

Validate if required keys exist in the payload for a function signature.

Parameters:
Return type:

dict

abstractmethod deserialize(stream)

Abstract method for deserialize.

Parameters:

stream (str)

Return type:

dict

abstractmethod serialize(payload)

Abstract method for serialize.

Parameters:

payload (dict)

Return type:

str

mdio.core.serialization.get_deserializer(stream_format)

Get deserializer based on format.

Parameters:

stream_format (str)

Return type:

Callable

mdio.core.serialization.get_serializer(stream_format)

Get serializer based on format.

Parameters:

stream_format (str)

Return type:

Callable