Reference#

Readers / Writers#

MDIO accessor APIs.

class mdio.api.accessor.MDIOAccessor(mdio_path_or_buffer, mode, access_pattern, storage_options, return_metadata, new_chunks, backend, memory_cache_size, disk_cache)#

Accessor class for MDIO files.

The accessor can be used to read and write MDIO files. It allows you to open an MDIO file in several mode and access_pattern combinations.

Access pattern defines the dimensions that are chunked. For instance if you have a 3-D array that is chunked in every direction (i.e. a 3-D seismic stack consisting of inline, crossline, and sample dimensions) its access pattern would be “012”. If it was only chunked in the first two dimensions (i.e. seismic inline and crossline), it would be “01”.

By default, MDIO will try to open with “012” access pattern, and will raise an error if that pattern doesn’t exist.

After dataset is opened, when the accessor is sliced it will either return just seismic trace data as a Numpy array or a tuple of live mask, headers, and seismic trace in Numpy based on the parameter return_metadata.

Regarding object store access, if the user credentials have been set system-wide on local machine or VM; there is no need to specify credentials. However, the storage_options option allows users to specify credentials for the store that is being accessed. Please see the fsspec documentation for configuring storage options.

MDIO currently supports Zarr and Dask backends. The Zarr backend is useful for reading small amounts of data with minimal overhead. However, by utilizing the Dask backend with a larger chunk size using the new_chunks argument, the data can be read in parallel using a Dask LocalCluster or a distributed Cluster.

The accessor also allows users to enable fsspec caching. These are particularly useful when we are accessing the data from a high-latency store such as object stores, or mounted network drives with high latency. We can use the disk_cache option to fetch chunks the local temporary directory for faster repetitive access. We can also turn on the Least Recently Used (LRU) cache by using the memory_cache option. It has to be specified in bytes.

Parameters:
  • mdio_path_or_buffer (str) – Store URL for MDIO file. This can be either on a local disk, or a cloud object store.

  • mode (str) – Read or read/write mode. The file must exist. Options are in {‘r’, ‘r+’, ‘w’}. ‘r’ is read only, ‘r+’ is append mode where only existing arrays can be modified, ‘w’ is similar to ‘r+’ but rechunking or other file-wide operations are allowed.

  • access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’, ‘01234’.

  • storage_options (dict | None) – Options for the storage backend. By default, system-wide credentials will be used. If system-wide credentials are not set and the source is not public, an authentication error will be raised by the backend.

  • return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.

  • new_chunks (tuple[int, ...] | None) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.

  • backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.

  • memory_cache_size (int) – Maximum, in memory, least recently used (LRU) cache size in bytes.

  • disk_cache (bool) – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.

Raises:

MDIONotFoundError – If the MDIO file can not be opened.

Notes

The combination of the Dask backend and caching schemes are experimental. This configuration may cause unexpected memory usage and duplicate data fetching.

Examples

Assuming we ingested my_3d_seismic.segy as my_3d_seismic.mdio we can open the file in read-only mode like this.

>>> from mdio import MDIOReader
>>>
>>>
>>> mdio = MDIOReader("my_3d_seismic.mdio")

This will open the file with the lazy Zarr backend. To access a specific inline, crossline, or sample index we can do:

>>> inline = mdio[15]  # get the 15th inline
>>> crossline = mdio[:, 15]  # get the 50th crossline
>>> samples = mdio[..., 250]  # get the 250th sample slice

The above will variables will be Numpy arrays of the relevant trace data. If we want to retreive the live mask and trace headers for our sliding we need to open the file with the return_metadata option.

>>> mdio = MDIOReader("my_3d_seismic.mdio", return_metadata=True)

Then we can fetch the data like this (for inline):

>>> il_live, il_headers, il_traces = mdio[15]

Since MDIOAccessor returns a tuple with these three Numpy arrays, we can directly unpack it and use it further down our code.

Accessor initialization function.

coord_to_index(*args, dimensions=None)#

Convert dimension coordinate to zero-based index.

The coordinate labels of the array dimensions are converted to zero-based indices. For instance if we have an inline dimension like this:

[10, 20, 30, 40, 50]

then the indices would be:

[0, 1, 2, 3, 4]

This method converts from coordinate labels of a dimension to equivalent indices.

Multiple dimensions can be queried at the same time, see the examples.

Parameters:
  • *args – Variable length argument queries. # noqa: RST213

  • dimensions (str | list[str] | None) – Name of the dimensions to query. If not provided, it will query all dimensions in the grid and will require len(args) == grid.ndim

Returns:

Zero-based indices of coordinates. Each item in result corresponds to indicies of that dimension

Raises:
  • ShapeError – if number of queries don’t match requested dimensions.

  • ValueError – if requested coordinates don’t exist.

Return type:

tuple[ndarray[Any, dtype[int]], …]

Examples

Opening an MDIO file.

>>> from mdio import MDIOReader
>>>
>>>
>>> mdio = MDIOReader("path_to.mdio")
>>> mdio.coord_to_index([10, 7, 15], dimensions='inline')
array([ 8,  5, 13], dtype=uint16)
>>> ils, xls = [10, 7, 15], [5, 10]
>>> mdio.coord_to_index(ils, xls, dimensions=['inline', 'crossline'])
(array([ 8,  5, 13], dtype=uint16), array([3, 8], dtype=uint16))

With the above indices, we can slice the data:

>>> mdio[ils]  # only inlines
>>> mdio[:, xls]  # only crosslines
>>> mdio[ils, xls]  # intersection of the lines

Note that some fancy-indexing may not work with Zarr backend. The Dask backend is more flexible when it comes to indexing.

If we are querying all dimensions of a 3D array, we can omit the dimensions argument.

>>> mdio.coord_to_index(10, 5, [50, 100])
(array([8], dtype=uint16),
 array([3], dtype=uint16),
 array([25, 50], dtype=uint16))
copy(dest_path_or_buffer, excludes='', includes='', storage_options=None, overwrite=False)#

Makes a copy of an MDIO file with or without all arrays.

Refer to mdio.api.convenience.copy for full documentation.

Parameters:
  • dest_path_or_buffer (str) – Destination path. Could be any FSSpec mapping.

  • excludes (str) – Data to exclude during copy. i.e. chunked_012. The raw data won’t be copied, but it will create an empty array to be filled. If left blank, it will copy everything.

  • includes (str) – Data to include during copy. i.e. trace_headers. If this is not specified, and certain data is excluded, it will not copy headers. If you want to preserve headers, specify trace_headers. If left blank, it will copy everything except specified in excludes parameter.

  • storage_options (dict | None) – Storage options for the cloud storage backend. Default is None (will assume anonymous).

  • overwrite (bool) – Overwrite destination or not.

property binary_header: dict#

Get seismic binary header metadata.

property chunks: tuple[int, ...]#

Get dataset chunk sizes.

property live_mask: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | Array#

Get live mask (i.e. not-null value mask).

property n_dim: int#

Get number of dimensions for dataset.

property shape: tuple[int, ...]#

Get shape of dataset.

property stats: dict#

Get global statistics like min/max/rms/std.

property text_header: list#

Get seismic text header.

property trace_count: int#

Get trace count from seismic MDIO.

class mdio.api.accessor.MDIOReader(mdio_path_or_buffer, access_pattern='012', storage_options=None, return_metadata=False, new_chunks=None, backend='zarr', memory_cache_size=0, disk_cache=False)#

Read-only accessor for MDIO files.

For detailed documentation see MDIOAccessor.

Parameters:
  • mdio_path_or_buffer (str) – Store URL for MDIO file. This can be either on a local disk, or a cloud object store.

  • access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’, ‘01234’.

  • storage_options (dict) – Options for the storage backend. By default, system-wide credentials will be used. If system-wide credentials are not set and the source is not public, an authentication error will be raised by the backend.

  • return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.

  • new_chunks (tuple[int, ...]) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.

  • backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.

  • memory_cache_size (int) – Maximum, in memory, least recently used (LRU) cache size in bytes.

  • disk_cache (bool) – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.

Initialize super class with r permission.

class mdio.api.accessor.MDIOWriter(mdio_path_or_buffer, access_pattern='012', storage_options=None, return_metadata=False, new_chunks=None, backend='zarr', memory_cache_size=0, disk_cache=False)#

Writable accessor for MDIO files.

For detailed documentation see MDIOAccessor.

Parameters:
  • mdio_path_or_buffer (str) – Store URL for MDIO file. This can be either on a local disk, or a cloud object store.

  • access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’, ‘01234’.

  • storage_options (dict) – Options for the storage backend. By default, system-wide credentials will be used. If system-wide credentials are not set and the source is not public, an authentication error will be raised by the backend.

  • return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.

  • new_chunks (tuple[int, ...]) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.

  • backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.

  • memory_cache_size (int) – Maximum, in memory, least recently used (LRU) cache size in bytes.

  • disk_cache (bool) – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.

Initialize super class with r+ permission.

Data Converters#

Seismic Data#

Note

By default, the SEG-Y ingestion tool uses Python’s multiprocessing to speed up parsing the data. This almost always requires a __main__ guard on any other Python code that is executed directly like python file.py. When running inside Jupyter, this is NOT needed.

1if __name__ == "__main__":
2    segy_to_mdio(...)

When the CLI is invoked, this is already handled.

See the official multiprocessing documentation here and here.

Conversion from SEG-Y to MDIO.

mdio.converters.segy.segy_to_mdio(segy_path, mdio_path_or_buffer, index_bytes, index_names=None, index_types=None, chunksize=None, endian='big', lossless=True, compression_tolerance=0.01, storage_options=None, overwrite=False, grid_overrides=None)#

Convert SEG-Y file to MDIO format.

MDIO allows ingesting flattened seismic surveys in SEG-Y format into a multidimensional tensor that represents the correct geometry of the seismic dataset.

The SEG-Y file must be on disk, MDIO currently does not support reading SEG-Y directly from the cloud object store.

The output MDIO file can be local or on the cloud. For local files, a UNIX or Windows path is sufficient. However, for cloud stores, an appropriate protocol must be provided. See examples for more details.

The SEG-Y headers for indexing must also be specified. The index byte locations (starts from 1) are the minimum amount of information needed to index the file. However, we suggest giving names to the index dimensions, and if needed providing the header lengths if they are not standard. By default, all header entries are assumed to be 4-byte long.

The chunk size depends on the data type, however, it can be chosen to accommodate any workflow’s access patterns. See examples below for some common use cases.

By default, the data is ingested with LOSSLESS compression. This saves disk space in the range of 20% to 40%. MDIO also allows data to be compressed using the ZFP compressor’s fixed rate lossy compression. If lossless parameter is set to False and MDIO was installed using the lossy extra; then the data will be compressed to approximately 30% of its original size and will be perceptually lossless. The compression ratio can be adjusted using the option compression_ratio (integer). Higher values will compress more, but will introduce artifacts.

Parameters:
  • segy_path (str) – Path to the input SEG-Y file

  • mdio_path_or_buffer (str) – Output path for MDIO file

  • index_bytes (Sequence[int]) – Tuple of the byte location for the index attributes

  • index_names (Sequence[str] | None) – Tuple of the index names for the index attributes

  • index_types (Sequence[str] | None) – Tuple of the data-types for the index attributes. Must be in {“int16, int32, float16, float32, ibm32”} Default is 4-byte integers for each index key.

  • chunksize (Sequence[int] | None) – Override default chunk size, which is (64, 64, 64) if 3D, and (512, 512) for 2D.

  • endian (str) – Endianness of the input SEG-Y. Rev.2 allows little endian. Default is ‘big’. Must be in {“big”, “little”}

  • lossless (bool) – Lossless Blosc with zstandard, or ZFP with fixed precision.

  • compression_tolerance (float) – Tolerance ZFP compression, optional. The fixed accuracy mode in ZFP guarantees there won’t be any errors larger than this value. The default is 0.01, which gives about 70% reduction in size. Will be ignored if lossless=True.

  • storage_options (dict[str, Any] | None) – Storage options for the cloud storage backend. Default is None (will assume anonymous)

  • overwrite (bool) – Toggle for overwriting existing store

  • grid_overrides (dict | None) – Option to add grid overrides. See examples.

Raises:
  • GridTraceCountError – Raised if grid won’t hold all traces in the SEG-Y file.

  • ValueError – If length of chunk sizes don’t match number of dimensions.

  • NotImplementedError – If can’t determine chunking automatically for 4D+.

Return type:

None

Examples

If we are working locally and ingesting a 3D post-stack seismic file, we can use the following example. This will ingest with default chunks of 128 x 128 x 128.

>>> from mdio import segy_to_mdio
>>>
>>>
>>> segy_to_mdio(
...     segy_path="prefix1/file.segy",
...     mdio_path_or_buffer="prefix2/file.mdio",
...     index_bytes=(189, 193),
...     index_names=("inline", "crossline")
... )

If we are on Amazon Web Services, we can do it like below. The protocol before the URL can be s3 for AWS, gcs for Google Cloud, and abfs for Microsoft Azure. In this example we also change the chunk size as a demonstration.

>>> segy_to_mdio(
...     segy_path="prefix/file.segy",
...     mdio_path_or_buffer="s3://bucket/file.mdio",
...     index_bytes=(189, 193),
...     index_names=("inline", "crossline"),
...     chunksize=(64, 64, 512),
... )

Another example of loading a 4D seismic such as 3D seismic pre-stack gathers is below. This will allow us to extract offset planes efficiently or run things in a local neighborhood very efficiently.

>>> segy_to_mdio(
...     segy_path="prefix/file.segy",
...     mdio_path_or_buffer="s3://bucket/file.mdio",
...     index_bytes=(189, 193, 37),
...     index_names=("inline", "crossline", "offset"),
...     chunksize=(16, 16, 16, 512),
... )

We can override the dataset grid by the grid_overrides parameter. This allows us to ingest files that don’t conform to the true geometry of the seismic acquisition.

For example if we are ingesting 3D seismic shots that don’t have a cable number and channel numbers are sequential (i.e. each cable doesn’t start with channel number 1; we can tell MDIO to ingest this with the correct geometry by calculating cable numbers and wrapped channel numbers. Note the missing byte location and word length for the “cable” index.

>>> segy_to_mdio(
...     segy_path="prefix/shot_file.segy",
...     mdio_path_or_buffer="s3://bucket/shot_file.mdio",
...     index_bytes=(17, None, 13),
...     index_lengths=(4, None, 4),
...     index_names=("shot", "cable", "channel"),
...     chunksize=(8, 2, 128, 1024),
...     grid_overrides={
...         "ChannelWrap": True, "ChannelsPerCable": 800,
...         "CalculateCable": True
...     },
... )

If we do have cable numbers in the headers, but channels are still sequential (aka. unwrapped), we can still ingest it like this.

>>> segy_to_mdio(
...     segy_path="prefix/shot_file.segy",
...     mdio_path_or_buffer="s3://bucket/shot_file.mdio",
...     index_bytes=(17, 137, 13),
...     index_lengths=(4, 2, 4),
...     index_names=("shot_point", "cable", "channel"),
...     chunksize=(8, 2, 128, 1024),
...     grid_overrides={"ChannelWrap": True, "ChannelsPerCable": 800},
... )

For shot gathers with channel numbers and wrapped channels, no grid overrides are necessary.

In cases where the user does not know if the input has unwrapped channels but desires to store with wrapped channel index use: >>> grid_overrides={“AutoChannelWrap”: True,

“AutoChannelTraceQC”: 1000000}

For ingestion of pre-stack streamer data where the user needs to access/index common-channel gathers (single gun) then the following strategy can be used to densely ingest while indexing on gun number:

>>> segy_to_mdio(
...     segy_path="prefix/shot_file.segy",
...     mdio_path_or_buffer="s3://bucket/shot_file.mdio",
...     index_bytes=(133, 171, 17, 137, 13),
...     index_lengths=(2, 2, 4, 2, 4),
...     index_names=("shot_line", "gun", "shot_point", "cable", "channel"),
...     chunksize=(1, 1, 8, 1, 128, 1024),
...     grid_overrides={
...         "AutoShotWrap": True,
...         "AutoChannelWrap": True,
...         "AutoChannelTraceQC":  1000000
...     },
... )

For AutoShotWrap and AutoChannelWrap to work, the user must provide “shot_line”, “gun”, “shot_point”, “cable”, “channel”. For improved common-channel performance consider modifying the chunksize to be (1, 1, 32, 1, 32, 2048) for good common-shot and common-channel performance or (1, 1, 128, 1, 1, 2048) for common-channel performance.

For cases with no well-defined trace header for indexing a NonBinned grid override is provided.This creates the index and attributes an incrementing integer to the trace for the index based on first in first out. For example a CDP and Offset keyed file might have a header for offset as real world offset which would result in a very sparse populated index. Instead, the following override will create a new index from 1 to N, where N is the number of offsets within a CDP ensemble. The index to be auto generated is called “trace”. Note the required “chunksize” parameter in the grid override. This is due to the non-binned ensemble chunksize is irrelevant to the index dimension chunksizes and has to be specified in the grid override itself. Note the lack of offset, only indexing CDP, providing CDP header type, and chunksize for only CDP and Sample dimension. The chunksize for non-binned dimension is in the grid overrides as described above. The below configuration will yield 1MB chunks:

>>> segy_to_mdio(
...     segy_path="prefix/cdp_offset_file.segy",
...     mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio",
...     index_bytes=(21,),
...     index_types=("int32",),
...     index_names=("cdp",),
...     chunksize=(4, 1024),
...     grid_overrides={"NonBinned": True, "chunksize": 64},
... )

A more complicated case where you may have a 5D dataset that is not binned in Offset and Azimuth directions can be ingested like below. However, the Offset and Azimuth dimensions will be combined to “trace” dimension. The below configuration will yield 1MB chunks.

>>> segy_to_mdio(
...     segy_path="prefix/cdp_offset_file.segy",
...     mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio",
...     index_bytes=(189, 193),
...     index_types=("int32", "int32"),
...     index_names=("inline", "crossline"),
...     chunksize=(4, 4, 1024),
...     grid_overrides={"NonBinned": True, "chunksize": 64},
... )

For dataset with expected duplicate traces we have the following parameterization. This will use the same logic as NonBinned with a fixed chunksize of 1. The other keys are still important. The below example allows multiple traces per receiver (i.e. reshoot).

>>> segy_to_mdio(
...     segy_path="prefix/cdp_offset_file.segy",
...     mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio",
...     index_bytes=(9, 213, 13),
...     index_types=("int32", "int16", "int32"),
...     index_names=("shot", "cable", "chan"),
...     chunksize=(8, 2, 256, 512),
...     grid_overrides={"HasDuplicates": True},
... )

Conversion from to MDIO various other formats.

mdio.converters.mdio.mdio_to_segy(mdio_path_or_buffer, output_segy_path, endian='big', access_pattern='012', out_sample_format='ibm32', storage_options=None, new_chunks=None, selection_mask=None, client=None)#

Convert MDIO file to SEG-Y format.

MDIO allows exporting multidimensional seismic data back to the flattened seismic format SEG-Y, to be used in data transmission.

The input headers are preserved as is, and will be transferred to the output file.

The user has control over the endianness, and the floating point data type. However, by default we export as Big-Endian IBM float, per the SEG-Y format’s default.

The input MDIO can be local or cloud based. However, the output SEG-Y will be generated locally.

A selection_mask can be provided (in the shape of the spatial grid) to export a subset of the seismic data.

Parameters:
  • mdio_path_or_buffer (str) – Input path where the MDIO is located

  • output_segy_path (str) – Path to the output SEG-Y file

  • endian (str) – Endianness of the input SEG-Y. Rev.2 allows little endian. Default is ‘big’.

  • access_pattern (str) – This specificies the chunk access pattern. Underlying zarr.Array must exist. Examples: ‘012’, ‘01’

  • out_sample_format (str) – Output sample format. Currently support: {‘ibm32’, ‘float32’}. Default is ‘ibm32’.

  • storage_options (dict) – Storage options for the cloud storage backend. Default: None (will assume anonymous access)

  • new_chunks (tuple[int, ...]) – Set manual chunksize. For development purposes only.

  • selection_mask (np.ndarray) – Array that lists the subset of traces

  • client (distributed.Client) – Dask client. If None we will use local threaded scheduler. If auto is used we will create multiple processes (with 8 threads each).

Raises:
  • ImportError – if distributed package isn’t installed but requested.

  • ValueError – if cut mask is empty, i.e. no traces will be written.

Return type:

None

Examples

To export an existing local MDIO file to SEG-Y we use the code snippet below. This will export the full MDIO (without padding) to SEG-Y format using IBM floats and big-endian byte order.

>>> from mdio import mdio_to_segy
>>>
>>>
>>> mdio_to_segy(
...     mdio_path_or_buffer="prefix2/file.mdio",
...     output_segy_path="prefix/file.segy",
... )

If we want to export this as an IEEE big-endian, using a selection mask, we would run:

>>> mdio_to_segy(
...     mdio_path_or_buffer="prefix2/file.mdio",
...     output_segy_path="prefix/file.segy",
...     selection_mask=boolean_mask,
...     out_sample_format="float32",
... )

Core Functionality#

Dimensions#

Dimension (grid) abstraction and serializers.

class mdio.core.dimension.Dimension(coords, name)#

Dimension class.

Dimension has a name and coordinates associated with it. The Dimension coordinates can only be a vector.

Parameters:
  • coords (list | tuple | ndarray[Any, dtype[_ScalarType_co]] | range) – Vector of coordinates.

  • name (str) – Name of the dimension.

classmethod deserialize(stream, stream_format)#

Deserialize buffer into Dimension.

Parameters:
  • stream (str) –

  • stream_format (str) –

Return type:

Dimension

classmethod from_dict(other)#

Make dimension from dictionary.

Parameters:

other (dict[str, Any]) –

Return type:

Dimension

max()#

Get maximum value of dimension.

Return type:

NDArray[np.float]

min()#

Get minimum value of dimension.

Return type:

NDArray[np.float]

serialize(stream_format)#

Serialize the dimension into buffer.

Parameters:

stream_format (str) –

Return type:

str

to_dict()#

Convert dimension to dictionary.

Return type:

dict[str, Any]

property size: int#

Size of the dimension.

class mdio.core.dimension.DimensionSerializer(stream_format)#

Serializer implementation for Dimension.

Initialize serializer.

Parameters:

stream_format (str) – Stream format. Must be in {“JSON”, “YAML”}.

deserialize(stream)#

Deserialize buffer into Dimension.

Parameters:

stream (str) –

Return type:

Dimension

serialize(dimension)#

Serialize Dimension into buffer.

Parameters:

dimension (Dimension) –

Return type:

str

Data I/O#

(De)serialization factory design pattern.

Current support for JSON and YAML.

class mdio.core.serialization.Serializer(stream_format)#

Serializer base class.

Here we define the interface for any serializer implementation.

Parameters:

stream_format (str) – Format of the stream for serialization.

Initialize serializer.

Parameters:

stream_format (str) – Stream format. Must be in {“JSON”, “YAML”}.

abstract deserialize(stream)#

Abstract method for deserialize.

Parameters:

stream (str) –

Return type:

dict

abstract serialize(payload)#

Abstract method for serialize.

Parameters:

payload (dict) –

Return type:

str

static validate_payload(payload, signature)#

Validate if required keys exist in the payload for a function signature.

Parameters:
  • payload (dict) –

  • signature (Signature) –

Return type:

dict

mdio.core.serialization.get_deserializer(stream_format)#

Get deserializer based on format.

Parameters:

stream_format (str) –

Return type:

Callable

mdio.core.serialization.get_serializer(stream_format)#

Get serializer based on format.

Parameters:

stream_format (str) –

Return type:

Callable

Convenience Functions#

Convenience APIs for working with MDIO files.

mdio.api.convenience.copy_mdio(source, dest_path_or_buffer, excludes='', includes='', storage_options=None, overwrite=False)#

Copy MDIO file.

Can also copy with empty data to be filled later. See excludes and includes parameters.

More documentation about excludes and includes can be found in Zarr’s documentation in zarr.convenience.copy_store.

Parameters:
  • source (MDIOReader) – MDIO reader or accessor instance. Data will be copied from here

  • dest_path_or_buffer (str) – Destination path. Could be any FSSpec mapping.

  • excludes (str) – Data to exclude during copy. i.e. chunked_012. The raw data won’t be copied, but it will create an empty array to be filled. If left blank, it will copy everything.

  • includes (str) – Data to include during copy. i.e. trace_headers. If this is not specified, and certain data is excluded, it will not copy headers. If you want to preserve headers, specify trace_headers. If left blank, it will copy everything except specified in excludes parameter.

  • storage_options (dict | None) – Storage options for the cloud storage backend. Default is None (will assume anonymous).

  • overwrite (bool) – Overwrite destination or not.

Return type:

None

mdio.api.convenience.rechunk(source, chunks, suffix, compressor=None, overwrite=False)#

Rechunk MDIO file adding a new variable.

Parameters:
  • source (MDIOAccessor) – MDIO accessor instance. Data will be copied from here.

  • chunks (tuple[int, ...]) – Tuple containing chunk sizes for new rechunked array.

  • suffix (str) – Suffix to append to new rechunked array.

  • compressor (Codec | None) – Data compressor to use, optional. Default is Blosc(‘zstd’).

  • overwrite (bool) – Overwrite destination or not.

Return type:

None

Examples

To rechunk a single variable we can do this

>>> accessor = MDIOAccessor(...)
>>> rechunk(accessor, (1, 1024, 1024), suffix="fast_il")
mdio.api.convenience.rechunk_batch(source, chunks_list, suffix_list, compressor=None, overwrite=False)#

Rechunk MDIO file to multiple variables, reading it once.

Parameters:
  • source (MDIOAccessor) – MDIO accessor instance. Data will be copied from here.

  • chunks_list (list[tuple[int, ...]]) – List of tuples containing new chunk sizes.

  • suffix_list (list[str]) – List of suffixes to append to new chunk sizes.

  • compressor (Codec | None) – Data compressor to use, optional. Default is Blosc(‘zstd’).

  • overwrite (bool) – Overwrite destination or not.

Return type:

None

Examples

To rechunk multiple variables we can do things like:

>>> accessor = MDIOAccessor(...)
>>> rechunk_batch(
>>>     accessor,
>>>     chunks_list=[(1, 1024, 1024), (1024, 1, 1024), (1024, 1024, 1)],
>>>     suffix_list=["fast_il", "fast_xl", "fast_z"],
>>> )