API Reference¶
Readers / Writers¶
MDIO accessor APIs.
- class mdio.api.accessor.MDIOAccessor(mdio_path_or_buffer, mode, access_pattern, storage_options, return_metadata, new_chunks, backend, disk_cache)¶
Accessor class for MDIO files.
The accessor can be used to read and write MDIO files. It allows you to open an MDIO file in several mode and access_pattern combinations.
Access pattern defines the dimensions that are chunked. For instance if you have a 3-D array that is chunked in every direction (i.e. a 3-D seismic stack consisting of inline, crossline, and sample dimensions) its access pattern would be “012”. If it was only chunked in the first two dimensions (i.e. seismic inline and crossline), it would be “01”.
By default, MDIO will try to open with “012” access pattern, and will raise an error if that pattern doesn’t exist.
After dataset is opened, when the accessor is sliced it will either return just seismic trace data as a Numpy array or a tuple of live mask, headers, and seismic trace in Numpy based on the parameter return_metadata.
Regarding object store access, if the user credentials have been set system-wide on local machine or VM; there is no need to specify credentials. However, the storage_options option allows users to specify credentials for the store that is being accessed. Please see the fsspec documentation for configuring storage options.
MDIO currently supports Zarr and Dask backends. The Zarr backend is useful for reading small amounts of data with minimal overhead. However, by utilizing the Dask backend with a larger chunk size using the new_chunks argument, the data can be read in parallel using a Dask LocalCluster or a distributed Cluster.
The accessor also allows users to enable fsspec caching. These are particularly useful when we are accessing the data from a high-latency store such as object stores, or mounted network drives with high latency. We can use the disk_cache option to fetch chunks the local temporary directory for faster repetitive access.
- Parameters:
mdio_path_or_buffer (str) – Store or URL for MDIO file.
mode (str) – Read or read/write mode. The file must exist. Options are in {‘r’, ‘r+’, ‘w’}. ‘r’ is read only, ‘r+’ is append mode where only existing arrays can be modified, ‘w’ is similar to ‘r+’ but rechunking or other file-wide operations are allowed.
access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’.
storage_options (dict | None) – Options for the storage backend. By default, system-wide credentials will be used.
return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.
new_chunks (tuple[int, ...] | None) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.
backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.
disk_cache (bool) – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.
- Raises:
MDIONotFoundError – If the MDIO file can not be opened.
Examples
Assuming we ingested my_3d_seismic.segy as my_3d_seismic.mdio we can open the file in read-only mode like this.
>>> from mdio import MDIOReader >>> >>> >>> mdio = MDIOReader("my_3d_seismic.mdio")
This will open the file with the lazy Zarr backend. To access a specific inline, crossline, or sample index we can do:
>>> inline = mdio[15] # get the 15th inline >>> crossline = mdio[:, 15] # get the 50th crossline >>> samples = mdio[..., 250] # get the 250th sample slice
The above will variables will be Numpy arrays of the relevant trace data. If we want to retreive the live mask and trace headers for our sliding we need to open the file with the return_metadata option.
>>> mdio = MDIOReader("my_3d_seismic.mdio", return_metadata=True)
Then we can fetch the data like this (for inline):
>>> il_live, il_headers, il_traces = mdio[15]
Since MDIOAccessor returns a tuple with these three Numpy arrays, we can directly unpack it and use it further down our code.
- coord_to_index(*args, dimensions=None)¶
Convert dimension coordinate to zero-based index.
The coordinate labels of the array dimensions are converted to zero-based indices. For instance if we have an inline dimension like this:
[10, 20, 30, 40, 50]
then the indices would be:
[0, 1, 2, 3, 4]
This method converts from coordinate labels of a dimension to equivalent indices.
Multiple dimensions can be queried at the same time, see the examples.
- Parameters:
- Returns:
Zero-based indices of coordinates. Each item in result corresponds to indicies of that dimension
- Raises:
ShapeError – if number of queries don’t match requested dimensions.
ValueError – if requested coordinates don’t exist.
- Return type:
Examples
Opening an MDIO file.
>>> from mdio import MDIOReader >>> >>> >>> mdio = MDIOReader("path_to.mdio") >>> mdio.coord_to_index([10, 7, 15], dimensions='inline') array([ 8, 5, 13], dtype=uint16)
>>> ils, xls = [10, 7, 15], [5, 10] >>> mdio.coord_to_index(ils, xls, dimensions=['inline', 'crossline']) (array([ 8, 5, 13], dtype=uint16), array([3, 8], dtype=uint16))
With the above indices, we can slice the data:
>>> mdio[ils] # only inlines >>> mdio[:, xls] # only crosslines >>> mdio[ils, xls] # intersection of the lines
Note that some fancy-indexing may not work with Zarr backend. The Dask backend is more flexible when it comes to indexing.
If we are querying all dimensions of a 3D array, we can omit the dimensions argument.
>>> mdio.coord_to_index(10, 5, [50, 100]) (array([8], dtype=uint16), array([3], dtype=uint16), array([25, 50], dtype=uint16))
- property live_mask: npt.ArrayLike | DaskArray¶
Get live mask (i.e. not-null value mask).
- class mdio.api.accessor.MDIOReader(mdio_path_or_buffer, access_pattern='012', storage_options=None, return_metadata=False, new_chunks=None, backend='zarr', disk_cache=False)¶
Read-only accessor for MDIO files.
For detailed documentation see MDIOAccessor.
- Parameters:
mdio_path_or_buffer (str) – Store or URL for MDIO file.
access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’.
storage_options (dict) – Options for the storage backend. By default, system-wide credentials will be used.
return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.
new_chunks (tuple[int, ...]) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.
backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.
disk_cache (bool) – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.
- class mdio.api.accessor.MDIOWriter(mdio_path_or_buffer, access_pattern='012', storage_options=None, return_metadata=False, new_chunks=None, backend='zarr', disk_cache=False)¶
Writable accessor for MDIO files.
For detailed documentation see MDIOAccessor.
- Parameters:
mdio_path_or_buffer (str) – Store or URL for MDIO file.
access_pattern (str) – Chunk access pattern, optional. Default is “012”. Examples: ‘012’, ‘01’.
storage_options (dict) – Options for the storage backend. By default, system-wide credentials will be used.
return_metadata (bool) – Flag for returning live mask, headers, and traces or just the trace data. Default is False, which means just trace data will be returned.
new_chunks (tuple[int, ...]) – Chunk sizes used in Dask backend. Ignored for Zarr backend. By default, the disk-chunks will be used. However, if we want to stream groups of chunks to a Dask worker, we can rechunk here. Then each Dask worker can asynchronously fetch multiple chunks before working.
backend (str) – Backend selection, optional. Default is “zarr”. Must be in {‘zarr’, ‘dask’}.
disk_cache (bool) – Disk cache implemented by fsspec, optional. Default is False, which turns off disk caching. See simplecache from fsspec documentation for more details.
Data Converters¶
Seismic Data¶
Note
By default, the SEG-Y ingestion tool uses Python’s multiprocessing
to speed up parsing the data. This almost always requires a __main__
guard on any other Python code that is executed directly like
python file.py
. When running inside Jupyter, this is NOT needed.
1if __name__ == "__main__":
2 segy_to_mdio(...)
When the CLI is invoked, this is already handled.
See the official multiprocessing
documentation
here
and
here.
Conversion from SEG-Y to MDIO.
- mdio.converters.segy.segy_to_mdio(segy_path, mdio_path_or_buffer, index_bytes, index_names=None, index_types=None, chunksize=None, lossless=True, compression_tolerance=0.01, storage_options_input=None, storage_options_output=None, overwrite=False, grid_overrides=None)¶
Convert SEG-Y file to MDIO format.
MDIO allows ingesting flattened seismic surveys in SEG-Y format into a multidimensional tensor that represents the correct geometry of the seismic dataset.
The SEG-Y file must be on disk, MDIO currently does not support reading SEG-Y directly from the cloud object store.
The output MDIO file can be local or on the cloud. For local files, a UNIX or Windows path is sufficient. However, for cloud stores, an appropriate protocol must be provided. See examples for more details.
The SEG-Y headers for indexing must also be specified. The index byte locations (starts from 1) are the minimum amount of information needed to index the file. However, we suggest giving names to the index dimensions, and if needed providing the header lengths if they are not standard. By default, all header entries are assumed to be 4-byte long.
The chunk size depends on the data type, however, it can be chosen to accommodate any workflow’s access patterns. See examples below for some common use cases.
By default, the data is ingested with LOSSLESS compression. This saves disk space in the range of 20% to 40%. MDIO also allows data to be compressed using the ZFP compressor’s fixed rate lossy compression. If lossless parameter is set to False and MDIO was installed using the lossy extra; then the data will be compressed to approximately 30% of its original size and will be perceptually lossless. The compression ratio can be adjusted using the option compression_ratio (integer). Higher values will compress more, but will introduce artifacts.
- Parameters:
segy_path (str | Path) – Path to the input SEG-Y file
mdio_path_or_buffer (str | Path) – Output path for the MDIO file, either local or cloud-based (e.g., with s3://, gcs://, or abfs:// protocols).
index_bytes (Sequence[int]) – Tuple of the byte location for the index attributes
index_names (Sequence[str] | None) – List of names for the index dimensions. If not provided, defaults to dim_0, dim_1, …, with the last dimension named sample.
index_types (Sequence[str] | None) – Tuple of the data-types for the index attributes. Must be in {“int16, int32, float16, float32, ibm32”}. Default is 4-byte integers for each index key.
chunksize (tuple[int, ...] | None) – Tuple specifying the chunk sizes for each dimension of the array. It must match the number of dimensions in the input array.
lossless (bool) – If True, uses lossless Blosc compression with zstandard. If False, uses ZFP lossy compression (requires zfpy library).
compression_tolerance (float) – Tolerance for ZFP compression in lossy mode. Ignored if lossless=True. Default is 0.01, providing ~70% size reduction.
storage_options_input (dict[str, Any] | None) – Dictionary of storage options for the SEGY input output file (e.g., cloud credentials). Defaults to None.
storage_options_output (dict[str, Any] | None) – Dictionary of storage options for the MDIO output output file (e.g., cloud credentials). Defaults to None.
overwrite (bool) – If True, overwrites existing MDIO file at the specified path.
grid_overrides (dict | None) – Option to add grid overrides. See examples.
- Raises:
GridTraceCountError – Raised if grid won’t hold all traces in the SEG-Y file.
ValueError – If length of chunk sizes don’t match number of dimensions.
NotImplementedError – If can’t determine chunking automatically for 4D+.
- Return type:
None
Examples
If we are working locally and ingesting a 3D post-stack seismic file, we can use the following example. This will ingest with default chunks of 128 x 128 x 128.
>>> from mdio import segy_to_mdio >>> >>> >>> segy_to_mdio( ... segy_path="prefix1/file.segy", ... mdio_path_or_buffer="prefix2/file.mdio", ... index_bytes=(189, 193), ... index_names=("inline", "crossline") ... )
If we are on Amazon Web Services, we can do it like below. The protocol before the URL can be s3 for AWS, gcs for Google Cloud, and abfs for Microsoft Azure. In this example we also change the chunk size as a demonstration.
>>> segy_to_mdio( ... segy_path="prefix/file.segy", ... mdio_path_or_buffer="s3://bucket/file.mdio", ... index_bytes=(189, 193), ... index_names=("inline", "crossline"), ... chunksize=(64, 64, 512), ... )
Another example of loading a 4D seismic such as 3D seismic pre-stack gathers is below. This will allow us to extract offset planes efficiently or run things in a local neighborhood very efficiently.
>>> segy_to_mdio( ... segy_path="prefix/file.segy", ... mdio_path_or_buffer="s3://bucket/file.mdio", ... index_bytes=(189, 193, 37), ... index_names=("inline", "crossline", "offset"), ... chunksize=(16, 16, 16, 512), ... )
We can override the dataset grid by the grid_overrides parameter. This allows us to ingest files that don’t conform to the true geometry of the seismic acquisition.
For example if we are ingesting 3D seismic shots that don’t have a cable number and channel numbers are sequential (i.e. each cable doesn’t start with channel number 1; we can tell MDIO to ingest this with the correct geometry by calculating cable numbers and wrapped channel numbers. Note the missing byte location and word length for the “cable” index.
>>> segy_to_mdio( ... segy_path="prefix/shot_file.segy", ... mdio_path_or_buffer="s3://bucket/shot_file.mdio", ... index_bytes=(17, None, 13), ... index_lengths=(4, None, 4), ... index_names=("shot", "cable", "channel"), ... chunksize=(8, 2, 128, 1024), ... grid_overrides={ ... "ChannelWrap": True, "ChannelsPerCable": 800, ... "CalculateCable": True ... }, ... )
If we do have cable numbers in the headers, but channels are still sequential (aka. unwrapped), we can still ingest it like this.
>>> segy_to_mdio( ... segy_path="prefix/shot_file.segy", ... mdio_path_or_buffer="s3://bucket/shot_file.mdio", ... index_bytes=(17, 137, 13), ... index_lengths=(4, 2, 4), ... index_names=("shot_point", "cable", "channel"), ... chunksize=(8, 2, 128, 1024), ... grid_overrides={"ChannelWrap": True, "ChannelsPerCable": 800}, ... )
For shot gathers with channel numbers and wrapped channels, no grid overrides necessary.
In cases where the user does not know if the input has unwrapped channels but desires to store with wrapped channel index use: >>> grid_overrides = { … “AutoChannelWrap”: True, … “AutoChannelTraceQC”: 1000000 … }
For ingestion of pre-stack streamer data where the user needs to access/index common-channel gathers (single gun) then the following strategy can be used to densely ingest while indexing on gun number:
>>> segy_to_mdio( ... segy_path="prefix/shot_file.segy", ... mdio_path_or_buffer="s3://bucket/shot_file.mdio", ... index_bytes=(133, 171, 17, 137, 13), ... index_lengths=(2, 2, 4, 2, 4), ... index_names=("shot_line", "gun", "shot_point", "cable", "channel"), ... chunksize=(1, 1, 8, 1, 128, 1024), ... grid_overrides={ ... "AutoShotWrap": True, ... "AutoChannelWrap": True, ... "AutoChannelTraceQC": 1000000 ... }, ... )
For AutoShotWrap and AutoChannelWrap to work, the user must provide “shot_line”, “gun”, “shot_point”, “cable”, “channel”. For improved common-channel performance consider modifying the chunksize to be (1, 1, 32, 1, 32, 2048) for good common-shot and common-channel performance or (1, 1, 128, 1, 1, 2048) for common-channel performance.
For cases with no well-defined trace header for indexing a NonBinned grid override is provided.This creates the index and attributes an incrementing integer to the trace for the index based on first in first out. For example a CDP and Offset keyed file might have a header for offset as real world offset which would result in a very sparse populated index. Instead, the following override will create a new index from 1 to N, where N is the number of offsets within a CDP ensemble. The index to be auto generated is called “trace”. Note the required “chunksize” parameter in the grid override. This is due to the non-binned ensemble chunksize is irrelevant to the index dimension chunksizes and has to be specified in the grid override itself. Note the lack of offset, only indexing CDP, providing CDP header type, and chunksize for only CDP and Sample dimension. The chunksize for non-binned dimension is in the grid overrides as described above. The below configuration will yield 1MB chunks:
>>> segy_to_mdio( ... segy_path="prefix/cdp_offset_file.segy", ... mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio", ... index_bytes=(21,), ... index_types=("int32",), ... index_names=("cdp",), ... chunksize=(4, 1024), ... grid_overrides={"NonBinned": True, "chunksize": 64}, ... )
A more complicated case where you may have a 5D dataset that is not binned in Offset and Azimuth directions can be ingested like below. However, the Offset and Azimuth dimensions will be combined to “trace” dimension. The below configuration will yield 1MB chunks.
>>> segy_to_mdio( ... segy_path="prefix/cdp_offset_file.segy", ... mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio", ... index_bytes=(189, 193), ... index_types=("int32", "int32"), ... index_names=("inline", "crossline"), ... chunksize=(4, 4, 1024), ... grid_overrides={"NonBinned": True, "chunksize": 64}, ... )
For dataset with expected duplicate traces we have the following parameterization. This will use the same logic as NonBinned with a fixed chunksize of 1. The other keys are still important. The below example allows multiple traces per receiver (i.e. reshoot).
>>> segy_to_mdio( ... segy_path="prefix/cdp_offset_file.segy", ... mdio_path_or_buffer="s3://bucket/cdp_offset_file.mdio", ... index_bytes=(9, 213, 13), ... index_types=("int32", "int16", "int32"), ... index_names=("shot", "cable", "chan"), ... chunksize=(8, 2, 256, 512), ... grid_overrides={"HasDuplicates": True}, ... )
Conversion from to MDIO various other formats.
- mdio.converters.mdio.mdio_to_segy(mdio_path_or_buffer, output_segy_path, endian='big', access_pattern='012', storage_options=None, new_chunks=None, selection_mask=None, client=None)¶
Convert MDIO file to SEG-Y format.
We export N-D seismic data to the flattened SEG-Y format used in data transmission.
The input headers are preserved as is, and will be transferred to the output file.
Input MDIO can be local or cloud based. However, the output SEG-Y will be generated locally.
A selection_mask can be provided (same shape as spatial grid) to export a subset.
- Parameters:
mdio_path_or_buffer (str) – Input path where the MDIO is located.
output_segy_path (str) – Path to the output SEG-Y file.
endian (str) – Endianness of the input SEG-Y. Rev.2 allows little endian. Default is ‘big’.
access_pattern (str) – This specificies the chunk access pattern. Underlying zarr.Array must exist. Examples: ‘012’, ‘01’
storage_options (dict) – Storage options for the cloud storage backend. Default: None (anonymous)
new_chunks (tuple[int, ...]) – Set manual chunksize. For development purposes only.
selection_mask (np.ndarray) – Array that lists the subset of traces
client (distributed.Client) – Dask client. If None we will use local threaded scheduler. If auto is used we will create multiple processes (with 8 threads each).
- Raises:
ImportError – if distributed package isn’t installed but requested.
ValueError – if cut mask is empty, i.e. no traces will be written.
- Return type:
None
Examples
To export an existing local MDIO file to SEG-Y we use the code snippet below. This will export the full MDIO (without padding) to SEG-Y format using IBM floats and big-endian byte order.
>>> from mdio import mdio_to_segy >>> >>> >>> mdio_to_segy( ... mdio_path_or_buffer="prefix2/file.mdio", ... output_segy_path="prefix/file.segy", ... )
If we want to export this as an IEEE big-endian, using a selection mask, we would run:
>>> mdio_to_segy( ... mdio_path_or_buffer="prefix2/file.mdio", ... output_segy_path="prefix/file.segy", ... selection_mask=boolean_mask, ... )
Conversion from Numpy to MDIO.
- mdio.converters.numpy.numpy_to_mdio(array, mdio_path_or_buffer, chunksize, index_names=None, index_coords=None, header_dtype=None, lossless=True, compression_tolerance=0.01, storage_options=None, overwrite=False)¶
Conversion from NumPy array to MDIO format.
This module provides functionality to convert a NumPy array into the MDIO format. The conversion process organizes the input array into a multidimensional tensor with specified indexing and compression options.
- Parameters:
array (NDArray) – Input NumPy array to be converted to MDIO format.
mdio_path_or_buffer (str) – Output path for the MDIO file, either local or cloud-based (e.g., with s3://, gcs://, or abfs:// protocols).
chunksize (tuple[int, ...]) – Tuple specifying the chunk sizes for each dimension of the array. It must match the number of dimensions in the input array.
index_names (list[str] | None) – List of names for the index dimensions. If not provided, defaults to dim_0, dim_1, …, with the last dimension named sample.
index_coords (dict[str, NDArray] | None) – Dictionary mapping dimension names to their coordinate arrays. If not provided, defaults to sequential integers (0 to size-1) for each dimension.
header_dtype (DTypeLike | None) – Data type for trace headers, if applicable. Defaults to None.
lossless (bool) – If True, uses lossless Blosc compression with zstandard. If False, uses ZFP lossy compression (requires zfpy library).
compression_tolerance (float) – Tolerance for ZFP compression in lossy mode. Ignored if lossless=True. Default is 0.01, providing ~70% size reduction.
storage_options (dict[str, Any] | None) – Dictionary of storage options for the MDIO output file (e.g., cloud credentials). Defaults to None (anonymous access).
overwrite (bool) – If True, overwrites existing MDIO file at the specified path.
- Raises:
ValueError – When length of chunksize does not match the number of dims in the input array or if an element of index_names not included in the index_coords dictionary. Also raised when size of a coordinate array in does not match the corresponding dimension.
- Return type:
None
Examples
To convert a 3D NumPy array to MDIO format locally with default chunking:
>>> import numpy as np >>> from mdio.converters import numpy_to_mdio >>> >>> array = np.random.rand(100, 200, 300) >>> numpy_to_mdio( ... array=array, ... mdio_path_or_buffer="output/file.mdio", ... chunksize=(64, 64, 64), ... index_names=["inline", "crossline", "sample"], ... )
For a cloud-based output on AWS S3 with custom coordinates:
>>> coords = { ... "inline": np.arange(0, 100, 2), ... "crossline": np.arange(0, 200, 4), ... "sample": np.linspace(0, 0.3, 300), ... } >>> numpy_to_mdio( ... array=array, ... mdio_path_or_buffer="s3://bucket/file.mdio", ... chunksize=(32, 32, 128), ... index_names=["inline", "crossline", "sample"], ... index_coords=coords, ... lossless=False, ... compression_tolerance=0.01, ... )
To convert a 2D array with default indexing and lossless compression:
>>> array_2d = np.random.rand(500, 1000) >>> numpy_to_mdio( ... array=array_2d, ... mdio_path_or_buffer="output/file_2d.mdio", ... chunksize=(512, 512), ... )
Convenience Functions¶
Convenience APIs for working with MDIO files.
- mdio.api.convenience.copy_mdio(source_path, target_path, overwrite=False, copy_traces=False, copy_headers=False, storage_options_input=None, storage_options_output=None)¶
Copy MDIO file.
This function copies an MDIO file from a source path to a target path, optionally including trace data, headers, or both, for all access patterns. It creates a new MDIO file at the target path with the same structure as the source, and selectively copies data based on the provided flags. The function supports custom storage options for both input and output, enabling compatibility with various filesystems via FSSpec.
- Parameters:
source_path (str) – Source MDIO path. Data will be copied from here
target_path (str) – Destination path. Could be any FSSpec mapping.
overwrite (bool) – Overwrite destination or not.
copy_traces (bool) – Flag to enable copying trace data for all access patterns.
copy_headers (bool) – Flag to enable copying headers for all access patterns.
storage_options_input (dict[str, Any] | None) – Storage options for input MDIO.
storage_options_output (dict[str, Any] | None) – Storage options for output MDIO.
- Return type:
None
- mdio.api.convenience.rechunk(source, chunks, suffix, compressor=None, overwrite=False)¶
Rechunk MDIO file adding a new variable.
- Parameters:
source (MDIOAccessor) – MDIO accessor instance. Data will be copied from here.
chunks (tuple[int, ...]) – Tuple containing chunk sizes for new rechunked array.
suffix (str) – Suffix to append to new rechunked array.
compressor (Codec | None) – Data compressor to use, optional. Default is Blosc(‘zstd’).
overwrite (bool) – Overwrite destination or not.
- Return type:
None
Examples
To rechunk a single variable we can do this
>>> accessor = MDIOAccessor(...) >>> rechunk(accessor, (1, 1024, 1024), suffix="fast_il")
- mdio.api.convenience.rechunk_batch(source, chunks_list, suffix_list, compressor=None, overwrite=False)¶
Rechunk MDIO file to multiple variables, reading it once.
- Parameters:
source (MDIOAccessor) – MDIO accessor instance. Data will be copied from here.
chunks_list (list[tuple[int, ...]]) – List of tuples containing new chunk sizes.
suffix_list (list[str]) – List of suffixes to append to new chunk sizes.
compressor (Codec | None) – Data compressor to use, optional. Default is Blosc(‘zstd’).
overwrite (bool) – Overwrite destination or not.
- Return type:
None
Examples
To rechunk multiple variables we can do things like:
>>> accessor = MDIOAccessor(...) >>> rechunk_batch( >>> accessor, >>> chunks_list=[(1, 1024, 1024), (1024, 1, 1024), (1024, 1024, 1)], >>> suffix_list=["fast_il", "fast_xl", "fast_z"], >>> )
Core Functionality¶
Dimensions¶
Dimension (grid) abstraction and serializers.
- class mdio.core.dimension.Dimension(coords, name)¶
Dimension class.
Dimension has a name and coordinates associated with it. The Dimension coordinates can only be a vector.
- Parameters:
- classmethod deserialize(stream, stream_format)¶
Deserialize buffer into Dimension.
- classmethod from_dict(other)¶
Make dimension from dictionary.
- serialize(stream_format)¶
Serialize the dimension into buffer.
Creation¶
Module for creating empty MDIO datasets.
This module provides tools to configure and initialize empty MDIO datasets, which are used for storing multidimensional data with associated metadata. It includes:
MDIOVariableConfig: Config for individual variables in the dataset.
MDIOCreateConfig: Config for the dataset, including path, grid, and variables.
create_empty: Function to create the empty dataset based on provided configuration.
create_empty_like: Create an empty dataset with same structure as an existing one.
The create_empty function sets up the Zarr hierarchy with metadata and data groups, creates datasets for each variable and their trace headers, and initializes attributes such as creation time, API version, grid dimensions, and basic statistics.
The create_empty_like function creates a new empty dataset by replicating the structure of an existing MDIO dataset, including its grid, variables, and headers.
For detailed usage and parameters, see the docstring of the create_empty function.
- class mdio.core.factory.MDIOCreateConfig(path, grid, variables)¶
Configuration for creating an MDIO dataset.
This dataclass encapsulates the parameters needed to create an MDIO dataset, including the storage path, grid specification, and a list of variable configurations.
- Parameters:
path (str)
grid (Grid)
variables (list[MDIOVariableConfig])
- grid¶
The grid specification defining the dataset’s spatial structure.
- Type:
mdio.core.grid.Grid
- variables¶
A list of configurations for variables to be included in dataset.
- class mdio.core.factory.MDIOVariableConfig(name, dtype, chunks=None, compressors=None, header_dtype=None)¶
Configuration for creating an MDIO variable.
This dataclass defines the parameters required to configure a variable in an MDIO dataset, including its name, data type, chunking strategy, compression method, and optional header data type.
- Parameters:
name (str)
dtype (str)
compressors (Iterable[dict[str, str | int | float | Mapping[str, JSON] | Sequence[JSON] | None] | BytesBytesCodec | Codec] | dict[str, str | int | float | Mapping[str, JSON] | Sequence[JSON] | None] | BytesBytesCodec | Codec | Literal['auto'] | None)
header_dtype (dtype[Any] | None | type[Any] | _SupportsDType[dtype[Any]] | str | tuple[Any, int] | tuple[Any, SupportsIndex | Sequence[SupportsIndex]] | list[Any] | _DTypeDict | tuple[Any, Any])
- compressors¶
The compression algorithm(s) to use.
- Type:
collections.abc.Iterable[dict[str, str | int | float | collections.abc.Mapping[str, JSON] | collections.abc.Sequence[JSON] | None] | zarr.abc.codec.BytesBytesCodec | numcodecs.abc.Codec] | dict[str, str | int | float | collections.abc.Mapping[str, JSON] | collections.abc.Sequence[JSON] | None] | zarr.abc.codec.BytesBytesCodec | numcodecs.abc.Codec | Literal[‘auto’] | None
- header_dtype¶
The data type for the variable’s header.
- Type:
numpy.dtype[Any] | None | type[Any] | numpy._typing._dtype_like._SupportsDType[numpy.dtype[Any]] | str | tuple[Any, int] | tuple[Any, SupportsIndex | collections.abc.Sequence[SupportsIndex]] | list[Any] | numpy._typing._dtype_like._DTypeDict | tuple[Any, Any]
- mdio.core.factory.create_empty(config, overwrite=False, storage_options=None, consolidate_meta=True)¶
Create an empty MDIO dataset.
This function initializes a new MDIO dataset at the specified path based on the provided configuration. It constructs a Zarr hierarchy with groups for metadata and data, creates datasets for each variable and its associated trace headers, and sets initial attributes such as creation time, API version, grid dimensions, and basic statistics (all initialized to zero). An empty ‘live_mask’ dataset is also created to track valid traces.
Important: It is up to the user to update live_mask and other attributes.
- Parameters:
config (MDIOCreateConfig) – Configuration object specifying the dataset’s path, grid structure, and a list of variable configurations (e.g., name, dtype, chunks).
overwrite (bool) – If True, overwrites any existing dataset at the specified path. If False, an error is raised if the dataset exists. Defaults to False.
storage_options (dict[str, Any] | None) – Options for the storage backend, such as credentials or settings for cloud storage (e.g., S3, GCS). Defaults to None.
consolidate_meta (bool) – If True, consolidates metadata into a single file after creation, improving performance for large metadata. Defaults to True.
- Returns:
The root Zarr group representing the newly created MDIO dataset.
- Return type:
Group
- mdio.core.factory.create_empty_like(source_path, dest_path, overwrite=False, storage_options_input=None, storage_options_output=None)¶
Create an empty MDIO dataset with the same structure as an existing one.
This function initializes a new empty MDIO dataset at the specified destination path, replicating the structure of an existing dataset, including its grid, variables, chunking strategy, compression, and headers. It copies metadata such as text and binary headers from the source dataset and sets initial attributes like creation time, API version, and zeroed statistics.
Important: It is up to the user to update headers, live_mask and stats.
- Parameters:
source_path (str) – The path or URI of the existing MDIO dataset to replicate.
dest_path (str) – The path or URI where the new MDIO dataset will be created.
overwrite (bool) – If True, overwrites any existing dataset at the destination.
storage_options_input (dict[str, Any] | None) – Options for storage backend of the source dataset.
storage_options_output (dict[str, Any] | None) – Options for storage backend of the destination dataset.
- Return type:
None
Data I/O¶
(De)serialization factory design pattern.
Current support for JSON and YAML.
- class mdio.core.serialization.Serializer(stream_format)¶
Serializer base class.
Here we define the interface for any serializer implementation.
- Parameters:
stream_format (str) – Stream format. Must be in {“JSON”, “YAML”}.
- static validate_payload(payload, signature)¶
Validate if required keys exist in the payload for a function signature.
- abstractmethod deserialize(stream)¶
Abstract method for deserialize.
- mdio.core.serialization.get_deserializer(stream_format)¶
Get deserializer based on format.