Test Dataset#

This page has the instruction of how the test-datasets were generated and how they are used in the tests.

Scaling workflow - MTZ files#

MTZ test datasets are create with gemmi and random generator.

We have multiple test MTZ files since multiple files are expected in usual cases.

These files do not have any physical meaning and they are meant to be useful for testing the workflow.

Here is the code cell to create the test files.

[1]:

import gemmi
import pandas as pd
import numpy as np
from ess.nmx.mtz_io import (
    DEFAULT_INTENSITY_COLUMN_NAME,
    DEFAULT_WAVELENGTH_COLUMN_NAME,
    DEFAULT_STD_DEV_COLUMN_NAME,
    DEFAULT_SPACE_GROUP_DESC,
)

# Negative intensities will happen due to corrections
# and high intensities are also expected in some cases
INTENSITY_RANGE = (-20.0, 200.0)
HKL_RANGE = (-100, 100)
MANDATORY_FIELDS = (
    "H",
    "K",
    "L",
    DEFAULT_WAVELENGTH_COLUMN_NAME,  # LAMBDA
    DEFAULT_INTENSITY_COLUMN_NAME,  # I
    DEFAULT_STD_DEV_COLUMN_NAME,  # SIGI
)
global_rng = np.random.default_rng(0)
HKL_CANDIDATES = tuple(
    zip(*[global_rng.integers(*HKL_RANGE, size=100) for _ in range(3)], strict=False)
)

def create_mtz_data_frame(random_seed: int) -> pd.DataFrame:
    rng = np.random.default_rng(random_seed)
    intensities = np.sort(rng.normal(50, 20, size=10_000))[::-1] + (
        rng.uniform(*INTENSITY_RANGE, size=10_000)
        * rng.choice([0] * 99 + [1], size=10_000)
    )
    std_devs = np.multiply(intensities, rng.uniform(0.1, 0.15, size=10_000))
    wavelengths = np.sort(rng.uniform(2.8, 3.2, size=10_000))[::-1]

    df = pd.DataFrame(
        {
            DEFAULT_INTENSITY_COLUMN_NAME: intensities,
            DEFAULT_STD_DEV_COLUMN_NAME: std_devs,
            DEFAULT_WAVELENGTH_COLUMN_NAME: wavelengths,
        }
    )

    df[["H", "K", "L"]] = pd.Series(
        rng.choice(HKL_CANDIDATES, size=10_000).tolist()
    ).to_list()

    return df


def dataframe_to_mtz(df: pd.DataFrame) -> gemmi.Mtz:
    """Create a random MTZ file with a single dataset.

    Columns:
    - H, K, L: Miller indices
    - LAMBDA: Wavelength
    - I: Intensity
    - SIGI: Standard deviation of intensity

    """
    assert set(df.columns) == set(MANDATORY_FIELDS)

    mtz = gemmi.Mtz()
    mtz.add_dataset("HKL")
    column_type_map = {  # Column types: https://www.ccp4.ac.uk/html/mtzformat.html#coltypes
        "H": "H",
        "K": "H",
        "L": "H",
        DEFAULT_WAVELENGTH_COLUMN_NAME: "R",
        DEFAULT_INTENSITY_COLUMN_NAME: "J",
        DEFAULT_STD_DEV_COLUMN_NAME: "Q",
    }

    for col_name in df.columns:
        mtz.add_column(col_name, type=column_type_map[col_name], expand_data=True)

    mtz.spacegroup = gemmi.SpaceGroup(DEFAULT_SPACE_GROUP_DESC)
    mtz.set_data(df.values)
    return mtz


for seed in range(1, 6):
    sample_df = create_mtz_data_frame(seed)
    sample_mtz = dataframe_to_mtz(sample_df)
    # sample_mtz.write_to_file(f"sample_{seed}.mtz")  # Uncomment to save the MTZ file

sample_df.head()

[1]:

	I	SIGI	LAMBDA	H	K	L
0	121.346225	16.221380	3.199910	68	66	60
1	120.356421	14.619821	3.199888	5	-18	-43
2	118.749968	17.432007	3.199803	-24	-15	-52
3	118.462882	12.729899	3.199750	86	-78	91
4	116.012108	14.932206	3.199700	42	3	18

Once the files were created, they were compressed into one file and uploaded in the server where pooch can access to.

Here is the script for compressing the files.

tar -czvf mtz_random_samples.tar.gz sample_*.mtz