Test Dataset#

This page has the instruction of how the test-datasets were generated and how they are used in the tests.

Scaling workflow - MTZ files#

MTZ test datasets are create with gemmi and random generator.

We have multiple test MTZ files since multiple files are expected in usual cases.

These files do not have any physical meaning and they are meant to be useful for testing the workflow.

Here is the code cell to create the test files.

[1]:
import gemmi
import pandas as pd
import numpy as np
from ess.nmx.mtz_io import (
    DEFAULT_INTENSITY_COLUMN_NAME,
    DEFAULT_WAVELENGTH_COLUMN_NAME,
    DEFAULT_STD_DEV_COLUMN_NAME,
    DEFAULT_SPACE_GROUP_DESC,
)

# Negative intensities will happen due to corrections
# and high intensities are also expected in some cases
INTENSITY_RANGE = (-20.0, 200.0)
HKL_RANGE = (-100, 100)
MANDATORY_FIELDS = (
    "H",
    "K",
    "L",
    DEFAULT_WAVELENGTH_COLUMN_NAME,  # LAMBDA
    DEFAULT_INTENSITY_COLUMN_NAME,  # I
    DEFAULT_STD_DEV_COLUMN_NAME,  # SIGI
)
global_rng = np.random.default_rng(0)
HKL_CANDIDATES = tuple(
    zip(*[global_rng.integers(*HKL_RANGE, size=100) for _ in range(3)], strict=False)
)

def create_mtz_data_frame(random_seed: int) -> pd.DataFrame:
    rng = np.random.default_rng(random_seed)
    intensities = np.sort(rng.normal(50, 20, size=10_000))[::-1] + (
        rng.uniform(*INTENSITY_RANGE, size=10_000)
        * rng.choice([0] * 99 + [1], size=10_000)
    )
    std_devs = np.multiply(intensities, rng.uniform(0.1, 0.15, size=10_000))
    wavelengths = np.sort(rng.uniform(2.8, 3.2, size=10_000))[::-1]

    df = pd.DataFrame(
        {
            DEFAULT_INTENSITY_COLUMN_NAME: intensities,
            DEFAULT_STD_DEV_COLUMN_NAME: std_devs,
            DEFAULT_WAVELENGTH_COLUMN_NAME: wavelengths,
        }
    )

    df[["H", "K", "L"]] = pd.Series(
        rng.choice(HKL_CANDIDATES, size=10_000).tolist()
    ).to_list()

    return df


def dataframe_to_mtz(df: pd.DataFrame) -> gemmi.Mtz:
    """Create a random MTZ file with a single dataset.

    Columns:
    - H, K, L: Miller indices
    - LAMBDA: Wavelength
    - I: Intensity
    - SIGI: Standard deviation of intensity

    """
    assert set(df.columns) == set(MANDATORY_FIELDS)

    mtz = gemmi.Mtz()
    mtz.add_dataset("HKL")
    column_type_map = {  # Column types: https://www.ccp4.ac.uk/html/mtzformat.html#coltypes
        "H": "H",
        "K": "H",
        "L": "H",
        DEFAULT_WAVELENGTH_COLUMN_NAME: "R",
        DEFAULT_INTENSITY_COLUMN_NAME: "J",
        DEFAULT_STD_DEV_COLUMN_NAME: "Q",
    }

    for col_name in df.columns:
        mtz.add_column(col_name, type=column_type_map[col_name], expand_data=True)

    mtz.spacegroup = gemmi.SpaceGroup(DEFAULT_SPACE_GROUP_DESC)
    mtz.set_data(df.values)
    return mtz


for seed in range(1, 6):
    sample_df = create_mtz_data_frame(seed)
    sample_mtz = dataframe_to_mtz(sample_df)
    # sample_mtz.write_to_file(f"sample_{seed}.mtz")  # Uncomment to save the MTZ file

sample_df.head()
[1]:
I SIGI LAMBDA H K L
0 121.346225 16.221380 3.199910 68 66 60
1 120.356421 14.619821 3.199888 5 -18 -43
2 118.749968 17.432007 3.199803 -24 -15 -52
3 118.462882 12.729899 3.199750 86 -78 91
4 116.012108 14.932206 3.199700 42 3 18

Once the files were created, they were compressed into one file and uploaded in the server where pooch can access to.

Here is the script for compressing the files.

tar -czvf mtz_random_samples.tar.gz sample_*.mtz