Test Dataset#
This page has the instruction of how the test-datasets were generated and how they are used in the tests.
Scaling workflow - MTZ files#
MTZ test datasets are create with gemmi
and random generator.
We have multiple test MTZ files since multiple files are expected in usual cases.
These files do not have any physical meaning and they are meant to be useful for testing the workflow.
Here is the code cell to create the test files.
[1]:
import gemmi
import pandas as pd
import numpy as np
from ess.nmx.mtz_io import (
DEFAULT_INTENSITY_COLUMN_NAME,
DEFAULT_WAVELENGTH_COLUMN_NAME,
DEFAULT_STD_DEV_COLUMN_NAME,
DEFAULT_SPACE_GROUP_DESC,
)
# Negative intensities will happen due to corrections
# and high intensities are also expected in some cases
INTENSITY_RANGE = (-20.0, 200.0)
HKL_RANGE = (-100, 100)
MANDATORY_FIELDS = (
"H",
"K",
"L",
DEFAULT_WAVELENGTH_COLUMN_NAME, # LAMBDA
DEFAULT_INTENSITY_COLUMN_NAME, # I
DEFAULT_STD_DEV_COLUMN_NAME, # SIGI
)
global_rng = np.random.default_rng(0)
HKL_CANDIDATES = tuple(
zip(*[global_rng.integers(*HKL_RANGE, size=100) for _ in range(3)], strict=False)
)
def create_mtz_data_frame(random_seed: int) -> pd.DataFrame:
rng = np.random.default_rng(random_seed)
intensities = np.sort(rng.normal(50, 20, size=10_000))[::-1] + (
rng.uniform(*INTENSITY_RANGE, size=10_000)
* rng.choice([0] * 99 + [1], size=10_000)
)
std_devs = np.multiply(intensities, rng.uniform(0.1, 0.15, size=10_000))
wavelengths = np.sort(rng.uniform(2.8, 3.2, size=10_000))[::-1]
df = pd.DataFrame(
{
DEFAULT_INTENSITY_COLUMN_NAME: intensities,
DEFAULT_STD_DEV_COLUMN_NAME: std_devs,
DEFAULT_WAVELENGTH_COLUMN_NAME: wavelengths,
}
)
df[["H", "K", "L"]] = pd.Series(
rng.choice(HKL_CANDIDATES, size=10_000).tolist()
).to_list()
return df
def dataframe_to_mtz(df: pd.DataFrame) -> gemmi.Mtz:
"""Create a random MTZ file with a single dataset.
Columns:
- H, K, L: Miller indices
- LAMBDA: Wavelength
- I: Intensity
- SIGI: Standard deviation of intensity
"""
assert set(df.columns) == set(MANDATORY_FIELDS)
mtz = gemmi.Mtz()
mtz.add_dataset("HKL")
column_type_map = { # Column types: https://www.ccp4.ac.uk/html/mtzformat.html#coltypes
"H": "H",
"K": "H",
"L": "H",
DEFAULT_WAVELENGTH_COLUMN_NAME: "R",
DEFAULT_INTENSITY_COLUMN_NAME: "J",
DEFAULT_STD_DEV_COLUMN_NAME: "Q",
}
for col_name in df.columns:
mtz.add_column(col_name, type=column_type_map[col_name], expand_data=True)
mtz.spacegroup = gemmi.SpaceGroup(DEFAULT_SPACE_GROUP_DESC)
mtz.set_data(df.values)
return mtz
for seed in range(1, 6):
sample_df = create_mtz_data_frame(seed)
sample_mtz = dataframe_to_mtz(sample_df)
# sample_mtz.write_to_file(f"sample_{seed}.mtz") # Uncomment to save the MTZ file
sample_df.head()
[1]:
I | SIGI | LAMBDA | H | K | L | |
---|---|---|---|---|---|---|
0 | 121.346225 | 16.221380 | 3.199910 | 68 | 66 | 60 |
1 | 120.356421 | 14.619821 | 3.199888 | 5 | -18 | -43 |
2 | 118.749968 | 17.432007 | 3.199803 | -24 | -15 | -52 |
3 | 118.462882 | 12.729899 | 3.199750 | 86 | -78 | 91 |
4 | 116.012108 | 14.932206 | 3.199700 | 42 | 3 | 18 |
Once the files were created, they were compressed into one file and uploaded in the server where pooch can access to.
Here is the script for compressing the files.
tar -czvf mtz_random_samples.tar.gz sample_*.mtz