Skip to content

Example data: provenance

How the tutorial's example.npz was made. A one-time, offline step — single-cell tools live here only, never in the package or the tutorial.

Source. PBMC68k (10x Genomics) — ~68,000 human blood cells.

Preprocessing (scanpy; mirrors the upstream TopOMetry step-by-step):

# Step Call
1 QC filter — drop cells <200 genes, genes in <3 cells, high mitochondrial % sc.pp.filter_cells/filter_genes, pct_counts_mt
2 Normalize counts per cell → 10⁴ sc.pp.normalize_total(target_sum=1e4)
3 Log-transform sc.pp.log1p
4 Keep top 2000 highly-variable genes sc.pp.highly_variable_genes
5 Z-score per gene, clip at 10 sc.pp.scale(max_value=10)
6 Cell-type labels marker-based classifier → obs['predicted_celltype']

Result in memory: AnnData 68265 × 2000 (scaled), with predicted_celltype.

Export to the package's (X, labels) contract, sub-sampled for a fast download:

import numpy as np
from scipy.sparse import issparse

# scaled matrix -> dense float32
X = adata.X
X = X.toarray() if issparse(X) else np.asarray(X)
X = X.astype(np.float32)

# cell type -> integer ids (+ readable names)
cats = adata.obs["predicted_celltype"].astype("category")
labels = cats.cat.codes.to_numpy().astype(int)
label_names = np.asarray(cats.cat.categories)        # id -> name

# stratified sub-sample to ~10k cells (keeps every cell type, in proportion)
rng = np.random.default_rng(0)
frac = 10_000 / X.shape[0]
idx = np.concatenate([
    rng.choice(
        np.flatnonzero(labels == c),
        max(1, round((labels == c).sum() * frac)),
        replace=False,
    )
    for c in np.unique(labels)
])
rng.shuffle(idx)
X, labels = X[idx], labels[idx]

np.savez_compressed("example.npz", X=X, labels=labels, label_names=label_names)

Output. example.npzX (cells × 2000, float32), labels (int cell-type id), label_names (id → cell type). Upload as a GitHub Release asset; the tutorial's data.py downloads it. Nothing single-cell is needed downstream.

gh release create data-v1 example.npz --title "Tutorial data" \
  --notes "Preprocessed, sub-sampled PBMC68k for the step-by-step tutorial"