Skip to content

Advanced Estimators (Custom Pipelines)

These scikit-learn compatible transformers can be used individually to build custom manifold learning pipelines.

The Kernel(..., fuzzy=True) path delegates UMAP-specific fuzzy simplicial-set construction to umap-learn; TopoMetry uses the resulting graph inside its own spectral-scaffold and graph-refinement pipeline.

The Kernel(..., cknn=True) path builds the paper-defined binary CkNN graph: samples are adjacent when d(i, j) < delta * sqrt(rho_i * rho_j). This is an unweighted graph construction, not a weighted adaptive kernel; use cknn_ratio_matrix separately when normalized CkNN distance ratios are needed.

topo.base.ann.kNN

kNN(
    X: ndarray | csr_matrix,
    Y: ndarray | csr_matrix | None = None,
    n_neighbors: int | float | str = 5,
    metric: str = "euclidean",
    n_jobs: int = -1,
    backend: str = "sklearn",
    M: int = 60,
    efC: int = 200,
    efS: int = 200,
    verbose: bool = False,
    **kwargs: Any,
) -> csr_matrix

Compute a k-nearest-neighbor distance graph.

Parameters:

Name Type Description Default
X array-like or scipy.sparse matrix, shape (n_samples, n_features)

Reference data used to fit the neighbor index. If metric='precomputed', X must be a square distance matrix.

required
Y array-like or scipy.sparse matrix

Query data. If provided, the output graph has shape (Y.shape[0], X.shape[0]).

None
n_neighbors int

Number of non-self neighbors to return for self-query graphs.

5
metric str

Distance metric.

'euclidean'
n_jobs int

Number of threads. -1 uses all available CPUs.

-1
backend (sklearn, hnswlib)

Neighbor-search backend.

'sklearn'
verbose bool

Emit backend/timing diagnostics through logging.

False

Returns:

Type Description
csr_matrix

kNN distance graph.

topo.tpgraph.cknn.cknn_graph

cknn_graph(
    X,
    *,
    scale_k: int = 10,
    delta: float = 1.0,
    metric: str = "euclidean",
    candidate_k: int | None = None,
    exact: bool = False,
    include_self: bool = False,
    symmetrize: SymmetrizeMode = "or",
    backend: str = "sklearn",
    n_jobs: int = -1,
    verbose: bool = False,
    **kwargs,
) -> csr_matrix

Build the unweighted Continuous k-Nearest Neighbors graph.

Two samples i and j are connected when d(i, j) < delta * sqrt(rho_i * rho_j), where rho_i is the distance from sample i to its scale_k-th nearest neighbor. Exact mode thresholds all pairs. Candidate-neighbor mode is scalable but may miss edges if candidate_k is too small.

Introduced by Berry & Sauer, "Consistent manifold representation for topological data analysis" (https://arxiv.org/abs/1606.02353). The binary CkNN graph with the unnormalized Laplacian gives a consistent estimate of the Laplace-Beltrami operator regardless of sampling density.

Parameters:

Name Type Description Default
X array-like or sparse matrix of shape (n_samples, n_features)

Input data, or a square distance matrix when metric='precomputed'.

required
scale_k int

Neighbor rank used for the local scale rho_i.

10
delta float

Connection threshold; larger values produce denser graphs.

1.0
metric str

Distance metric, or 'precomputed'.

'euclidean'
candidate_k int

Number of candidate neighbors searched per sample. Defaults to max(3 * scale_k, scale_k + 15).

None
exact bool

Threshold all pairwise distances instead of candidate neighbors only. Quadratic in memory; use for small datasets or validation.

False
include_self bool

Whether to keep self-loops.

False
symmetrize ('or', 'and', none)

How to symmetrize the directed candidate relation.

'or'
backend (sklearn, hnswlib)

Neighbor-search backend for candidate retrieval.

'sklearn'
n_jobs int

Number of threads. -1 uses all available CPUs.

-1
verbose bool

Emit search diagnostics through logging.

False
**kwargs

Forwarded to :func:topo.base.ann.kNN.

{}

Returns:

Type Description
csr_matrix

Binary (0/1) adjacency matrix of shape (n_samples, n_samples).

topo.tpgraph.cknn.cknn_ratio_matrix

cknn_ratio_matrix(
    X,
    *,
    scale_k: int = 10,
    metric: str = "euclidean",
    candidate_k: int | None = None,
    exact: bool = False,
    symmetrize: SymmetrizeMode = "or",
    backend: str = "sklearn",
    n_jobs: int = -1,
    verbose: bool = False,
    **kwargs,
) -> csr_matrix

Return sparse CkNN normalized distance ratios.

Entry (i, j) is d(i, j) / sqrt(rho_i * rho_j). This matrix is useful for persistence/order diagnostics; it is not the binary CkNN adjacency used for the unnormalized graph Laplacian. Ratios are always computed without self-loops.

topo.tpgraph.kernels.compute_kernel

compute_kernel(
    X,
    metric: str = ...,
    n_neighbors: int = ...,
    fuzzy: bool = ...,
    cknn: bool = ...,
    cknn_delta: float = ...,
    cknn_candidate_neighbors: int | None = ...,
    cknn_exact: bool = ...,
    pairwise: bool = ...,
    sigma: float | None = ...,
    adaptive_bw: bool = ...,
    expand_nbr_search: bool = ...,
    alpha_decaying: bool = ...,
    *,
    return_densities: Literal[False] = ...,
    symmetrize: bool = ...,
    backend: str = ...,
    n_jobs: int = ...,
    verbose: bool = ...,
    use_angular: bool = ...,
    square_distances: bool = ...,
    random_state: int | RandomState | None = ...,
    **kwargs,
) -> csr_matrix
compute_kernel(
    X,
    metric: str = ...,
    n_neighbors: int = ...,
    fuzzy: bool = ...,
    cknn: bool = ...,
    cknn_delta: float = ...,
    cknn_candidate_neighbors: int | None = ...,
    cknn_exact: bool = ...,
    pairwise: bool = ...,
    sigma: float | None = ...,
    adaptive_bw: bool = ...,
    expand_nbr_search: bool = ...,
    alpha_decaying: bool = ...,
    *,
    return_densities: Literal[True],
    symmetrize: bool = ...,
    backend: str = ...,
    n_jobs: int = ...,
    verbose: bool = ...,
    use_angular: bool = ...,
    square_distances: bool = ...,
    random_state: int | RandomState | None = ...,
    **kwargs,
) -> tuple[
    csr_matrix, dict[str, csr_matrix | np.ndarray | int]
]
compute_kernel(
    X,
    metric: str = "cosine",
    n_neighbors: int = 10,
    fuzzy: bool = False,
    cknn: bool = False,
    cknn_delta: float = 1.0,
    cknn_candidate_neighbors: int | None = None,
    cknn_exact: bool = False,
    pairwise: bool = False,
    sigma: float | None = None,
    adaptive_bw: bool = True,
    expand_nbr_search: bool = False,
    alpha_decaying: bool = False,
    *,
    return_densities: bool = False,
    symmetrize: bool = True,
    backend: str = "hnswlib",
    n_jobs: int = -1,
    verbose: bool = False,
    use_angular: bool = True,
    square_distances: bool = True,
    random_state: int | RandomState | None = None,
    **kwargs,
) -> (
    csr_matrix
    | tuple[
        csr_matrix, dict[str, csr_matrix | np.ndarray | int]
    ]
)

Compute a kernel matrix from a set of points.

Parameters:

Name Type Description Default
X array-like, shape (n_samples, n_features).

The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices. If precomputed, assumed to be a square symmetric semidefinite matrix.

required
metric string

The metric to use when computing the kernel matrix. Possible values are: 'cosine', 'euclidean' and others. Accepts precomputed distances.

'cosine'
n_neighbors int

The number of neighbors to use when computing the kernel matrix. Ignored if pairwise set to True.

10
fuzzy bool

Whether to build the kernel matrix using fuzzy simplicial sets, similarly to UMAP. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored. If set to True at the same time that cknn is set to True, the cknn parameter is ignored.

False
cknn bool

Whether to build the binary, unweighted continuous k-nearest-neighbors graph. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored.

False
cknn_delta float

Unitless CkNN edge threshold. Ignored if cknn is False.

1.0
cknn_candidate_neighbors int or None

Number of candidate neighbors tested in approximate CkNN mode. Ignored when cknn_exact=True.

None
cknn_exact bool

If True, threshold all pairwise distances for CkNN construction.

False
pairwise bool

Whether to compute the kernel using dense pairwise distances. If set to True, the n_neighbors and backend parameters are ignored. Uses numba for computations if available. If not, uses sklearn.

False
sigma float

Scaling factor if using fixed bandwidth kernel (only used if adaptive_bw is set to False).

None
adaptive_bw bool

Whether to use an adaptive bandwidth based on the distance to median k-nearest-neighbor.

True
expand_nbr_search bool

Whether to expand the neighborhood search (mitigates a choice of too small a number of k-neighbors).

False
alpha_decaying bool

Whether to use an adaptively decaying kernel.

False
return_densities bool

Whether to return the bandwidth metrics as a dictinary. If set to True, the function returns a tuple containing the kernel matrix and a dictionary containing the bandwidth metric.

False
symmetrize bool

Whether to symmetrize the kernel matrix after normalizations.

True
backend str

Which backend to use for neighborhood computations. Defaults to 'hnswlib'. Options are 'hnswlib' and 'sklearn'.

'hnswlib'
n_jobs int

The number of jobs to use for parallel computations. If set to -1, all available cores are used.

1
verbose bool

Whether to print progress messages.

False
**kwargs dict

Additional arguments to be passed to the nearest-neighbors backend.

{}

Returns:

Name Type Description
K (array - like, shape(n_samples, n_samples))

The kernel matrix.

densities dict, optional (if `return_densities` is set to `True`)

If fuzzy and cknn are False, is a dictionary containing the bandwidth metrics. If fuzzy is set to True, the dictionary contains sigma and rho estimates. If cknn is set to True, the dictionary contains the binary adjacency.

topo.tpgraph.kernels.Kernel

Kernel(
    metric="cosine",
    n_neighbors=10,
    fuzzy=False,
    cknn=False,
    cknn_delta=1.0,
    cknn_candidate_neighbors=None,
    cknn_exact=False,
    pairwise=False,
    sigma=None,
    adaptive_bw=True,
    expand_nbr_search=False,
    alpha_decaying=False,
    symmetrize=True,
    backend="hnswlib",
    n_jobs=1,
    laplacian_type="normalized",
    anisotropy=1.0,
    semi_aniso=False,
    n_landmarks=None,
    cache_input=False,
    verbose=False,
    random_state=None,
    use_angular=True,
)

Bases: BaseEstimator, TransformerMixin

Scikit-learn flavored estimator for computing a kernel matrix from points.

Includes functions for computing the kernel matrix with a variety of methods (adaptive bandwidth, fuzzy simplicial sets, continuous k-nearest-neighbors, etc) and performing operations on the resulting graph, such as obtaining its Laplacian, sparsifying it, filtering and interpolating signals, obtaining diffusion operators, imputing missing values and computing shortest paths.

Use this only when you want to manually construct a graph or diffusion operator.

Example

from topo.tpgraph import Kernel
from topo.spectral import EigenDecomposition
from topo.layouts import Projector
kernel = Kernel(n_neighbors=30, metric="cosine").fit(X)
Z = EigenDecomposition(n_components=64, method="msDM").fit_transform(kernel)
Y = Projector(projection_method="MAP").fit_transform(Z)

Parameters:

Name Type Description Default
metric str

The metric to use when computing the kernel matrix. Possible values are: 'cosine', 'euclidean' and others, depending on the chosen nearest-neighbors backend. Accepts precomputed distances as 'precomputed'.

'cosine'
n_neighbors int

The number of neighbors to use when computing the kernel matrix. Ignored if pairwise set to True.

10
fuzzy bool

Whether to build the kernel matrix using fuzzy simplicial sets, similarly to UMAP. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored. If set to True at the same time that cknn is set to True, the cknn parameter is ignored.

False
cknn bool

Whether to build the binary, unweighted continuous k-nearest-neighbors graph. If set to True, the pairwise, sigma, adaptive_bw, expand_nbr_search and alpha_decaying parameters are ignored.

False
pairwise bool

Whether to compute the kernel using dense pairwise distances. If set to True, the n_neighbors and backend parameters are ignored. Uses numba for computations if available. If not, uses sklearn.

False
sigma float or None

Scaling factor if using fixed bandwidth kernel (only used if adaptive_bw is set to False).

None
adaptive_bw bool

Whether to use an adaptive bandwidth based on the distance to median k-nearest-neighbor.

True
expand_nbr_search bool

Whether to expand the neighborhood search (mitigates a choice of too small a number of k-neighbors).

False
alpha_decaying bool

Whether to use an adaptively decaying kernel.

False
anisotropy float

The anisotropy (alpha) parameter in the diffusion maps literature for kernel reweighting.

1.0
semi_aniso bool

Whether to use semi-anisotropic diffusion. This reweights the original kernel (not the renormalized kernel) by the renormalized degree.

False
symmetrize bool

Whether to symmetrize the kernel matrix after normalizations.

True
backend (hnswlib, sklearn)

Which backend to use for k-nearest-neighbor computations. Defaults to 'hnswlib'. Options are 'hnswlib' and 'sklearn'.

"hnswlib"
n_jobs int

The number of jobs to use for parallel computations. If -1, all CPUs are used. Parallellization (multiprocessing) is highly recommended whenever possible.

-1
laplacian_type (normalized, unnormalized, random_walk)

The type of laplacian to use.

"normalized"
Fitted Attributes

knn_ : scipy.sparse.csr_matrix, shape (n_samples, n_samples) The computed k-nearest-neighbors graph.

_K : scipy.sparse.csr_matrix, shape (n_samples, n_samples) The kernel matrix.

dens_dict : dict Dictionary containing the density information for each point in the dataset for the computed kernel.

L : scipy.sparse.csr_matrix, shape (n_samples, n_samples) The laplacian matrix of the kernel graph.

P : scipy.sparse.csr_matrix, shape (n_samples, n_samples) The diffusion operator of the kernel graph.

SP : scipy.sparse.csr_matrix, shape (n_samples, n_samples) The shortest-path matrix.

degree : np.ndarray, shape (n_samples,) The degree of each point in the adjacency graph.

weighted_degree : np.ndarray, shape (n_samples,) The weighted degree of each point in the kernel graph.

knn property

knn

The k-nearest-neighbors graph (sparse distance matrix).

Contains the raw nearest-neighbor distances used to compute the bandwidths and affinities. Represented as a CSR sparse matrix.

K property

K

The kernel (affinity) matrix.

Represents the pairwise similarities between samples on the manifold, typically decaying from 1 (identical) to 0 (disconnected). Used as the weighted adjacency matrix for graph operators.

A property

A

The binary graph adjacency matrix.

Unweighted representation of the graph connectivity, generated from the non-zero entries of the kernel affinity matrix (K).

degree property

degree

The node degrees of the binary adjacency matrix.

A 1-D array containing the number of connected neighbors for each sample, ignoring the affinity weights.

weighted_degree property

weighted_degree

The weighted node degrees of the kernel affinity matrix.

A 1-D array containing the sum of affinity weights for each sample. Used to normalize graph Laplacians and diffusion operators.

L property

L

The graph Laplacian matrix.

Use this property for the default cached result. Use the laplacian method when you need to pass options or recompute with non-default settings.

Evaluates and caches the Laplacian on first access using the laplacian_type specified during initialization.

Use this property for the default cached result. Use the laplacian() method when you need to pass options or recompute with non-default settings.

P property

P

The graph diffusion operator (Markov transition matrix).

Use this property for the default cached result. Use the diff_op method when you need to pass options or recompute with non-default settings.

Evaluates and caches the operator on first access. Describes the probabilities of random walks across the graph.

Use this property for the default cached result. Use the diff_op() method when you need to pass options or recompute with non-default settings.

SP property

SP

The shortest-paths (geodesic distance) matrix.

Use this property for the default cached result. Use the shortest_paths method when you need to pass options or recompute with non-default settings.

Evaluates and caches the all-pairs shortest paths on the graph on first access.

Use this property for the default cached result. Use the shortest_paths() method when you need to pass options or recompute with non-default settings.

fit

fit(X, recompute=False, **kwargs)

Fit the kernel matrix to the data.

Parameters:

Name Type Description Default
X array-like or sparse matrix, shape (n_samples, n_features)

Input data. Accepts NumPy arrays and SciPy CSR sparse matrices. If metric="precomputed", this should be a square distance matrix.

required
recompute bool

Whether to recompute the kernel if it has already been computed.

False
**kwargs

Additional arguments to be passed to the k-nearest-neighbor backend.

{}

Returns:

Name Type Description
self Kernel

The fitted Kernel instance, populating properties like K and knn.

transform

transform(X=None)

Return the fitted kernel (affinity) matrix.

Provided for scikit-learn compatibility in pipelines. X is ignored because out-of-sample extension for the kernel is not supported.

fit_transform

fit_transform(X, y=None, **fit_params)

Fit the kernel to the data and return the affinity matrix.

Parameters:

Name Type Description Default
X array-like or sparse matrix, shape (n_samples, n_features)

Input data.

required
y None

Ignored.

None
**fit_params

Additional parameters passed to fit.

{}

Returns:

Name Type Description
K (csr_matrix, shape(n_samples, n_samples))

The computed kernel (affinity) matrix.

adjacency

adjacency()

Compute the binary graph adjacency matrix.

The adjacency matrix defines the unweighted connectivity of the graph. It is represented as an N-by-N sparse matrix where entry A_{i,j} is 1 if the kernel affinity K_{i,j} > 0, and 0 otherwise.

laplacian

laplacian(laplacian_type=None, recompute=False)

Compute the graph Laplacian of this kernel's affinity matrix.

For a friendly reference, see this material from James Melville: https://jlmelville.github.io/smallvis/spectral.html

Parameters:

Name Type Description Default
laplacian_type str

The type of laplacian to use. Can be 'unnormalized', 'normalized', or 'random_walk'. If not provided, uses the default laplacian_type specified in the constructor.

None

Returns:

Name Type Description
L (csr_matrix, shape(n_samples, n_samples))

The computed graph Laplacian matrix.

diff_op

diff_op(anisotropy=1.0, symmetric=True, recompute=False)

Compute the diffusion operator.

Parameters:

Name Type Description Default
anisotropy float

Anisotropy to apply. 'Alpha' in the diffusion maps literature. Defaults to the anisotropy defined in the constructor.

1.0
symmetric bool

Whether to use a symmetric version of the diffusion operator. This is particularly useful to yield a symmetric operator when using anisotropy (alpha > 0), as the diffusion operator P would be asymmetric otherwise, which can be problematic during matrix decomposition. Eigenvalues are the same as the asymmetric version, and the right eigenvectors of the original asymmetric operator can be recovered by left multiplying by Kernel.D_inv_sqrt_.

True

Returns:

Name Type Description
P (csr_matrix, shape(n_samples, n_samples))

The graph diffusion operator (Markov transition matrix).

Notes

Populates the Kernel.D_inv_sqrt_ attribute when symmetric=True.

shortest_paths

shortest_paths(
    landmark=False, indices=None, recompute=False
)

Compute the shortest paths (geodesic distances) on the graph.

Notes

This method may require O(n_samples^2) memory and heavy computation if all-pairs distances are materialized. For large datasets, use landmark mode.

Parameters:

Name Type Description Default
landmark bool

If True, the shortest paths are computed between all pairs of landmarks, rather than all sample nodes.

False
indices list of int

If None, the shortest paths are computed between all pairs of nodes. Else, the shortest paths are computed between all pairs of nodes and nodes with specified indices.

None

Returns:

Name Type Description
D (ndarray, shape(n_samples, n_samples))

The shortest paths matrix. Unreachable nodes evaluate to infinity.

get_indices_distances

get_indices_distances(n_neighbors=None, kernel=True)

Return per-row neighbor indices and distances from the kernel or kNN graph.

Returns:

Type Description
indices, distances : tuple of ndarray

The extracted arrays.

impute

impute(Y=None, t=None, threshold=0.01, tmax=10)

Impute data Y using the diffusion operator built from X.

Although the idea behind this is far older, it was first reported in single-cell genomics by the Krishnaswamy lab in the MAGIC (Markov Affinity-based Graph Imputation of Cells) manuscript

Parameters:

Name Type Description Default
Y ndarray

The input data to impute. If None, the input data X is imputed (if it was cached by setting the cache_input parameter to True). Otherwise, you'll have to specify it as Y.

None
t int

The number of steps to perform during diffusion. The default None iterates until the Procrustes disparity value is below the threshold parameter.

None
threshold float

The threshold value for the Procrustes disparity when finding an optimal t.

0.01
tmax int

The maximum number of steps to perform during diffusion when estimating an optimal t.

10

Returns:

Name Type Description
Y_imp ndarray

The imputed data.

filter

filter(
    signal,
    replicates=None,
    beta=50,
    target=None,
    filterfunc=None,
    offset=0,
    order=1,
    solver="chebyshev",
    chebyshev_order=100,
)

Estimate per-sample density over the graph by filtering a signal.

Inspired by MELD for sample-associated density estimation. However, you can naturally use this for any signal in your data, not just samples of specific conditions. In practice, this is just a simple PyGSP filter on a graph. Indeed, it calls PyGSP, so you'll need it installed to use this function.

Parameters:

Name Type Description Default
signal

Signal(s) to filter - usually sample labels.

required
replicates

Replicate labels for each sample. If None, no replicates are assumed.

None
beta int

Amount of smoothing to apply. Vary this parameter to get good estimates - this can vary widely from dataset to dataset.

50
target array - like

Similarity matrix to use for filtering. If None, uses the kernel matrix.

None
filterfunc function

Function to use for filtering. If None, the default is to use a Laplacian filter.

None
offset

Amount to shift the filter in the eigenvalue spectrum. Recommended to use an eigenvalue from the graph based on the spectral distribution. Should be in interval [0,1]

0
order

Falloff and smoothness of the filter. High order leads to square-like filters.

1
solver string

Method to solve convex problem. If 'chebyshev', uses a chebyshev polynomial approximation of the corresponding filter. Else, if 'exact', uses the eigenvalue solution to the problem

'chebyshev'
chebyshev_order int

Order of chebyshev approximation to use.

100

Returns:

Name Type Description
densities DataFrame

The filtered sample densities.

is_connected

is_connected()

Check if the graph is connected (cached).

A graph is connected if and only if there exists a (directed) path between any two vertices.

Returns:

Name Type Description
connected bool

True if the graph is connected, False otherwise.

topo.spectral.eigen.EigenDecomposition

EigenDecomposition(
    n_components=10,
    method="DM",
    eigensolver="auto",
    eigen_tol=0.0001,
    drop_first=True,
    laplacian_type="random_walk",
    anisotropy=1,
    t=1,
    random_state=None,
    return_evals=False,
    estimate_eigengap=True,
    verbose=False,
)

Transformer for eigendecomposing kernels and spectral operators.

Use this when you already have a kernel, Laplacian, adjacency/affinity matrix, or diffusion operator and want spectral coordinates. The numerical eigendecomposition is delegated to SciPy:

  • dense inputs use scipy.linalg.eigh when eigensolver="dense" or when eigensolver="auto" receives a dense matrix;
  • sparse inputs use scipy.sparse.linalg.eigsh when eigensolver="arpack" or when eigensolver="auto" receives a sparse matrix.

The main input can be either a topo.tpgraph.Kernel object or a square matrix. For Kernel inputs, the transformer reuses the fitted kernel's stored operators when possible.

Parameters:

Name Type Description Default
n_components int

Number of non-trivial components to return. One extra eigenpair is computed internally so the trivial first component can be dropped when drop_first=True.

10
method (top, bottom, msDM, DM, LE)

Method for organizing the eigendecomposition.

  • 'top': compute the largest eigenpairs of the input matrix/operator.
  • 'bottom': compute the smallest eigenpairs of the input matrix/operator.
  • 'DM': compute Diffusion Maps coordinates from a diffusion operator. If a Kernel object is provided, its fitted diffusion operator is reused when available.
  • 'msDM': compute multiscale Diffusion Maps coordinates by weighting diffusion components by lambda / (1 - lambda).
  • 'LE': compute Laplacian Eigenmaps coordinates from a graph Laplacian. If a Kernel object is provided, its fitted Laplacian is reused when available.
'top'
eigensolver (auto, dense, arpack)

Solver policy.

  • 'auto': use dense eigendecomposition for dense inputs and ARPACK for sparse inputs.
  • 'dense': use scipy.linalg.eigh on a dense array. Avoid for large graphs.
  • 'arpack': use scipy.sparse.linalg.eigsh for partial sparse eigendecomposition.
'auto'
eigen_tol float

Tolerance passed to scipy.sparse.linalg.eigsh. Ignored by the dense path.

1e-4
drop_first bool

Whether to drop the first eigenpair, typically the trivial component.

True
laplacian_type str

Laplacian type used when method='LE' and the input is a matrix. Common values are 'normalized', 'unnormalized', and 'random_walk'.

'random_walk'
anisotropy float

Diffusion-maps anisotropy parameter, usually denoted alpha.

1
t int

Diffusion time used when method='DM'. Ignored by other methods.

1
random_state optional

Accepted for API compatibility. The simplified eigensolver path does not use randomness.

None
return_evals bool

Whether results() should return eigenvalues along with the representation.

False
estimate_eigengap bool

Whether to store a simple eigengap estimate after fitting.

True
verbose bool

Accepted for API compatibility. The simplified eigensolver path does not emit verbose solver diagnostics.

False

fit

fit(X)

Compute the eigendecomposition of kernel matrix X per method.

Parameters:

Name Type Description Default
X (array - like, shape(n_samples, n_samples))

Matrix to be decomposed. Should generally be an adjacency, affinity/kernel/similarity, Laplacian matrix or a diffusion-type operator.

required

Returns:

Name Type Description
self EigenDecomposition

The fitted instance itself. Eigenvectors and eigenvalues are stored as attributes. If method is 'DM' or 'msDM', the diffusion operator is cached as well.

rescale

rescale(use_eigs=50)

Re-compute the msDM embedding using a different number of eigenvectors.

Parameters:

Name Type Description Default
use_eigs int

Number of eigenvectors to include in the embedding (must be ≤ the number retained during fit).

50

Returns:

Name Type Description
self EigenDecomposition

The modified instance with the scaled embedding updated.

results

results(return_evals=None)

Return the fitted spectral representation.

For DM and msDM, this computes the embedding from stored eigenpairs if needed. For LE, top, and bottom, it returns the fitted eigenvectors.

transform

transform(X=None)

Return the current representation.

X is ignored; the representation is computed from the eigenpairs stored during fit. Present for scikit-learn compatibility.

Parameters:

Name Type Description Default
X None

Ignored.

None

Returns:

Name Type Description
embedding ndarray of shape (n_samples, n_components)

The computed spectral representation.

fit_transform

fit_transform(X)

Fit the model on X and return the resulting representation.

topo.spectral.eigen.eigendecompose

eigendecompose(
    G: ArrayLike | spmatrix,
    n_components: int = 8,
    eigensolver: str = "auto",
    largest: bool = True,
    eigen_tol: float = 0.0001,
    random_state=None,
    verbose: bool = False,
) -> tuple[NDArray[np.float64], NDArray[np.float64]]

Return sorted eigenpairs of a square symmetric matrix/operator.

Dense inputs use scipy.linalg.eigh. Sparse inputs use scipy.sparse.linalg.eigsh unless eigensolver="dense" is requested.

random_state and verbose are accepted for compatibility and ignored.

topo.spectral.LE

LE(
    W: ArrayLike | spmatrix,
    n_eigs: int = 10,
    laplacian_type: str = "random_walk",
    drop_first: bool = True,
    return_evals: bool = False,
    eigen_tol: float = 0.0,
    random_state=None,
)

Compute a Laplacian Eigenmaps embedding from an affinity graph.

Laplacian Eigenmaps were introduced by Belkin and Niyogi as a spectral embedding method based on eigenvectors of a graph Laplacian: https://www2.imm.dtu.dk/projects/manifold/Papers/Laplacian.pdf

Parameters:

Name Type Description Default
W ArrayLike | spmatrix

Dense or sparse graph adjacency/affinity matrix. W must be square. No symmetrization is performed here; callers should pass a symmetric affinity matrix when using a symmetric Laplacian.

required
n_eigs int

Number of non-trivial eigenvectors to return.

10
laplacian_type str

Laplacian type understood by topo._compat.scipy_graph.graph_laplacian. Common values are "unnormalized", "normalized", and "random_walk".

'random_walk'
drop_first bool

Whether to drop the first eigenvector, which is typically the trivial constant mode for connected graphs.

True
return_evals bool

If True, return (eigenvectors, eigenvalues).

False
eigen_tol float

Tolerance passed to scipy.sparse.linalg.eigsh.

0.0
random_state

Accepted for API compatibility. The current implementation uses ARPACK through eigsh and does not use randomness.

None

Returns:

Type Description
ndarray or tuple[ndarray, ndarray]

Eigenvectors sorted by ascending eigenvalue. If return_evals=True, also returns the corresponding eigenvalues.

Raises:

Type Description
ValueError

If W is not square or if the requested number of eigenvectors is invalid.

RuntimeError

If ARPACK fails to compute the requested eigendecomposition.

topo.spectral.graph_laplacian

graph_laplacian(
    graph,
    *,
    laplacian_type: str = "normalized",
    return_D: bool = False,
)

Return graph Laplacian with package-standard laplacian_type names.

Supported types:

  • "unnormalized": D - W, delegated to scipy.sparse.csgraph.laplacian.
  • "normalized" / "symmetric": I - D^-1/2 W D^-1/2, delegated to SciPy.
  • "random_walk" / "rw": I - D^-1 W, implemented here because SciPy's normalized Laplacian is the symmetric normalized Laplacian.

Isolated nodes are handled without inf/nan values.

topo.spectral.diffusion_operator

diffusion_operator(
    W: ArrayLike | spmatrix,
    alpha: float = 1.0,
    symmetric: bool = False,
    semi_aniso: bool = False,
    *,
    return_D_inv_sqrt: bool = False,
) -> csr_matrix | tuple[csr_matrix, csr_matrix]

Compute a sparse diffusion-map operator from an affinity graph.

This implements the anisotropic normalization used in diffusion maps, following Coifman and Lafon: https://doi.org/10.1016/j.acha.2006.04.006

Parameters:

Name Type Description Default
W ArrayLike | spmatrix

Dense or sparse graph adjacency/affinity matrix. W must be square. No symmetrization is performed here.

required
alpha float

Diffusion-maps anisotropy parameter. For alpha > 0, the affinity is reweighted by D^-alpha W D^-alpha before row normalization. Negative values are treated as 0.

1.0
symmetric bool

If True, return the symmetric conjugate form of the diffusion operator. This is useful when downstream eigensolvers require a symmetric operator. The symmetric and row-stochastic forms are related by a diagonal similarity transform under the usual diffusion-map construction.

False
semi_aniso bool

If True, compute the density correction from the anisotropically reweighted affinity but apply the resulting row normalization to the original affinity matrix.

False
return_D_inv_sqrt bool

If True, also return the inverse square-root degree matrix used to construct the symmetric operator. This option requires symmetric=True.

False

Returns:

Type Description
csr_matrix or tuple[csr_matrix, csr_matrix]

The diffusion operator. If return_D_inv_sqrt=True, returns (P_symmetric, D_inv_sqrt).

Notes

The return type is always sparse CSR, regardless of the input format.

topo.layouts.projector.Projector

Projector(
    n_components=2,
    projection_method="MAP",
    metric="euclidean",
    n_neighbors=10,
    n_jobs=1,
    landmarks=None,
    landmark_method="kmeans",
    num_iters=800,
    init: str | ndarray = "spectral",
    nbrs_backend="hnswlib",
    keep_estimator=False,
    random_state=None,
    verbose=False,
    save_every=None,
    save_limit=None,
    save_callback=None,
    include_init_snapshot=True,
)

Bases: BaseEstimator, TransformerMixin

Scikit-learn compatible class that handles all projection methods.

Use this when you already have a spectral representation or graph and only want the final 2-D/3-D layout. Ideally, it takes in either a orthonormal eigenbasis or a graph kernel learned from such an eigenbasis. It is included in TopoMetry to allow custom TopOGraph-like pipelines (projection is the final step).

Parameters:

Name Type Description Default
n_components int

Number of dimensions to optimize the layout to. Usually 2 or 3 if you're into visualizing data.

2
projection_method str

Which projection method to use. UMAP delegates to umap-learn; MAP is TopoMetry's local checkpoint-aware graph-layout optimizer. Current options are: * 'Isomap' - one of the first manifold learning methods * 't-SNE' - a classic manifold learning method * 'MAP' - local checkpoint-aware graph-layout optimization * 'UMAP' - upstream umap-learn estimator * 'PaCMAP' (Pairwise-controlled Manifold Approximation and Projection) - for balanced visualizations * 'TriMAP' - dimensionality reduction using triplets * 'IsomorphicMDE' - MDE with preservation of nearest neighbors * 'IsometricMDE' - MDE with preservation of pairwise distances * 'NCVis' (Noise Contrastive Visualization) - a UMAP-like method with blazing fast performance These are frankly quite direct to add, so feel free to make a feature request if your favorite method is not listed here.

'Isomap'
metric str

The metric to use when computing distances. Possible values are: 'cosine', 'euclidean' and others. Accepts precomputed distances ('precomputed').

'euclidean'
n_neighbors int

The number of neighbors to use when computing the kernel matrix. Ignored if pairwise set to True.

10
landmarks int or ndarray

If passed as int, will obtain the number of landmarks. If passed as np.ndarray, will use the specified indexes in the array. Any value other than None will result in only the specified landmarks being used in the layout optimization, and will populate the Projector.landmarks_ slot.

None
landmark_method str

The method to use for selecting landmarks. If landmarks is passed as an int, this will be used to select the landmarks. Can be either 'kmeans' or 'random'.

'kmeans'
num_iters int

Most (if not all) methods optimize the layout up to a limit number of iterations. Use this parameter to set this number.

1000
keep_estimator bool

Whether to keep the used estimator as Projector.estimator_ after fitting. Useful if you want to use it later (e.g. UMAP allows inverse transforms and out-of-sample mapping).

False

fit

fit(
    X: ndarray | csr_matrix | Kernel, **kwargs: Any
) -> Projector

Run the desired projection method on the specified data.

Parameters:

Name Type Description Default
X array-like, shape (n_samples, n_features) or topo.Kernel() class.

The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices or a topo.Kernel() object. If precomputed, assumed to be a square symmetric semidefinite matrix.

required
**kwargs Any

Additional keyword arguments for the desired projection method.

{}

Returns:

Name Type Description
self Projector

The fitted Projector instance with populated Y_ attribute.

transform

transform(
    X: ndarray | csr_matrix | None = None,
) -> np.ndarray

Return the projection, using the backend's transform when available.

If the desired method does not have a transform method, returns the results from the fit method.

Returns:

Name Type Description
Y ndarray of shape (n_samples, n_components)

Projection results.

fit_transform

fit_transform(X, y=None, **kwargs) -> np.ndarray

Fit the projection and return the embedding.

If the desired method does not have a fit_transform method, returns the results from the fit method.

Parameters:

Name Type Description Default
X (array - like or Kernel, shape(n_samples, n_features))

The input data to fit and transform.

required
y None

Ignored.

None
**kwargs

Additional arguments passed to the underlying backend.

{}

Returns:

Name Type Description
Y ndarray of shape (n_samples, n_components)

Projection results.

topo.tpgraph.intrinsic_dim.IntrinsicDim

IntrinsicDim(
    methods=["fsa", "mle"],
    k=[10, 20, 50, 75, 100],
    backend="hnswlib",
    metric="euclidean",
    n_jobs=-1,
    plot=True,
    random_state=None,
    **kwargs,
)

Bases: BaseEstimator, TransformerMixin

Scikit-learn flavored estimator of intrinsic dimensionality.

Use this to estimate manifold dimensionality or inspect local geometric complexity. This class iterates over a range of possible values of k-nearest-neighbors to consider in calculations using two different methods: the Farahmand-Szepesvári-Audibert (FSA) dimension estimator and the Maximum Likelihood Estimator (MLE).

Parameters:

Name Type Description Default
methods list of str, (default ['fsa'])

The dimensionality estimation methods to use. Current options are 'fsa' () and 'mle'().

['fsa', 'mle']
k int, range or list of ints, (default [10, 20, 50, 75, 100])

The number of nearest neighbors to use for the dimensionality estimation methods. If a single value of k is provided, then the result dictionary will have keys corresponding to the methods, and values corresponding to the dimensionality estimates. If multiple values of k are provided, then the result dictionary will have keys corresponding to the number of k, and values corresponding to other dictionaries, which have keys corresponding to the methods, and values corresponding to the dimensionality estimates.

[10, 20, 50, 75, 100]
metric str (default 'euclidean')

The metric to use when calculating distance between instances in a feature array.

'euclidean'
backend str

Which backend to use for k-nearest-neighbor computations. Defaults to 'hnswlib'. Options are 'hnswlib' and 'sklearn'.

'hnswlib'
n_jobs int

The number of jobs to use for parallel computations. If -1, all CPUs are used. Parallellization (multiprocessing) is highly recommended whenever possible.

1
plot bool

Whether to plot the results when using the fit() method.

True
random_state int or numpy.random.RandomState() (optional

A pseudo random number generator. Used for generating colors for plotting.

None).
**kwargs keyword arguments

Additional keyword arguments to pass to the backend kNN estimator.

{}
Properties

local_id, global_id : dictionaries containing local and global dimensionality estimates, respectivelly.

Their structure depends on the value of the `k` parameter:

* If a single value of `k` is provided, then the dictionaries will have
keys corresponding to the methods, and values corresponding to the
dimensionality estimates.

* If multiple values of `k` are provided, then the dictionaries will have
keys corresponding to the number of k, and values corresponding to other dictionaries,
which have keys corresponding to the methods, and values corresponding to the
dimensionality estimates.

plot_id

plot_id(
    bins=30,
    figsize=(6, 8),
    titlesize=22,
    labelsize=16,
    legendsize=10,
)

Plot histograms of the estimated local intrinsic dimensions.

fit

fit(X, **kwargs)

Estimates the intrinsic dimensionalities of the data.

Parameters:

Name Type Description Default
X (array - like, shape(n_samples, n_features))

The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices. If precomputed, assumed to be a square symmetric semidefinite matrix of k-nearest-neighbors, with its k higher or equal the highest value of k used to estimate the intrinsic dimensionalities.

required
**kwargs

Additional keyword arguments to pass to the plotting function IntrinsicDim.plot_id().

{}

Returns:

Type Description
Populates the `local` and `global` properties of the class.
Shows a plot of the results if `plot=True`.

transform

transform(X=None)

No-op transform kept for scikit-learn API compatibility.

IntrinsicDim exposes its estimates via attributes (local_id, global_id); this method returns self unchanged.

Returns:

Name Type Description
self IntrinsicDim

The estimator itself.

topo.tpgraph.intrinsic_dim.automated_scaffold_sizing

automated_scaffold_sizing(
    X,
    method: str = "fsa",
    ks=(15, 30, 60),
    backend="hnswlib",
    metric="euclidean",
    n_jobs: int = -1,
    quantile: float = 0.99,
    min_components: int = 16,
    max_components: int = 512,
    headroom: float = 0.15,
    random_state=None,
    use_median: bool = False,
    return_details: bool = False,
    **knn_kwargs,
)

Unified automated scaffold sizing.

method='fsa': - Compute local FSA i.d. for each k in ks, take per-cell median across ks, then take the upper quantile across cells, add headroom, clamp to bounds.

method='mle': - Use a single neighborhood size k (if ks is an int; if iterable, use max(ks)). - Compute local MLE i.d. at k, then global i.d. via: * median of locals if use_median=True * Levina–Bickel harmonic-mean estimator (mle_global) otherwise. - Add headroom, clamp to bounds.

Returns:

Name Type Description
n_components int
details dict (if return_details=True)

topo.layouts.diagnostics.find_ideal_projection

find_ideal_projection(
    tg: Any,
    min_dist_grid: Iterable[float] | None = None,
    spread_grid: Iterable[float] | None = None,
    initial_alpha_grid: Iterable[float] | None = None,
    *,
    multiscale: bool = True,
    num_iters: int = 600,
    save_every: int = 10,
    metric: str = "euclidean",
    n_neighbors: int = 30,
    times: Sequence[int] = (1, 2, 4),
    r: int = 32,
    k_for_pf1: int | None = None,
    symmetric_hint: bool = True,
) -> dict[str, Any]

Grid-search MAP hyperparameters for a fitted TopOGraph-like object.

The object must expose: - project(...) - P_X_ - _backend_resolved - _n_jobs_effective - verbosity - msTopoMAP_snapshots / TopoMAP_snapshots

This function does not rerun the best projection automatically. It returns the best parameters and score records; the caller can decide whether to run the final projection.

topo.layouts.diagnostics.run_best_projection

run_best_projection(
    tg: Any,
    params: dict[str, float],
    *,
    multiscale: bool = True,
    num_iters: int = 600,
    save_every: int = 10,
) -> np.ndarray

Run MAP once using selected hyperparameters.