Advanced Estimators (Custom Pipelines)¶
These scikit-learn compatible transformers can be used individually to build custom manifold learning pipelines.
The Kernel(..., fuzzy=True) path delegates UMAP-specific fuzzy simplicial-set
construction to umap-learn; TopoMetry uses the resulting graph inside its own
spectral-scaffold and graph-refinement pipeline.
The Kernel(..., cknn=True) path builds the paper-defined binary CkNN graph:
samples are adjacent when d(i, j) < delta * sqrt(rho_i * rho_j). This is an
unweighted graph construction, not a weighted adaptive kernel; use
cknn_ratio_matrix separately when normalized CkNN distance ratios are needed.
topo.base.ann.kNN ¶
kNN(
X: ndarray | csr_matrix,
Y: ndarray | csr_matrix | None = None,
n_neighbors: int | float | str = 5,
metric: str = "euclidean",
n_jobs: int = -1,
backend: str = "sklearn",
M: int = 60,
efC: int = 200,
efS: int = 200,
verbose: bool = False,
**kwargs: Any,
) -> csr_matrix
Compute a k-nearest-neighbor distance graph.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like or scipy.sparse matrix, shape (n_samples, n_features)
|
Reference data used to fit the neighbor index. If |
required |
Y
|
array-like or scipy.sparse matrix
|
Query data. If provided, the output graph has shape
|
None
|
n_neighbors
|
int
|
Number of non-self neighbors to return for self-query graphs. |
5
|
metric
|
str
|
Distance metric. |
'euclidean'
|
n_jobs
|
int
|
Number of threads. |
-1
|
backend
|
(sklearn, hnswlib)
|
Neighbor-search backend. |
'sklearn'
|
verbose
|
bool
|
Emit backend/timing diagnostics through logging. |
False
|
Returns:
| Type | Description |
|---|---|
csr_matrix
|
kNN distance graph. |
topo.tpgraph.cknn.cknn_graph ¶
cknn_graph(
X,
*,
scale_k: int = 10,
delta: float = 1.0,
metric: str = "euclidean",
candidate_k: int | None = None,
exact: bool = False,
include_self: bool = False,
symmetrize: SymmetrizeMode = "or",
backend: str = "sklearn",
n_jobs: int = -1,
verbose: bool = False,
**kwargs,
) -> csr_matrix
Build the unweighted Continuous k-Nearest Neighbors graph.
Two samples i and j are connected when
d(i, j) < delta * sqrt(rho_i * rho_j), where rho_i is the distance
from sample i to its scale_k-th nearest neighbor. Exact mode
thresholds all pairs. Candidate-neighbor mode is scalable but may miss edges
if candidate_k is too small.
Introduced by Berry & Sauer, "Consistent manifold representation for topological data analysis" (https://arxiv.org/abs/1606.02353). The binary CkNN graph with the unnormalized Laplacian gives a consistent estimate of the Laplace-Beltrami operator regardless of sampling density.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like or sparse matrix of shape (n_samples, n_features)
|
Input data, or a square distance matrix when |
required |
scale_k
|
int
|
Neighbor rank used for the local scale |
10
|
delta
|
float
|
Connection threshold; larger values produce denser graphs. |
1.0
|
metric
|
str
|
Distance metric, or |
'euclidean'
|
candidate_k
|
int
|
Number of candidate neighbors searched per sample. Defaults to
|
None
|
exact
|
bool
|
Threshold all pairwise distances instead of candidate neighbors only. Quadratic in memory; use for small datasets or validation. |
False
|
include_self
|
bool
|
Whether to keep self-loops. |
False
|
symmetrize
|
('or', 'and', none)
|
How to symmetrize the directed candidate relation. |
'or'
|
backend
|
(sklearn, hnswlib)
|
Neighbor-search backend for candidate retrieval. |
'sklearn'
|
n_jobs
|
int
|
Number of threads. |
-1
|
verbose
|
bool
|
Emit search diagnostics through logging. |
False
|
**kwargs
|
Forwarded to :func: |
{}
|
Returns:
| Type | Description |
|---|---|
csr_matrix
|
Binary (0/1) adjacency matrix of shape |
topo.tpgraph.cknn.cknn_ratio_matrix ¶
cknn_ratio_matrix(
X,
*,
scale_k: int = 10,
metric: str = "euclidean",
candidate_k: int | None = None,
exact: bool = False,
symmetrize: SymmetrizeMode = "or",
backend: str = "sklearn",
n_jobs: int = -1,
verbose: bool = False,
**kwargs,
) -> csr_matrix
Return sparse CkNN normalized distance ratios.
Entry (i, j) is d(i, j) / sqrt(rho_i * rho_j). This matrix is useful
for persistence/order diagnostics; it is not the binary CkNN adjacency used
for the unnormalized graph Laplacian. Ratios are always computed without
self-loops.
topo.tpgraph.kernels.compute_kernel ¶
compute_kernel(
X,
metric: str = ...,
n_neighbors: int = ...,
fuzzy: bool = ...,
cknn: bool = ...,
cknn_delta: float = ...,
cknn_candidate_neighbors: int | None = ...,
cknn_exact: bool = ...,
pairwise: bool = ...,
sigma: float | None = ...,
adaptive_bw: bool = ...,
expand_nbr_search: bool = ...,
alpha_decaying: bool = ...,
*,
return_densities: Literal[False] = ...,
symmetrize: bool = ...,
backend: str = ...,
n_jobs: int = ...,
verbose: bool = ...,
use_angular: bool = ...,
square_distances: bool = ...,
random_state: int | RandomState | None = ...,
**kwargs,
) -> csr_matrix
compute_kernel(
X,
metric: str = ...,
n_neighbors: int = ...,
fuzzy: bool = ...,
cknn: bool = ...,
cknn_delta: float = ...,
cknn_candidate_neighbors: int | None = ...,
cknn_exact: bool = ...,
pairwise: bool = ...,
sigma: float | None = ...,
adaptive_bw: bool = ...,
expand_nbr_search: bool = ...,
alpha_decaying: bool = ...,
*,
return_densities: Literal[True],
symmetrize: bool = ...,
backend: str = ...,
n_jobs: int = ...,
verbose: bool = ...,
use_angular: bool = ...,
square_distances: bool = ...,
random_state: int | RandomState | None = ...,
**kwargs,
) -> tuple[
csr_matrix, dict[str, csr_matrix | np.ndarray | int]
]
compute_kernel(
X,
metric: str = "cosine",
n_neighbors: int = 10,
fuzzy: bool = False,
cknn: bool = False,
cknn_delta: float = 1.0,
cknn_candidate_neighbors: int | None = None,
cknn_exact: bool = False,
pairwise: bool = False,
sigma: float | None = None,
adaptive_bw: bool = True,
expand_nbr_search: bool = False,
alpha_decaying: bool = False,
*,
return_densities: bool = False,
symmetrize: bool = True,
backend: str = "hnswlib",
n_jobs: int = -1,
verbose: bool = False,
use_angular: bool = True,
square_distances: bool = True,
random_state: int | RandomState | None = None,
**kwargs,
) -> (
csr_matrix
| tuple[
csr_matrix, dict[str, csr_matrix | np.ndarray | int]
]
)
Compute a kernel matrix from a set of points.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like, shape (n_samples, n_features).
|
The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices. If precomputed, assumed to be a square symmetric semidefinite matrix. |
required |
metric
|
string
|
The metric to use when computing the kernel matrix. Possible values are: 'cosine', 'euclidean' and others. Accepts precomputed distances. |
'cosine'
|
n_neighbors
|
int
|
The number of neighbors to use when computing the kernel matrix. Ignored if |
10
|
fuzzy
|
bool
|
Whether to build the kernel matrix using fuzzy simplicial sets, similarly to UMAP.
If set to |
False
|
cknn
|
bool
|
Whether to build the binary, unweighted continuous k-nearest-neighbors graph.
If set to |
False
|
cknn_delta
|
float
|
Unitless CkNN edge threshold. Ignored if |
1.0
|
cknn_candidate_neighbors
|
int or None
|
Number of candidate neighbors tested in approximate CkNN mode. Ignored
when |
None
|
cknn_exact
|
bool
|
If True, threshold all pairwise distances for CkNN construction. |
False
|
pairwise
|
bool
|
Whether to compute the kernel using dense pairwise distances.
If set to |
False
|
sigma
|
float
|
Scaling factor if using fixed bandwidth kernel (only used if |
None
|
adaptive_bw
|
bool
|
Whether to use an adaptive bandwidth based on the distance to median k-nearest-neighbor. |
True
|
expand_nbr_search
|
bool
|
Whether to expand the neighborhood search (mitigates a choice of too small a number of k-neighbors). |
False
|
alpha_decaying
|
bool
|
Whether to use an adaptively decaying kernel. |
False
|
return_densities
|
bool
|
Whether to return the bandwidth metrics as a dictinary. If set to |
False
|
symmetrize
|
bool
|
Whether to symmetrize the kernel matrix after normalizations. |
True
|
backend
|
str
|
Which backend to use for neighborhood computations. Defaults to 'hnswlib'. Options are 'hnswlib' and 'sklearn'. |
'hnswlib'
|
n_jobs
|
int
|
The number of jobs to use for parallel computations. If set to -1, all available cores are used. |
1
|
verbose
|
bool
|
Whether to print progress messages. |
False
|
**kwargs
|
dict
|
Additional arguments to be passed to the nearest-neighbors backend. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
K |
(array - like, shape(n_samples, n_samples))
|
The kernel matrix. |
densities |
dict, optional (if `return_densities` is set to `True`)
|
If |
topo.tpgraph.kernels.Kernel ¶
Kernel(
metric="cosine",
n_neighbors=10,
fuzzy=False,
cknn=False,
cknn_delta=1.0,
cknn_candidate_neighbors=None,
cknn_exact=False,
pairwise=False,
sigma=None,
adaptive_bw=True,
expand_nbr_search=False,
alpha_decaying=False,
symmetrize=True,
backend="hnswlib",
n_jobs=1,
laplacian_type="normalized",
anisotropy=1.0,
semi_aniso=False,
n_landmarks=None,
cache_input=False,
verbose=False,
random_state=None,
use_angular=True,
)
Bases: BaseEstimator, TransformerMixin
Scikit-learn flavored estimator for computing a kernel matrix from points.
Includes functions for computing the kernel matrix with a variety of methods (adaptive bandwidth, fuzzy simplicial sets, continuous k-nearest-neighbors, etc) and performing operations on the resulting graph, such as obtaining its Laplacian, sparsifying it, filtering and interpolating signals, obtaining diffusion operators, imputing missing values and computing shortest paths.
Use this only when you want to manually construct a graph or diffusion operator.
Example
from topo.tpgraph import Kernel
from topo.spectral import EigenDecomposition
from topo.layouts import Projector
kernel = Kernel(n_neighbors=30, metric="cosine").fit(X)
Z = EigenDecomposition(n_components=64, method="msDM").fit_transform(kernel)
Y = Projector(projection_method="MAP").fit_transform(Z)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metric
|
str
|
The metric to use when computing the kernel matrix. Possible values are: 'cosine', 'euclidean' and others, depending on the chosen nearest-neighbors backend. Accepts precomputed distances as 'precomputed'. |
'cosine'
|
n_neighbors
|
int
|
The number of neighbors to use when computing the kernel matrix. Ignored if |
10
|
fuzzy
|
bool
|
Whether to build the kernel matrix using fuzzy simplicial sets, similarly to UMAP.
If set to |
False
|
cknn
|
bool
|
Whether to build the binary, unweighted continuous k-nearest-neighbors graph.
If set to |
False
|
pairwise
|
bool
|
Whether to compute the kernel using dense pairwise distances.
If set to |
False
|
sigma
|
float or None
|
Scaling factor if using fixed bandwidth kernel (only used if |
None
|
adaptive_bw
|
bool
|
Whether to use an adaptive bandwidth based on the distance to median k-nearest-neighbor. |
True
|
expand_nbr_search
|
bool
|
Whether to expand the neighborhood search (mitigates a choice of too small a number of k-neighbors). |
False
|
alpha_decaying
|
bool
|
Whether to use an adaptively decaying kernel. |
False
|
anisotropy
|
float
|
The anisotropy (alpha) parameter in the diffusion maps literature for kernel reweighting. |
1.0
|
semi_aniso
|
bool
|
Whether to use semi-anisotropic diffusion. This reweights the original kernel (not the renormalized kernel) by the renormalized degree. |
False
|
symmetrize
|
bool
|
Whether to symmetrize the kernel matrix after normalizations. |
True
|
backend
|
(hnswlib, sklearn)
|
Which backend to use for k-nearest-neighbor computations. Defaults to 'hnswlib'. Options are 'hnswlib' and 'sklearn'. |
"hnswlib"
|
n_jobs
|
int
|
The number of jobs to use for parallel computations. If -1, all CPUs are used. Parallellization (multiprocessing) is highly recommended whenever possible. |
-1
|
laplacian_type
|
(normalized, unnormalized, random_walk)
|
The type of laplacian to use. |
"normalized"
|
Fitted Attributes
knn_ : scipy.sparse.csr_matrix, shape (n_samples, n_samples) The computed k-nearest-neighbors graph.
_K : scipy.sparse.csr_matrix, shape (n_samples, n_samples) The kernel matrix.
dens_dict : dict Dictionary containing the density information for each point in the dataset for the computed kernel.
L : scipy.sparse.csr_matrix, shape (n_samples, n_samples) The laplacian matrix of the kernel graph.
P : scipy.sparse.csr_matrix, shape (n_samples, n_samples) The diffusion operator of the kernel graph.
SP : scipy.sparse.csr_matrix, shape (n_samples, n_samples) The shortest-path matrix.
degree : np.ndarray, shape (n_samples,) The degree of each point in the adjacency graph.
weighted_degree : np.ndarray, shape (n_samples,) The weighted degree of each point in the kernel graph.
knn
property
¶
knn
The k-nearest-neighbors graph (sparse distance matrix).
Contains the raw nearest-neighbor distances used to compute the bandwidths and affinities. Represented as a CSR sparse matrix.
K
property
¶
K
The kernel (affinity) matrix.
Represents the pairwise similarities between samples on the manifold, typically decaying from 1 (identical) to 0 (disconnected). Used as the weighted adjacency matrix for graph operators.
A
property
¶
A
The binary graph adjacency matrix.
Unweighted representation of the graph connectivity, generated from the non-zero entries of the kernel affinity matrix (K).
degree
property
¶
degree
The node degrees of the binary adjacency matrix.
A 1-D array containing the number of connected neighbors for each sample, ignoring the affinity weights.
weighted_degree
property
¶
weighted_degree
The weighted node degrees of the kernel affinity matrix.
A 1-D array containing the sum of affinity weights for each sample. Used to normalize graph Laplacians and diffusion operators.
L
property
¶
L
The graph Laplacian matrix.
Use this property for the default cached result. Use the laplacian method
when you need to pass options or recompute with non-default settings.
Evaluates and caches the
Laplacian on first access using the laplacian_type specified during
initialization.
Use this property for the default cached result. Use the laplacian()
method when you need to pass options or recompute with non-default settings.
P
property
¶
P
The graph diffusion operator (Markov transition matrix).
Use this property for the default cached result. Use the diff_op method
when you need to pass options or recompute with non-default settings.
Evaluates and caches the operator on first access. Describes the probabilities of random walks across the graph.
Use this property for the default cached result. Use the diff_op()
method when you need to pass options or recompute with non-default settings.
SP
property
¶
SP
The shortest-paths (geodesic distance) matrix.
Use this property for the default cached result. Use the shortest_paths method
when you need to pass options or recompute with non-default settings.
Evaluates and caches the all-pairs shortest paths on the graph on first access.
Use this property for the default cached result. Use the shortest_paths()
method when you need to pass options or recompute with non-default settings.
fit ¶
fit(X, recompute=False, **kwargs)
Fit the kernel matrix to the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like or sparse matrix, shape (n_samples, n_features)
|
Input data. Accepts NumPy arrays and SciPy CSR sparse matrices.
If |
required |
recompute
|
bool
|
Whether to recompute the kernel if it has already been computed. |
False
|
**kwargs
|
Additional arguments to be passed to the k-nearest-neighbor backend. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
Kernel
|
The fitted Kernel instance, populating properties like |
transform ¶
transform(X=None)
Return the fitted kernel (affinity) matrix.
Provided for scikit-learn compatibility in pipelines. X is ignored
because out-of-sample extension for the kernel is not supported.
fit_transform ¶
fit_transform(X, y=None, **fit_params)
Fit the kernel to the data and return the affinity matrix.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like or sparse matrix, shape (n_samples, n_features)
|
Input data. |
required |
y
|
None
|
Ignored. |
None
|
**fit_params
|
Additional parameters passed to |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
K |
(csr_matrix, shape(n_samples, n_samples))
|
The computed kernel (affinity) matrix. |
adjacency ¶
adjacency()
Compute the binary graph adjacency matrix.
The adjacency matrix defines the unweighted connectivity of the graph. It is represented as an N-by-N sparse matrix where entry A_{i,j} is 1 if the kernel affinity K_{i,j} > 0, and 0 otherwise.
laplacian ¶
laplacian(laplacian_type=None, recompute=False)
Compute the graph Laplacian of this kernel's affinity matrix.
For a friendly reference, see this material from James Melville: https://jlmelville.github.io/smallvis/spectral.html
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
laplacian_type
|
str
|
The type of laplacian to use. Can be 'unnormalized', 'normalized', or 'random_walk'.
If not provided, uses the default |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
L |
(csr_matrix, shape(n_samples, n_samples))
|
The computed graph Laplacian matrix. |
diff_op ¶
diff_op(anisotropy=1.0, symmetric=True, recompute=False)
Compute the diffusion operator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
anisotropy
|
float
|
Anisotropy to apply. 'Alpha' in the diffusion maps literature. Defaults to the anisotropy defined in the constructor. |
1.0
|
symmetric
|
bool
|
Whether to use a symmetric version of the diffusion operator. This is particularly useful to yield a symmetric operator
when using anisotropy (alpha > 0), as the diffusion operator P would be asymmetric otherwise, which can be problematic
during matrix decomposition. Eigenvalues are the same as the asymmetric version, and the right eigenvectors of the original asymmetric
operator can be recovered by left multiplying by |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
P |
(csr_matrix, shape(n_samples, n_samples))
|
The graph diffusion operator (Markov transition matrix). |
Notes
Populates the Kernel.D_inv_sqrt_ attribute when symmetric=True.
shortest_paths ¶
shortest_paths(
landmark=False, indices=None, recompute=False
)
Compute the shortest paths (geodesic distances) on the graph.
Notes
This method may require O(n_samples^2) memory and heavy computation if all-pairs distances are materialized. For large datasets, use landmark mode.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
landmark
|
bool
|
If True, the shortest paths are computed between all pairs of landmarks, rather than all sample nodes. |
False
|
indices
|
list of int
|
If None, the shortest paths are computed between all pairs of nodes. Else, the shortest paths are computed between all pairs of nodes and nodes with specified indices. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
D |
(ndarray, shape(n_samples, n_samples))
|
The shortest paths matrix. Unreachable nodes evaluate to infinity. |
get_indices_distances ¶
get_indices_distances(n_neighbors=None, kernel=True)
Return per-row neighbor indices and distances from the kernel or kNN graph.
Returns:
| Type | Description |
|---|---|
indices, distances : tuple of ndarray
|
The extracted arrays. |
impute ¶
impute(Y=None, t=None, threshold=0.01, tmax=10)
Impute data Y using the diffusion operator built from X.
Although the idea behind this is far older, it was first reported in single-cell genomics by the Krishnaswamy lab in the MAGIC (Markov Affinity-based Graph Imputation of Cells) manuscript
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
Y
|
ndarray
|
The input data to impute. If None, the input data X is imputed (if it was cached by setting
the |
None
|
t
|
int
|
The number of steps to perform during diffusion. The default |
None
|
threshold
|
float
|
The threshold value for the Procrustes disparity when finding an optimal |
0.01
|
tmax
|
int
|
The maximum number of steps to perform during diffusion when estimating an optimal |
10
|
Returns:
| Name | Type | Description |
|---|---|---|
Y_imp |
ndarray
|
The imputed data. |
filter ¶
filter(
signal,
replicates=None,
beta=50,
target=None,
filterfunc=None,
offset=0,
order=1,
solver="chebyshev",
chebyshev_order=100,
)
Estimate per-sample density over the graph by filtering a signal.
Inspired by MELD for sample-associated density estimation. However, you can naturally use this for any signal in your data, not just samples of specific conditions. In practice, this is just a simple PyGSP filter on a graph. Indeed, it calls PyGSP, so you'll need it installed to use this function.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
signal
|
Signal(s) to filter - usually sample labels. |
required | |
replicates
|
Replicate labels for each sample. If None, no replicates are assumed. |
None
|
|
beta
|
int
|
Amount of smoothing to apply. Vary this parameter to get good estimates - this can vary widely from dataset to dataset. |
50
|
target
|
array - like
|
Similarity matrix to use for filtering. If None, uses the kernel matrix. |
None
|
filterfunc
|
function
|
Function to use for filtering. If None, the default is to use a Laplacian filter. |
None
|
offset
|
Amount to shift the filter in the eigenvalue spectrum. Recommended to use an eigenvalue from the graph based on the spectral distribution. Should be in interval [0,1] |
0
|
|
order
|
Falloff and smoothness of the filter. High order leads to square-like filters. |
1
|
|
solver
|
string
|
Method to solve convex problem. If 'chebyshev', uses a chebyshev polynomial approximation of the corresponding filter. Else, if 'exact', uses the eigenvalue solution to the problem |
'chebyshev'
|
chebyshev_order
|
int
|
Order of chebyshev approximation to use. |
100
|
Returns:
| Name | Type | Description |
|---|---|---|
densities |
DataFrame
|
The filtered sample densities. |
is_connected ¶
is_connected()
Check if the graph is connected (cached).
A graph is connected if and only if there exists a (directed) path between any two vertices.
Returns:
| Name | Type | Description |
|---|---|---|
connected |
bool
|
True if the graph is connected, False otherwise. |
topo.spectral.eigen.EigenDecomposition ¶
EigenDecomposition(
n_components=10,
method="DM",
eigensolver="auto",
eigen_tol=0.0001,
drop_first=True,
laplacian_type="random_walk",
anisotropy=1,
t=1,
random_state=None,
return_evals=False,
estimate_eigengap=True,
verbose=False,
)
Transformer for eigendecomposing kernels and spectral operators.
Use this when you already have a kernel, Laplacian, adjacency/affinity matrix, or diffusion operator and want spectral coordinates. The numerical eigendecomposition is delegated to SciPy:
- dense inputs use
scipy.linalg.eighwheneigensolver="dense"or wheneigensolver="auto"receives a dense matrix; - sparse inputs use
scipy.sparse.linalg.eigshwheneigensolver="arpack"or wheneigensolver="auto"receives a sparse matrix.
The main input can be either a topo.tpgraph.Kernel object or a square matrix.
For Kernel inputs, the transformer reuses the fitted kernel's stored
operators when possible.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_components
|
int
|
Number of non-trivial components to return. One extra eigenpair is computed
internally so the trivial first component can be dropped when
|
10
|
method
|
(top, bottom, msDM, DM, LE)
|
Method for organizing the eigendecomposition.
|
'top'
|
eigensolver
|
(auto, dense, arpack)
|
Solver policy.
|
'auto'
|
eigen_tol
|
float
|
Tolerance passed to |
1e-4
|
drop_first
|
bool
|
Whether to drop the first eigenpair, typically the trivial component. |
True
|
laplacian_type
|
str
|
Laplacian type used when |
'random_walk'
|
anisotropy
|
float
|
Diffusion-maps anisotropy parameter, usually denoted |
1
|
t
|
int
|
Diffusion time used when |
1
|
random_state
|
optional
|
Accepted for API compatibility. The simplified eigensolver path does not use randomness. |
None
|
return_evals
|
bool
|
Whether |
False
|
estimate_eigengap
|
bool
|
Whether to store a simple eigengap estimate after fitting. |
True
|
verbose
|
bool
|
Accepted for API compatibility. The simplified eigensolver path does not emit verbose solver diagnostics. |
False
|
fit ¶
fit(X)
Compute the eigendecomposition of kernel matrix X per method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
(array - like, shape(n_samples, n_samples))
|
Matrix to be decomposed. Should generally be an adjacency, affinity/kernel/similarity, Laplacian matrix or a diffusion-type operator. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
self |
EigenDecomposition
|
The fitted instance itself. Eigenvectors and eigenvalues are stored as attributes. If method is 'DM' or 'msDM', the diffusion operator is cached as well. |
rescale ¶
rescale(use_eigs=50)
Re-compute the msDM embedding using a different number of eigenvectors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
use_eigs
|
int
|
Number of eigenvectors to include in the embedding
(must be ≤ the number retained during |
50
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
EigenDecomposition
|
The modified instance with the scaled embedding updated. |
results ¶
results(return_evals=None)
Return the fitted spectral representation.
For DM and msDM, this computes the embedding from stored eigenpairs
if needed. For LE, top, and bottom, it returns the fitted
eigenvectors.
transform ¶
transform(X=None)
Return the current representation.
X is ignored; the representation is computed from the eigenpairs
stored during fit. Present for scikit-learn compatibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
None
|
Ignored. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
embedding |
ndarray of shape (n_samples, n_components)
|
The computed spectral representation. |
topo.spectral.eigen.eigendecompose ¶
eigendecompose(
G: ArrayLike | spmatrix,
n_components: int = 8,
eigensolver: str = "auto",
largest: bool = True,
eigen_tol: float = 0.0001,
random_state=None,
verbose: bool = False,
) -> tuple[NDArray[np.float64], NDArray[np.float64]]
Return sorted eigenpairs of a square symmetric matrix/operator.
Dense inputs use scipy.linalg.eigh. Sparse inputs use
scipy.sparse.linalg.eigsh unless eigensolver="dense" is requested.
random_state and verbose are accepted for compatibility and ignored.
topo.spectral.LE ¶
LE(
W: ArrayLike | spmatrix,
n_eigs: int = 10,
laplacian_type: str = "random_walk",
drop_first: bool = True,
return_evals: bool = False,
eigen_tol: float = 0.0,
random_state=None,
)
Compute a Laplacian Eigenmaps embedding from an affinity graph.
Laplacian Eigenmaps were introduced by Belkin and Niyogi as a spectral embedding method based on eigenvectors of a graph Laplacian: https://www2.imm.dtu.dk/projects/manifold/Papers/Laplacian.pdf
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
W
|
ArrayLike | spmatrix
|
Dense or sparse graph adjacency/affinity matrix. |
required |
n_eigs
|
int
|
Number of non-trivial eigenvectors to return. |
10
|
laplacian_type
|
str
|
Laplacian type understood by |
'random_walk'
|
drop_first
|
bool
|
Whether to drop the first eigenvector, which is typically the trivial constant mode for connected graphs. |
True
|
return_evals
|
bool
|
If |
False
|
eigen_tol
|
float
|
Tolerance passed to |
0.0
|
random_state
|
Accepted for API compatibility. The current implementation uses ARPACK
through |
None
|
Returns:
| Type | Description |
|---|---|
ndarray or tuple[ndarray, ndarray]
|
Eigenvectors sorted by ascending eigenvalue. If |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
RuntimeError
|
If ARPACK fails to compute the requested eigendecomposition. |
topo.spectral.graph_laplacian ¶
graph_laplacian(
graph,
*,
laplacian_type: str = "normalized",
return_D: bool = False,
)
Return graph Laplacian with package-standard laplacian_type names.
Supported types:
- "unnormalized": D - W, delegated to scipy.sparse.csgraph.laplacian.
- "normalized" / "symmetric": I - D^-1/2 W D^-1/2, delegated to SciPy.
- "random_walk" / "rw": I - D^-1 W, implemented here because SciPy's normalized Laplacian is the symmetric normalized Laplacian.
Isolated nodes are handled without inf/nan values.
topo.spectral.diffusion_operator ¶
diffusion_operator(
W: ArrayLike | spmatrix,
alpha: float = 1.0,
symmetric: bool = False,
semi_aniso: bool = False,
*,
return_D_inv_sqrt: bool = False,
) -> csr_matrix | tuple[csr_matrix, csr_matrix]
Compute a sparse diffusion-map operator from an affinity graph.
This implements the anisotropic normalization used in diffusion maps, following Coifman and Lafon: https://doi.org/10.1016/j.acha.2006.04.006
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
W
|
ArrayLike | spmatrix
|
Dense or sparse graph adjacency/affinity matrix. |
required |
alpha
|
float
|
Diffusion-maps anisotropy parameter. For |
1.0
|
symmetric
|
bool
|
If |
False
|
semi_aniso
|
bool
|
If |
False
|
return_D_inv_sqrt
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
csr_matrix or tuple[csr_matrix, csr_matrix]
|
The diffusion operator. If |
Notes
The return type is always sparse CSR, regardless of the input format.
topo.layouts.projector.Projector ¶
Projector(
n_components=2,
projection_method="MAP",
metric="euclidean",
n_neighbors=10,
n_jobs=1,
landmarks=None,
landmark_method="kmeans",
num_iters=800,
init: str | ndarray = "spectral",
nbrs_backend="hnswlib",
keep_estimator=False,
random_state=None,
verbose=False,
save_every=None,
save_limit=None,
save_callback=None,
include_init_snapshot=True,
)
Bases: BaseEstimator, TransformerMixin
Scikit-learn compatible class that handles all projection methods.
Use this when you already have a spectral representation or graph and only want the final 2-D/3-D layout.
Ideally, it takes in either a orthonormal eigenbasis or a graph kernel learned from such an eigenbasis.
It is included in TopoMetry to allow custom TopOGraph-like pipelines (projection is the final step).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_components
|
int
|
Number of dimensions to optimize the layout to. Usually 2 or 3 if you're into visualizing data. |
2
|
projection_method
|
str
|
Which projection method to use. |
'Isomap'
|
metric
|
str
|
The metric to use when computing distances. Possible values are: 'cosine', 'euclidean' and others. Accepts precomputed distances ('precomputed'). |
'euclidean'
|
n_neighbors
|
int
|
The number of neighbors to use when computing the kernel matrix. Ignored if |
10
|
landmarks
|
int or ndarray
|
If passed as |
None
|
landmark_method
|
str
|
The method to use for selecting landmarks. If |
'kmeans'
|
num_iters
|
int
|
Most (if not all) methods optimize the layout up to a limit number of iterations. Use this parameter to set this number. |
1000
|
keep_estimator
|
bool
|
Whether to keep the used estimator as Projector.estimator_ after fitting. Useful if you want to use it later (e.g. UMAP allows inverse transforms and out-of-sample mapping). |
False
|
fit ¶
fit(
X: ndarray | csr_matrix | Kernel, **kwargs: Any
) -> Projector
Run the desired projection method on the specified data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
array-like, shape (n_samples, n_features) or topo.Kernel() class.
|
The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices or a |
required |
**kwargs
|
Any
|
Additional keyword arguments for the desired projection method. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
self |
Projector
|
The fitted Projector instance with populated |
transform ¶
transform(
X: ndarray | csr_matrix | None = None,
) -> np.ndarray
Return the projection, using the backend's transform when available.
If the desired method does not have a transform method, returns the results from the fit method.
Returns:
| Name | Type | Description |
|---|---|---|
Y |
ndarray of shape (n_samples, n_components)
|
Projection results. |
fit_transform ¶
fit_transform(X, y=None, **kwargs) -> np.ndarray
Fit the projection and return the embedding.
If the desired method does not have a fit_transform method, returns the results from the fit method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
(array - like or Kernel, shape(n_samples, n_features))
|
The input data to fit and transform. |
required |
y
|
None
|
Ignored. |
None
|
**kwargs
|
Additional arguments passed to the underlying backend. |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
Y |
ndarray of shape (n_samples, n_components)
|
Projection results. |
topo.tpgraph.intrinsic_dim.IntrinsicDim ¶
IntrinsicDim(
methods=["fsa", "mle"],
k=[10, 20, 50, 75, 100],
backend="hnswlib",
metric="euclidean",
n_jobs=-1,
plot=True,
random_state=None,
**kwargs,
)
Bases: BaseEstimator, TransformerMixin
Scikit-learn flavored estimator of intrinsic dimensionality.
Use this to estimate manifold dimensionality or inspect local geometric complexity. This class iterates over a range of possible values of k-nearest-neighbors to consider in calculations using two different methods: the Farahmand-Szepesvári-Audibert (FSA) dimension estimator and the Maximum Likelihood Estimator (MLE).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
methods
|
list of str, (default ['fsa'])
|
The dimensionality estimation methods to use. Current options are 'fsa' () and 'mle'(). |
['fsa', 'mle']
|
k
|
int, range or list of ints, (default [10, 20, 50, 75, 100])
|
The number of nearest neighbors to use for the dimensionality estimation methods.
If a single value of |
[10, 20, 50, 75, 100]
|
metric
|
str (default 'euclidean')
|
The metric to use when calculating distance between instances in a feature array. |
'euclidean'
|
backend
|
str
|
Which backend to use for k-nearest-neighbor computations. Defaults to 'hnswlib'. Options are 'hnswlib' and 'sklearn'. |
'hnswlib'
|
n_jobs
|
int
|
The number of jobs to use for parallel computations. If -1, all CPUs are used. Parallellization (multiprocessing) is highly recommended whenever possible. |
1
|
plot
|
bool
|
Whether to plot the results when using the |
True
|
random_state
|
int or numpy.random.RandomState() (optional
|
A pseudo random number generator. Used for generating colors for plotting. |
None).
|
**kwargs
|
keyword arguments
|
Additional keyword arguments to pass to the backend kNN estimator. |
{}
|
Properties
local_id, global_id : dictionaries containing local and global dimensionality estimates, respectivelly.
Their structure depends on the value of the `k` parameter:
* If a single value of `k` is provided, then the dictionaries will have
keys corresponding to the methods, and values corresponding to the
dimensionality estimates.
* If multiple values of `k` are provided, then the dictionaries will have
keys corresponding to the number of k, and values corresponding to other dictionaries,
which have keys corresponding to the methods, and values corresponding to the
dimensionality estimates.
plot_id ¶
plot_id(
bins=30,
figsize=(6, 8),
titlesize=22,
labelsize=16,
legendsize=10,
)
Plot histograms of the estimated local intrinsic dimensions.
fit ¶
fit(X, **kwargs)
Estimates the intrinsic dimensionalities of the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
(array - like, shape(n_samples, n_features))
|
The set of points to compute the kernel matrix for. Accepts np.ndarrays and scipy.sparse matrices. If precomputed, assumed to be a square symmetric semidefinite matrix of k-nearest-neighbors, with its k higher or equal the highest value of k used to estimate the intrinsic dimensionalities. |
required |
**kwargs
|
Additional keyword arguments to pass to the plotting function |
{}
|
Returns:
| Type | Description |
|---|---|
Populates the `local` and `global` properties of the class.
|
|
Shows a plot of the results if `plot=True`.
|
|
transform ¶
transform(X=None)
No-op transform kept for scikit-learn API compatibility.
IntrinsicDim exposes its estimates via attributes (local_id,
global_id); this method returns self unchanged.
Returns:
| Name | Type | Description |
|---|---|---|
self |
IntrinsicDim
|
The estimator itself. |
topo.tpgraph.intrinsic_dim.automated_scaffold_sizing ¶
automated_scaffold_sizing(
X,
method: str = "fsa",
ks=(15, 30, 60),
backend="hnswlib",
metric="euclidean",
n_jobs: int = -1,
quantile: float = 0.99,
min_components: int = 16,
max_components: int = 512,
headroom: float = 0.15,
random_state=None,
use_median: bool = False,
return_details: bool = False,
**knn_kwargs,
)
Unified automated scaffold sizing.
method='fsa':
- Compute local FSA i.d. for each k in ks, take per-cell median across ks,
then take the upper quantile across cells, add headroom, clamp to bounds.
method='mle':
- Use a single neighborhood size k (if ks is an int; if iterable, use max(ks)).
- Compute local MLE i.d. at k, then global i.d. via:
* median of locals if use_median=True
* Levina–Bickel harmonic-mean estimator (mle_global) otherwise.
- Add headroom, clamp to bounds.
Returns:
| Name | Type | Description |
|---|---|---|
n_components |
int
|
|
details |
dict (if return_details=True)
|
|
topo.layouts.diagnostics.find_ideal_projection ¶
find_ideal_projection(
tg: Any,
min_dist_grid: Iterable[float] | None = None,
spread_grid: Iterable[float] | None = None,
initial_alpha_grid: Iterable[float] | None = None,
*,
multiscale: bool = True,
num_iters: int = 600,
save_every: int = 10,
metric: str = "euclidean",
n_neighbors: int = 30,
times: Sequence[int] = (1, 2, 4),
r: int = 32,
k_for_pf1: int | None = None,
symmetric_hint: bool = True,
) -> dict[str, Any]
Grid-search MAP hyperparameters for a fitted TopOGraph-like object.
The object must expose:
- project(...)
- P_X_
- _backend_resolved
- _n_jobs_effective
- verbosity
- msTopoMAP_snapshots / TopoMAP_snapshots
This function does not rerun the best projection automatically. It returns the best parameters and score records; the caller can decide whether to run the final projection.
topo.layouts.diagnostics.run_best_projection ¶
run_best_projection(
tg: Any,
params: dict[str, float],
*,
multiscale: bool = True,
num_iters: int = 600,
save_every: int = 10,
) -> np.ndarray
Run MAP once using selected hyperparameters.