Background: Diffusion Maps, CkNN, and Manifold Learning¶
A light, readable tour of the ideas behind topometry-nosc, with pointers to the
papers and package docs that introduce each one. You do not need any of this
to use the package — it is here to explain why each piece exists.
The sections follow the pipeline the package runs:
neighbor graph → graph operator (Laplace–Beltrami) → spectral scaffold → refined graph → 2-D layout → evaluation
Start here
If you read only three things, read the UMAP documentation (intuition for neighbor graphs), von Luxburg's spectral-clustering tutorial (graph Laplacians), and Coifman & Lafon on diffusion maps (the spectral backbone). Everything else specializes from there.
1. Nearest-neighbor graphs¶
The first step turns a table of points into a graph that connects each point to
its closest neighbors. (Package: topo.base.ann.)
- Approximate nearest neighbors (HNSW). Malkov & Yashunin's Hierarchical Navigable Small World graphs give fast, scalable neighbor search — the default backend for large data. IEEE TPAMI 2020 / arXiv:1603.09320
- Continuous k-nearest neighbors (CkNN). Berry & Sauer show that a
continuous kNN graph yields an unweighted, undirected graph whose
unnormalised Laplacian converges to the Laplace-Beltrami operator. Given
local scale radii
rho_iandrho_j, CkNN connects samples whend(i, j) < delta * sqrt(rho_i * rho_j). Foundations of Data Science 2019 / arXiv:1606.02353
2. Kernels and the Laplace–Beltrami operator¶
Edge weights (a kernel) turn the neighbor graph into a discrete approximation
of a continuous operator on the data's underlying shape. (Package:
topo.tpgraph.kernels.)
- Diffusion maps. Coifman & Lafon's foundational paper: a diffusion operator built from a Gaussian kernel whose eigenfunctions parameterize the manifold and approximate the Laplace–Beltrami operator. Appl. Comput. Harmon. Anal. 2006
- Variable-bandwidth kernels. Berry & Harlim generalize diffusion maps: scaling bandwidth by local density improves convergence on unevenly sampled data — the basis for the package's adaptive-bandwidth kernels. Appl. Comput. Harmon. Anal. 2016 / arXiv:1406.5064
- Laplacian eigenmaps. Belkin & Niyogi: embed data using eigenvectors of the graph Laplacian — the simplest member of this family and a useful baseline. Neural Computation 2003
- Fuzzy simplicial sets (UMAP's graph). UMAP builds its graph from fuzzy
neighborhoods combined by a probabilistic union. For
fuzzykernels,topometry-noscdelegates this UMAP-specific graph construction toumap-learn; TopoMetry then continues with its own spectral scaffolds, diffusion operators and graph-refinement pipeline. The docs are the gentlest introduction. UMAP docs · McInnes, Healy & Melville, arXiv:1802.03426
3. Spectral scaffolds (the eigenbasis)¶
Eigenvectors of the graph operator form an orthonormal basis — the "spectral
scaffold" — that captures structure across scales. (Package:
topo.spectral.eigen.)
- Spectral-clustering tutorial. Von Luxburg's self-contained tutorial: similarity graphs, the different graph Laplacians and their normalizations, and why their eigenvectors reveal structure. The best single reference for this layer. Statistics and Computing 2007 / arXiv:0711.0189
- Multiscale diffusion (msDM). Diffusion at multiple time scales (the package's
msDMscaffold) follows directly from the diffusion-map eigenstructure in Coifman & Lafon (above); the diffusion time controls the scale of structure emphasized.
4. Union of Manifolds (UoM)¶
When data splits into disconnected or weakly connected pieces, the package builds
per-component scaffolds and assembles them block-diagonally. (Package:
topo.uom.)
- Louvain community detection. Blondel et al.'s greedy modularity method — used to partition the graph into coherent communities before per-component scaffolding. J. Stat. Mech. 2008 / arXiv:0803.0476
- Leiden (modern refinement). Traag et al. fix Louvain's badly-connected-community issue; useful background even though the package's own splitter is Louvain-style. Scientific Reports 2019
- Cuts and conductance. Shi & Malik's normalized cuts motivate the conductance-based merging of fragile components; von Luxburg's tutorial (§3) also covers cut measures. IEEE TPAMI 2000
5. Intrinsic dimensionality¶
How many underlying factors really shape the data — used to size the scaffold.
(Package: topo.tpgraph.intrinsic_dim, methods fsa and mle.)
- FSA estimator. Farahmand, Szepesvári & Audibert's manifold-adaptive
dimension estimator from nearest-neighbor distance ratios (the
fsamethod). ICML 2007 - MLE estimator. Levina & Bickel's maximum-likelihood intrinsic-dimension
estimator (the
mlemethod). NeurIPS 2004 - Practical overview.
scikit-dimensioncollects and benchmarks many such estimators — handy for context. Bac et al., Entropy 2021
6. Layouts and projections¶
Turning the scaffold and refined graph into a 2-D picture. (Package:
topo.layouts.) Different methods trade off local detail against the global
arrangement; trying a few is normal.
- Isomap. Tenenbaum, de Silva & Langford: geodesic distances + classical MDS; the foundational geodesic-preserving embedding. Science 2000
- t-SNE. Van der Maaten & Hinton: heavy-tailed neighbor embedding, excellent for local cluster structure. JMLR 2008
- UMAP / MAP.
projection_method="UMAP"usesumap-learn's UMAP estimator. The package'sMAPremains a local checkpoint-aware graph-layout optimizer for TopoMetry refined graphs. McInnes et al., arXiv:1802.03426 - PaCMAP. Wang, Huang, Rudin & Shaposhnik: balances local and global structure by design, with a clear analysis of what makes a good DR loss. JMLR 2021
- TriMap. Amid & Warmuth: triplet-constraint embedding that preserves global layout and scales to large data. arXiv:1910.00204
- Minimum-Distortion Embedding (MDE). Agrawal, Ali & Boyd: a general framework (generalizing PCA, MDS, spectral embedding, UMAP…) implemented in PyMDE. Found. Trends ML 2021 / arXiv:2103.02559 · PyMDE docs
7. Evaluation and geometry diagnostics¶
How faithful is the embedding? The package emphasizes geometry and operator
preservation rather than trusting any single layout. (Packages: topo.eval,
topo.eval.topo_metrics, topo.eval.rmetric.)
- Geodesic rank correlation. The package compares geodesic distances in the original vs. embedded space with Spearman/Kendall correlation (built on Isomap's geodesic idea, §6). Tenenbaum et al., Science 2000
- Operator-native topology metrics. The package also compares transition neighborhoods, transition probabilities, and spectral coordinates between graph operators. These scores are meant to compare representations and parameter choices in the same graph/diffusion language as the pipeline.
- Trustworthiness & continuity. These are standard DR diagnostics in the broader literature: do embedded neighbors reflect real neighbors, and vice versa? They are useful context, but they are not the main built-in evaluation API in this fork. Venna & Kaski, Neural Networks 2006
- Co-ranking framework. Lee & Verleysen unify these rank-based metrics in one co-ranking matrix — the cleanest way to think about DR quality. Neurocomputing 2009
- Riemannian metric estimation. Perrault-Joncas & Meila estimate the distortion
an embedding introduces from its Laplacian — the basis for the package's
rmetricdiagnostics (after the megaman library). arXiv:1305.7255 · megaman, JMLR 2016 - Target-aware evaluation. Predicting or smoothing a per-sample target answers a different question from geometry preservation. See the Practical FAQ for recommended checks.
8. Surveys and starting points¶
- Comparative review of nonlinear DR. Van der Maaten, Postma & van den Herik compare many techniques on common datasets — a map of the whole field. Report, 2009 (PDF)
- UMAP documentation. Still the friendliest on-ramp to neighbor graphs and fuzzy topology. umap-learn docs