The Compromise of UMAP for Single Cell Data
Read about the compromise of UMAP for single cell data. UMAP is a local distance preserving layout with some attempt of a global space, but forcing that global space to be 2d can be a big compromise.
Ewan Birney
Deputy Director General of EMBL, Director of EMBL-EBI. I have an insatiable love of biology. @ewanbirney@genomic.social. I also work with ONT, Dovetail + GeL.
-
Idly reading another single cell paper (it is ... impressive how the technology) and authors using phrases like "neighbourhood" and "close to" clearly refering the reader to the UMAP display...
— Ewan Birney (@ewanbirney) June 4, 2023 -
Now, this slightly puts my teeth on edge, because UMAP is at its heart a local distance preserving layout with some attempt of a global space, but forcing that global space to be 2d (so we can print it/visualise it) seems like ... a big compromise.
— Ewan Birney (@ewanbirney) June 4, 2023 -
A reminder - single cell data has sparse, ~8,000 expressed gene dimensions - in theory 20,000 but in practice many genes are not expressed at all, not least the many olfactory receptors - and when you visualise you need to decide on how to get those ~8,000 dimensions into 2
— Ewan Birney (@ewanbirney) June 4, 2023 -
One way, from the 1900s, is principal components, which is the projection of the 8,000 dimensions into 1 dimension which has the most variance, and then onwards from there, the next orthogonal dimension which has the next most. Clearly a compromise for things >2 "real" dimensions
— Ewan Birney (@ewanbirney) June 4, 2023 -
This wont work well for things >2 dimensions in the real world, and also given unaesthetic plots for things with high dimensions, often a big L shaped space with weird scatter.
— Ewan Birney (@ewanbirney) June 4, 2023 -
Other many -> 2 dimension projects are available, and one, t-SNE, is more about laying out the points and making sure the local contacts are right (two points touching/very close to each other mean that they are similar). This handles scenarios >2 dimensions more gracefully
— Ewan Birney (@ewanbirney) June 4, 2023 -
and is particularly good if there are clusters in the data. However, the 2D layout is literally arbitary, and although the pictures look nice, they obviously don't look like a sensible "space" (because, well, it really doesn't care as long as the points fit into the page)
— Ewan Birney (@ewanbirney) June 4, 2023 -
UMAP is conceptually similar in that it is interested in local point (cells) behaviour, but has a stronger view of what the final dimensional layout space looks like from the overall concept (I get lost in the maths, but it does sound glorious - Reimann manifold metrics etc)
— Ewan Birney (@ewanbirney) June 4, 2023 -
The UMAP maths is based on the idea that there is a "true" dimenionality for this data, and one is approximating it (using clever maths ideas), literally nothing and certainly no bit of biology says this dimensionality is 2. We... must be making compromises when we plot 2d UMAPs.
— Ewan Birney (@ewanbirney) June 4, 2023 -
...and yet... it seems to work... well enough to colour them in and talk about the "right hand blob of the UMAP" or to visualise development "along" a spine of UMAP.
— Ewan Birney (@ewanbirney) June 4, 2023 -
So - what is going on? First off I think a hidden gem here is lots and lots of data in these single cell experiments with variation between samples means the datasets do fit into local connectivity
— Ewan Birney (@ewanbirney) June 4, 2023 -
ie, in the data there are many cells which are, well, somewhere on a closely related journey (notice, one has to batch correct well, and interestingly batch correction often involves finding "near neighbours" in different samples to anchor the batch correction)
— Ewan Birney (@ewanbirney) June 4, 2023 -
Somehow cellular biology is a good fit to this sort of "densely occupied manifold" (if one wants to get pseudo-maths-ey about it).
— Ewan Birney (@ewanbirney) June 4, 2023 -
But as well as being well sampled (when done at scale) I suspect developmental cell biology to its ultimate set of cell types - are in some relatively small (sub 20? 50?) dimensional space in the main, with fractal leaves, which can be squished further perhaps for viewing
— Ewan Birney (@ewanbirney) June 4, 2023 -
This goes to the fact that cells must be linked by biochemical processes to their originators, and thus everything must be connected and not I suspect that mind-blowingly different.
— Ewan Birney (@ewanbirney) June 4, 2023 -
This "inherent cellular dimensions" would be an interesting thing to discover and aim to quantify/estimate. It is related to cellular ontogeny, but it is not the same (commenting on the ideas from @JShendure)
— Ewan Birney (@ewanbirney) June 4, 2023 -
This goes to finding "appropriate" or "pragmatic" dimensionality (aka embeddings, aka latent spaces) for biology, and working in those. I think it is very unlikely the dimensionality is 2 - or that it is easy plottable, but...it is not infinte. Biology is, fundamentally, bounded.
— Ewan Birney (@ewanbirney) June 4, 2023