Byte Sized Machine Learning

Byte Sized Machine Learning is a forum for quick articles on complex topics. Help others learn about core concepts quickly with concise posts.

Follow publication

Member-only story

Selection of Highly Variable Genes (HVG’s) in scRNA-seq datasets

Faraz Ahmed
Byte Sized Machine Learning
3 min readMar 3, 2024

For context, when performing Single-Cell RNA sequencing analysis, a pivotal step is finding the Highly Variable Genes or HVGs. These HVGs are important as they directly influence the downstream analysis steps such as clustering.

Over the years, many methods have been developed to select for HVGs, however, it turns out that there is an intrinsic aspect pertaining to these data that must be corrected before HVGs can be properly selected.

Why do we need to select for HVGs in the first place?

In reality, there are tens and thousands of genes that are sequenced in each cell, however, the underlying challenge is the sparse nature of the data generated by single-cell experiments (Most cells have zero counts associated with a given gene). These zeros are primarily derived from drop-out events alongside other technical limitations of the technology.

In addition, a majority of the genes across all cells are highly correlated. Therefore, it makes sense to focus on genes that are highly variable across cells (typically top 2000 - top 5000 genes). These are the genes that drive the main signal in our dataset. Selecting for HVGs not only makes the data less sparse in comparison to the original count matrix, but also facilitates the downstream computational steps to be more efficient.

The problem:

Left Panel: Plotting the relationship of Average Expression of a given gene with its variance. Right Panel: Correction using VST
image source: https://ouyanglab.com/singlecell/basic.html

It may be easier to conceptually understand the problem through visualization. Let's take a look at the left panel in the above figure. In this panel, we have plotted the relationship of a gene’s average expression with its observed variance. Each dot here is a gene, the x-axis is the average expression of that gene and the y-axis is the observed variance associated with that gene.

In this panel, we observe that there is a very strong positive relationship between a gene’s average expression and its observed variance. In other words, highly expressed genes have high variances associated with them and vice…

Create an account to read the full story.

The author made this story available to Medium members only.
If you’re new to Medium, create a new account to read this story on us.

Or, continue in mobile web

Already have an account? Sign in

Byte Sized Machine Learning
Byte Sized Machine Learning

Published in Byte Sized Machine Learning

Byte Sized Machine Learning is a forum for quick articles on complex topics. Help others learn about core concepts quickly with concise posts.

Faraz Ahmed
Faraz Ahmed

Written by Faraz Ahmed

Bioinformatics Programmer III @Cornell

Responses (1)

Write a response