< Previous         Next >  
Intrinsic entropy model for feature selection of scRNA-seq data
Lin Li1,2,† , Hui Tang3,† , Rui Xia1,2 , Hao Dai1 , Rui Liu3 , Luonan Chen1,4,5,6,*
1State Key Laboratory of Cell Biology, Shanghai Institute of Biochemistry and Cell Biology, CAS Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai 200031, China
2University of Chinese Academy of Sciences, Beijing 100049, China
3School of Mathematics, South China University of Technology, Guangzhou 510640, China
4Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China
5Key Laboratory of Systems Health Science of Zhejiang Province, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
6Guangdong Institute of Intelligence Science and Technology, Zhuhai 519031, China
These authors contributed equally to this work
*Correspondence to:Luonan Chen ,
J Mol Cell Biol, Volume 14, Issue 2, February 2022, mjac008,
Keyword: scRNA-seq, feature selection, intrinsic entropy, extrinsic entropy, entropy decomposition, informative genes

Recent advances of single-cell RNA sequencing (scRNA-seq) technologies have led to extensive study of cellular heterogeneity and cell-to-cell variation. However, the high frequency of dropout events and noise in scRNA-seq data confounds the accuracy of the downstream analysis, i.e. clustering analysis, whose accuracy depends heavily on the selected feature genes. Here, by deriving an entropy decomposition formula, we propose a feature selection method, i.e. an intrinsic entropy (IE) model, to identify the informative genes for accurately clustering analysis. Specifically, by eliminating the ‘noisy’ fluctuation or extrinsic entropy (EE), we extract the IE of each gene from the total entropy (TE), i.e. TE = IE + EE. We show that the IE of each gene actually reflects the regulatory fluctuation of this gene in a cellular process, and thus high-IE genes provide rich information on cell type or state analysis. To validate the performance of the high-IE genes, we conduct computational analysis on both simulated datasets and real single-cell datasets by comparing with other representative methods. The results show that our IE model is not only broadly applicable and robust for different clustering and classification methods, but also sensitive for novel cell types. Our results also demonstrate that the intrinsic entropy/fluctuation of a gene serves as information rather than noise in contrast to its total entropy/fluctuation.