Math Bio Seminar

Time: Friday, December 19 2025 at 11:00am
Location: Carver 401

Xinglin Jia, Iowa State University (Bioinformatics & Computational Biology)
CLEAR: Concise List Enrichment Analysis using R
Many modern high-throughput methods provide genome-wide data for all genes, SNPs, or other molecular features. Since biological functions are carried out by interacting proteins rather than individual genes, gene set analysis is crucial for interpreting these large-scale datasets. Model-based gene set analysis methods such as GenGO and MGSA use probabilistic approaches to infer which biological categories are activated. GenGO identifies active Gene Ontology (GO) categories using a generative probabilistic model that accounts for noise and overlapping GO terms to reduce redundancy in the result. MGSA extends this framework by introducing a Bayesian network, simultaneously inferring all categories, and improving robustness against noise. These methods have the advantage of returning a group of concise, non-redundant gene sets, which traditional methods (such as Over-representation Analysis (ORA) and Gene Set Enrichment Analysis (GSEA) lack since they test each gene set individually. However, GenGO and MGSA rely on binary gene activation states, which are determined by an arbitrary, user-defined threshold, rather than utilizing the underlying continuous test statistics such as effect sizes or p-values. Some extensions of MGSA incorporate the topological structure of the Gene Ontology or additional constraints to improve the model performance, but the statistical information associated with the genes is disregarded. We propose a novel, Bayesian model-based method, Concise List Enrichment Analysis using R (CLEAR), which directly models the gene-level statistics rather than the binary activation states. CLEAR assumes that the gene statistics follow distinct distributions under the alternative and null hypotheses, enabling a more sensitive and nuanced interpretation of gene-level variation within gene sets. This probabilistic, continuous framework improves the robustness and interpretability of gene set analysis. We compared the performance of CLEAR against established methods using both in silico and real datasets, assessing its sensitivity and ability to return gene sets with established phenotype relevance. CLEAR achieves higher sensitivity and improves output interpretability by reducing redundancy and preserving more meaningful information. In conclusion, CLEAR is a powerful gene set enrichment analysis method that leverages all the information available in gene-level statistics and identifies relevant gene sets with greater precision.