讲座通知: EM meets Boosting inbig genomic data analysis

题目: EM meets Boosting inbig genomic data analysis

主讲人:杨灿教授 香港浸会大学统计系





Recent internationalprojects, such as the Encyclopedia of DNA Elements (ENCODE) project, theRoadmap project and the Genotype-Tissue Expression (GTEx) project, havegenerated vast amounts of genomic annotation data, e.g., epigenome andtranscriptome. There is great demanding of effective statistical approaches tointegrate genomic annotations with the results from genome-wide associationstudies. In this talk, we introduce a statistical framework, named IMAC, forintegratingmultipleannotationstocharacterizefunctional roles of genetic variants that underlie human complex phenotypes.For a given phenotype, IMAC can adaptively incorporates relevant annotations forprioritization of genetic risk variants, allowing nonlinear effects among theseannotations, such as interaction effects between genomic features.Specifically, we assume that the prior probability of a variant associated withthe phenotype is a function of its annotations F(X), where X is thecollection of the annotation status and F(X)is an ensemble of decision trees, i.e., F(X)= \sum_kf_k(X) and f_k(X) is a shallow decision tree. We havedeveloped an efficient EM-Boosting algorithm for model fitting, where a shallowdecision tree grows in a gradient-Boosting manner (Friedman J. 2001) at eachEM-iteration. Our framework inherits the nice property of gradient boostedtrees: (1) The gradient accent property of the Boosting algorithm naturallyguarantees the convergence of our EM-Boosting algorithm. (2) Based on thefitted ensemble \hat{F}(X), we areable to rank the importance of annotations, measure the interaction amongannotations and visualize the model via partial plots (Friedman J. 2005). UsingIMAC, we performed integrative analysis of genome-wide association studies onhuman complex phenotypes and genome-wide annotation resources, e.g., Roadmapepigenome. The analysis results revealed interesting regulatory patterns ofrisk variants. These findings deepen our understanding of genetic architecturesof complex phenotypes. Thestatistical framework developed here is also broadly applicable to many otherareas for integrative analysis of rich data sets.


杨灿教授于2011年毕业于香港科技大学电子信息工程系,获得博士学位。2011-2012耶鲁大学做博士后研究。2012-2014年在耶鲁大学做associate researchscientist2014年起,其进入香港浸会大学数学系做助理教授。2012年他获得了the winner of the 2012Hong Kong Young Scientist称号。其研究兴趣主要集中在statisticalgenomics, bioinformatics, pattern recognition and machine learning.