Deep Learning-based Ab Initio Protein Structure Prediction and Structure-based Protein Function Annotation

Zhang, Chengxin

Deep Learning-based Ab Initio Protein Structure Prediction and Structure-based Protein Function Annotation

dc.contributor.author	Zhang, Chengxin
dc.date.accessioned	2021-02-04T16:38:02Z
dc.date.available	2023-01-01
dc.date.available	2021-02-04T16:38:02Z
dc.date.issued	2020
dc.date.submitted	2020
dc.identifier.uri	https://hdl.handle.net/2027.42/166121
dc.description.abstract	Predicting protein structure from its sequence (especially in the absence of structure templates) and deduction of biological function from structure remains a significant and unsolved problem. Much progress in ab initio (i.e. template-free) modeling of protein structure in recent years is due to the introduction of deep learning predicted inter-residue contacts and, even more recently, inter-residue distances. We present D-QUARK, an ab initio protein folding algorithm guided by residue-residue distances and orientations predicted by deep learning. The D-QUARK pipeline is distinct from existing protein folding programs in the following aspects. Firstly, for a target sequence, it generates a high quality multiple sequence alignment (MSA) with deep and diverse sequence homolog alignment using the in-house DeepMSA algorithm. Secondly, to generate input features for deep learning prediction of distances and orientations from the MSA, raw coevolution features are extracted in the form of a covariance matrix and pseudo-likelihood maximization parameters, rather than traditional post-process coevolutionary features. Thirdly, the distance and orientation potentials are incorporated into a comprehensive replica-exchange Monte Carlo (REMC) simulation with a uniquely designed flat well potential for ab initio protein folding. The high quality MSA, accurate deep learning prediction, and REMC simulation with carefully designed energy terms all contribute to the high performance of D-QUARK. In terms of the first model TM-score, D-QUARK outperforms our previous ab initio protein folding algorithm by QUARK by 108.8% and two state-of-the-art distance-based structure prediction programs, DMPfold and trRosetta, by 22.9% and 11.4 %, respectively. In a post-CASP experiment, D-QUARK achieves 8.1% higher first model TM-score on CASP13 FM target proteins than AlphaFold. To annotate protein functions, including Gene Ontology (GO) terms, Enzyme Commission (EC) numbers, and ligand binding sites, from a predicted structure model, we developed COFACTOR. COFACTOR combines functional templates identified by structure alignment against the target structure model as well as sequence homologs and protein-protein interaction partners to derive consensus function annotations. COFACTOR was blindly tested in the community-wide CAFA3 function annotation challenge and was ranked among the top groups. The structure and function prediction pipeline developed in this thesis was applied to proteome-wide annotation projects for several model organisms, including human and the JCVI-syn3.0 minimal bacterial genome, where our pipeline reveals previous uncharacterized proteins with important functions. Overall, we showed the impact of deep learning on protein structure and function prediction, and demonstrated its utility for reliable and scalable modeling.
dc.language.iso	en_US
dc.subject	Protein Structure Prediction
dc.subject	Protein Function Annotation
dc.subject	Multiple Sequence Alignment (MSA)
dc.subject	Human Proteome
dc.subject	Deep Learning
dc.subject	JCVI-syn3.0 minimal bacterial genome
dc.title	Deep Learning-based Ab Initio Protein Structure Prediction and Structure-based Protein Function Annotation
dc.type	Thesis
dc.description.thesisdegreename	PhD	en_US
dc.description.thesisdegreediscipline	Bioinformatics
dc.description.thesisdegreegrantor	University of Michigan, Horace H. Rackham School of Graduate Studies
dc.contributor.committeemember	Omenn, Gilbert S
dc.contributor.committeemember	Ohi, Melanie D
dc.contributor.committeemember	Carlson, Heather A
dc.contributor.committeemember	Freddolino, Peter Louis
dc.contributor.committeemember	Guan, Yuanfang
dc.contributor.committeemember	Richardson, Rudy J
dc.subject.hlbsecondlevel	Molecular, Cellular and Developmental Biology
dc.subject.hlbtoplevel	Science
dc.description.bitstreamurl	http://deepblue.lib.umich.edu/bitstream/2027.42/166121/1/zcx_1.pdf
dc.identifier.doi	https://dx.doi.org/10.7302/44
dc.identifier.orcid	0000-0001-7290-1324
dc.identifier.name-orcid	Zhang, Chengxin; 0000-0001-7290-1324	en_US
dc.working.doi	10.7302/44	en
dc.owningcollname	Dissertations and Theses (Ph.D. and Master's)

Files in this item

Name:: zcx_1.pdf
Size:: 26.70MB
Format:: PDF

View/Open

Dissertations and Theses (Ph.D. and Master's)

Show simple item record

Remediation of Harmful Language

The University of Michigan Library aims to describe its collections in a way that respects the people and communities who create, use, and are represented in them. We encourage you to Contact Us anonymously if you encounter harmful or problematic language in catalog records or finding aids. More information about our policies and practices is available at Remediation of Harmful Language.

Accessibility

If you are unable to use this file in its current format, please select the Contact Us link and we can modify it to make it more accessible to you.