IIS-2128307 EAGER: Integration and analysis of high-dimensional dataset

National Science Foundation Award Number: NSF IIS-2128307 (October 1, 2021 - NCE September 30, 2024)

Contact Information

Lanjing Zhang, MD

Department of Chemical Biology
Ernest Mario School of Pharmacy

Rutgers University
Office Room #: 107, 164 Frelinghuysen Rd.

Piscataway, NJ 08854
E-mail: Lanjing.Zhang #at# rutgers.edu, URL: https://thezhanglab.github.io/

List of Contributors

§ Wei Vivian Li, PhD, co-Principal Investigator, School of Public Health, Rutgers University (now at UC Riverside)

§ Jinchuan Xing, PhD, co-Principal Investigator, School of Arts and Sciences, Rutgers University

§ Nan Gao, PhD, co-Principal Investigator, School of Arts and Sciences, Rutgers University Newark

§ Fei Deng, Postdoc research associate, School of Pharmacy, Rutgers University

§ Catherine Feng, Summer undergraduate student, Harvard College, Cambridge, MA (REU student) and previously high school student at Montgomery High School, NJ

Project Award Information

· Award Number: IIS-2128307

· Duration: 10/01/2021-9/30/2024 (no cost extension)

· Title: EAGER: Integration and analysis of high-dimensional dataset

· Keywords: Big data, normalization, genomics, external validation

Project Summary

In recent years, massive and complex datasets such as data from, facial recognition systems, autonomous cars, medical imaging, single-cell biology, etc. are increasing dramatically. Machine learning as part of artificial intelligence has been used to combine and understand these massive and complex datasets. The current mainstream machine learning algorithms have performed well, they are primarily mathematics-based and abstracted from their sources. Thus, these algorithms do not consider nor incorporate the rich knowledge from which these datasets were produced. Thus, this project aims to examine whether and how domain knowledge influences the outcomes of machine learning algorithms on combining and analyzing massive and complex datasets. If successful, this project will develop and substantially validate a domain knowledge driven computing framework. This project will enable scientists and engineers in various fields to apply their domain knowledge to better combine and analyze massive and complex datasets. Additional insights will also be generated to understand and improve the machine learning algorithms themselves. Therefore, the findings of this project will promote the progress of sciences and can directly advance biomedical fields and human health.

Technically, this project aims to address the knowledge gap in mathematics-driven integration and analysis of high-dimensional datasets. This mathematics-driven knowledge gap has limited the full and robust integration of large, high-dimensional datasets. Moreover, external validation is required for rigorous examination of tuned machine learning algorithms. However, a majority of the studies on high-dimensional biomedical datasets did not use validation, largely due to missing data. Therefore, this project will improve the integration and analysis of high-dimensional datasets using domain-knowledge based data-normalization, missing data imputation and dimensionality reduction. As a proof of principle, the project also aims to develop and validate an adaptive multimetric pipeline to integrate various types of mutiomic data using novel feature-selection and dimensionality reduction algorithms. The resulted pipeline and package will enable researchers to better understand and classify high-dimensional datasets in biomedical and other fields. The project will result in a paradigm shift because the domain-knowledge driven data normalization, data imputation and dimensionality reduction are radically different from the mainstream mathematics driven approaches. Finally, this project also aims to expose undergraduate and high-school students who are interested in Computer Science to experiences in machine learning and data science.

Publications and Products:

Note: All full-text papers can be searched and downloaded in PDF, if legally available, at the PI's ResearchGate page.

Journal articles

· Ryu E, Xia HH, GuoGL, Zhang L. "Multivariable-adjusted trends in mortality due to alcoholic liver disease among adults in the United States, from 1999-2017." American journal of translational research, 2022, 14(2): 10921099 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8902556/

· Cui M, Cheng C, Zhang L "High-throughput proteomics: a methodological mini-review" Laboratory Investigation, 2022, 102 (11), 1170-1181 https://doi.org/10.1038/s41374-022-00830-7

· Zhang L. "The Challenges and Opportunities of Translational Pathology" Journal of Clinical and Translational Pathology, 2022, 2(2): 6366 https://doi.org/10.14218%2Fjctp.2022.00001

· Shrestha D, Bag A, Wu R, Zhang Y, Tang X, Qi Q, Xing J, Cheng Y. Genomics and epigenetics guided identification of tissue-specific genomic safe harbors. Genome Biology 2022, 23: 199 https://doi.org/10.1186/s13059-022-02770-3

· Cheng A, Hu G, Li WV. Benchmarking cell-type clustering methods for spatially resolved transcriptomics data. Brief Bioinform 2023, 24(1): bbac475 https://doi.org/10.1093/bib/bbac475

· Balasubramanian I, Bandyopadhyay S, Flores J, Smak JB, Lin X, Liu H, Sun S, Golovchenko NB, Liu Y, Wang D, Patel R, Joseph II, Suntornsaratoon P, Vargas J, Green PHR, Bhagat Govind, Lagana SM, Ying W, Zhang Y, Wang Z, Li WV, Singh S, Zhou Z, Kollias G, Farr LA, Moonah SN, Yu S, Wei Z, Ferraris R, Bonder EM, Zhang L, Kiela PR, Edelblum KL, Liu TL, Gao N. Infection and inflammation stimulate expansion of a CD74+ Paneth cell subset to regulate disease progression. EMBO J. 2023 Nov 2;42(21):e113975 DOI: 10.15252/embj.2023113975 PMID: 37718683

· Hu K, Zhang L. Challenges and Opportunities Associated with Lifting the Zero COVID-19 Policy in China. Explor Res Hypothesis Med. 2024 Jan-Mar;9(1):71-75. doi: 10.14218/erhm.2023.00002. Epub 2023 Mar 8. PMID: 38572142; PMCID:PMC10989839.

· Deng F, Zhao L, Yu N, Lin Y, Zhang L. Union With Recursive Feature Elimination: A Feature Selection Framework to Improve the Classification Performance of Multicategory Causes of Death in Colorectal Cancer. Lab Invest. 2024 Mar;104(3):100320. doi: 10.1016/j.labinv.2023.100320. Epub 2023 Dec 28. PMID: 38158124.

· Liang Y, Guo GL, Zhang L. Current and Emerging Molecular Markers of Liver Diseases: A Pathogenic Perspective. Gene Expression 2022; 21(1), 919. doi: 10.14218/GEJLR.2022.00010 PMCID: PMC11192043

· Suntornsaratoon P, Antonio JM, Flores J, Upadhyay R, Veltri J, Bandyopadhyay S, Dadala R, Kim M, Liu Y, Balasubramanian I, Turner JR, Su X, Li WV, Gao N, Ferraris RP. (2024) Lactobacillus rhamnosus GG Stimulates Dietary Tryptophan-Dependent Production of Barrier-Protecting Methylnicotinamide. Cell Mol Gastroenterol Hepatol. 18(2):101346. doi: 10.1016/j.jcmgh.2024.04.003. Online ahead of print. PMID: 38641207

· Suntornsaratoon P, Ferraris RP, Ambat J, Antonio JM, Flores J, Jones A, Su X, Gao N, Li WV. (2024) Metabolomic and Transcriptomic Correlative Analyses in Germ-Free Mice Link Lacticaseibacillus rhamnosus GG-Associated Metabolites to Host Intestinal Fatty Acid Metabolism and β-Oxidation. Lab Invest. 104(4):100330. doi: 10.1016/j.labinv.2024.100330. Epub 2024 Jan 18. PMID: 38242234

· Cui M, Deng F, Disis ML, Cheng C, Zhang L. Advances in the Clinical Application of High-throughput Proteomics. Explor Res Hypothesis Med (in press).

Project Impact

§ Education: Parts of the project results are used in data science education among high school students. We hosted a series of lectures at the Montgomery High School, New Jersey. Catherine Feng also founded a data science club at that high school (url: https://montydsc.wordpress.com/ ). We also involved several high school and undergraduate students in the project. They became very interested in machine learning and its application. Most of the software and coding developed in this project have been made publicly available (see below). All new progress will be added into the other research collections upon completion.

§ Collaborations: For this project we have established collaborations with several schools of Rutgers University and Montgomery High School, New Jersey. Through such collaborations we expect to explore many real applications and produce bigger Research Impacts.

Current and Future Activities

The following are some of the highlights of our ongoing work.

1. Develop highly sensitive and specific machine learning algorithms to classify non-cancer causes in cancer patients.

2. Study effective and scalable methods for improving machine learning fairness.

Potential Related Project(s)

· R37CA277812: SCH: Screening and confirmatory machine learning for explainable modeling of non-cancer deaths in cancer patients

Project Web site URL: https://thezhanglab.github.io/EAGER.html Project Abstract and Outcome Report:

Online software: Online software can be downloaded at https://github.com/FeiDeng-RUTGERS/URFE.