NSF-HDR: Biology-guided Neural Networks for Discovering Phenotypic Traits

Abstract

Unlike genetic data, the traits of organisms such as their visible features, are not available in databases for analysis. The lack of machine-readable trait data has slowed progress on four grand challenge problems in biology: predicting the genes that generate traits, understanding the patterns of evolution, predicting the effects of ecological change, and species identification. This project will use advances in machine learning and machine-readable biological knowledge to create a new method to automatically identify traits from images of organisms. Images of organisms are widely available, and this new method could be used to rapidly harvest traits that could be used to solve the grand challenges in biology. Large image collections and corresponding digital data from fishes will be used in this study because of the extensive resources available for these organisms. The new machine learning model can be generalized to other disciplines that have similar machine-readable knowledge, and it will help in explaining the results of artificial intelligence, thus advancing the field of computer science. The new method stands to benefit society in application to areas such as agriculture or medicine, where trait discovery from images is critical in disease diagnosis. The project will support the education of students and postdocs in biology, computer science, and information science. It will disseminate its findings through workshops, presentations, publications, and open access to data and code that it produces.

This project will leverage advances in state-of-the-art machine learning to develop a novel class of artificial neural networks that can exploit the machine readable and predictive knowledge about biology that is available in the form of phylogenies and anatomy ontologies. These biology-guided neural networks are expected to automatically detect and predict traits from specimen images, with little training data. Image-based trait data derived from this work will enable progress in gene-phenotype mapping to novel traits and understanding patterns of evolution. The resulting machine learning model can be generalized to other disciplines that have formally structured knowledge, and will contribute to advances in computer science by going beyond black-box learning and making important advances toward Explainable Artificial Intelligence. It may be extended to applied areas, such as agriculture or the biomedical domain. The research will be piloted using teleost fishes because of many high-quality data resources (digital images, evolutionary trees, anatomy ontology). Methods for automated metadata quality assessment and provenance tracking will be developed in the course of this project to ensure the results and processes are verifiable, replicable and reusable. These will broadly impact the many domains that will adopt machine learning as a way to make discoveries from images. This convergent research will accelerate scientific discovery across the biological sciences and computer science by harnessing the data revolution in conjunction with biological knowledge.

This project is part of the National Science Foundation’s Harnessing the Data Revolution (HDR) Big Idea activity, and is jointly supported by the HDR and the Division of Biological Infrastructure within the NSF Directorate of Directorate for Biological Sciences.

This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.

Project PIs and Drexel team members

  • Jane Greenberg, Alice B. Kroger Professor, Information Science Department, CCI/Drexel
  • David Breen, Professor and Associate Department Head, Computer Science Department, CCI/Drexel
  • Joel Pepper, Doctoral student, Computer Science Department, CCI/Drexel (2020-)
  • Kevin Karnani, CS Co-op, CCI/Drexel (Fall ’21-Winter ’22)
  • Jeremey Leipzig, Bioinformatics Developer/Engineer (2019-2021)

Biology-guided Neural Networks for Discovering Phenotypic Traits is supported by NSF grant: #1940233

Collaborating PIs

  • Hank Bart, Professor and Director, Tulane University Biodiversity Research Institute
  • Anuj Karpatne, Assistant Professor, Department of Computer Science, Virginia Tech
  • Paula Mabee, Chief Scientist and Observatory Director, National Ecological Observatory Network (NEON)
  • Murat Maga, Assistant Professor, Seattle Children’s Hospital