News & Events

AI-Ready Data: Navigating the Dynamic Frontier of Metadata and Ontologies

ID4: Institute of Data Driven Dynamical Design

Hosted by the Metadata Research Center, College of Computing & Informatics, Drexel University

AI-ready data refers to the high-quality and well-prepared data that is optimized for use in artificial intelligence (AI) applications. AI-ready data increasingly encompasses the inclusion of metadata and ontologies to enhance the value and usability of data. Metadata provides essential context and information about the data, and ontologies offer structured semantic representation of a particular domain. These additional layers of information help data scientists,data scientists, researchers, and AI systems understand, interpret, and apply appropriate algorithms and models for analysis. Metadata and ontologies enable consistent data integration, interoperability, and knowledge sharing across systems, while facilitating more knowledgeable AI applications. Additionally, these systems are proving vital for supporting the FAIR (Findable, Accessible, Interoperable, and Reusable) principles and reproducible computational research (RCR).

Despite these capacities, approaches for developing, implementing, and sustaining metadata and ontologies within AI-ready data pipelines remain inconsistent, cumbersome, and lack sufficient support. Challenges underlie the full data lifecycle from data creation, collection, and research, to longer-term aims of data preservation, archiving, reuse and support for research reproducibility. Collective, community driven efforts are needed to address current obstacles and maximize the value and reliability of data. The AI-Ready Data: Navigating the Dynamic Frontier of Metadata and Ontologies workshop is a step toward addressing this challenge. This workshop will bring together a community of individuals with expertise across the data lifecycle to discuss issues, share solutions, and chart a path forward for addressing key challenges in preparing AI-ready data for scientific research. 

Specific workshop goals are to:

  1. Collectively define the state of AI-ready data challenges in the metadata and ontology space
  2. Share current successes and solutions leveraging metadata standards and ontologies.
  3. Contribute to a road map to accelerate the preparation of data for artificial intelligence (AI) applications.

Current topics

  • What is AI ready data
  • Research Bottlenecks: Data Life Cycle Challenges and Solutions with Scientific Data
  • Metadata and Ontologies: Human in the Loop in the Era of LLMs
  • Annotation: Large-scale Data and Balancing Human and Machine Driven Approaches 
  • Standards Development, Adoption, and Implementation: Realities and Fictions
  • Knowledge Graphs
  • Ontology Guided Knowledge Extraction: Leveraging Scholarly Big Data for Scientific Discovery
  • Future Directions with Metadata and Knowledge Organization Systems
News & Events

2024 Alice B. Kroeger Talk sponsored by the Metadata Research Center, College of Computing & Informatics, Drexel University

Navigating the Data Deluge: AI, Infrastructure, and Decision-Making in the Era of Big Data

Joshua C. Agar, Assistant Professor Department of Mechanical Engineering and Mechanics, Drexel University

  • Date/time: Wednesday, February 28, 2024, @ 12:00 PM ET
  • Location/in person: Room 912 (9th floor), College of Computing & Informatics (CCI), Drexel University, 3675 Market Street (please send your name to: mrc.metadata@drexel.edu, for CCI access).
  • Virtual attendees, email mrc.metadata@drexel.edu for ZOOM link invite.

Science has traditionally harnessed data to inform decisions. Historically, data was sufficiently low-dimensional and manageable for human processing. However, the rapid expansion of sensing technologies across disciplines has overwhelmed traditional human-centric methods with vast, high-velocity data streams from diverse and often unreliable sources. Despite the remarkable advances in computers and large language models like ChatGPT, their capabilities remain limited. Current AI algorithms predominantly excel in interpolation, not extrapolation, leading to unrealistic and nonsensical outputs when stretched beyond their training data.

This talk explores the intersection of massive data influx and AI, focusing on their limitations and potential in enhancing decision-making, particularly in data-driven infrastructure. We propose a “humanistic carrot” – not the “stick” approach to address pressing challenges in scientific data management, spotlighting DataFed – a comprehensive data management system. This platform facilitates autonomous pipelines for the curation, sharing, searching, and fine-grain access control of data and metadata. We demonstrate how DataFed can streamline data management for experimentalists, enhancing data stewardship while reducing their workload.

We also delve into the intricacies of handling high-velocity data streams, where gigabits per second of data necessitate immediate processing for critical decision-making or autonomous control. This section covers deploying high-availability inference servers for on-demand data analysis and reduction. Additionally, we explore the concept of AI co-design, where algorithms are optimized to fit on programmable logic, enabling rapid, intelligent analysis, decision-making, and control on ultra-low cost, low-power devices at unprecedented speeds. Finally, we discuss the broad applicability of these methodologies across various fields, from particle physics to astronomy, highlighting their potential to revolutionize our approach to data and AI integration.

Dr. Joshua C. Agar is an Assistant Professor in the Department of Mechanical Engineering and Mechanics at Drexel University. With a foundational background in experimental materials science, Dr. Agar is predominantly renowned for his pioneering contributions to AI algorithms, computing infrastructure, and the development of cyber-physical systems in the fields of materials synthesis and microscopy. His expertise has been applied across a wide array of disciplines, including particle and plasma physics, materials science, and fluid dynamics. An active member of various AI communities, particularly the FastML community, which emphasizes ultra-low latency ML co-design, Dr. Agar has earned recognition as a leader in AI innovation. His work has garnered attention from prestigious institutions such as the National Academy of Engineering and the National Science Foundation

News & Events

Ajani Levere Presents STAR Scholar Project

Ajani Levere, a Drexel University STAR Scholar working with Drs. Jane Greenberg and David Breen presented their research imageomics at the STAR Scholars showcase on August 31, 2023. Their presentation was titled “Computational Fish Specimen Classification: Advancing Machine Learning Model Accuracy” and was part of the NSF-HDR: Biology-guided Neural Networks for Discovering Phenotypic Traits. Ajani’s research is continuing under the guidance of Dr. Greenberg. They describe their project as follows:

Digital specimen metadata is valuable for scientific research and discovery, yet sparse specimen metadata availability restricts its potential. In addition to computational efforts made to remedy this issue, Machine Learning (ML) classification was performed on a computed metadata component, the outline extracted from fish specimen images. An ML model (MLM) approach provided a computational genus classification for a given fish outline. This research improves the MLM’s ability to accurately classify fish from their 2D outlines and demonstrates the expressiveness of this computed metadata item.  

In our analysis, we inspected the outlines of the error cases, followed by a statistical review of their numerical data. We discovered our dataset limited higher MLM accuracy potential. Refactoring the dataset with a reduced feature length thus enhanced our dataset for MLM interpretability. Experimental results indicate a 96% accuracy, a 5% improvement over previous results. These results confirm the outline as a unique and highly distinguishable metadata component. Computing metadata components of this nature aids the development of a more robust metadata catalog for ML researchers.

News & Events

Jane Greenberg Receives ASIS&T Research in Information Science Award

The Metadata Reasearch Center congratulates Jane Greenberg, its Director and Founder, for receiving the Association for Information Science & Technology’s (ASIS&T) 2023 Research in Information Science Award. The award “recognizes an individual or team who has made an outstanding contribution to information science research. The award is for a systematic “program of research” in a single area at a level beyond the single study.” ASIS&T recognized Dr. Greenberg’s wide-ranging contributions, including her current positions as principal investigator on the Metadata Capital Initiative (MetaDataCAPT’L) and the NSF-funded Institute for Data Driven Dynamical Design (ID4), and the IMLS-funded project LEADING (LIS Education and Data Science Integrated Network Group). The award committee also singled out her work with the Biology-guided Neural Network (BGNN) project and the Helping Interdisciplinary Vocabulary Engineering (HIVE) tool. Please click here for the full press release from ASIS&T, “Jane Greenberg Receives Association for Information Science and Technology (ASIS&T) Research in Information Science Award.”

News & Events

Summer 2023 NSF Research Experiences for Undergraduates (REU) Opportunities at the MRC

Two (2) virtual National Science Foundation Research Experience for Undergraduate research opportunities @ the Metadata Research Center, Drexel University, as part of the Harnessing (HDR) Institute for Data Driven Dynamical Design (ID4)

Dates: Mid-July through Mid-September

REU stipend: $5,500

Deadline: Rolling basis (Friday, July, 7th for first consideration)

Contacts:

Interested applicants, please sent resume and brief statement of interest (1 paragraph) indicating: 1) which REU option you would like to apply for, and 2) why you would like to participate in the REU program.

Please send your application to:

REU Option 1: Materials Science Repository Semantics

Standards are an integral component of data repository infrastructure and support of the FAIR (findable, accessible, interoperable, and reusable) data. Terminology, specifically the language (vocabulary) used to represent data, is standardized through metadata and semantic ontologies. The focus of this REU will be on investigating metadata infrastructures across a sub-set of materials science repositories, and looking specifically at the terminological representation used and alignment with semantic ontologies.

REU applicants for this project should have:

  • Some disciplinary exposure to chemistry, engineering, physics, and/or materials science.
  • Interest in semantic systems (terminology/vocabulary) and their value for representation, machine learning, and AI
  • Appreciation standards for communication human to human, human to machine, machine to machine 
  • Knowledge of Excel, Tableau, Orange, or other data science software that allows analysis and visualization, or interest in learning
  • Python, R, or other coding experience helpful, but not necessary

Research Goals

  • Explore similarities and differences of standards and data representation practices across a subset of materials science data representations.
  • Analyze and visualize data representation, specifically metadata and semantic systems.
  • Assess the effectiveness of standards and identifying areas needing more attention.

Learning Goals

  • Gain knowledge of metadata standards and semantic ontologies are key to the FAIR data principles.
  • Advance analytical and visualization research skills
  • Obtain better understanding of the relationship of standards to ML/AI

REU Option 2: Metal-Organic Frameworks (MOFs) Synthesis Extraction from Scholarly Big Data

Metal-Organic Frameworks (MOFs) are a kind of crystals (natural or synthetic) that have advanced the field of materials and solid-state sciences over the last quarter century. The synthesis procedure often reported in literature can play a critical role in data-driven discovery of Metal-organic framework materials. Unfortunately, this valuable knowledge is significantly underutilized as it remains buried in text, which is unstructured and not machine understandable. This challenge is exasperated because it is simply not feasible for human researchers to read every single article in their fields, given there are over thousands of publications, and the number is still growing exponentially. In this project, students will work with researchers in Drexel University’s Metadata Research Center, University of Central Florida and Colorado School of Mines, connected with the NSF/ID4 (Institute for Data Driven Dynamical Design) project. The focus will be on investigating the use of natural language processing techniques to extract key synthesis knowledge from unstructured text data. We seek to develop robust deep learning models which enable automatic knowledge extraction and ultimately construct knowledge graphs from scholarly corpus. REU summer students will gain deeper understanding of natural language processing and use of large pre-trained language models through the text annotation process.

Research Goals

  • Pre-train language models for downstream NLP tasks in materials science
  • Develop different deep learning models to improve extraction performance
  • Construct solid external knowledge sources (e.g., taxonomy, ontology) for future research

Learning Goals

  • Gain knowledge of deep learning frameworks such as Pytorch
  • How to generate language representations as features for deep learning models
  • Obtain better understanding of the complete workflow of information extraction (named entity recognition/relation extraction)
News & Events

LEADING Moves Forward: LEADING Forum and Welcome 2023 Fellows

On Friday, May 19th, the 2023 LEADING Forum took place in the Science Center/Quorum at Drexel University. Highlights form the Forum included a panel on ChatGPT, a fellows panel, a fellow poster session, and keynote presentations from Florence Hudson (Executive director, Northeast Big Data Innovation Hub at Columbia University), and Laurie Allen (Chief, Digital Innovation Lab (LC Labs), Library of Congress). The 2023 OCLC/LEADING Data Challenge preceded the forum on Thursday, May 18th.

In moving forward, we welcome the incoming LEADING 2023 fellows. This year’s cohort includes 18 fellows from 14 iSchools and LIS institutions from across the country. June kicks off the 2023 LEADING boot camp, which precedes the 6-month fellowship period. Read more about the 2023 fellows here.

News & Events

Scott McClellan presents at 20th RDA Plenary’s Session on Materials Science Ontologies

Scott McClellan, a second year doctoral student, presented research results to the “Data representation in materials and chemicals based on harmonised domain ontologies” birds of a feather group at the Research Data Alliance’s 20th Plenary meeting in Gothenburg, Sweden on March 21-23, 2023. His presentation, titled “Along the Border: Term Overlap Among 5 Matportal Ontologies,” focused on term overlap among a subset of ontologies maintained at the Matportal repository. It looked at how term matching algorithms for materials science semantic artifacts differed when locating terminological or URI results. His presentation stemmed from prior research done with Drs. Yuan An and Jane Greenberg and fellow graduate student Xintong Zhao. [Slides]

News & Events

Jane Greenberg and Richard Marciano Present at DLF 2022

MRC’s Jane Greenberg and Richard Marciano, Advanced Information Collaboratory (AI Collaboratory) University of Maryland, presented at the 2022 DLF Forum on Wednesday, October 12th.

Jane Greenberg presenting at DLF 2022

Their panel, titled “Innovating Data Science Education and Computational Thinking: Connecting iSchools and LAMs,” presented about two national Institute of Museum and Library Services (IMLS) initiatives connecting leading GLAMs (galleries, libraries, archives, and museums) and educators, and innovative data science education. Jane presented about the LEADING (The LIS Education and Data Science Integrated Network Group) fellowship project, and Richard presented about the TALENT (Training of Archival & Library Educators with iNnovative Technologies) Network.

Richard Marciano presenting at DLF 2022

The presentation slides are available here: [LINK].

News & Events

MRC Publication Updates

Sharing news on MRC recent and forthcoming publication! 

For more information on MRC student and faculty outputs, see the publications page.