At the annual DCMI (Dublin Core Metadata Initiative) 2024 conference this month, Senior Research Associate, John Kunze, gave a tutorial on ARKs (Archival Resource Keys). It included an in-depth case study of YAMZ (Yet Another Metadata Zoo), which helps build consensus on terminology using ARKs as persistent identifiers for vocabulary terms and Linked Data concepts.
MRC member Tim Gorichanaz is organizing this year’s annual meeting of the Document Academy (Docam ’24) taking place from September 18-20. This year’s theme is “Documents from the Future.”
Documents involve technology, meaning they change with the times. The ancient Latin root of “document” referred to an oral teaching, and centuries later the most common form of document was a piece of paper. Now, with computing widespread, perhaps most documents are digital. What will tomorrow’s documents bring, particularly in light of generative AI? What might change, and what might stay the same?
http://documentacademy.org/?2024
Docam ’24 will feature an array of presentations exploring the social, cultural, and technological conceptualizations and effects of documents and their possible future iterations. Among the presenters at this year’s meeting will be the MRC members Chris Rauch, Mat Kelly, and Hyung Wook Choi. The conference agenda, including a full list of presenters and presentations can be found here.
The Document Academy was founded in 2001 by Maribeth Back and Niels Windfeld Lund. The organization is dedicated to exploring documents and documentation through a variety of media and means. For more information about the Document Academy, please visit their website: https://documentacademy.org/
Elizabeth (Lizzie) Jones, Northeastern University (Project: AI-ready data: Knowledge Extraction from Laboratory Notebooks). Lizzie pursued document segmentation, optical character recognition, and text tokenization to extract research protocols and results from digitized lab notebooks produced by members of the Reticular Synthesis Laboratory led by Fernando Uribe-Romo, University of Central Florida. Project aim: To make archival laboratory notebook data AI-ready. (Mentors include Drexel Joel Pepper, David Breen, and Jane Greenberg.)
Robert Sammarco, Drexel University(Project: Developing YAMZ (Yet Another Metadata Zoo) for Materials Science Terminology). Robert extended the YAMZ foundation, and developed a materials science terminology portal. Project aim: To achieve better data interoperability, support the FAIR data principles, and help materials scientists better communicate with-in and across subdomains. (Mentors: Christopher Rauch, John Kunze, and Mat Kelly)
Rob Fleur, University of Michigan (Project: Knowledge Graph Implementation for Materials Science). Rob worked on automatic knowledge graph generation, drawing from an extensive collection of materials science research literature. Project aim: To help researchers more expediently extract knowledge from research literature. Rob will continue his work over September 2024. (Mentor: Alex Kalinowski)
Lizzie and Robert S. also participated in the ID4 REU end-of-summer event at Northwestern University in August 2024, and they each presented posters on their work. Kudos to all our REUs and their awesome accomplishments!
On June 11, 2024, Dave Breen gave a presentation titled “Image Informatics for Metadata Extraction and Verification of Museum Specimen Images” at the Advances in Digital Media Workshop Series at the Yale Peabody Museum. The series is part of the Integrated Digital Biocollections (iDigBio) project and sought to answer the question, “How can we use media technologies to position biodiversity collections for even greater relevance to science, society, and Earth’s biota in the future?”
Dates: Mid-June through Mid-September (Flexibility with start date, and opportunity to continue work over Fall ‘24 term.)
REU stipend: $5,500
Deadline: Rolling basis (Friday, June 1st for first consideration)
Contacts: Interested applicants, please send a resume and brief statement of interest (1 paragraph) indicating why you would like to participate in the REU program. Please send your application to:
Project overview and description: Agreement on terminology is critical for human and machine communication supporting scientific research. Additionally, shared vocabulary provides a necessary foundation of data and metadata standards, as well as the basis for labels in machine learning pipelines. This REU project will develop and enhance YAMZ.net by creating a domain-specific portal for materials science and exploring AI integration. YAMZ is a general purpose crowdsourced, online dictionary using reputation-based voting to support community discussion and consensus. Project REUs will:
Develop and test the domain specific portal in the materials science subdomain
Explore and pilot integrating ChatGPT for drawing in definitions
Document project procedures to enable a generalizable model that can, on demand, present users with a constrained view (or portal) restricted just to terms from the materials science subdomain
Collaborate with project mentors and project staff on a scholarly output (e.g., conference poster, presentation, research paper)
REU applicants for this project should have
Exposure and instruction in at least one of the following disciplines: computer science, data science, chemistry, engineering, physics, and/or materials science
Interest in semantic systems (terminology/vocabulary) and their value for representation, machine learning, and AI
Knowledge of the value of data standards for communicating human to human, human to machine, and machine to machine
Knowledge of database and data science software (SQL, Tableau, Orange, etc.)
Python, Flask or similar web framework, or other coding experience
Applicant restrictions
Must be a non-Drexel undergraduate (not graduated)
May work remotely or onsite
Must be a U.S. citizen or permanent resident of the United States or its possessions
Research Goals
Advance YAMZ.net features supporting domain specific portals (e.g., tagging, group ownership of terms and portals).
Explore and pilot AI integration into YAMZ.net.
Develop ways for domain-specific communities to be mostly self-sufficient in creating and managing portals.
Learning Goals
Gain R&D experience with a working online dictionary, and understand tradeoffs between domain-agnostic and domain-specific portals
Advance semantic research and data science/computer science skills
Obtain a better understanding of the complexity of questions surrounding terminology agreement and its importance for scientific communication and research
Please join the MRC on April 17 from 11AM-12PM in Room 928 of the College of Computing and Informatics for “Accelerating Artificial Intelligence for Data-Driven Discovery” a talk delivered by Shih-Chieh Hsu, PhD, University of Washington.
Abstract
As scientific datasets become progressively larger, algorithms to process this data quickly become more complex. In response, Artificial Intelligence (AI) has emerged as a solution to efficiently analyze these massive datasets. Emerging processor technologies such as graphics processing units (GPUs) and field-programmable gate arrays (FPGAs) allow AI algorithms to be greatly accelerated. The Accelerated AI Algorithms for Data-Driven Discovery (A3D3) Institute sponsored by the National Science Foundation under the Harnessing the Data Revolution program is established to enable real-time AI at scale for broad applications. In this talk, Hsu will give an overview about the challenges of high energy physics, multi-messenger astrophysics and neuroscience regarding AI across latency and throughput regimes. He will introduce various techniques for model compression using state-of-the-art techniques such as pruning and quantization for edge computing. He will demonstrate that acceleration of AI inference as a web service represents a heterogeneous computing solution. Finally Hsu will discuss how A3D3 can bring together disparate communities that are threaded by common data-intensive grand challenges to accelerate discovery in science and engineering.
Biography
Shih-Chieh Hsu, PhD is a professor in physics and adjunct professor in electrical and computer engineering at University of Washington (UW), and director of NSF HDR Institute: Accelerated Artificial Intelligence Algorithms for Data-Driven Discovery. He earned the BS/MS in physics from National Taiwan University and the PhD in Physics from University of California San Diego. He is working on experimental particle physics using proton-proton collision data from the Large Hadron Collider. His research interests range from dark matter searches with the ATLAS experiment neutrino cross-section measurements with the FASER experiment innovative artificial intelligence algorithms for data-intensive discovery and accelerated machine learning with heterogeneous computing.
AI-ready data refers to the high-quality and well-prepared data that is optimized for use in artificial intelligence (AI) applications. AI-ready data increasingly encompasses the inclusion of metadata and ontologies to enhance the value and usability of data. Metadata provides essential context and information about the data, and ontologies offer structured semantic representation of a particular domain. These additional layers of information help data scientists,data scientists, researchers, and AI systems understand, interpret, and apply appropriate algorithms and models for analysis. Metadata and ontologies enable consistent data integration, interoperability, and knowledge sharing across systems, while facilitating more knowledgeable AI applications. Additionally, these systems are proving vital for supporting the FAIR (Findable, Accessible, Interoperable, and Reusable) principles and reproducible computational research (RCR).
Despite these capacities, approaches for developing, implementing, and sustaining metadata and ontologies within AI-ready data pipelines remain inconsistent, cumbersome, and lack sufficient support. Challenges underlie the full data lifecycle from data creation, collection, and research, to longer-term aims of data preservation, archiving, reuse and support for research reproducibility. Collective, community driven efforts are needed to address current obstacles and maximize the value and reliability of data. The AI-Ready Data: Navigating the Dynamic Frontier of Metadata and Ontologies workshop is a step toward addressing this challenge. This workshop will bring together a community of individuals with expertise across the data lifecycle to discuss issues, share solutions, and chart a path forward for addressing key challenges in preparing AI-ready data for scientific research.
Specific workshop goals are to:
Collectively define the state of AI-ready data challenges in the metadata and ontology space
Share current successes and solutions leveraging metadata standards and ontologies.
Contribute to a road map to accelerate the preparation of data for artificial intelligence (AI) applications.
Current topics
What is AI ready data
Research Bottlenecks: Data Life Cycle Challenges and Solutions with Scientific Data
Metadata and Ontologies: Human in the Loop in the Era of LLMs
Annotation: Large-scale Data and Balancing Human and Machine Driven Approaches
Standards Development, Adoption, and Implementation: Realities and Fictions
Knowledge Graphs
Ontology Guided Knowledge Extraction: Leveraging Scholarly Big Data for Scientific Discovery
Future Directions with Metadata and Knowledge Organization Systems
Navigating the Data Deluge: AI, Infrastructure, and Decision-Making in the Era of Big Data
Joshua C. Agar, Assistant Professor Department of Mechanical Engineering and Mechanics, Drexel University
Date/time: Wednesday, February 28, 2024, @ 12:00 PM ET
Location/in person: Room 912 (9th floor), College of Computing & Informatics (CCI), Drexel University, 3675 Market Street (please send your name to: mrc.metadata@drexel.edu, for CCI access).
Science has traditionally harnessed data to inform decisions. Historically, data was sufficiently low-dimensional and manageable for human processing. However, the rapid expansion of sensing technologies across disciplines has overwhelmed traditional human-centric methods with vast, high-velocity data streams from diverse and often unreliable sources. Despite the remarkable advances in computers and large language models like ChatGPT, their capabilities remain limited. Current AI algorithms predominantly excel in interpolation, not extrapolation, leading to unrealistic and nonsensical outputs when stretched beyond their training data.
This talk explores the intersection of massive data influx and AI, focusing on their limitations and potential in enhancing decision-making, particularly in data-driven infrastructure. We propose a “humanistic carrot” – not the “stick” approach to address pressing challenges in scientific data management, spotlighting DataFed – a comprehensive data management system. This platform facilitates autonomous pipelines for the curation, sharing, searching, and fine-grain access control of data and metadata. We demonstrate how DataFed can streamline data management for experimentalists, enhancing data stewardship while reducing their workload.
We also delve into the intricacies of handling high-velocity data streams, where gigabits per second of data necessitate immediate processing for critical decision-making or autonomous control. This section covers deploying high-availability inference servers for on-demand data analysis and reduction. Additionally, we explore the concept of AI co-design, where algorithms are optimized to fit on programmable logic, enabling rapid, intelligent analysis, decision-making, and control on ultra-low cost, low-power devices at unprecedented speeds. Finally, we discuss the broad applicability of these methodologies across various fields, from particle physics to astronomy, highlighting their potential to revolutionize our approach to data and AI integration.
Dr. Joshua C. Agar is an Assistant Professor in the Department of Mechanical Engineering and Mechanics at Drexel University. With a foundational background in experimental materials science, Dr. Agar is predominantly renowned for his pioneering contributions to AI algorithms, computing infrastructure, and the development of cyber-physical systems in the fields of materials synthesis and microscopy. His expertise has been applied across a wide array of disciplines, including particle and plasma physics, materials science, and fluid dynamics. An active member of various AI communities, particularly the FastML community, which emphasizes ultra-low latency ML co-design, Dr. Agar has earned recognition as a leader in AI innovation. His work has garnered attention from prestigious institutions such as the National Academy of Engineering and the National Science Foundation
Ajani Levere, a Drexel University STAR Scholar working with Drs. Jane Greenberg and David Breen presented their research imageomics at the STAR Scholars showcase on August 31, 2023. Their presentation was titled “Computational Fish Specimen Classification: Advancing Machine Learning Model Accuracy” and was part of the NSF-HDR: Biology-guided Neural Networks for Discovering Phenotypic Traits. Ajani’s research is continuing under the guidance of Dr. Greenberg. They describe their project as follows:
Digital specimen metadata is valuable for scientific research and discovery, yet sparse specimen metadata availability restricts its potential. In addition to computational efforts made to remedy this issue, Machine Learning (ML) classification was performed on a computed metadata component, the outline extracted from fish specimen images. An ML model (MLM) approach provided a computational genus classification for a given fish outline. This research improves the MLM’s ability to accurately classify fish from their 2D outlines and demonstrates the expressiveness of this computed metadata item.
In our analysis, we inspected the outlines of the error cases, followed by a statistical review of their numerical data. We discovered our dataset limited higher MLM accuracy potential. Refactoring the dataset with a reduced feature length thus enhanced our dataset for MLM interpretability. Experimental results indicate a 96% accuracy, a 5% improvement over previous results. These results confirm the outline as a unique and highly distinguishable metadata component. Computing metadata components of this nature aids the development of a more robust metadata catalog for ML researchers.