Jessica Cheng, Week 1: Explore the direction of our project — Multi-layer taxonomy alignment problems

Week 1 of the LEADS fellowship project starts with exploring the actual problems and directions we would like to work for the course of 10 weeks.
For identifying the problems, I reviewed some literature that may hopefully guide me towards the intersection of my research interests and the Academy of Natural Science’s (ANS) goals. These topics are, but not exclusive to, taxonomies, knowledge graphs, biodiversity, geo-politics, and knowledge organization. We also want to link this project towards the Biodiversity Heritage Library (BHL) and ANS collections. 
Given the conversations I had with my mentor Steve Dilliplane at both the LEADS boot camp and the NASKO 2019 conference, I came up with this interesting idea of a ‘multi-layer’ taxonomy alignment problems/framework, which may ultimately guide us to constructing a data-driven biodiversity knowledge graph/ontology of a specific species we wish to examine.
A lot of times species co-occurrences datasets contain records based on the Darwin Core metadata standard. Multi-layers in this case means different fields in a co-occurrence dataset, these can include (again, not exclusive to): species names, characters/phenotypes, habitat information, geolocations (country, cities, latitude, longitude), IUCN redlist/other endangered speices classifications, etc. Depending on what type of metadata they actually use in the dataset, my thoughts are that each of these field can itself be a taxonomy.
1st taxonomy: species names
2nd taxonomy: geographic regions (geopolitical realities may exist) 
3rd taxonomy: phenotypes
…and more
How do we proceed? BHL/ANS or co-occurrences datasets & TAP:
Say we have two different datasets from BHL/ANS about Grizzly Bears. 
Each of these field can itself have a taxonomy alignment problem.
One dataset may only locate the Grizzly bear (species name identified by author X) in the lower 48 states, and lists the bears as endangered.
The other dataset may be the occurence dataset of the Grizzly bears (species name identified by author Y) in Alaska, which the bears are more than abundant.
1st taxonomy alignment problem (TAP): align the species names given by Author X vs. Author Y
2nd TAP: geographic regions – lower 48 states vs. Alaska
3rd TAP: endangered species list – one classification vs. another classification
In this case we can align the multi-layers in different datasets and each layers will come up with multiple possible worlds (merged solutions). 
Ontologies/knowledge graph/linked data:
If the abovementioned approach is feasible, my guess is that for each of the possible world we came up with, we can then patch them up together to form our own ‘grizzly bear’ ontologies/knowledge graphs. This can enable us to visualize and query for future uses.  
– Work with a particular species Academy of Natural Sciences is most proud of?
– What does the actual dataset look like? 
– Are there any relations across different layers?
10-week rough timeline:
Week1-2: identify the problems and research questions & come up with a 3-4 page proposal draft
Week 3-8: execute, implement the proposal 
Week 9-10: wrap up and draft deliverables 


Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica


1 thought on “Jessica Cheng, Week 1: Explore the direction of our project — Multi-layer taxonomy alignment problems”

  1. Hi Jessica,

    Would it be useful for you if I connected you to the 2018 LEADS fellow who worked at the Biodiversity Heritage Library? She did preliminary geoparsing of the BHL data.

    I love the outline that you developed for yourself. Great idea! Excited to see where this research goes!

Leave a Reply

Your email address will not be published. Required fields are marked *