Week 4-5: Implementing the experiments & blockers

These two weeks I have been implementing the experiments we proposed to do: pairwise alignments of the ‘historical sovereignty’ of Taiwan. 
Apart from the Darwin-core based occurrence dataset, we believe that adding an extra field called ‘historical sovereignty’ will be very beneficial for scientists to study the historical distribution of certain species. For the case of Pupinella swinhoei , land snail, we found most of the occurrence to be in the location of Taiwan. As the last blog post said, the years that this species occur are across a broad range: from 1700 to now. 
However, some blockers I had when I was looking through the actual dataset are as the following:
1. Country Code: If the dataset indicated that the country code is TW (Taiwan), sometimes it is JP (Japan), did they really meant that these species occur in such location? When we cross-referenced the ‘country code’ field with the ‘locality’ field, there’s also some discrepancies such as ‘country code’ being Japan, but the locality is Formosa (Taiwan’s alias). What gets weirder is that the year indicated these records are 1700 — and at that time Taiwan was not part of Japan. The country code, locality, and year fields are problematic in this sense. 
2. Year: We have 50 records in total on Pupinella swinhoei. Almost all the records have country codes,  but more than two thirds of the records are missing the year information. Knowing the year that the species appeared or was collected is crucial, given this is one factor on how we determine the historical sovereignty of Taiwan. 
I suppose we could go from another direction and look at Taiwan’s historical sovereignty based on Taiwan’s timeline – but if we disregard the occurrence data’s ‘years’ and operate solely with other outside information, our original goal of proposing a ‘more precise’ way for merging taxonomically organized dataset would be lost. And also, we probably cannot view this as constructing a data-driven knowledge graph (our endgame). 
Another workaround is to have dummy records in addition to the real records, and fill in the years that we wanted to examine. 
More to be discussed. Until next week!
Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica



Week 3: Bridging NHM collection to Biodiversity Occurrence dataset – example of land snails

To recap what I was trying to do: I wanted to find a species in Taiwan that also happened to be mentioned in the Proceedings of Academy of Natural Sciences.
We chose Taiwan as our geographic point of interest because it has been historically complex in terms of sovereignty and will probably be interesting as an example to see shifting geopolitical realities.
This whole week I have been brushing up the use cases on the example we gathered from the Biodiversity Heritage Library — a land snail species “Pupinella swinhoei sec. H. Adams 1866″.
The idea is to bring the Natural History Museum literature (NHM) closer to real life biodiversity occurrence dataset. I then gathered the dataset from GBIF with search term on scientific name “Pupinella swinhoei”. The aggregated GBIF dataset contains 50 occurrence records across 18 institutions (18 datasets), ranging from year 1700 to now

(different colors indicate they are from different data source)
Though the ‘countryCode’ field mostly indicated that the records are from TW (Taiwan), it may not be the sovereignty at that time period. To merge these datasets with the sovereignty at the time, I examined two of the 18 data sources first: MCZ dataset versus NSSM dataset.  
The 1700 Taiwan is a county within the Qing Dynasty China.
And the 1930s Taiwan is a colonized region of Japan.
I had some preliminary results to merge these two dataset’s sovereignty field by using the logic-based taxonomy alignment approach. However, since I am preparing a submission for a conference based on this use case- I don’t want to jinx anything! (Fingers crossed).
If I am allowed to share more about the paper, I promise to discuss more in the next blog post!
Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica



Week 2: Elaborating on our multi-level alignment idea and an initial exploration on the BHL collection

This week I explored more into the multi-level alignment idea , and I was almost convinced that we can leverage this idea into a ‘dataset merging’ problem.
The dataset merging idea is not new. For example in this one paper from my PhD advisor, they have discussed briefly about how to merge taxonomic data: Towards Best-effort merge on taxonomically organized data
But for our group in UIUC (in collaboration with systematic experts from ASU), we have mainly been working on the actually taxonomic names alignment rather than ‘dataset merging’.
For the dataset merging idea, our proposal is pretty simple.
If we can align taxonomic names, we should also be able to align other things in the dataset such as spatial information (in our case, countries/areas).
Naturally, finding the intersection from my project site the Academy of Natural Sciences and my interest in taxonomy has become the priority for this week. The task I have set for myself was to find a certain species that is endemic or popular across Taiwan (my geographical point of interest), and that also happens to appear somewhere in the text of either the proceedings or the journals of the Academy of Natural Sciences.
The quest went on with me fascinated (and slightly sidetracked) by all the orchids population and its varieties Taiwan has. To my surprise, one of the news (in Chinese) mentioned that Taiwan has more than 0.9 billions of moth orchids!
(image source:britannica.com)
Then I went on to create our dataset merging idea first around the orchids:
Basically, the idea is that if we have two occurrence datasets on orchids, then we can do the dataset merging with the two datasets like the figure shown above, with each column being one ‘taxonomy alignment problem’.
Just as I was almost set on going for the beautiful orchid flowers, I finally turned back to BHL to search the keyword “Taiwan” and set the Titles on “Academy of Natural Sciences”. This is when I found a whole new world of Mollusca (snails)!
The entry that returned results of intersection of “Taiwan” and “ANS” is from the Proceedings of Academy of Natural Sciences, v.57, 1905, and the title of the page/chapter is :“Catalogue of the Land and Fresh-water Mollusca of Taiwan (Formosa) with descriptions of new species”. 
Like the above BHL search interface shows, the scientific names on this page were also extracted and shown on the bottom left corner. Having this breakthrough on the Mollusca (possibly endemic to Taiwan), I will begin to work with this species on the dataset merging idea next week!
Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica



Jessica Cheng, Week 1: Explore the direction of our project — Multi-layer taxonomy alignment problems

Week 1 of the LEADS fellowship project starts with exploring the actual problems and directions we would like to work for the course of 10 weeks.
For identifying the problems, I reviewed some literature that may hopefully guide me towards the intersection of my research interests and the Academy of Natural Science’s (ANS) goals. These topics are, but not exclusive to, taxonomies, knowledge graphs, biodiversity, geo-politics, and knowledge organization. We also want to link this project towards the Biodiversity Heritage Library (BHL) and ANS collections. 
Given the conversations I had with my mentor Steve Dilliplane at both the LEADS boot camp and the NASKO 2019 conference, I came up with this interesting idea of a ‘multi-layer’ taxonomy alignment problems/framework, which may ultimately guide us to constructing a data-driven biodiversity knowledge graph/ontology of a specific species we wish to examine.
A lot of times species co-occurrences datasets contain records based on the Darwin Core metadata standard. Multi-layers in this case means different fields in a co-occurrence dataset, these can include (again, not exclusive to): species names, characters/phenotypes, habitat information, geolocations (country, cities, latitude, longitude), IUCN redlist/other endangered speices classifications, etc. Depending on what type of metadata they actually use in the dataset, my thoughts are that each of these field can itself be a taxonomy.
1st taxonomy: species names
2nd taxonomy: geographic regions (geopolitical realities may exist) 
3rd taxonomy: phenotypes
…and more
How do we proceed? BHL/ANS or co-occurrences datasets & TAP:
Say we have two different datasets from BHL/ANS about Grizzly Bears. 
Each of these field can itself have a taxonomy alignment problem.
One dataset may only locate the Grizzly bear (species name identified by author X) in the lower 48 states, and lists the bears as endangered.
The other dataset may be the occurence dataset of the Grizzly bears (species name identified by author Y) in Alaska, which the bears are more than abundant.
1st taxonomy alignment problem (TAP): align the species names given by Author X vs. Author Y
2nd TAP: geographic regions – lower 48 states vs. Alaska
3rd TAP: endangered species list – one classification vs. another classification
In this case we can align the multi-layers in different datasets and each layers will come up with multiple possible worlds (merged solutions). 
Ontologies/knowledge graph/linked data:
If the abovementioned approach is feasible, my guess is that for each of the possible world we came up with, we can then patch them up together to form our own ‘grizzly bear’ ontologies/knowledge graphs. This can enable us to visualize and query for future uses.  
– Work with a particular species Academy of Natural Sciences is most proud of?
– What does the actual dataset look like? 
– Are there any relations across different layers?
10-week rough timeline:
Week1-2: identify the problems and research questions & come up with a 3-4 page proposal draft
Week 3-8: execute, implement the proposal 
Week 9-10: wrap up and draft deliverables 


Yi-Yun Cheng
PhD student, Research Assistant
School of Information Sciences, University of Illinois at Urbana-Champaign
Twitter: @yiyunjessica