After my last post I spent some time, along with my mentors figuring out how to isolate the subject headings and ids from the dataset. We decided that since the dataset was so large and my machine did not have the power to handle it all. We would do all our test with a sample subset. Using some python code with Apache Spark we managed to isolate the subject terms from these records terms and output them as a csv file. The sample we yielded over 700,000 subject terms.
One of the goals of this project was to map these term against LCSH. At first my idea was to download the LCSH dataset in xml and see what kind of scripting I could do with it. However, I discovered that there is a Python script which extends OpenRefine and which will perform reconciliation against the LOC API which we decided to test. This allows you to load a csv file and run the reconciliation script against it. We found that this is an effective method to find close matches where the confidence level is over 85% for a match. The reconciliation process returns the term as listed in LCSH along with a URI which can be saved with the original data. The biggest concern with this method is the time that it takes to run within OpenRefine, however my mentors feel that this process can be captured and run in a similar way outside the tool using other programming methods.
Later we manually checked the items that were returned to see if they in fact were matching and happily everything has checked out. There still remains a question as to whether or not there are subjects that are not close/exact matches but rather fuzzy matches and how to identify and get uri results for those. Also, the dataset seemed to have a number of duplicates and data that may need some kind of cleaning preparation, so that is another thing that may need to be examined.