One of the things that we’ve noticed about the dataset is that beyond duplicate terms there are subject terms that refer to the same thing that are spelled or entered differently by the contributing institution but which refer to the same thing. We’ve been thinking about using clustering applications to look at the data to see what kinds of matches are returned.
It was necessary to first do some reading on what the different clustering methods would do and how that might work for the data we have. We did end up trying some clustering using various key collision methods (Fingerprint, n-gram fingerprint) and KNN and Levenshtein distance methods. They return different results and we are still looking at the results returned from this before performing any merge functions. It is possible that terms look the same or seem similar but are in fact different so it is not as simple as just merging everything that matches.
One important question to answer is how accurate are the clusters and whether we can trust the results enough to go ahead and automatically merge. My feeling is that a lot of human oversight is needed to evaluate the clusters.
Another thing we want to test is how much faster the reconciliation process would be if we accepted and merged the results from the cluster and whether it was worth the time to do it, i.e. if we cluster and then do string matching, is there an improvement in the results or are they basically the same.