Category: LEADS Blog
New Data Science Tool
—
Clustering
—
Extracting Subjects
Rongqian Ma; Week 4-5: Visualizing Decorations Information
Decoration information of the manuscripts is one of the most complex categories of information in the dataset, and to visualize it needs much work of data pre-processing. There are two layers of information that is contained in the dataset: a) one is what decorations the manuscripts include; and b) the other is how those decorations are arranged across the manuscripts. Delivering such information in the dataset may potentially communicate the decorative characteristics of the book of hours. For the what part, I identified several major decorative elements of the manuscripts from the dataset and color-coded each element in the Excel sheet, such as the illuminated initial, miniature (large and small), foliate, border (border decorations), bookplate (usually indicating the ownership of the book), catalog, notation, and multiple pictorial themes and imageries (e.g., annunciation, crucifixion, Pentecost, betrayal, and lamentation, Mary, Christ). Figure 1 demonstrates my preliminary attempt to visualize the decorative information of the manuscripts. I coded the major decorative patterns of the visualizations for the left half of the coding graph and the major pictorial themes (e.g., Virgin, Christ, Annunciation) for the right half of the graph. From this preliminary coding graph, we could see that there appears two general decorative styles for the book of hours. One type of decoration focuses on making the manuscripts beautiful and the other type focuses on displaying stories and the meaning behind them using pictorial representations of the texts. I then went back to check the original digitized images of the manuscript collection and found that the patterns were mostly utilized to decorate texts (appear surrounding the texts) while the other style appears mostly as full-leaf miniatures supplementing the texts. A preliminary analysis of the two styles’ relationship with the geographic information also suggests that the majority of the first decoration style is associated with France while the other that’s more emphasized on the miniature storytelling is more associated with the production locations such as Bruges.
For the second step, I explored the transitions as well as relationships among different decorative elements using Tableau, Voyant, and Wordle. Figure 2 is a word cloud that demonstrates the frequency of the major decoration elements across the whole manuscript collection. The Voyant Tools, in comparison, provides a way to further demonstrate the strengths of relationships among different decorative elements across the dataset. Here is an example. Treating all the decoration information as texts, the “links” feature in Voyant demonstrates the relationships among different elements. For instance, we could see that the strength of the link between the “illuminated” and “initial” is the strongest and there are also associations between different elements of decoration, such as “decorated,” “line,” “miniature,” “border,” “bookplate,” and “vignette.” The dataset has also attested that patterns such as illuminated initials, miniature, and bookplates demonstrating the ownership of the book, are the most common elements. The links, however, do not present any of the relationships among different themes.
Figure 1.
Week 6 – Sonia Pascua – Parser, Python & Mapping
Finally I met my mentor, Peter Logan last Monday, and it was great to see him in person. In this meeting I presented the progress of the project and figured out that perhaps a TEI format would be a good data format for me to move forward. As pending action item, TEI format will be generated and provided by Peter.
- I was able to make a parser code in Python to extract the elements from SKOS RDF/XML format of the 1910 LCSH
- Concerted assessment of Jane, Peter and I resulted to the following
-
- There is an entry in LCSH that has multiple SEE terms that when converted to SKOS RDF/XML using MultiTes, only the first term is accounted as PrefLable and the rest fell into AltLabel. How SEE should be represented is seen as a challenge. Based on LCSH, concept with SEE tag should use the SEE term as subject heading. It is the case in the first term in the SEE tag. It became the PrefLabel. However, AltLabel is used as the tag for the succeeding SEE terms and it is seen as an incorrect representation. Multiple PrefLables are going to be explored. Can it be done? Wouldn’t it violate the LCSH or SKOS rules? I need to conduct further investigation on this.
- It is decided for now that USE : will be transferred to AltLabel; We will set a meeting with Joan, the developer of HIVE, how USE and Use for will be represented in HIVE.
- I brought up about some alphanumeric words in 1910 LCSH that is a recognized Library of Congress Classification number. Should it still be needed to be represented? As per Jane, they can be kept as Notes.
- I need also to investigate how BT and NT are going to be represented both in SKOS and in HIVE DB.
- The current SKOS RDF/XML at hand, shows the different SKOS elements that some have no representation in HIVE. To address this, we will bring this concern to Joan and consult with her on how this can be added or mapped with the existing HIVE DB fields.
- Now that the text file is the input in the parser script I wrote, it is recommended to work on a text file of the 1910 LCSH. Peter to provide the TEI format.
- There is an entry in LCSH that has multiple SEE terms that when converted to SKOS RDF/XML using MultiTes, only the first term is accounted as PrefLable and the rest fell into AltLabel. How SEE should be represented is seen as a challenge. Based on LCSH, concept with SEE tag should use the SEE term as subject heading. It is the case in the first term in the SEE tag. It became the PrefLabel. However, AltLabel is used as the tag for the succeeding SEE terms and it is seen as an incorrect representation. Multiple PrefLables are going to be explored. Can it be done? Wouldn’t it violate the LCSH or SKOS rules? I need to conduct further investigation on this.
Additionally, earlier today, LEADS-4-NDP 1-minute madness was held. I presented the progress of the project to co-fellow and the LEADS-4-NDP advisory board.
Jamillah Gabriel: Working Around the Unique Identifier
In recent weeks, my project has taken an unexpected turn from data storytelling and visualization towards one of data processing. As it turns out, our partner organization (Densho.org) has already done some data cleaning in Open Refine, created a database, and began preliminary data processing. I’ll be using Python and Jupyter Notebook to continue the work they’ve started, first by testing previous processes and then by creating new processes. I also found out that the data doesn’t have unique identifiers so I’ll be using the following workaround for attempting to isolate pockets of data.
In this partial example (there’s more to it than what’s seen in this screenshot), I’ll need to query the data using a for loop that searches for a combination of first name, last name, family number, and year of birth in order to precisely locate data in a way that potentially replicates the use of a unique identifier. I’m finding that not having a unique identifier makes it much more difficult to access data quickly and accurately, but hopefully this for loop will do the trick. I’m looking forward to playing with the code more and seeing what can be discovered.
Alyson Gamble, Week 4: Historical Society of Pennsylvania
—
Week 2: Kai Li: What I talk about when I talk about publishers
As I mentioned in my previous posts, the entitization of publishers was only recently problematized when the new BibFrame model was proposed, which treats publishers as a separate entity in the overall bibliographic universe, rather than a text string in the MARC record. However, from the perspective of cataloging, we still do not seem to know too much about what a publisher is.
In the MARC record, two fields, 260 and 264, are used for describing information about the publication, printing, distribution, issue, release, or production of the resource. The use of these two fields are different in the two cataloging rules, AACR2 (The Anglo-American Cataloging Rules, 2nd edition) and RDA (Resource Description and Access) that replaces AACR2. In the period of AACR2, all publisher and distributor information should be described in 260 subfield b, where multiple subfields can be used when there are more than one publishers/distributors. In the RDA rules, however, the 264 field should be used and different functions (primarily publication and distribution in the previous context) are distinguished by the second indicator of this field. One of the issues with the AACR2 rules is that it does not require publisher names to be transcribed just as what is displayed in the resource: catalogers have the freedom to omit or abbreviate some name components, such as “publishers” and “limited.” In certain ways, the RDA rules is more consistent with how publishers are supposed to be dealt with in a more modern information infrastructure: that publishers should be recorded in more consistent and structure manners and not mixing with other types of entities (especially distributors but also printers and issuers). But in practice, the majority of library bibliographic records were produced under AACRS rules, which are almost impossible to be transformed into RDA rules because we do not know what name components were omitted or abbreviated.
While how publisher names are described (inconsistently) in the MARC format is just one barrier to the identification of publishers that is relatively easy to solve, a real challenge in the present project is the history of the publishing industry. In the real-world context, what is described in 260/264 subfield b is just an imprint, which, by definition, is the unit that publishes, no matter what the unit is (it could be a publisher, or a brand or branch that is owned by the publisher, or an individual person that publishes the resource). For example, in this link, you can see all imprints that are owned by Penguin Random House, which BTW, was merged from Penguin Group and Random House in 2013, two of the largest publishers in the American publishing market.
Throughout the history of the publishing industry, publishers have been merging and splitting, just like the example of Penguin Random House. They might acquire a different publisher in total, or just some brands (imprints) owned by another publisher. And in some rare cases, an imprint was sold to a different publisher but was sold back to its original owner later. Shown below is a slice of data manually collected by Cecilia about the history of Wiley, another major publisher in America.
[A slice of the history of Wiley]
From this imprint-centered view, a publisher is a higher level entity than imprints that includes all its child entities in a given time. In other words, quite unlike other bibliographic concepts, such as works (“great works are timeless”), publishers or imprints exist in a temporal framework. But this is a huge challenge for this project, partly because the idea of temporality is extremely difficult to be combined with network analysis methods. While I cannot give any solution at this time for this difficulty, this will be an interesting topic to be further addressed in my works.
Week 4-5: Implementing the experiments & blockers