LEADS Blog

Nikolaus Parulian, Week 1: Exploratory Data Analysis – What we can do to understand the data?

LEADS site: Repository Analytics & Metrics Portal

 
 
After getting some ideas about data science, data analytics, and data visualization in the boot camp (Sonia already posted an excellent review on what we learn on the boot camp), I started working on the Repository Analytics and Metrics Portal (RAMP) dataset provided by my mentors. 
RAMP is a The Repository Analytics & Metrics Portal (RAMP) is a web service that improves the accuracy of institutional repository (IR) analytics. 
RAMP provides a persistent and accurate count of file downloads from IR and so much potential for IR metrics aggregation and comparison across the organization that join this project.
 
The first thing I did on the dataset is understanding the data by doing an exploratory data analysis. The RAMP dataset I am working on is derived from the Google Analytics Console  which contains page_clicks, URL, average_positions, and impressions merged with additional data that RAMP provided. I visualized and aggregated most of the categorical columns on the dataset and found the correlation between each numerical column. Besides that, I also count the statistics to see if there are outliers in the dataset.
 
In the end, I found some interesting result through the visualization and correlation analysis, and we will discuss the findings in the meeting on the second week.
 
Overall, this RAMP project is pretty exciting and have so many potentials. I am excited to continue working on this project further.
 
 
Nikolaus Parulian
LEADS Blog

Julaine Clunis, Week 1: Getting Started

Hi everyone!

This is Julaine and my assignment is with the Digital Public Library of America (DPLA). The DPLA has more than 3 million unique subject headings, with only a portion of those being from controlled vocabularies which can lead to various issues arising when records use slight term variations or synonyms for the same concept.
The aim of my project is to continue working on the development and testing of an effective method for analyzing record content and matching content. This includes keywords with relevant controlled terms from a defined list, in an effort to create a consistent vocabulary to aid users and that can be reliably re-ingested as well as consistently support analytics.
I have spent the last couple of days reading through a ton of documentation about the work that has already been completed on this project. Familiarizing myself with the DPLA Metadata Application Profile and getting set up and familiar with the software and data that has been recommended for use. I have been exploring, for the first time, Apache Spark and I am slowly finding my way around it (downloading, installing and setting up the environment for its use on my machine and reviewing tutorials),so I haven’t really done much in terms of coming up with any solutions to this problem as I am just getting to know the tools and the data.
My mentors have been incredibly supportive and helpful and make themselves available to me in several ways. I expect I will learn a lot from working with them and am feeling really thankful for that. We use various tools such as Slack, Zoom and email to stay in touch so I am feeling positive about having access to direction or support if and when I need it.
Well, that is about all I have to report at this time.
I wish everyone the best of luck going forward with their projects.

Julaine Clunis

News & Events

CCI Presents at ICHI 2019 in Xi’an, China

Drexel CCI participated in the 7th IEEE International Conference on Healthcare Informatics (ICHI 2019) in Xi’an, China, from June 10-13th. CCI professor Chris Yang served as the general co-chair and panelist for the conference.

Phd students Ou Stella Liang and Michal Monselise presented their full paper, “Identifying Important Risk Factors Associated with Vehicle Injuries using Driving Behavior Data and Predictive Analytics.” The paper was co-authored with Chris Yang. Ou Stella also presented a data analytics challenges paper co-authored with Ali Jazayeri and Chris Yang, entitled, “Interpatient Similarity-based Imputation of Missing Data in Electronic Health Records.”

Ou Stella participated in the doctoral consortium with her presentation, “Determining Safe Prescription Practices for Pregnant Women.”

 

LEADS Blog

Week 1: Bridget Disney blog entry

LEADS: Getting Started
Bridget Disney, California Digital Library
My LEAD project is at the California Digital Library (CDL), working with mentor John Kunze, and fellow participant Hanlin Zhang. On June 8th, the LEADS fellows attended a three day data science bootcamp in Philadelphia. It was a great opportunity to meet the LEADS staff and the other students. What an amazing group! I’m sure that we will learn from each other and collaborate on projects in the future. We learned a lot from the professors who introduced us to the basic concepts (in some depth) of data science. It was helpful to have a complete overview in everything from metadata to text processing to visualization.

 

LEADS-4-NDP Data Science Boot Camp
At the CDL, I’ll be working on YAMZ (http://yamz.net), which stands for Yet Another Metadata Zoo. The tagline on the web site bills itself as “A crowdsourced metadata dictionary. Search for terms, upvote useful ones.” This platform is used those developing and sharing controlled vocabularies. The software is written in Python using a PostgreSQL database.
I spent the first week hopelessly trying to feel my way around and setting up the environment for YAMZ. I have never used Python and am excited to get the chance to learn it. It looks like there are two choices of operating systems for this project – Mac and Ubuntu, a Unix like operating system that can run on a desktop. I elected to give the Mac a try. I started using a Macintosh two years ago, just to see how it worked and now I love it so much, there’s no turning back! However, while installing the components, I have run into a few obstacles. Hopefully, I’ll be able to work through those.
Perusing through the documentation, I see there is an article about scoring of meta dictionary terms (Patton, 2014, Community-based scoring of metadictionary terms) that might be helpful. Also, Hanlin sent me a link to get me started with GitHub (https://help.github.com/en/articles/connecting-to-github-with-ssh). So now I have some reading to do!
LEADS Blog

Week 1: Sonia Pascua, I am a LEADS-4-NDP 2019 Fellow

LEADS site: Digital Scholarship Center
Project title: SKOS of the 1910 Library of Congress Subject Heading
                 As I am so privileged that I am one of the LEADS-4-NDP fellows for this year grant. My placement is with the Digital Scholarship Center of Temple University and my mentor is Peter Logan. Currently, we are at the project proposal stage and establishing proof of concept. We’re looking at a paper too to be one of our outputs which we target to submit to a conference like NKOS or Dublin Core.
                 As a fellow, I was included in the recent 3-day Data Science boot camp held at our University, Drexel University. As I posted it to LinkedIn, I was really excited to learn and to meet co-fellows in this boot camp. The days had gone by so quickly for this great endeavor. Nonetheless, I had a good account of my experience with this boot camp.
Day 1 was a full pack lecture and getting to know co-fellow and our respective projects. Our ice breaker was fantastic. It gave us the opportunity to know participants in a more fun way by asking a couple of questions to a partner then presented to everyone in the room what you’d found. It revealed exciting facts about co-fellow and broke rigidity amongst ourselves. From that moment on I felt comfortable with everyone.  
Lectures on Intro to Data Science by Prof. Erjia and Big Data Management by Prof. Il-Yeong, both from CCI were inspiring especially when they shared their own comprehension of concepts. I liked how Prof Erjia started with “A hundred people will have a hundred definitions of Data science (DS)…” which gave the right understanding on why there’s different treatment experienced in the DS field. I liked too how he drilled on the multidisciplinary skills needed by a modern data scientist and coached us that we should be getting just one skill and be good at it; that it would be hard to work on all four skillsets (Mathematics and Statistics, Programming and Databases, Domain Knowledge and Soft skills, Communications and Visualizations) and be the jack of all trades to them. This may end you up master of none which is not fruitful for a career. As an academic researcher, it’s advisable to boast of one skill and be a good part of a team in a DS endeavor. I appreciated Prof Erjia’s list of biases which I believe if understood, could be keys to overcoming challenges encountered DS.
On the other note, Prof Il-Yeong did expose a lot of compendium account of what happened through time in the database field. His story of “Old SQL to NO SQL to New SQL” was awesome. It provided an understanding of what we have now. It’s also great experiencing validation of what I was teaching. Hearing the database from an “antiqua” person. Don’t get me wrong. For me, “antiqua” term is full of respect and admiration. In my 10 years of teaching database, only a handful of people whom I regard as knowledgeable of the heart and soul of database and Il-Yeong is one of them.
Data Science talk of one of the mentors, Dr. Jean Godby, a senior research scientist at OCLC, was precious. She laid a good perspective to understand data science challenges and promises.
That day ended with our group dinner at Han Dynasty. We were joined by the Department Head of CCI Drexel University, Dr. Xia and Dr. Michelle Rogers and Dr. Peter Logan, one of the mentors of the LEADS-4-NDP Project and the director of Digital Scholarship Center which is my placement.
Day 2 as well as day 3, I should say were another stretches of lectures together with workshop in R. We got our hands dirty with the coding and building of our tech skill in the basics of R. Various topics ran from data pre-processing, data visualization and visual analytics, data mining and machine learning II to text processing and mini-workshop on BigML, a code-free tool for Automated Data Analytics. Dr. Richard Marciano did a small Data Science talk and presented the projects he and Digital Curation Innovation Center (DCIC) were working on. Additionally, Dr. Jane Greenberg delivered her presentation on metadata, data quality, and metadata integration.
I will miss the fellows. We had not gotten much time to really get to know each other but by heart, they are colleague and cohorts whom I can work with in this research journey of my life. I wish all of our successes in all our projects. Looking forward to our virtual meeting because we’re all working in Summer but from different states. How I wish we got time for bonding and trips.
News & Events

MRC Hosts NASKO 2019

The Metadata Research Center hosted the North American Symposium on Knowledge Organization (NASKO 2019) from June 13-14.

Sam Grabus, Jane Greenberg, Sonia Pascua, Deborah Garwood NASKO
NASKO 2019 Participants: MRC’s Sam Grabus, Jane Greenberg, Sonia Pascua, and Deborah Garwood.

MRC Phd Student Sam Grabus presented her paper, “Representing Aboutness: Automatically Indexing 19th-Century Encyclopedia Britannica Entries.” The presentation discussed topic relevance revaluation for automatic indexing results, evaluating which of three keyword extraction algorithms produce more relevant results for the digital collection.

Sam Grabus presenting at NASKO 2019
Sam Grabus presenting at NASKO 2019
LEADS Blog

Week 1: Kai Li: How did I get here?

I would like to imagine that I’ve had a quite “weird” career path. After getting an undergraduate degree in history, I became a library cataloger in a public library in China. And then because of my love for librarianship, I came to the US to get a Master’s degree in Library and Information Science and then this PhD degree in Information Science. After doing PhD, I gradually developed the dichotomy between being a professional librarian and being a researcher. I think a major difference is one’s epistemological stance: being a PhD means that you should be critical to all ideologies, including those embedded in your own business.

Long story short, all these seemingly not-so-related experience converged in my LEAD4 project: “Automatic Identification of Publisher Entities to Support Discovery and Navigation,” one that is sponsored by OCLC to use data science methods to disambiguate publisher entities recorded in the publication statements in library bibliographic metadata.

Interestingly enough, this project is not a totally new idea for me either. When I was still working at Ingram Content Group in 2014 (also as a cataloger) and was about to start my PhD program, Mrs. Cecilia Preston talked to me about this idea. That was a time when VIAF.org and ISNI were still relatively new projects and “entitization” (or name disambiguation) was a major interest in the library cataloging communities. In general terms, this has been a problem for library cataloging for many years because publisher names are only transcribed into unstandardized text strings, thus preventing the library data from being used in other meaningful ways. This argument, of course, was made in Mr. Roy Tennant’s very famous article, “MARC Must Die.”

I am very glad to get some updated knowledge about this movement from Dr. Jean Godby, my supervisor in this summer project. The entitization of publishers is still a major task faced by library cataloging communities because in the BIBFRAME (Bibliographic Framework) model (one that is to replace the MARC format), the publisher is treated as an entity. To be an entity, all publishers must be freed from the text strings, disambiguated, and assigned their own identifiers.

Screen Shot 2019-06-11 at 4.21.13 PM copy.jpg

 

So this is why I am here. I was super excited to read the project’s description when I decided to apply for the LEADS grant. And I am still super excited to spend the summer to immerse myself in the library bibliographic data to figure out how to extract and disambiguate publishers in the most effective way. This, I hope, will play a small role in making the library data more useful to all its “users.”

News & Events

MRC Co-Sponsors NASKO this week: NASKO Highlights

The Metadata Research Center is co-hosting the North American Symposium on Knowledge Organization (NASKO 2019) from June 13-14th, at the College of Computing and Informatics.

Howard White: “On Patrick Wilson”

Professor Emeritus and Visiting Research Professor Howard White will deliver a special presentation at NASKO, titled “On Patrick Wilson.” Read more about Howard here.

Continue reading “MRC Co-Sponsors NASKO this week: NASKO Highlights”

News & Events

Metadata Mixer: “Metadata Madness”

TOPIC: Metadata Madness – accomplishments for the year, and/or goals for the summer.
Presenters: CCI PhD students, Cecilia Preston
Date:
Wednesday, June 12th
Time: 12:30-1:30 PM
Location: 3675 Market Street,
University City Science Center,
CCI’s new location
Room: Dean’s conference room is #1039 (10th floor)

ADDED FUN: A visit to the Metadata Research Center, now residing on the 11th floor of 3675 Market Street, joining AI (artificial intelligence) and data science [This is for guests outside CCI who may attend].

News & Events

LEADS-4-NDP 2019 Data Science Boot Camp

The LEADS-4-NDP 2019 fellowship program kicked off this week with a 3-day data science boot camp at Drexel University’s College of Computing and Informatics. Eleven fellows from iSchools across the U.S. are paired with nine National Digital Platform partner sites for 10-week remote internships to address data science challenges.

LEADS-4-NDP 2019 cohort
The 2019 LEADS cohort, joined by CCI’s Dr. Il-Yeol Song, Dr. Jane Greenberg, OCLC’s Jean Godby, and Project Manager Sam Grabus

Boot camp sessions included big data management; metadata; data pre-processing; data visualization; data mining and machine learning; large-scale and parallel computing, and automated data analytics tools. As part of the boot camp, LEADS mentors OCLC’s Jean Godby and DCIC’s Richard Marciano shared about data science opportunities at their institutions; And LEADS mentors Steven Dilliplane, Academy of Natural Sciences, and Peter Logan, Temple University’s Digital Scholarship Center, participated in boot camp activities.

Read more about the LEADS program HERE.