LEADS Blog

Week 3: Metadata – data about data

 

LEADS site: Repository Analytics & Metrics Portal

 

On the 3rd week, I worked on downloading the metadata from the institutional repository. We already prepared a script to download the metadata based on the RAMP dataset we want to analyze. However, because every period there will always be a new request to the different documents, we must download the unique URL requested to have complete metadata.
 
Besides gathering the metadata, I also tried to do some analysis on the metadata and the Institutional Repositories. From my observation, I found that each Institutional Repository commonly uses some metadata terms, and there are also some unique terms that are used by only some IRs.
 
On the weekly meeting, we gathered some ideas that we want to focus working on, and we will work on to tackle these research questions with the supervision of Prof. Arlitsch and Jonathan for the upcoming weeks.
 
Nikolaus Parulian
LEADS Blog

Minh Pham, Week 2: Mapping data out with aesthetics and readability

 

In week 2, I focused on refining the visualizations I did in week 1 to better visualize and understand one dataset among the three large datasets (so far) we have in the project. Thanks to the visualizations, I have some sense of information seeking behaviors of users who use institutional repositories (IR) to search and download information including devices used, device differences due to geolocation, time of search, factors affecting their clicks and clickthroughs etc. 

 

To improve the aesthetics of the visualization, I paid attention to color contrast, graphic resolution, color ramp, transparency of colors, shapes, and scales of x and y axis. To enhance the readability of the visualization, I tried not to present too much information in one visual using Miller’s law of “The Magical Number Seven, Plus or Minus Two” to make sure that people will not feel overwhelmed when looking at the visual and processing information. 

 

Besides working with visualizing information which struck me as interesting in the first dataset, I also tried to wrangle the other datasets. Nikolaus managed to harvest metadata relevant to each URL. This means we can look into metadata content related to each search. However, it also creates a challenge for me regarding how to make unstructured string data into structured data. This is not what I often do but I am excited to brush up my skills in working with text data in the coming weeks.

 

Minh Pham



LEADS Blog

Week 2: Understanding the limitation of data – What we can’t do

LEADS site: Repository Analytics & Metrics Portal

 

 

After developing some visualization to understand the relationship between columns in the RAMP dataset, we had a follow-up meeting to discuss the visualization result.
The visualization I discussed on the meeting focuses on aggregation between categorical values in the ramp dataset including the number of visits for each index and each domain name (URL), number of visitors for citable and non-citable content, number of visits based on the user devices, and providing histogram for position, clicks, and clickThrough.
In the meeting, we also discussed the possibilities of incorporating external data such as metadata for each index. One of the mentors Jonathan have been trying to merge metadata to the older RAMP dataset period (2018), and we also can extract the metadata from the new dataset that we want to focus on analyzing.
What I will do next for this dataset is extracting metadata, make the data reacher so we can understand more about the behavior of the users through the metadata and form a research question that we want to focus on for the RAMP dataset.
Nikolaus Parulian

 

LEADS Blog

Minh Pham, Week 1- Exploring the data

 

Week 1: Exploring the data

My placement is with the Repository Analytics & Metrics Portal (RAMP) project at Montana State University. Nikolaus – another LEAdS fellow in the same project with me provided a nice overview of the project. Thanks, Nikolaus!

 

Before the bootcamp, Nikolaus and I had an online meeting with our mentor – Dr. Kenning Atlitsch and other members in the project. Dr. Atlitsch and the other members in the project helped us understand more about the project and familiarized us with the data collected from the RAMP service. Thanks to the bootcamp, I came home filled with new knowledge about library science in general and meta data in particular and new techniques in database management, visualization, and analysis with text mining and machine learning methods.

 

For week 1, I focused on exploring the data by doing descriptive analysis and creating crude visualizations from the data. RAMP data consists numbers from over 50 IRs and consists over 400 million rows. Due to the amount of data and memory constraints of my laptop, it takes R from a couple of minutes to hours to run a command or knit the document. I looked into the option of working with R Studio Cloud but the current version of R Studio Cloud does not enable us to upload and work with such big data like RAMP. For now, I have to use the old school way of handing generated results from R: copying and pasting one by one to a word doc rather than make use of knitting capabilities of all results in a single document using R notebook or markdown.

 

My plan for the 2nd week is to refine the visualization for aesthetics and readability and merge RAMP data with other data to explore research possibilities from the RAMP data.

 

Minh Pham



LEADS Blog

Nikolaus Parulian, Week 1: Exploratory Data Analysis – What we can do to understand the data?

LEADS site: Repository Analytics & Metrics Portal

 
 
After getting some ideas about data science, data analytics, and data visualization in the boot camp (Sonia already posted an excellent review on what we learn on the boot camp), I started working on the Repository Analytics and Metrics Portal (RAMP) dataset provided by my mentors. 
RAMP is a The Repository Analytics & Metrics Portal (RAMP) is a web service that improves the accuracy of institutional repository (IR) analytics. 
RAMP provides a persistent and accurate count of file downloads from IR and so much potential for IR metrics aggregation and comparison across the organization that join this project.
 
The first thing I did on the dataset is understanding the data by doing an exploratory data analysis. The RAMP dataset I am working on is derived from the Google Analytics Console  which contains page_clicks, URL, average_positions, and impressions merged with additional data that RAMP provided. I visualized and aggregated most of the categorical columns on the dataset and found the correlation between each numerical column. Besides that, I also count the statistics to see if there are outliers in the dataset.
 
In the end, I found some interesting result through the visualization and correlation analysis, and we will discuss the findings in the meeting on the second week.
 
Overall, this RAMP project is pretty exciting and have so many potentials. I am excited to continue working on this project further.
 
 
Nikolaus Parulian