In week 2, I focused on refining the visualizations I did in week 1 to better visualize and understand one dataset among the three large datasets (so far) we have in the project. Thanks to the visualizations, I have some sense of information seeking behaviors of users who use institutional repositories (IR) to search and download information including devices used, device differences due to geolocation, time of search, factors affecting their clicks and clickthroughs etc.
To improve the aesthetics of the visualization, I paid attention to color contrast, graphic resolution, color ramp, transparency of colors, shapes, and scales of x and y axis. To enhance the readability of the visualization, I tried not to present too much information in one visual using Miller’s law of “The Magical Number Seven, Plus or Minus Two” to make sure that people will not feel overwhelmed when looking at the visual and processing information.
Besides working with visualizing information which struck me as interesting in the first dataset, I also tried to wrangle the other datasets. Nikolaus managed to harvest metadata relevant to each URL. This means we can look into metadata content related to each search. However, it also creates a challenge for me regarding how to make unstructured string data into structured data. This is not what I often do but I am excited to brush up my skills in working with text data in the coming weeks.
Minh Pham
LEADS site: Repository Analytics & Metrics Portal
Week 1: Exploring the data
My placement is with the Repository Analytics & Metrics Portal (RAMP) project at Montana State University. Nikolaus – another LEAdS fellow in the same project with me provided a nice overview of the project. Thanks, Nikolaus!
Before the bootcamp, Nikolaus and I had an online meeting with our mentor – Dr. Kenning Atlitsch and other members in the project. Dr. Atlitsch and the other members in the project helped us understand more about the project and familiarized us with the data collected from the RAMP service. Thanks to the bootcamp, I came home filled with new knowledge about library science in general and meta data in particular and new techniques in database management, visualization, and analysis with text mining and machine learning methods.
For week 1, I focused on exploring the data by doing descriptive analysis and creating crude visualizations from the data. RAMP data consists numbers from over 50 IRs and consists over 400 million rows. Due to the amount of data and memory constraints of my laptop, it takes R from a couple of minutes to hours to run a command or knit the document. I looked into the option of working with R Studio Cloud but the current version of R Studio Cloud does not enable us to upload and work with such big data like RAMP. For now, I have to use the old school way of handing generated results from R: copying and pasting one by one to a word doc rather than make use of knitting capabilities of all results in a single document using R notebook or markdown.
My plan for the 2nd week is to refine the visualization for aesthetics and readability and merge RAMP data with other data to explore research possibilities from the RAMP data.
Minh Pham
LEADS site: Repository Analytics & Metrics Portal