LEADS Blog

Bridget Disney, California Digital Library – YAMZ

California Digital Library – YAMZ
Bridget Disney
We have been duplicating our setup for the the local instance of YAMZ on the Amazon AWS server. The process is similar – kind of – and we’ve come across and worked through some major glitches in its setup.
One challenge that we have experienced is setting up the database. First we had to figure out where PostGreSQL was installed. The address is specified in the code but it had moved to a different location on the new server. There are different steps that the code goes through to determine which database to use (local or remote) and the rules have changed on the new system. Because of that, we have had to figure out our new environments and our permissions, documenting the process as we go along. We’ve set up a markdown file in GitHub which will be the final destination for our process documentation, but in the meantime, we made entries to a file in Google Docs as we worked through the process of the AWS installation.
Finally, we used pg_dump/pg_restore to move the data from the old to the new PostGreSQL database, so now we have over 2500 records and a functioning website on Amazon AWS! This has been a long time coming but it has helped me see the purpose of the whole project, which is to allow people to enter terms and then collaborate to determine which of those terms will become standard in different environments. In order for this to happen, this system will have to be used frequently and consistently over time.
I still have some concerns. Did we document the process correctly? It does not seem feasible to wipe everything out and reinstall it to make sure. Also, we still haven’t worked out the process that should be used for checking out code to make changes. 
It’s been a productive summer and we’ve learned a lot, but I feel we are running out of time before completing our mission. Starting and stopping, summer to summer, without continuous focus can be detrimental to projects. This is not the first time I’ve encountered this as it seems to be prevalent in academic life.
So, in summary, I see two challenges to library/data science projects:
  1. Bridging the gap between librarians and computer science knowledge
  2. Maintaining the continuity of on going projects
LEADS Blog

California Digital Library

California Digital Library – YAMZ
Bridget Disney
We are making slow and steady progress on YAMZ (pronounced yams). My task this week has been to import data into my local instance. I began by trying to import the data manually into PostgreSQL but got stuck even though I tried a few different methods I had found using Google.
This is where the advice of someone experienced come in helpful. In our Zoom meeting last week with John (mentor), Dillon (previous intern), and Hanlin, it became evident that I should have been using the import function that was available in YAMZ. Finally, progress could be made. I hammered out some fixes that allowed the data to put imported, but it wasn’t eloquent. Another meeting with John shed light on the correct way to do it.
YAMS uses four PostgreSQL tables: users, terms, comments, tracking. We had errors during the import because of the ‘terms’ data referencing a foreign key from the ‘users’ table. Because of that the ‘users’ table must be imported first. There were still other errors and we only ended up importing 43 records into the ‘terms’ table. There should have been about 2700! John will be providing us with another set of exported JSON files. The first one only had 252 records. He also provided us with some nifty Unix tricks for finding and replacing data.
 

 

On the server side, both Hanlin and I have been able to access the production site on AWS. We going to try to figure out how to get that running this week.
 
LEADS Blog

LEADS Blog #3 Setup a `virtualenv` for yamz!

 

Setup a `virtualenv` for yamz!

Hanlin Zhang

July 9th, 2019

 

This week I have solved a Google OAuth login problem caused by incompatible Python environments. Typically, there could be multiple versions of Python that are installed on the same machine, e.g. I have Python 2.7.10 (comes with my macOS), Python 2.7.16 (Anaconda), Python 3.7.1 (Anaconda) installed on my laptop, which may create some compatibility issue. In our case, we know yamz requires Python 2, but the real problem is that there are different versions of Python 2 and unexpected errors may occur if the program was installed on a “wrong” Python setup. The good news is, Bridget is able to run yamz successfully with the following configuration:

 

Python 2.7.10 on with Mac Mojave 10.14.5

 

However, I was unable to reproduce the same result in the first place since the program kept throwing me out an error message. I have done the initial debugging process with the help from Bridget, but I was still unable to solve the problem until John Kunze, our LEADS mentor, shed light on isolating the Python environment with `virtualenv`. John suspects the error was caused by running yamz on an Anaconda distribution of Python:

 

Python 2.7.16 (Anaconda) on macOS Mojave 10.14.5

 

which keeps fighting against the system’s default one. However, this can be solved by using a Python package called `virtualenv`. According to the documentation of `virtualenv` (see https://virtualenv.pypa.io/en/latest/), this Python package is able to “create isolated Python environments”, where it extracts a specified version of Python from my laptop and builds a virtual environment to run the program, which is very like running a virtual machine for Python on my laptop.

Luckily, `virtualenv` has solved the problem and now I’m able to login! Further, I’m also able to isolate the Python environment now, which allows me to do further investigations on the impact of Python versions on installing yamz. I’m going to explore install yamz on several different Python versions. Since Anaconda distributions are so common right now, I think it might worth it for me to test Anaconda Python and put the result in the new readme file. I’m curious about if the login problem was caused by Anaconda Python itself or the conflict between the default version of Python on my laptop and the Anaconda distribution I installed later. 

 

To learn more about `virtualenv`:

  • Virtualenv and why you should use virtual environments

https://www.youtube.com/watch?v=N5vscPTWKOk&t=139s

  • Working Effectively with Python Virtual Environments (Virtualenv)

https://www.youtube.com/watch?v=8KWVEc6vFgA&t=53s

 

LEADS Blog

LEADS Blog #2 Deploying yamz on my machine!

 

Deploying Yamz on my machine!

Hanlin Zhang

July 3rd, 2019

 

Last week has been a tough week for me. I had been working closely with Bridget and John to set up a local yamz environment on my machine. Both John and Bridget are super helpful and very experienced in software developing and problem-solving. I asked John a question since started to read the readme document: what does the ‘xxx’ mark in the readme file stand for? I noticed a lot of ‘xxx’ marks in the Readme document of yamz.net (https://github.com/vphill/yamz), for instance, there are a couple of blocks start with the mark of ‘xxx’, such as:

 

xxx do this in a separate “local_deploy” dir?

xxx user = reader?

 

I was really interested in what does those line mean. Based on my experience with yamz, most of the lines started with ‘xxx’ is pretty useful and definitely something worth to read in the first place. John said in the world of software development, ‘xxx’ mark stands for problem waiting to be solved or comments so critical that should be paid attention to immediately. It seems my intuition was right but it is also confusing to those people without developing experience. We are going to rewrite the readme file in the summer to make it more reader-friendly. Meanwhile, I’m still debugging some error I’ve encountered while developing:

 

flask_oauth.OAuthException

 

OAuth

 

According to Margaret Rouse (see the link below), OAuth “allows an end user’s account information to be used by third-party services, such as Facebook, without exposing the user’s password”. The central idea of OAuth is to reduce the total number of times password is required in order to establish an identity, and instead to ask trusted parties to issue certificates for security and convenience concerns. But it also raises a question of to what extent we trust Google, Facebook or Twitter, and etc. as a gatekeeper for our personal identity? What is the price we are paying to use their service in lieu of money?  Will it stop at ‘we run ads’?

 

To read more:

  • What does XXX mean in a comment?

https://softwareengineering.stackexchange.com/questions/65467/what-does-xxx-mean-in-a-comment

  • OAuth

https://searchmicroservices.techtarget.com/definition/OAuth

 

LEADS Blog

California Digital Library

California Digital Library – YAMZ (Week 2)
Bridget Disney
This week, I’ve been learning more about YAMZ. Going through the install process has been tedious but I have (barely) achieved a working instance. I was able to start the web server and display YAMZ on my localhost, and learned a bit in the process, so that was exciting!    
The difference is because I don’t have any data in my PostgreSQL database. Here’s were things get a little bit murky. To add a term, I have to log in to the system via Google. The login didn’t seem to be working so I changed some code to make it work on my local installation. However, it could be that the login was only intended for use with the Heroku (not local) system so what I really need to do is to somehow bypass the login when it runs on my computer. So it’s back to the drawing board.
Even when I do login successfully, I am getting error messages – still working on those! These messages look like they might have something to do with one of the subsystems that YAMZ uses.    
After going through all that, Hanlin and I had a very useful Zoom session with John Kunze, our mentor, and the plans have been adjusted slightly. The directions for using YAMZ are different now due to the fact that it’s been a few years and the versions of the software used have changed. Also, the free hosting server has limitations and needs to be moved from Heroku to Amazon’s AWS. As such, Hanlin and I are revising the directions in Google doc to document the new process.
John is working to get us direct access to the CDL server which requires us to VPN into our respective universities and then connect to the YAMZ servers. When that is all set up, we will work through the challenge of figuring out how to proceed to move code from development to production environments.
In the meantime, looking through the code I see there are also two Python components I need to get up to speed on – Flask (a micro framework for the user interface) and Django (a web framework for use with HTML).
LEADS Blog

Hanlin Zhang, LEADS Blog #1 Yamz Kickoff

 

Yamz Kickoff

June 23rd, 2019

 

In this summer, I’m going to work with my mentor John Kunze from California Digital Library (CDL), and another LEADS-4-NDP fellow Bridget Disney (University of Missouri), to do some awesome metadata research! What Jane Greenberg, John Kunze and other researchers in the area of metadata standards found problematic is that when metadata standard is being discussed and created, people (mostly domain experts) spend a relatively large amount of time to discuss and set the standards, controlled vocabularies and etc., but have little or less time to test the actual performance of such a standard and then revision.

 

YAMZ (Yet Another Metadata Zoo) creates a unique experience that is similar to Wikipedia and Stack Overflow in a scene that the community can co-edit and vote for a standard. Our first kickoff meeting with the LEADS-4-NDP site supervisor John was on Friday. We’ve learned that yamz.net is currently deployed on the free version of Heroku, and is going to be transferred to the Amazon cloud services (AWS) in this summer, and Bridget and I are going to be part of it. I’m very excited about we are going to be involved in this process and expecting to learn a lot of cool stuff.

 

To read more about Yamz:

http://www.yamz.net/about

 

The goals for next week:

  • Rewrite the new readme and improve the readability

  • Figure out how to remotely connect to CDL, preferably through a Drexel University Network.

 

 
Hanlin Zhang
LEADS Blog

Week 1: Bridget Disney blog entry

LEADS: Getting Started
Bridget Disney, California Digital Library
My LEAD project is at the California Digital Library (CDL), working with mentor John Kunze, and fellow participant Hanlin Zhang. On June 8th, the LEADS fellows attended a three day data science bootcamp in Philadelphia. It was a great opportunity to meet the LEADS staff and the other students. What an amazing group! I’m sure that we will learn from each other and collaborate on projects in the future. We learned a lot from the professors who introduced us to the basic concepts (in some depth) of data science. It was helpful to have a complete overview in everything from metadata to text processing to visualization.

 

LEADS-4-NDP Data Science Boot Camp
At the CDL, I’ll be working on YAMZ (http://yamz.net), which stands for Yet Another Metadata Zoo. The tagline on the web site bills itself as “A crowdsourced metadata dictionary. Search for terms, upvote useful ones.” This platform is used those developing and sharing controlled vocabularies. The software is written in Python using a PostgreSQL database.
I spent the first week hopelessly trying to feel my way around and setting up the environment for YAMZ. I have never used Python and am excited to get the chance to learn it. It looks like there are two choices of operating systems for this project – Mac and Ubuntu, a Unix like operating system that can run on a desktop. I elected to give the Mac a try. I started using a Macintosh two years ago, just to see how it worked and now I love it so much, there’s no turning back! However, while installing the components, I have run into a few obstacles. Hopefully, I’ll be able to work through those.
Perusing through the documentation, I see there is an article about scoring of meta dictionary terms (Patton, 2014, Community-based scoring of metadictionary terms) that might be helpful. Also, Hanlin sent me a link to get me started with GitHub (https://help.github.com/en/articles/connecting-to-github-with-ssh). So now I have some reading to do!