1 FAST Times in Digital Repository Metadata RemediationRachel Jaffe Metadata Librarian University of California, Santa Cruz ALCTS CaMMS Faceted Subject Access Interest Group Meeting 24 June 2017 Hi, my name is Rachel Jaffe and I am the metadata librarian at UC Santa Cruz. Today I’m going to walk you through our library’s experimentation with FAST as a solution for streamlining and simplifying our use of controlled subject terms in our digital objects metadata.
2 Outline Background A whole new (brave?, scary?, contentious?, exciting?) world Entering life in the FAST lane (aka Workflow) Evaluation Conclusions Resources Image Credits Here’s a quick outline and before we get into the nitty gritty, I’d like to give you a bit more background.
3 Background: UCSC’s Digital Collections135,400+ Objects in CONTENTdm 21 collections Dublin Core Digital Objects Metadata Transition Project DAMS Assessment Project ILS migration … ? UCSC is currently curating over 135,000 digital objects in CONTENTdm. Over the last year we’ve implemented an instance of Sufia to handle our streaming audio-visual content. All of our digital objects metadata is in Dublin Core. Presently the Metadata Services department is engaged in a digital repository metadata remediation project in anticipation of a DAMS migration and the implementation of a shared discovery layer. This project aims to bring our legacy metadata into alignment with established standards and current best practices.
4 Background: What Our Data Looked LikedHere’s a truncated snapshot of our legacy metadata. This spreadsheet contains data for a series of digitized images from the Lick Observatory Archive. In focusing on our subject data, you can see that there are number of issues: foremost, there are multiple elements containing data that could be considered subject data; secondly, for some elements, there are multiple vocabularies in use; and lastly, if we were to scroll further down the spreadsheet, we would see that there are duplicative headings as well as correctly and incorrectly formulated precoordinated subject strings. It’s important to note that much of this metadata was created between five and fifteen years ago by a library assistant who, despite her deep knowledge of the materials, had no formal training in subject analysis. As we embarked on our remediation project, we knew that our data needed tidying up, but we also knew that we didn’t want it to lose its descriptive richness. Going forward, we would need to make subject analysis easier and more intuitive, especially as we’re going to continue to have non-experts creating metadata, whether they’re subject specialists, archival assistants or special collections librarians.
5 Controlled Vocabulary Reconciliation Mini-ProjectGoals Streamline metadata profiles Streamline use of controlled terms Get our metadata Linked Data ready Scope Not a recataloging project: i.e., no further subject analysis or reassessment of terms assigned Batch level changes In response to the issues observed, as part of the larger remediation project, we engaged in a controlled vocabulary reconciliation mini-project. We were very clear with ourselves that this was not a re-subject analyzing project; rather we would work with the subject data we had at the batch-level.
6 A Whole New World, Some Old Data & Some Brave LibrariansGetting from A to Streamlined: Relocate geographic terms recorded in Geographic Location (Coverage Spatial) to Subject Reconcile all assigned subject terms (AAT, Local, TGM, other) to LCSH Establish best practices to guide the creation and use of local subject terms Thinking big and thinking small: assigning subject terms to serve both a local and an international audience Linked Data, a discovery layer … whaaaat? In practical terms, how were we going to get from A to Streamlined? We had a few ideas: Relocate geographic terms recorded in Geographic Location (aka Dublin Core’s Coverage Spatial element) to Subject. This decision is in keeping with a broader discussion we had concerning our use of the Geographic Location element and the lack of clarity around what kind of data should be recorded there. Reconcile all assigned subject terms to LCSH. Establish best practices to guide the creation and use of local subject terms. Instead of supplying non-standard or local terms, save for where it is absolutely necessary, we wanted to encourage and increase use of established controlled vocabularies. Thinking big and think small: we wanted to assign subject terms to serve both local and international audiences. For example, this lead to the batch addition of constant value “University of California. Santa Cruz,” to nearly all our digital objects metadata. And our last task: How to get our subject data Linked Data and Discovery layer ready? Given that a system migration is imminent, we know that we’re going to be deep-diving into new waters. The new DAMS and the new discovery layer will have different requirements of our data than our legacy systems did; these new systems will also enable our endusers to interact with our data in new ways, which will require that we shift not only our data but our thinking about our data. The new discovery system will not only pool together our data from different systems or silos, but it will offer users a search experience that favors post-filtering or faceting, rather than relying on pre-filtering or the construction of complex search queries. For better or worse, we moving away from classic systems that look like this:
7 Example 1: This is a browse view in our local Cruzcat catalog with all of our nicely pre-coordinated subject headings.
8 Example 2: To systems that look and function more like this. (This is a screen capture of Binghamton University’s implementation of Primo.)
9 Example 3: Or like this. (Screen capture from Stanford University’s SearchWorks.) We can see that these discovery systems are automatically doing some of the work for us in making our LCSH more facet-friendly. Whereas previously you’d see facets that looked like paragraphs containing full pre-coordinated strings of terms, they are de-coordinating subject strings and presenting constituent terms as facets. Still, this new discovery environment begs the question: What can we do as metadata stewards to best prepare our data and how can we start creating data that is better suited to this reimagined search experience?
10 Entering the FAST Lane What is FAST?Over 1.7 million terms Linked Data ready: Things vs. strings Simpler than LCSH; better than keywords Alternative or complement to traditional LC vocabularies What are other libraries doing? Is anyone else using FAST in the space of their digital collections? So this is where we began to think about moving into the FAST lane. In FAST, we saw an elegant and potentially easy solution to the issues we observed both in our legacy metadata and in the work of subject analysis. This led us to wonder: What are other libraries doing? Have other libraries adopted FAST for use with digital collections? An informal survey of our UC colleagues revealed that some had experimented with mapping subject terms to FAST and that one of campus was considering entirely replacing the LCSH headings in their digital objects metadata with FAST.
11 Entering the FAST Lane: The Mechanics of Switching LanesOCLC’s FAST tools Requirements: It has to be fast, easy & automated OpenRefine So now that we and others are indicating that we want to get in the FAST lane, in practical terms, how do we do it? And how do we ensure that we’re about to FASTer instead of slower? As part of their research project, OCLC developed a suite of tools that many of us are familiar with: searchFAST, assignFAST, etc. In considering how to best convert our legacy subject headings to FAST, we spent a lot of time experimenting with all these tools, in particular the FASTconverter. Despite the tool’s facility in converting subject terms found in our MARC records, in experimenting with our digital objects metadata we quickly found that we could only go so far with this tool set. We would we either have to 1) manually search for and replace FAST headings, or 2) if we were to use the FASTconverter, we would have to encode our digital objects in MARC, convert our headings and then transform the data back into Dublin Core. We felt that neither of these strategies was an acceptable solution especially when processing larger datasets. We also felt that to make this transition and the case for it viable, the process would need to fast, easy and automated. Enter OpenRefine.
12 Entering the FAST Lane: OpenRefineWhat is OpenRefine? What is OpenRefine’s Reconciliation Service? OpenRefine is an open-source tool for cleaning, transforming and reconciling messy data. Even the non-programmer, novice user can do a lot with OpenRefine. There is a great introductory text on Using OpenRefine, however beyond that, there is no comprehensive users guide to OpenRefine. There’s a wiki and some other assorted documentation, much of which was written by programmers in some language other than plain English. Locally, we found that most of our learning came through trial and error and keeping good notes. OpenRefines’ reconciliation service was developed as an automated means of converting controlled strings to URIs; however we can also use the tool to match labels provided that the vocabulary you want to match against has been published as Linked Data. While we had located a couple SPARQL endpoints for querying LCSH and other vocabularies, in order to speed up our processing time and to not continually tax someone else’s server, decided to locally host a data dump of FAST so that we can reconcile and re-reconcile until our metadata hearts are content. As we had already been using OpenRefine to reconcile terms in our metadata (creator and contributor names, subject, genre and format terms) to controlled vocabularies, it was not a giant leap to employ OpenRefine as a FAST conversion tool. Instead of matching LCSH terms to LCSH, we’d be matching LCSH terms to FAST. While I’m not going to walk us through how to set up local reconciliation services, I am going to walk us through how to run the reconciliation process once the service is up and running.
13 Entering the FAST Lane: Using OpenRefine’s Reconciliation ServiceAlready de-coordinated subject strings, reconciled to LCSH, and sifted out truly local terms Before we get started, I want to point out that I’ve already done some work on this data: I’ve decoordinated the existing pre-coordinated strings; reconciled the pre-existing terms to the LCSH; and deduplicated terms. (For example, let’s say that in one record I had two precoordinated strings “Mountains—Pictorial works” and “Landscapes—Pictorial works.” After decomposing those strings, I’d have the “Mountains, Landscapes, Pictorial works, and Pictorial works,” meaning that I would need to remove one of those two instances of Pictorial works). Lastly I sifted out our truly local terms from our reconciled LCSH into a separate column, so that I would only be matching our LCSH terms to FAST. (It’s worth noting that while I did this prep work in Excel, it could have done it in OpenRefine. My choice to use Excel was a matter of personal preference.) The snapshot we see here is how things look after the initial uncontrolled/TGN/LCSH/maybe LCSH terms to LCSH reconciliation.
14 Entering the FAST Lane: Using OpenRefine’s Reconciliation ServiceAnd this is how our same data looks after having been brought into OpenRefine. For display purposes, I abbreviated our spreadsheet to contain only the Title and our two subject columns. (Notice that I also added a list of numbers (1, 2, 3, 4) as the first column of my spreadsheet. OpenRefine expects there to be a value in every cell of the first column. So rather than inserting a placeholder into any blank cells that appear in the first column of my actual data, what I did as a workaround is to prefix my spreadsheet with a dummy column of numbers I filled down in Excel.) My first step in OpenRefine is going to be to split my multi-valued cells, like those in the Subject.LCSH column, onto separate rows. I do this by hitting the little arrow in the header of the column I want to split. From the dropdown menu, I select Edit Cells and Split multi-valued cells. I then enter my desired separator, which in this case a semi-colon and a space, in the dialog box, et voila.
15 Entering the FAST Lane: Using OpenRefine’s Reconciliation ServiceNow that each subject term has been split onto its own row, we can match each term individually to FAST. This is where it is helpful to be able to toggle between the rows and records views. We can see here that from 4 rows we had at the outset, we now have 24. And when we click on the records view, we can see and confirm that we are still working with 4 metadata records.
16 Entering the FAST Lane: Using OpenRefine’s Reconciliation ServiceNext to begin our reconciliation process. To do this, open the dropdown menu next to the column header, click Reconcile, and select Start Reconciling… This opens a dialog box in which we can select the vocabular we want to reconcile our terms against.
17 Entering the FAST Lane: Using OpenRefine’s Reconciliation ServiceHere is how the data looks post-process. My first two records are on the left and the second two on the right. All the terms that matched to FAST terms are hyperlinked in blue. Unmatched terms with near matches appear in black and corresponding suggested terms are in blue, whereupon I have the option to either select from one of those suggested terms or create a new, unmatched, unlinked topic. In looking at our reconciled data in the full screen, we can see that under the Facet/Filter tab on the left that there’s a Judgment box that shows that we have 21 matched terms and 3 unmatched terms. Our next step is to manually review our matched terms to verify that they are in fact true matches. This is easy enough to do when you are working with four records; but when working with larger sets of data, there other tools within OpenRefine that allow you to facet your data so that you can review it by term instead of paging through all of your records. At a glance, all of my matched terms look good. Now to address my unmatched terms; luckily for us each of these is easily matched to suggested matches.
18 Entering the FAST Lane: Using OpenRefine’s Reconciliation ServiceHere’s an OpenRefine tip: If I were to just click on the suggest match to accept it – OpenRefine will match your term to that URI and that label, but it will not update the actual data value you originally entered. My advice is to do that yourself using the Edit option hidden within each of the cells. If you hover over the upper right corner of a given cell, a blue Edit button will appear. Click that button. Update the cell value and hit apply to all identical cells, after which you can click the double checkmark icon to accept your preferred near match.
19 Entering the FAST Lane: Using OpenRefine’s Reconciliation ServiceAfter we’ve resolved our other near matches, all of our LCSH have been successfully mapped to FAST. Our next step is to rejoin our multi-valued cells which is easily done by selecting Edit Cells, Rejoin multi-valued column and reinserting semi-colons and spaces as our separators. Our last issue is to manually reformat our local LCSH-like subject terms so that they appear more FAST-like. This example having been dumbed down for the sake of presentation, obviously doesn’t represent or reflect all of the complexities we may encounter in our data. In some of the Lick Archive metadata you didn’t see there are many more pre-coordinated strings and geographic terms that required much more manual review and intervention. Additionally in this example, we didn’t encounter any terms that didn’t reconcile to FAST. And as I mentioned earlier, I did a fair amount of prep work on this data before reconciling it to FAST. OpenRefine’s matching is much less if hardly successful when I attempt to reconcile uncontrolled vocabulary terms or pre-coordinated strings. In my work, I’ve also encountered false matches and wrongfully unmatched terms. The process is not flawless. In short, OpenRefine is a great tool that can be used to automate part of the work of authority control, but it isn't an authority control solution nor is it a replacement for a human know-how. One still has to do some manual verification and updating -- just a lot less of it, especially when working with large spreadsheets or datasets. OpenRefine is not the only tool I would want in my metadata toolkit, but it’s a powerful one to have.
20 Entering the FAST Lane: Re-learning How to DriveLack of training materials Now that we’ve tackled our legacy data, the remaining and perhaps greater challenge is to change how we think about and do subject analysis. If we adopt FAST, how are we going to assign FAST? How are we going to teach our expert and non-expert metadata creators to assign FAST headings? One of the stumbling blocks we’ve encountered with introducing FAST locally is the lack of training resources. The bulk of the FAST headings appearing in our catalog records are automatically derived from assigned LC terms by OCLC; and while the conversion of LCSH to FAST has been expedited by OpenRefine, it has not made subject analysis any more efficient – these models still require us to assign LCSH in more or less same old way that we have. In searching beyond OCLC Research’s project webpages for additional resources, we purchased a copy of Chan and O’Neill’s FAST: Faceted Application of Subject Terminology: Principles and Application. While this book is a great resource that discusses the thinking and principles behind FAST and the structure of FAST, like the other resources we found, it is not a training manual. As we began to discuss the idea of drafting our own FAST training materials, in a moment of supreme coincidence, we happened to attend an ALCTS webinar, Using FAST for faster workflows and discovery, that was presented last September, by Joelen Pastva then of UIC and Alison Jai O’Dell of the University of Florida. In her portion of the presentation, Joelen discussed and walked us through how UIC implemented FAST in the space of their digital collections work, which included her creation of an in-house training manual for assigning FAST headings, which she later graciously shared with the attendees. Score!
21 Conclusion: License to drive FAST?Is FAST an acceptable replacement for LC? Is it worth the work? Are we adding value? In conclusion: Is FAST an acceptable replacement for LC? What is gained, what is lost? In terms of evaluating our FAST reconciliation test project and weighing the use of FAST in our digital objects metadata, what we did was to have a follow up conversation. This being an informal experiment, we didn’t do a bunch of data collection or do a ton of analysis. Though hopefully, that’s work we can undertake in the space of a fuller project down the road. Within the space of anecdote and conversation, we determined that the pros would be: simpler, more interoperable subject headings that are 1) more amenable to post-filtering and 2) more easily reconciled using automated tools. The cons include: the loss of specificity and the art of pre-coordination; that it would be hard or impossible to reconstruct pre-coordinated LCSH strings from FAST; and lastly, that persons used to searching and browsing our catalog in a certain way might have their worlds shaken up. So, for now, we’ve decided to take our FAST slow. Given that we have two system implementations of the horizon, we don’t have the time and resources to invest in pursuing an additional FAST implementation project. We’ve decided that it’s best for us to retain our LC subject terms and but to take a more “faceted” approach to assigning them, i.e., creating fewer and less complex precoordinated subject strings. As part of our pre-ingest processing of new metadata and in our remediation work, we are deriving FAST headings and including them in our unpublished metadata, but will defer any decision-making about what to do with these headings until we’ve selected our new DAMS.
22 X So we’re not here.
23 X And we’re not here.
24 But we’re here taking the middle of the road, Toyota Camry approach. We recognize the value in going FAST, but we also realize that at this time, there are organizational and technological speed limits. We’re still in our old systems and our staff and users are used to interacting with our data in a certain way. Change doesn’t have to be fast or furious: while we haven’t chucked our LCSH for FAST, our experimentations with FAST have changed our approach to subject analysis, and to our data remediation and metadata creation processes. Thank you.
25 Resources FAST OCLC Research: Chan, L. M., & O'Neill, E. T. (2010). FAST: Faceted Application of Subject Terminology : principles and applications. Santa Barbara, Calif: Libraries Unlimited. Pastva, J. (28 September 2016). Simplified cataloging for non-catalogers through FAST. ALCTS webinar: Using FAST for faster workflows and discovery. OpenRefine Verborgh, R., & De, W. M. (2013). Using OpenRefine: The essential OpenRefine guide that takes you from data analysis and error fixing to linking your dataset to the Web. Free Your Metadata: Here are some resources that I’ve found helpful. (For those who are interested, a more detailed version of this process has been included in a librarian-friendly OpenRefine recipe book/blog created by my UCSC colleague Lisa Wong: https://liwong.blogspot.com/)
26 Image Credits McClanathan Meriwether, L. (2012). McHenry Library, [Photograph] American Colony . Photo Dept, photographer. (1934) Cars in Desert. , Sept. [Photograph] Retrieved from the Library of Congress, https://www.loc.gov/item/mpc /PP/. https://www.themoviedb.org/collection/9485-the-fast-and-the-furious-collection camry-le-rebel-without-a-clause/ https://news.ucsc.edu/2016/06/meditation-sessions.html And image credits. Thank you!