Libraries and Technology
The Art of Cataloging
It is not the first thing people think of when they think of librarians, but metadata and cataloging are in many ways the core of librarianship. A library without some manner of organization to its contents would be prone to chaos and disorder, unable to keep track of materials and ultimately fail in its missions of access and education. It may not seem glamorous, but metadata is crucial, doubly so in the digital age.
As it stands this underworld of librarianship feels as though it is in flux, divided between the more traditional MARC and newer RDA on one side and the growth of XML-based metadata on the other. During my time at UNCG I had the benefit of working with all of these systems, learning their purpose, their function, their strengths and weaknesses. With XML I delved even further, dabbling in reverse-engineering existing metadata schemas and creating my own, as well as working with automatic harvesting to understand the benefits and drawbacks therein.
In the case of MARC and RDA, I slowly dissected my way through them, learning the component fields, rules, indicators, and relationships. CDs, DVDs, a variety of books, digital content; I practiced with all of these, realizing along the way just how tricky it became to match newer expressions of content with these slowly updating frameworks. Nonetheless, by the end I found myself quite fond of cataloging and attempting to unravel the mystery of each item, learning a bit about it along the way.
Related: MARC and RDA Record Practice
XML and Metadata Harvesting
XML records, by comparison, are far easier to create, but the tradeoff for this is often that they are far lighter on what they actually record. Dublin Core for example is extremely flexible—aiding in its rise as the de facto standard for so many online archival projects—but it also demands very little in terms of actual metadata. Granted one can flesh out and modify a DC record, but in practice this rarely seems to happen, with the end result being that many online archives present a wide array of items with utterly minimal metadata to actually identify and organize items.
Partially this is a factor of time: with hundreds and sometimes thousands of items to digitize, organize, and upload, creating robust metadata simply is not feasible. Several attempts to circumvent this have been established, such as automatic metadata harvesting which serves as a simple program to look at a digitized item—typically a book or another metadata record—and harvest what metadata it can identify from where it expects to find this information. Of course, as was seen in the Google Books fiasco some years back, automatically harvested metadata can result in some horrible mistakes. I noticed this myself when I was tasked to look through the Digital Public Library of America and analyzed the quality of the metadata compared to the source from which it was harvested. This led to some bizarre finds, such as the DPLA containing a picture which, according to the metadata, had been taken in 1026 CE. That alone is unusual, but stranger still was the fact that the picture was of apartments in California. I could only conclude that these were part of the fabled lost apartments of California, the legends of which spurred Spanish colonists to explore the region.
Related: Metadata Harvesting Assessment of DPLA
The Role of OCR
As it stands, the world of digital archives sometimes feels like the Wild West, but as technology advances and archives move from amassing collections to tending to existing ones this will steadily change. An example of this advance can be found in the form of OCR technology—optical character recognition--which is the ability for a program to recognize characters such as when ingesting a book, even in cases where the characters are handwritten letters rather than standardized fonts. While it has uses in facilitating better metadata harvesting it is truly exciting for the archival and digitization fields where advancing OCR holds the promise of being able to create not only an image of an item but a full transcript of it, vastly increasing the usefulness of an item by not only making it more easily readable, but also allowing for better, full-text searches, more thorough tagging, better metadata, and greater usability since transcripts can be more easily read and interacted with than an image.
Related: Advances in Multilingual OCR
Crosswalking
Of course, most of the harvesting conducted by organizations such as DPLA is harvesting from existing metadata records. In these instances, the crux of the problems—aside from the quality of the source metadata—often comes down to issues in crosswalks. On the surface it might seem easy to take information from one metadata schema and transfer it to another. In truth, this can be difficult when done manually, let alone trying to program something to do it automatically. The difficulty comes in that every schema handles its metadata differently, some having robust fields with numerous subentries while others, such as Dublin Core, remain as simplistic and flat as possible. Taking metadata from a complex to a simple schema—or vice versa—is liable to lead to all manner of headaches which have to be tidied up. I learned this myself when attempting to crosswalk two Library of Congress records into my own metadata schema, only to then have to crosswalk them into Dublin Core.
By the end it was not hard to see what a challenge not only crosswalking and harvesting could be, but also how one had to be more mindful in their schema creation to truly tailor its structure to the items being ingested. Going into my metadata class I felt myself a full proponent of truly robust schemas such as CDWA, yet by the end I could see the benefits of something as simple and flexible as Dublin Core. After all, while DC may not have all the fields and subfields of something like CDWA, it nonetheless gives the librarian the freedom to add this information in to appropriate places within the schema with duplicate fields, a bit of re-purposing, and even the allowance for custom fields when needed.
Related: Metadata Schema Crosswalk
Planning Ahead
Of course, with all this dabbling in technology it is crucial to have some manner of plan to bring it all together and guide one along. After all, libraries must be active in their implementation of new technologies and methods. It is for precisely that reason that I formulated my own technology plan to attempt to gather and synthesize what I knew thus far and to provide a way forward. It was a great way to cap off my class on emerging technologies, and it is the sort of thing which, with a bit of tweaking and updating, I can continue to apply to my future library career.
Related: Technology Plan
Conclusion
In the end, being a librarian requires one to keep abreast of technologies new and old, from RDA to XML, from surveying software to maker spaces, from opacs to digital archives. Standing here as I am at the end of my time at UNCG, I feel confident in what I have learned and am ready to explore and learn even further as I transition into the active library environment.
It is not the first thing people think of when they think of librarians, but metadata and cataloging are in many ways the core of librarianship. A library without some manner of organization to its contents would be prone to chaos and disorder, unable to keep track of materials and ultimately fail in its missions of access and education. It may not seem glamorous, but metadata is crucial, doubly so in the digital age.
As it stands this underworld of librarianship feels as though it is in flux, divided between the more traditional MARC and newer RDA on one side and the growth of XML-based metadata on the other. During my time at UNCG I had the benefit of working with all of these systems, learning their purpose, their function, their strengths and weaknesses. With XML I delved even further, dabbling in reverse-engineering existing metadata schemas and creating my own, as well as working with automatic harvesting to understand the benefits and drawbacks therein.
In the case of MARC and RDA, I slowly dissected my way through them, learning the component fields, rules, indicators, and relationships. CDs, DVDs, a variety of books, digital content; I practiced with all of these, realizing along the way just how tricky it became to match newer expressions of content with these slowly updating frameworks. Nonetheless, by the end I found myself quite fond of cataloging and attempting to unravel the mystery of each item, learning a bit about it along the way.
Related: MARC and RDA Record Practice
XML and Metadata Harvesting
XML records, by comparison, are far easier to create, but the tradeoff for this is often that they are far lighter on what they actually record. Dublin Core for example is extremely flexible—aiding in its rise as the de facto standard for so many online archival projects—but it also demands very little in terms of actual metadata. Granted one can flesh out and modify a DC record, but in practice this rarely seems to happen, with the end result being that many online archives present a wide array of items with utterly minimal metadata to actually identify and organize items.
Partially this is a factor of time: with hundreds and sometimes thousands of items to digitize, organize, and upload, creating robust metadata simply is not feasible. Several attempts to circumvent this have been established, such as automatic metadata harvesting which serves as a simple program to look at a digitized item—typically a book or another metadata record—and harvest what metadata it can identify from where it expects to find this information. Of course, as was seen in the Google Books fiasco some years back, automatically harvested metadata can result in some horrible mistakes. I noticed this myself when I was tasked to look through the Digital Public Library of America and analyzed the quality of the metadata compared to the source from which it was harvested. This led to some bizarre finds, such as the DPLA containing a picture which, according to the metadata, had been taken in 1026 CE. That alone is unusual, but stranger still was the fact that the picture was of apartments in California. I could only conclude that these were part of the fabled lost apartments of California, the legends of which spurred Spanish colonists to explore the region.
Related: Metadata Harvesting Assessment of DPLA
The Role of OCR
As it stands, the world of digital archives sometimes feels like the Wild West, but as technology advances and archives move from amassing collections to tending to existing ones this will steadily change. An example of this advance can be found in the form of OCR technology—optical character recognition--which is the ability for a program to recognize characters such as when ingesting a book, even in cases where the characters are handwritten letters rather than standardized fonts. While it has uses in facilitating better metadata harvesting it is truly exciting for the archival and digitization fields where advancing OCR holds the promise of being able to create not only an image of an item but a full transcript of it, vastly increasing the usefulness of an item by not only making it more easily readable, but also allowing for better, full-text searches, more thorough tagging, better metadata, and greater usability since transcripts can be more easily read and interacted with than an image.
Related: Advances in Multilingual OCR
Crosswalking
Of course, most of the harvesting conducted by organizations such as DPLA is harvesting from existing metadata records. In these instances, the crux of the problems—aside from the quality of the source metadata—often comes down to issues in crosswalks. On the surface it might seem easy to take information from one metadata schema and transfer it to another. In truth, this can be difficult when done manually, let alone trying to program something to do it automatically. The difficulty comes in that every schema handles its metadata differently, some having robust fields with numerous subentries while others, such as Dublin Core, remain as simplistic and flat as possible. Taking metadata from a complex to a simple schema—or vice versa—is liable to lead to all manner of headaches which have to be tidied up. I learned this myself when attempting to crosswalk two Library of Congress records into my own metadata schema, only to then have to crosswalk them into Dublin Core.
By the end it was not hard to see what a challenge not only crosswalking and harvesting could be, but also how one had to be more mindful in their schema creation to truly tailor its structure to the items being ingested. Going into my metadata class I felt myself a full proponent of truly robust schemas such as CDWA, yet by the end I could see the benefits of something as simple and flexible as Dublin Core. After all, while DC may not have all the fields and subfields of something like CDWA, it nonetheless gives the librarian the freedom to add this information in to appropriate places within the schema with duplicate fields, a bit of re-purposing, and even the allowance for custom fields when needed.
Related: Metadata Schema Crosswalk
Planning Ahead
Of course, with all this dabbling in technology it is crucial to have some manner of plan to bring it all together and guide one along. After all, libraries must be active in their implementation of new technologies and methods. It is for precisely that reason that I formulated my own technology plan to attempt to gather and synthesize what I knew thus far and to provide a way forward. It was a great way to cap off my class on emerging technologies, and it is the sort of thing which, with a bit of tweaking and updating, I can continue to apply to my future library career.
Related: Technology Plan
Conclusion
In the end, being a librarian requires one to keep abreast of technologies new and old, from RDA to XML, from surveying software to maker spaces, from opacs to digital archives. Standing here as I am at the end of my time at UNCG, I feel confident in what I have learned and am ready to explore and learn even further as I transition into the active library environment.