Metadata Harvesting Assessment of the DPLA
(Note: project originally completed in early 2015. Linked records may no longer reflect the issues cited)
1. La Redowa. Nouvelle valse bohemienne. Avec la Theorie de cette Danse d'apres la Methode Varin. Musique de Fred. Burgmuller. No. 8564.
Glancing at the brief entry, there doesn’t seem to be anything awry. There’s certainly a loss of information between the original and the DPLA entry, but this is fairly standard as the DPLA displays fewer fields. Probably the most glaring issue is the description, which is curiously cobbled together from the “illustration,” “previous owner,” and “price” fields, giving the strange description of, “Dancing couple. Eliza C. Anderson. 18 kr.” Without any proper context, this would seem to imply that the dancing couple includes Eliza C. Anderson, and the 18kr must be some esoteric reference to an element of dance rather than to the krone/krona (which is itself a bit of a mystery, since the Austro-Hungarian krone was not introduced until 1892, just under 20 years after the death of the piece’s creator, yet it seems highly unlikely that the piece would be priced in the Danish or Swedish krona given that Mainz is in southern Germany, far and away from Scandinavia. So then, precisely what is the price referring to? Is it in fact a reprint of the original dating to after 1892? Is it a copy issued in Denmark? This is somewhat important for the original piece of course, but right now our focus is on how it transfers to the DPLA, so I’ll have to let the issue go unanswered).
What really led me here though was a hidden element. Namely, in the DPLA, the language of this item is listed as “the.” That’s it, just “the.” This shows up when, during an advanced search in the DPLA, you can select from such languages as Italian, Arabic, and, of course, “the” and “as.” It has all the hallmarks of data mixed up in the transfer due to a non-standard use of fields. What’s especially baffling about this case is that while the original item does misuse the language field, it doesn’t do so in a way that would clearly lead to the DPLA recording it as “the.” The language is listed in the original as, “Instructions for the dance (in French).” It’s entirely possible that there’s a hidden field that’s leading to the confusion. Regardless, you can see a distinct lack of human oversight in the harvesting since it wouldn’t be too difficult for a person to catch that the language is French.
2. The Guitar of Spain. Sung at the Theatre Royal Drury Lane, by Miss Shirreff, composed & dedicated to Madame Malibran, by S. Nelson.
This item was found under similar circumstances as the above, and has an even more glaringly bad language entry in the original. Unless there is some magic hidden in the abbreviations, none of the information entered into the original language field is relevant to that field, which reads, “L. r. of p. 4-7: The Spanish Guitar. 6.” You might think Spanish holds some relevance, but that’s merely a title in this case, the actual language of the piece being English in this instance.
As before, the description field is atrocious, even moreso this time since it gives no information relating to the piece itself save of its price. It simply lists the former owner, his place of residence, and the price, none of which are going to be things the average browser will be looking for.
In part, this is really the weakness of the original team handling the piece as no attempt was made to include even a brief description of this and other pieces in the collection, and it hardly falls to the DPLA to do the work of these teams for them. Instead it speaks to a certain slipshod approach to digital metadata which is focused wholly on the mass digitization of items over making these items accessible to those not already familiar with the items in question. But, that’s a discussion for another time.
3. Vatican City, BAV Lat. 4804, folio 27r
This item is actually in fairly good condition, complete with a functional description, topics, and some very basic information about its origins. Perhaps the only shortfall is the title itself. In this case the item archived is specifically an illustration drawn from a manuscript drawn from the Vatican archives, and the original item says as much, listing the above title as the source of the image. In other words, were you in the Vatican archives, that’s where you would look to find the item, but it has nothing to do with the title of the piece, either the picture or its original text. Granted, the original title may have been lost (unless the “text illustrated” field is a reference to the original title), or it may simply have not been available—the item comes from a previously existing collection of medieval medical images and this collection may have never recorded the original title of the piece since they cared only for documenting the images. The location of the picture was enough for them, but once again this is winding off into a discussion of metadata practices rather than metadata harvesting.
This is once again a difficult point for the DPLA team. The original uses this source as the title and they’re in no position to do any different. So it’s not so much an error in transfer as a question to be raised in digital metadata, which as always falls to the practical vs. perfectionist debate.
Aside from all of that, the only other oddity is the duplication of the creator’s name, this due to DPLA merging the author and common names fields. At least for this item, the only difference between the fields is that the latter lacks a date and lists the name in forename-surname rather than surname-forename as the author field does. Thus it becomes redundant in the DPLA when both are merged into the creator field.
One could also argue that the DPLA entry does not make it clear enough that the item being presented is just the illustration and not the entire manuscript. While this can be presumed by the image provided and the format, these only hint at the fact rather than stating it as the original does.
4. Florence, Biblioteca Nazionale, Palatina 586, folio 19r
As can be presumed from the title, this piece shares a lot of the strengths and weaknesses of the previous item. But, in this case, the creator field is particularly egregious since it reads, “Anonymous.; Anonymous”. In the previous item one might could justify the repetition of the name, but here it serves no purpose to have it twice. Yet, the DPLA harvester is powerless but to do as it is told, and so we end up with the dynamic team of Anonymous and Anonymous for this piece.
One thing I didn’t mention on the previous piece is that there is actually an alternate title given in the form of “text illustrated.” It can be presumed that this refers to the name of the original text from which the illustrations are being derived, though there is no certainty since “text illustrated” is a non-standard custom field. DPLA appears to make use of this as the title for the illustration itself, which is functional to some degree, but I have to wonder if it couldn’t have been incorporated into the title proper if indeed it relates to that.
5. Explicit
Well, this item is special. For starters, the title appears to have nothing whatsoever to do with the piece. For those not following the link, the “explicit” image is in fact a lectionary, which is, to quote the Oxford English Dictionary, “A book containing ‘lessons’ or portions of Scripture appointed to be read at divine service; also, the list of passages appointed to be so read.” I’m not sure how on earth one gets “explicit” from any of that. Still, this is a quirk present in the NYPL’s original entry, and no real explanation is given for how it could have possibly come about.
There are two errors gained in the transfer to the DPLA. The first is a minor one—the original date is listed as “ca. 950 – 1000,” while the DPLA entry simply reads, “1000 – 1000,” with no marker for circa, ce, bce, ad, bc, or any other clue to give context to the numbers. Beyond that, it’s just plain redundant. The error seems to arise due the NYPL’s use of the “date created” field twice, once for each part of the range, while the DPLA likely only picked it up once since “date created” is not typically a repeatable field.
The second and much more glaring error is the location. There is not, nor was there ever, a place in Germany called “Ottonian.” Ottonian refers instead to the Germanic Ottonian dynasty which ruled over a new Holy Roman Empire from 919 to 1024. In other words, one can presume that “Ottonian” in this case is instead referring to an era or culture, or is at best referring to the Ottonian imperial territories since Germany itself did not come into existence until 1871 (prior to the Ottonian dynasty, which began to bandy about the name “Holy Roman Empire” to claim Charlemagne’s Roman legacy, the region was simply known as “East Francia,” or eastern France). It’s a bit of a mystery since the original
record makes no mention of the Ottonian dynasty, meaning it must have been a hidden field and, as a result, it’s not clear whether the fault lies with the NYPL or with DPLA’s metadata harvesting.
6. Rabbit
For the most part, this entry actually seems to have transferred just fine. What information there is has made the leap intact. The closest thing to an error I could cite is the bundling of information in the format field. Dimensions and medium are all well in good, but it becomes a bit blurrier when talking about “costume accessories” and “decorative arts.” This information might have fit better either in a description or as topics, but Dublin Core does make allowances for the use of controlled vocabulary terms, so it all really depends on whether these terms were drawn from such a vocabulary commonly usable for format info. Since the original entry does display its fields, it’s not entirely clear what information was meant for what field. The original formatting makes it appear as though “costume accessories” is a description, while “decorative arts” doesn’t show up at all. So this may not be an error. It’s hard to tell given the information at hand.
7. Master-of-animals standard finial
This item is relatively intact, but does bear several wounds of transfer. For starters, the type field is incorrectly displayed. If memory serves from my digital library class, when making a record for an object you either describe the object itself, or you describe the digital facsimile thereof, but you do not mix and match the two. Yet that’s precisely what is occurring here, with the format listed as “copper alloy cast” and the type as “image.” I don’t believe too many images are made from copper alloy casts. Oddly enough, the Smithsonian entry handles this properly, listing the type given as “metalwork.” Precisely why the DPLA changed “metalwork” to “image” is a curious question.
Aside from this, there’s an odd duplication of “lion” in the DPLA’s subject field that isn’t present in the original, and the “rights” section has nothing to do with copyright and usage, instead just stating that it’s a gift from someone. This in and of itself implies nothing about how it can be used, reproduced, etc.
8. Clark Construction Co., Southern California, 1926
Fun fact: did you know California has apparently contained modern apartments since 1024 CE? At least according to the DPLA, this is indeed the case. It’s a truly bizarre error as the original metadata—which is wonderfully robust and a refreshing change from the barebones approach of most archives—has no ambiguities whatsoever about the date, which is 1926. The only place 1024 shows up is under the field “Geographic Subject (Roadways).” In other words, DPLA somehow managed to confused roadways with dates. This does indicate something interesting about their metadata harvesting though, namely that it seems to have some degree of action independent of what the original record advises. In other words, it likely looks for date-like information in the metadata and, if it finds it, it places it in the date field, selecting the earliest and latest dates to create a range. I can’t imagine the original archive had the geographic subject field set to transfer to date—though it’s not impossible someone made that mistake, if they had then all of the data should have ended up under date instead of only the first—nor do I imagine an actual individual would have made such an absurd error. It raises a lot of questions about just how the DPLA harvesting operates…
At the same time, given some of the other entries I’ve looked at, this degree of autonomy seems unlikely since in the “Explicit” item it failed to capture both ends of the date despite both being present in the original record. So in short, I’m not sure how on earth it happened.
That aside, the rest of the metadata seems intact given the fields the DPLA uses.
9. Graphic Schema of Purgatorio
And here we have one of the rare examples where DPLA actually creates a better record than the original presented. The original record is just downright dismal, filled with “unsures,” listing “Brigham Young University” as the creator, listing “literature” twice in the topics field, duplicating the date unnecessarily, and just plain being bad. The DPLA carries over some of these errors by necessity, namely the creator field, but it actually gets rid of the duplicate dates and topics and removes all the unnecessary “unsures.”
10. Illustration: Poetical Composition in the Form of a Circular Diagram; Leaf from Amplified Poem in Honor of the Prophet Muhammad; Text Title: Takhmis al-burdah
Curiously, this record lists its dates both in Latin AD and Arabic Hijri year, which is something of a nice touch. The title itself it something of a mess but it at least gets all the information across. DPLA seems to have literally cobbled it together from the original records illustration and text fields.
There is an odd error in the creator field, as four names are listed, one of which is simply “Islamic.” It’s not hard to see where this error came from though, as the original record has the term “Islamic” simply floating between the scribe and illustration fields, its purpose entirely unclear. DPLA’s harvester seems to have simply assumed it went with the scribe field and appended it as a fourth name.
There is some loss of information, namely the place of origin, but otherwise the remainder of the metadata seems intact.
1. La Redowa. Nouvelle valse bohemienne. Avec la Theorie de cette Danse d'apres la Methode Varin. Musique de Fred. Burgmuller. No. 8564.
Glancing at the brief entry, there doesn’t seem to be anything awry. There’s certainly a loss of information between the original and the DPLA entry, but this is fairly standard as the DPLA displays fewer fields. Probably the most glaring issue is the description, which is curiously cobbled together from the “illustration,” “previous owner,” and “price” fields, giving the strange description of, “Dancing couple. Eliza C. Anderson. 18 kr.” Without any proper context, this would seem to imply that the dancing couple includes Eliza C. Anderson, and the 18kr must be some esoteric reference to an element of dance rather than to the krone/krona (which is itself a bit of a mystery, since the Austro-Hungarian krone was not introduced until 1892, just under 20 years after the death of the piece’s creator, yet it seems highly unlikely that the piece would be priced in the Danish or Swedish krona given that Mainz is in southern Germany, far and away from Scandinavia. So then, precisely what is the price referring to? Is it in fact a reprint of the original dating to after 1892? Is it a copy issued in Denmark? This is somewhat important for the original piece of course, but right now our focus is on how it transfers to the DPLA, so I’ll have to let the issue go unanswered).
What really led me here though was a hidden element. Namely, in the DPLA, the language of this item is listed as “the.” That’s it, just “the.” This shows up when, during an advanced search in the DPLA, you can select from such languages as Italian, Arabic, and, of course, “the” and “as.” It has all the hallmarks of data mixed up in the transfer due to a non-standard use of fields. What’s especially baffling about this case is that while the original item does misuse the language field, it doesn’t do so in a way that would clearly lead to the DPLA recording it as “the.” The language is listed in the original as, “Instructions for the dance (in French).” It’s entirely possible that there’s a hidden field that’s leading to the confusion. Regardless, you can see a distinct lack of human oversight in the harvesting since it wouldn’t be too difficult for a person to catch that the language is French.
2. The Guitar of Spain. Sung at the Theatre Royal Drury Lane, by Miss Shirreff, composed & dedicated to Madame Malibran, by S. Nelson.
This item was found under similar circumstances as the above, and has an even more glaringly bad language entry in the original. Unless there is some magic hidden in the abbreviations, none of the information entered into the original language field is relevant to that field, which reads, “L. r. of p. 4-7: The Spanish Guitar. 6.” You might think Spanish holds some relevance, but that’s merely a title in this case, the actual language of the piece being English in this instance.
As before, the description field is atrocious, even moreso this time since it gives no information relating to the piece itself save of its price. It simply lists the former owner, his place of residence, and the price, none of which are going to be things the average browser will be looking for.
In part, this is really the weakness of the original team handling the piece as no attempt was made to include even a brief description of this and other pieces in the collection, and it hardly falls to the DPLA to do the work of these teams for them. Instead it speaks to a certain slipshod approach to digital metadata which is focused wholly on the mass digitization of items over making these items accessible to those not already familiar with the items in question. But, that’s a discussion for another time.
3. Vatican City, BAV Lat. 4804, folio 27r
This item is actually in fairly good condition, complete with a functional description, topics, and some very basic information about its origins. Perhaps the only shortfall is the title itself. In this case the item archived is specifically an illustration drawn from a manuscript drawn from the Vatican archives, and the original item says as much, listing the above title as the source of the image. In other words, were you in the Vatican archives, that’s where you would look to find the item, but it has nothing to do with the title of the piece, either the picture or its original text. Granted, the original title may have been lost (unless the “text illustrated” field is a reference to the original title), or it may simply have not been available—the item comes from a previously existing collection of medieval medical images and this collection may have never recorded the original title of the piece since they cared only for documenting the images. The location of the picture was enough for them, but once again this is winding off into a discussion of metadata practices rather than metadata harvesting.
This is once again a difficult point for the DPLA team. The original uses this source as the title and they’re in no position to do any different. So it’s not so much an error in transfer as a question to be raised in digital metadata, which as always falls to the practical vs. perfectionist debate.
Aside from all of that, the only other oddity is the duplication of the creator’s name, this due to DPLA merging the author and common names fields. At least for this item, the only difference between the fields is that the latter lacks a date and lists the name in forename-surname rather than surname-forename as the author field does. Thus it becomes redundant in the DPLA when both are merged into the creator field.
One could also argue that the DPLA entry does not make it clear enough that the item being presented is just the illustration and not the entire manuscript. While this can be presumed by the image provided and the format, these only hint at the fact rather than stating it as the original does.
4. Florence, Biblioteca Nazionale, Palatina 586, folio 19r
As can be presumed from the title, this piece shares a lot of the strengths and weaknesses of the previous item. But, in this case, the creator field is particularly egregious since it reads, “Anonymous.; Anonymous”. In the previous item one might could justify the repetition of the name, but here it serves no purpose to have it twice. Yet, the DPLA harvester is powerless but to do as it is told, and so we end up with the dynamic team of Anonymous and Anonymous for this piece.
One thing I didn’t mention on the previous piece is that there is actually an alternate title given in the form of “text illustrated.” It can be presumed that this refers to the name of the original text from which the illustrations are being derived, though there is no certainty since “text illustrated” is a non-standard custom field. DPLA appears to make use of this as the title for the illustration itself, which is functional to some degree, but I have to wonder if it couldn’t have been incorporated into the title proper if indeed it relates to that.
5. Explicit
Well, this item is special. For starters, the title appears to have nothing whatsoever to do with the piece. For those not following the link, the “explicit” image is in fact a lectionary, which is, to quote the Oxford English Dictionary, “A book containing ‘lessons’ or portions of Scripture appointed to be read at divine service; also, the list of passages appointed to be so read.” I’m not sure how on earth one gets “explicit” from any of that. Still, this is a quirk present in the NYPL’s original entry, and no real explanation is given for how it could have possibly come about.
There are two errors gained in the transfer to the DPLA. The first is a minor one—the original date is listed as “ca. 950 – 1000,” while the DPLA entry simply reads, “1000 – 1000,” with no marker for circa, ce, bce, ad, bc, or any other clue to give context to the numbers. Beyond that, it’s just plain redundant. The error seems to arise due the NYPL’s use of the “date created” field twice, once for each part of the range, while the DPLA likely only picked it up once since “date created” is not typically a repeatable field.
The second and much more glaring error is the location. There is not, nor was there ever, a place in Germany called “Ottonian.” Ottonian refers instead to the Germanic Ottonian dynasty which ruled over a new Holy Roman Empire from 919 to 1024. In other words, one can presume that “Ottonian” in this case is instead referring to an era or culture, or is at best referring to the Ottonian imperial territories since Germany itself did not come into existence until 1871 (prior to the Ottonian dynasty, which began to bandy about the name “Holy Roman Empire” to claim Charlemagne’s Roman legacy, the region was simply known as “East Francia,” or eastern France). It’s a bit of a mystery since the original
record makes no mention of the Ottonian dynasty, meaning it must have been a hidden field and, as a result, it’s not clear whether the fault lies with the NYPL or with DPLA’s metadata harvesting.
6. Rabbit
For the most part, this entry actually seems to have transferred just fine. What information there is has made the leap intact. The closest thing to an error I could cite is the bundling of information in the format field. Dimensions and medium are all well in good, but it becomes a bit blurrier when talking about “costume accessories” and “decorative arts.” This information might have fit better either in a description or as topics, but Dublin Core does make allowances for the use of controlled vocabulary terms, so it all really depends on whether these terms were drawn from such a vocabulary commonly usable for format info. Since the original entry does display its fields, it’s not entirely clear what information was meant for what field. The original formatting makes it appear as though “costume accessories” is a description, while “decorative arts” doesn’t show up at all. So this may not be an error. It’s hard to tell given the information at hand.
7. Master-of-animals standard finial
This item is relatively intact, but does bear several wounds of transfer. For starters, the type field is incorrectly displayed. If memory serves from my digital library class, when making a record for an object you either describe the object itself, or you describe the digital facsimile thereof, but you do not mix and match the two. Yet that’s precisely what is occurring here, with the format listed as “copper alloy cast” and the type as “image.” I don’t believe too many images are made from copper alloy casts. Oddly enough, the Smithsonian entry handles this properly, listing the type given as “metalwork.” Precisely why the DPLA changed “metalwork” to “image” is a curious question.
Aside from this, there’s an odd duplication of “lion” in the DPLA’s subject field that isn’t present in the original, and the “rights” section has nothing to do with copyright and usage, instead just stating that it’s a gift from someone. This in and of itself implies nothing about how it can be used, reproduced, etc.
8. Clark Construction Co., Southern California, 1926
Fun fact: did you know California has apparently contained modern apartments since 1024 CE? At least according to the DPLA, this is indeed the case. It’s a truly bizarre error as the original metadata—which is wonderfully robust and a refreshing change from the barebones approach of most archives—has no ambiguities whatsoever about the date, which is 1926. The only place 1024 shows up is under the field “Geographic Subject (Roadways).” In other words, DPLA somehow managed to confused roadways with dates. This does indicate something interesting about their metadata harvesting though, namely that it seems to have some degree of action independent of what the original record advises. In other words, it likely looks for date-like information in the metadata and, if it finds it, it places it in the date field, selecting the earliest and latest dates to create a range. I can’t imagine the original archive had the geographic subject field set to transfer to date—though it’s not impossible someone made that mistake, if they had then all of the data should have ended up under date instead of only the first—nor do I imagine an actual individual would have made such an absurd error. It raises a lot of questions about just how the DPLA harvesting operates…
At the same time, given some of the other entries I’ve looked at, this degree of autonomy seems unlikely since in the “Explicit” item it failed to capture both ends of the date despite both being present in the original record. So in short, I’m not sure how on earth it happened.
That aside, the rest of the metadata seems intact given the fields the DPLA uses.
9. Graphic Schema of Purgatorio
And here we have one of the rare examples where DPLA actually creates a better record than the original presented. The original record is just downright dismal, filled with “unsures,” listing “Brigham Young University” as the creator, listing “literature” twice in the topics field, duplicating the date unnecessarily, and just plain being bad. The DPLA carries over some of these errors by necessity, namely the creator field, but it actually gets rid of the duplicate dates and topics and removes all the unnecessary “unsures.”
10. Illustration: Poetical Composition in the Form of a Circular Diagram; Leaf from Amplified Poem in Honor of the Prophet Muhammad; Text Title: Takhmis al-burdah
Curiously, this record lists its dates both in Latin AD and Arabic Hijri year, which is something of a nice touch. The title itself it something of a mess but it at least gets all the information across. DPLA seems to have literally cobbled it together from the original records illustration and text fields.
There is an odd error in the creator field, as four names are listed, one of which is simply “Islamic.” It’s not hard to see where this error came from though, as the original record has the term “Islamic” simply floating between the scribe and illustration fields, its purpose entirely unclear. DPLA’s harvester seems to have simply assumed it went with the scribe field and appended it as a fourth name.
There is some loss of information, namely the place of origin, but otherwise the remainder of the metadata seems intact.