Posting-date: Tue, 9 May 1995 00:00:00 EST A1-type: DOCUMENT VENDOR PROCESSING FOR AUTHORITY CONTROL: THE UC EXPERIENCE WITH BNA Paul Cauthen Presented 31 October 1992 at a meeting of the Midwest Chapter. My presentation addressed the topic of vendor processing of catalog data records for authority control by describing the experience of the University of Cincinnati with the service offered by Blackwell North America. The presentation had three parts: 1. A brief explanation of how our authority control project came about; 2. A description of how the project itself was executed by BNA; 3. A general evaluation of the results of the authority work done by BNA. When the plan for a statewide library network for Ohio, now known as OhioLINK, was originally developed, the idea was to load the online catalogs of each of the 17 original member libraries into a central database one at a time. The original plan called for the University of Cincinnati to be the first institution to contribute its catalog records. Starting the central database in effect from scratch provided an opportunity for a fresh start. I am sure we are all familiar with the process that has taken place as libraries converted from card catalogs to electronic databas- es. It simply was not possible for the same level of loving care and attention to detail that went into the original cataloging to go into the conversion of that cataloging into machine-readable format. Nor was it usually possible to duplicate all of the discrete interfiling of conflicting headings we used to use in our card catalogs to accomodate the never-ending parade of name, uniform tile and subject heading changes that resulted from changes in cataloging rules, the publication of new thematic catalogs, policy decisions of the Library of Congress. The outcome was that with every massive conversion project, our electronic catalogs became increasingly tarnished by the scourge of "dirty data." At OhioLINK central, the idea was: let's take this opportunity and begin with a catalog database that is as clean as we can reasonably make it. Thus, before Cincinnati's catalog records could be loaded into the central database, they had to be subjected to a massive clean up. This was clearly too large a project to be done in-house. A search was launched to select a vendor. Blackwell North America got the contract. BNA offers several options for authority control processing. The option chosen for Cincinnati is its most thorough. It has two phases: machine processing and manual review. In machine process- ing, each name, series and subject heading in the catalog records is compared against the BNA authority file. Any heading that is an exact match with an authority record cross reference is replaced with the valid heading. All partial or non-matches are copied to an exceptions file. In manual review, each entry in the exceptions file is examined individually by a human reviewer, who attempts to verify the heading and, if necessary, make correc- tions. All unverified headings are sent to a final exceptions file. Two points are key to a complete understanding of the BNA process. First, the BNA authority file includes all of the authority records supplied by the Library of Congress, plus records created by BNA itself as a by-product of its own authori- ty processing, which I believe includes feedback from its library customers. Second, and more significantly, the BNA reviewers do not see the unverified headings in the context of their full bibliographic records, only in a presumably long list of other- wise undifferentiated headings. Let me describe now, briefly, how Cincinnati went about having its entire database scrubbed and polished by BNA. As our participation in OhioLINK involved switching from WLN software to Innovative Interfaces software, we already had a test database in hand that would allow us to verify that the records went out and came back in a usable format and that the BNA authority work met the terms of the contract. The test database consisted of dummy data records in every MARC format with every valid subfield. The remaining records were selected by a computer program that copied off every 50th record from the real database, some 20,000 re- cords. After the machine processing of the test database was completed, we received printouts of all the records that had had certain types of headings changed, principally subjects. All other evaluation had to be done online by searching the test database for uncorrected errors. After validating BNA's handling of the test database, our entire database was copied onto tape and sent away. BNA's machine processing and manual review of Cincinnati's entire catalog took about 5 months. Before summarizing the results of my evaluation of BNA work, I would like to mention a couple of points to keep the evaluation of BNA's performance in proper perspective. First, it is impor- tant to remember that this project covered the entire UC data- base, nearly 1 million records, from not only the university's main library and branch libraries, but also those of the Law School Library, Medical Center Library and Enviromental Health Library; in other words, a considerable variety of specialized cataloging. Unfortunately, I am not able to tell you whether a clean-up project involving only music records would have produced different results. I do think it appropriate, however, to mention the possibility. Second, the staff of the Music Library decided after reviewing the BNA proposal that more thorough clean up of music cataloging was needed than appeared would be provided by BNA. The Music Library undertook its own project, hiring and training a full-time, temporary employee to do nothing but clean up music records. That project lasted approximately six months, right up to the day before the copying off of the database for BNA began. I mention this local clean-up project to not suggest we were correct not to trust BNA to get the job done, but to make clear that there was much less work for BNA to do to begin with-- our data was not as dirty as it might have been. I presented my evaluation of the results of BNA processing through a series of various "before and after" searches. As space limitations and other practical considerations prohibit reproduc- tion of my overheads here, I must forego evidence and examples and limit myself to discussion and conclusions. A preliminary note: I did not know I would be making a formal evaluation of BNA's work until a few weeks before the reload of our processed records was scheduled to begin. Consequently, there are some categories of errors for which I had no test cases or only a few random examples. My evaluation focussed on four categories of access points: subject headings, series, name headings, and uniform titles. In each category I tried to find examples of two kinds of problems: mechanical errors (otherwise known as typos) and form-of-entry errors (or at least form-of-entry inconsistencies). I did not have any test cases of mechanical errors in topical subject headings in my problems folder and so cannot comment on BNA's performance in this area. I checked a number of recent changes in form of entry for topical subject headings and found BNA to have been quite thorough. Such efficiency is to be expected, as changes from one accepted form to to another are usually reflect- ed in the see-reference structure of authority records, allowing the changes to be made during machine processing. One simple mechanical correction to subject headings I was hoping BNA would help us with is the elimination of the woefully out-of-date chronological subdivision --To 1800. Alas, all 1865 instances remain in the catalog [Ex. 1]. Turning briefly to names-as-subjects, I do have a problem file left over from our local clean up project. Many of you will recognize WLN software at work in the Elgar example. The top 4 entries contain the current form of Elgar's name, the bottom 5, earlier forms, including one apparently dating from a time when Elgar was a contemporary composer (item 4). After BNA processing and a change of soft- ware, the form of entry of Elgar's name is consistent throughout. The only uncorrected error occurs in line 8, where the uniform title works, orchestra did not get converted to its current form, orchestra music. I shall return to the sticky business of uniform titles momentarily. As with subject headings, I was unable to find any test cases for mechanical errors in series entries. Correcting dis- crepancies in form of entry for series, however, turned out to be a much tougher challenge for BNA. Here is a printout I made to draw my attention to a split file for the series Corona [Ex. 2]: 25 entries incorporating the subtitle Werkreihe fur Kammer- orchester; 4 entries with the title qualified by place of publi- cation. Notice how each set of entries includes one instance where the volume designation is lacking. Now here are the same searches after BNA processing. The file remains split, the only change being the addition of an authority record and the restora- tion of one missing volume designation. Why was this split file missed? If we look at the authority record we see that there is no cross reference for the entry incorporating the subtitle Werkreihe fur Kammerorchester, and therefore, no basis for machine conversion. Now recall that in manual conversion the reviewers do not see the headings in context. Let's look at a few sample screens from a simple title search on corona [Ex. 3]. Even if a reviewer were suspicious that lines 27-28 (that is, plain Corona), Corona Werkreihe and Corona Wolfenbuttel were all the same series, there is nothing in the authority record to confirm the suspicion without also looking at the original bibliographic records [Ex. 4]. A quick check of two bibs provides the key information denied to the BNA reviewers, that both plain Corona and Corona Werkreihe are the same series as Corona Wolfenbuttel. Let's proceed to names and uniform titles. In general I would say that BNA had a high degree of success of correcting mechanical errors and form-of-entry discrepancies in name headings. [Ex. 5] In the Brumel example, Antoine got his dates adjusted and Giaches had both his name and dates fixed. These corrections had to be made as a part of manual review as neither discrepant heading appears as a cross reference in the authority records. If you count carefully you will see that before BNA processing we had more incorrect entries than correct ones for the Brumels. Below the Brumels is another example of name and date correction in a single entry. Unfortunately, I cannot report a 100% success rate [Ex. 6]. In the case of Gaspard Le Roux, BNA missed the floating approximatelies that they had successfully negotiated in the previous example. In the case of Manuel Infante and Hunter Johnson they were again thwarted by not being able to see the headings in context, that is, they were unable to confirm that the name entries without dates were in fact the musicians represented by headings with dates. In the case of Balakirev, they got it half right: they corrected the name error in line 3 (Alakseevich) and incorrect birthdate in line 5 (1836), but missed the incorrect birthdate in line 4 (1835) and the name error in line 8 (Millii). One type of error I have yet to cover is incorrect or missing subfield delimiters [Ex. 7]. Rachmaninoff is a good example of the value of BNA processing to our catalog as it is a file that we were not able to cover in our local clean up project. All 12 incorrect headings were corrected: date errors in lines 2-4, 10-11, mispelled name in line 13, and the 3 dates incorrectly tagged with delimeter d instead of f in lines 6-8. Let me just say for right now that the entry Rachmaninoff Sergei Cat (line 4) came into the catalog after BNA processing. I turn now to Mozart to illustrate 3 points [Ex. 8]: 1. That there were still errors in the file even after our local clean up project (lines 2-3, 7) 2. That BNA corrected the name errors, but missed the delimiter error before K. 371 and 3. Managed to introduce an error of their own at least as far as I can determine (Cosi fan tutte). Still, to have only 2 errors in nearly 3,900 name headings is not too bad I suppose. Well, where would we be without our beloved uniform titles. I will restrict my discussion to generic uniform titles, in part because I did not have any convenient test cases of typographical errors and did not have much luck tracking down Rites of spring, Swan Lakes and the like that had not been translated into Russian, but also because it seemed to me after my own work on the local authority project that generic uniform titles were the single most common cause of split files in the catalog. The biggest culprit is the rules change from always using the singular form of the genre to using the plural if a composer wrote more than one composition in the genre, which, as luck would have it, most did. Let's begin with John Stanley. Here is a typical pre-AACR2 uniform title: genre in the singular, ampersand in the instrumentation, opus number and key. AACR2 scholars ready... begin. Check the authority file to make sure all of op. 10 is concertos for organ and string orchestra and that no one has come up with thematic catalog numbers for Stanley. OK. Change Concerto to concertos, convert ampersand to comma, change comma after 10 to period insert delimiter n, capitalize the n in number, lop off C major. Very good. Now let's see how BNA did. Genre pluralized, ampersand converted, but that's all. That's only half the job, but it's not all bad, because now the entry files in the correct place in the title index. Let's stay with Stanley and see if there's a pattern. Here we have another split file caused by the difference between singular and plural in the genre "voluntary," forcing the user to look in two different places in the index to find all the entries under a single opus number. As in the previous example, BNA pluralized the genre, but left the rest of the entries unchanged. This was enough, however, to acheive the same discrete interfiling we would have used in the card catalog. I have examined many other split files of this type and have found that BNA routinely converted genre designations from singular to plural, but did not attempt to change the form of entry for the numbering of items within an opus. Let me now quickly pass over several other categories of changes in generic uniform titles and their treatment by BNA. AACR2 requires that uniform titles for concertos specify the instrumentation of the accompaniment. Here are the entries in our catalog for the Elgar violin concerto before BNA processing. The first two entries are standard pre-AACR2--single genre designation, no accompaniment specified; the remainder follow AACR2. After BNA processing the genre designation is plural, but the accompaniment has not been specified. Parallel examples treated similarly by BNA convinced me that manual review did not correct this category of split file. OK, but why did none of my samples get converted during machine processing? After all, that was the standard form of entry for concertos before AACR2; surely those entries must appear as cross references in the authority file, at least in the case of standard repertoire. Let's have a look. In the case of the Elgar violin concerto, the pre-AACR2 form of entry happens not to appear as a cross reference. Bum luck. Well, what about the Elgar cello concerto? Bingo, there is pre-AACR2 in all its glory. But all of the entries in our title index still are not up to the current standard. What went wrong? That's easy. For a machine conversion of this type to take place, both parts must match exactly, which was precisely not the case in our catalog. Each of the 3 entries the begin to cello concerto list differed from the cross reference in one or more details. For my final example I present what proved to be a hopeless assignment for BNA processing, bringing together all the entries for everyone's favorite (except the composer's): the Rachmaninoff prelude in C# minor. Let me run through the search that is required to locate all the entries in our catalog even after BNA processing, reminding you that I determined that the Rachmaninoff file was not covered in our own, pre-BNA authority clean-up project. First we search Rachmaninoff, then jump down to the preludes with the jump command (or as I have observed some of our patrons do, by machine-gunning the forward screen command, in this case, 37 times). Entry 298 is a prelude in C# minor. Surely we must have more than one and besides, that opus number looks a little fishy to me. So we press on until we come to entry 325, another prelude in C# minor that proves to be a see reference. We are adventurous and choose the see-reference, cleverly noting that entry 324 just above has the same opus number, but is apparently not in C# minor. Now we read the see reference: Yes please. Now that's more like it, 7 entries. Are there more? Yes, one. No, there's more opus 3 with preludes, only this time definitely for piano. Another 7 with piano. And for good measure, one that is not necessarily for piano or opus 3. The result: 5 different forms of entry with only 7 of 17 entries actually using the form of entry the computer so confidently claims is the one used by the library. Looks like I've got a lot of work to do when I get back to Cincinnati! Let me briefly summarize what I found BNA processing was and was not able to accomplish for music records at the University of Cincinnati. I will mention a few things that I did not document in my presentation. Machine processing successfully converted exact match see-reference headings to the current form. This was very effective for subject headings and personal names, much less so for series and uniform titles. Manual review corrected most mechanical errors and form-of-entry discrepancies in name headings and eliminated many split files in uniform titles by pluralizing singular forms of genre designations. Neither process was able to eliminate the chronological subdivision --To 1800, change piano-vocal score to vocal score in uniform titles, fix the designation of numbers within an opus, add the accompanying instrumentation to concerto headings, or delete instrumentation, opus number, key, or language from headings where they are not required. We at Cincinnati are certainly pleased with the improvements BNA processing made in our catalog, but recognize that local authority control begins again on Monday. Copies of the examples [overheads] are available from the author: paul.cauthen@uc.edu