Wednesday, October 24, 2007

Implementing Author Affiliation in Fez

I wrote here before about author affiliation. How can you keep track of not only the name of an author but their institutional affiliation for a particular item in a repository?

Because it uses MODS metadata by default the Fez repository can deal with this, but the default configuration for generic items does not include affiliation. So this is a good opportunity to see how Fez configuration works.

In this post I'll look at some of the main issues for AANRO:

  1. Mapping the existing data to MODS using the tools we developed for the RUBRIC project.

  2. Configuring Fez to recognize the affiliation and display it.

  3. Configuring an OAI-PMH feed so that the repository can be part of a federation and be linked into services like Google Scholar.

1 Data mapping

In the AANRO data we're working with the author field contains the affiliation in brackets. Our first mapping to MODS left this as-is. To fix this I needed to change the stylesheets we use to map the original data which is in an ad hoc format into a standard format.

Here's the process in outline. This is what our tech officers did to get the data into Fedora:

  1. Transform the raw data to XML using Excel (might be better do this with Python in future).

  2. Split the resulting file into multiple parts, by using an XPath to select an element on which to break:


    python split_xml_into_archive.py   ~/aanro/Data/Xml/AANROPubArchive0_30k.xml  ~/datamigration/aanro1 "//record" DC.xml False

    The result is a series of dirctories (30,000 of them in this case) each containing a file called rec.xml in a format like this

    <record>

    <id>38161</id>

    <doctype>Book or Report</doctype>

    <pubyear>1996</pubyear>

    <title>Guidelines on the quality of stormwater and treated wastewater for injection into

    aquifers for storage and reuse</title>

    <author>Dillon P (Australian Centre for Groundwater Studies)|Pavelic P (Australian Centre

    for Groundwater Studies)</author>

    ...

    </record>

  3. Transform each metadata record into MARC XML (we could change this process and go straight to MODS but we are used to working with MARC because that's what our staff were used to with RUBRIC and we have some VITAL configuration that uses Marc).


    python xsl_transform.py  temp.xml AANRO_xml_to_marc.xsl MARC.xml aanro1/ False
  4. Transform the MARC XML to MODS using a variant of a stylesheet we downloaded from the Library of Congress, this uses the same script as above, with a different XSL stylesheet.

  5. Transform the MARC XML to Dublin Core.

  6. Convert the Dublin Core and MODS metadata into FOXML, which can be ingested directly into Fedora.

  7. Ingest into Fedora using the Fedora client.

  8. Index the items into Fez

    • Click on administration.

    • Click on (Maintenance) Index Objects.

    • Click Discover new Fedora objects.

    • Select the Generic Document Version MODS 1.0 under Document Type.

    • Click Index All Items into Fez.

    While the indexing is happening you can monitor it from the My Fez area, in the Background Processes tab it will tell you how long it expects the indexing to take.

    (Currently this is very slow as I noted before but Christiaan Kortekaass has another trick up his sleeve which should improve indexing by at least an order of magnitude using the same technique for accessing Fedora records as the Fedora Gsearch indexer.)

I had to make changes to the stylesheet used for step 3. Here's the process I followed:

  1. Check out the RUBRIC toolkit code from Subversion.

  2. Locate the AANRO specific stylesheets.

  3. Find the UTF-X unit test for the AANRO to MARCXML XSLT transformation.

    A test for a simple transformation from <author> in the input data to MARCXML looks like this:


    <utfx:test>

            <utfx:name>Author</utfx:name>

            <utfx:assert-equal>

                <utfx:source validate="no">

                    <author>Leng RA</author>

                </utfx:source>

                <utfx:expected validate="no">

                    <datafield tag="100" ind1=" " ind2=" ">

                        <subfield code="a">Leng RA</subfield>

                    </datafield>

                </utfx:expected>

            </utfx:assert-equal>

        </utfx:test>
  4. Add new tests to deal with affiliations

    I added another test that deals with multiple authors with affiliation.


    <utfx:assert-equal>

                <utfx:source validate="no">

                    <author>Dillon P (Australian Centre for Groundwater Studies)|Pavelic P (Australian

                        Centre for Groundwater Studies)</author>

                </utfx:source>

                <utfx:expected validate="no">

                    <datafield tag="100" ind1=" " ind2=" ">

                        <subfield code="a">Dillon P</subfield>

                        <subfield code="u">Australian Centre for Groundwater Studies</subfield>

                    </datafield>

                    <datafield tag="700" ind1=" " ind2=" ">

                        <subfield code="a">Pavelic P</subfield>

                        <subfield code="u">Australian Centre for Groundwater Studies X</subfield>

                    </datafield>

                </utfx:expected>

    Since the authors had the same affiliation I added an 'X' to the second one to make sure that the right author gets the right affiliation.

  5. Write the XSLT to pass the test.

    This bit took a bit longer than expected and I ended up having to talk to our metadata specialist about MARC indicators and the like but I managed to get it working without breaking the other parts of the stylesheet, and simplified the code in the process.

  6. Check back in the code to subversion

  7. Update the code on our test machine and use it on the AANRO data.

We will include the AANRO code in a release of the RUBRIC toolkit which will be undergoing a tidy-up in early 2008.

2 Configuring Fez metadata entry and display

Once there are records in Fedora and a Fez index to make them show up in Fez we have to configure the display and the HTML form used to edit metadata.

The key to this is a system which maps XML Schemas to HTML forms. An XML schema is a description of the structural potential of a type of document, such as a MODS metadata document.

This is a complex and confusing area of the system, which generates a large percentage of the traffic on the Fez users mailing list.

I'm stuck on this at the moment even with the very helpful Fez developers online in their Campfire chat I can't figure out how to add an affiliation to an author.

One thing I have tried is to export the mapping as XML, but it's really not a human-editable format, still it's good to see that the configuration can be exported kept under version control and re-imported to a server. (And Christiaan Kortekaass tells me that there will soon be an option to lock down Fez so that even administrative users can't mess around with the mappings on a live system. Good news).

So we have to leave this configuration to a future AANRO implementation team, if Fez is selected. One big issue outstanding is that quite a bit of the AANRO data consists of project descriptions, not documents. This will require a fair bit of configuration.

3 Configuring OAI-PMH feeds

The web interface on a repository is not the only way that people find things. In fact most traffic is likely to come from elsewhere, particularly large search services like Google Scholar.

To expose metadata to other services you need to configure an OAI-PMH feed.

Fez has highly configurable OAI-PHM templates. Here's a snippet from the ListRecords template:

{assign var="loop_authors" value=$list[i].rek_author}
{section name="a" loop=$loop_authors}
<dc:creator>{$list[i].rek_author[a]|escape:"html"}</dc:creator>
{/section}

Now I don't speak Smarty Templates but obviously this is a loop (loop_authors) that iterates over the record for an item and spits out the Dublin Core creator element. For AANRO it is possible that the implementer would want to change this to add an affiliation in brackets as there's no simple way in Dublin Core to associate an author with an affiliation. Yes this effectively undoing the work I described above to split the author and their affiliation, but it may turn out to be a useful mechanism in an AANRO federation, to overcome some of the limitations of Dublin Core metadata.

It might looks something like this, assuming that there's a key 'f' for affiliation in the author array:

<dc:creator>{$list[i].rek_author[a]|escape:"html"}
({$list[i].rek_author[f]|escape:"html"})
</dc:creator>

4 Conclusion

This brief blog post has covered a few of the major areas of concern for AANRO. The data migration stuff will be expanded in the forthcoming data migration guide for AANRO, and pointers to potential Fez implementation will be in the final report.

I mentioned the Subversion revision control system and UTF-X unit tests for XSLT because they're really helpful in this kind of work, but not everyone we come across in the library systems world uses them.

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

No comments: