Wednesday, September 12, 2007

Fez - first round of testing

(This post is by Peter, Tim and Bron)

The second repository solution we tried out was Fez, again we focused more on the simple issue of can it handle 200,000 plus items. If it survives this round then we'll consider ingest processes and workflow.

First up was Fez, version 1.3 from the Subversion trunk as of July 16th.

What we did

As with the VITAL test Tim and Bron used a Fedora database with 30,000 or so items in it with MARCXML and MODS datastreams, and no full text. (We're being kind to the software by starting small.)

Set up

Machine used: 8GB Centos LINUX with 1.5GB RAM

FEZ installation was completed successfully, it was made more difficult because we chose to install the software on CentOS while the The Fez Digital Repository Wiki recommends installation on the following platforms: Windows2003, WindowsXP, MacOS and Kubuntu.

As part of data preparation for ingestion of items, marc.xml,mods.xml and dublin_core.xml were created from the basic excel file using (xsl) stylesheets. A python script was then used to construct a foxml object for each item.

(NOTE: To perform a Fedora ingest using the fedora-ingest command, the server running the Fez/Fedora repository needed to have the Fedora Client software installed.)

Fez Data Requirements

Fez requires the Fedora XML (FOXML) to contain the following values for each datastream within the FOXML.

Datastream ID:

example for Dublin Core

<foxml datastream id=DC>

example for MODS

<foxml datastream id=MODS>

Datastream Version ID:

example for Dublin Core

<foxml datastreamVersion id=DC.0>

example for MODS

<foxml datastreamVersion id=MODS.0>

MODS Subject:

Library of Congress creates MODS subject in the following way

<mods:subject><topic>QLD</topic><topic>NSW</topic></mods:subject>

Fez expects MODS subject in the following format

<mods:subject><topic>QLD</topic></mods:subject><mods:subject><topic>NSW</topic></mods:subject>

MODS Namespace:

The following namespace was suggested by Fez developers.

<mods:mods xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema">

We found that putting <mods:mods> caused conflict in the xml editor so we are using <mods> for eg

<mods xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema">

Data Ingestion

Items were ingested into Fedora easily. All would have been completed within an hour.

Fez is then required to index the items residing in Fedora. This process is initiated using a browser and took an extremely long time (ie at least overnight). The browser hung repeatedly. The indexing has been completely rewritten for Fez version 2.

Performance

30999 items ingested

Show All items took more than 3 minutes to load

Search for pig took 5 seconds to load

Performance should improve with the new release, but how much remains to be seen.

Current concerns

There are some concerns with Fez as the basis for an AANRO repository.

  1. Indexing looks like it will be too slow using the current version, added to which the server is unusable while indexing takes place. If this were to remain the case then Fez would be out of the running but we're told that a new Fez release, due within a few days will improve performance dramatically, and will be usable while indexing takes place.

  2. Fez has no support for handles as persistent identifiers, a topic we have yet to discuss with the AANRO team. There may be software coming out of PILIN that can help, and this may not be an issue if AANRO can live without handles.

  3. The configuration for adding items is very, very complex. If we stick to using the same metdata schema as the University of Queensland, then this may not be too much of a problem. We do need to check that configuration changes can be exported from the database in which they are stored and kept under version control.

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

No comments: