Thursday, September 6, 2007

VITAL - first round of testing

[Update Heather Myers of VTLS has pointed out that we started out testing on a machine that had too little RAM. She also expressed concerns about using Virtual machines, but there's nothing we can do about that in the short term, we will await advice from VTLS about why virtual machines are a problem. This was a stupid mistake on our part and I apologize to VTLS for publishing misleading result. I (Peter) am going to remove the errant results, even though it's against my normal blogging ethics, and it's a useful data point that emphasises that you should not try to under-spec a machine to run VITAL. We were going to retest soon anyway with an updated version of the software so we'll retry with a better specified machine.]

(This post is by Peter Sefton and Bron Dye)

The first round of repository software testing is focusing mainly on scalability. AANRO has 200,000 records and counting so there's no point in looking at software that can't handle that. Once we have an idea of which software handles 200,000 records we will begin to look at how AANRO might ingest new records an maintain old ones.

We're all confident that Fedora itself can handle lots of records, it's whether they can be used that's in question.

First up was VTLS Vital, version 3.1.

What we did

Tim and Bron the technical officers who are working 50/50 on this project took the data from the AANRO database and went through a simple-but-tedious process to get it into Fedora in the MODS metadata standard. There's a post coming about how they did this which we'll try to make not as tedious as the process that it describes.

The result was a Fedora database with 30,000 or so items in it with MARCXML and MODS datastreams, and no full text. We're being kind to the software by starting small.

We took a plain-vanilla installation of VITAL with no configuration changes at all (ie all default indexes and settings) and pointed it at a repository containing first 10K, then 20K then 30K records.

Set up

Machine used: 8GB Centos LINUX with 512MB RAM

VITAL installation was completed successfully, without any problems.

As part of data preparation for ingestion of items, Tim and Bron created marc.xml,mods.xml and dublin_core.xml from the basic excel file using XSL stylesheets. They then used a Python script to construct a FOXML (Fedora) object for each item.

Ingestion

[This part has been removed as we tested with too little RAM.]

Performance

[two lines removed]

Following the increase of RAM to 1.5GB:

Show All page took 45 seconds to load

Search for pig took 35 seconds to load

Display a page from page range took 37 seconds to load.

Summary

At present, the AANRO data does not have full-text version of articles with it, which is lucky, because the current release is unable to take a Fedora repository and create a full-text index or add preservation related metadata streams automatically.

(This limitation includes existing VITAL version 2 repositories, meaning that there is no automated upgrade path for existing customers. An upgrade script has been promised and a USQ staff member has seen screenshots today, but so far we can't confirm that it worked.

[Update: we can now confirm that there is an upgrade path, but at time of writing this was correct])

After this experience, the testing with Fez that followed was conducted on a virtual machine with 1.5Gb of RAM from the start. We'll re-run the VITAL tests with the same configuration and the latest available software and post a comparison.

Current concerns

There are some concerns with VITAL as the basis for an AANRO repository.

  1. Indexing looks like it will be too slow if you needed to recover a repository then on current indications it could take days to re-index. To be fair, we'll re-run our indexing test using more RAM on the server, and remove unwanted indexing code. The vendor has just notified us that there is a version 3.1.1 that will fix some of the indexing / searching problems.

  2. At present our understanding is that VITAL can only be used for OAI harvesting if handles are being used for identifiers. We have not discussed using handles for AANRO but if they wanted to go live without them, then this would be a problem.

  3. Getting items in to VITAL remains a major problem. There are two options for user input in the current version, both of which are less than ideal:

    1. VALET is a simple web application for entering metadata and uploading data which may suit AANRO's needs, except for one very major flaw it is not connected to Fedora so it can be used for creating new records, but not editing existing ones.

    2. VITAL has a limited management interface via the web and a Microsoft Windows client, both of which require users to edit XML using an external editor. This is not going to suite AANROs data entry staff.

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

No comments: