Thursday, September 13, 2007

AANRO repository architecture

There's an assumption in the repositories that we have used and evaluated at USQ that the repository is a fairly self contained, monolithic application. Eprints, DSpace, VITAL and Fez are all pretty much like this; they provide end-to-end services for putting stuff in to a repository, managing it and disseminating it to the web, for humans to use. (Don't know enough about Mura to comment at this stage about its architecture)

But while they all provide end-to-end services, all of them are designed to disseminate their contents in a machine readable way too, making it straightforward to aggregate and dis-aggregate data and have it flow around a network of repositories, via the OAI-PMH (Open Access Initiative Protocol for Metadata Harvesting). The website says:

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. Data Providers are repositories that expose structured metadata via OAI-PMH. Service Providers then make OAI-PMH service requests to harvest that metadata. OAI-PMH is a set of six verbs or services that are invoked within HTTP.1

If you have a repository then OAI-PMH means it can be harvested by things like the ARROW discovery service2, or the OAIster3 service.

As I wrote for the RUBRIC website, you should consider how many repositories you might want:

It is by no means a given that a single repository needs to be used for everything. While there are benefits to minimizing the number of different packages required, different types of repository may be best served by different software.

Consider whether different software solutions might be appropriate for:

  • An Open Access research repository.

  • A thesis repository, which may require embargo features and authorization, for example if theses contain third party copyright material, confidential material or information that could compromise patents.

  • Image collections.

  • Work in progress.

  • A preservation repository, containing records form all of the above but without a public portal.4

One possible architecture for an AANRO repository would be something like the ARROW discovery service, an aggregated view of data that may reside in multiple repositories.

Why would AANRO consider an architecture like this?

  1. Performance! So far we have found that managing 200K records is looking a bit daunting for the two (VITAL and Fez) packages we have tried. Yes, they both have new versions and yes we will re-test but it makes sense to consider a model where there may be a number of regional or subject-based repositories that are of a more manageable size for repository maintainers, with a high-performance discovery portal aggregating them together.

  2. Re-using existing infrastructure. Nerida Hart from the AANRO told me on Tuesday (we were at the Long Lived Collections meeting in Canberra) that some of the agencies involved are using a project management system which already holds some of the material that needs to go into the repository. It may be more appropriate for groups that have such infrastructure to stick with it, and build connectors that can send it off to a central repository.

  3. Meeting the needs of different agencies with different software. The players in AANRO vary in size some may have the resources to manage their own system and be able to justify having a repository locally, which may extend beyond items that are in-scope for AANRO, while others are very small and may be better off with a simple, minimalist web application where they can deposit items and have them reviewed by AANRO-central. (I'm speculating a bit here I'm sure AANRO staff can fill us in with specific examples, do any of the partners already have a repository that talks OAI-PMH?)

As part of the discussion about a potential federated archictecture. I will write soon about our experiments with Apache Solr, which is a search-portal service that makes it easy to index large amounts of data and build interfaces.

Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat.5

I was talking to Heather Myers from VTLS (the company behind VITAL) today and I mentioned Solr and guess what? Their forthcoming product called Visualizer is built on Solr.

I've heard of Solr being used elsewhere in the Australian library scene but need to confirm details.

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.


1 Open Archives Initiative Protocol for Metadata Harvesting, http://www.openarchives.org/pmh/ (accessed September 12, 2007)

2 Australian Research Repositories Online to the World, http://search.arrow.edu.au/apps/ArrowUI/ (accessed September 12, 2007)

3 OAIster | Home, http://www.oaister.org/ (accessed September 12, 2007)

4 Repository software, http://rubric.edu.au/repositories/choosingarepository.htm (accessed September 12, 2007)

5 Welcome to Solr, http://lucene.apache.org/solr/ (accessed September 12, 2007)

No comments: