Wednesday, September 19, 2007

AANRO Requirements summary

About this document
Search Facility
Search results
Content harvesting and publishing
Indexing
Content Management
- Support for web-based document management, auditing, simple workflow, including research status, publishing rights and ability to edit incorrect content.
- Archive facility to flag data that exceeds a particular age.
Content reporting
- Analysis and reporting of traffic using an industry standard tool.
- Activity reports summarising research documents, projects and research status.
Application maintenance and support
Interoperability with other repositories
- Web Single Sign On
Data migration

About this document

[UPDATE: Changed the format of the document so it renders better in the blog fixed a few typos.]

This document contains a summary of requirements for the AANRO repository, which originally appeared in a Request For Quote from 2006. One of the deliverables on this project is an annotated version of the RFQ so I'm starting on that now and I'm doing it in public so others can contribute and correct me. (I may update this post with change-tracking turned on)

Mathew Silver of Land and Water prepared the original summary of requirements document for me early this year [update Mat only passed it on] and we used it to to compare the requirements with the then-current VITAL 2.1. At that stage VITAL definitely would not have met requirements, it does not have a web interface to edit Fedora objects, and the index and search performance was too slow. We have seen a marked improvement in performance with a pre-release version of 3.1.1, but it still lacks an easy to use web interface for modifying objects stored in Fedora.

I'm putting up this list with some comments and questions, inviting responses from the developers involved. If you can clarify anything or I've made an error, use the comments below or drop me an email.

Search Facility

Rapid and comprehensive access to records of research in-progress, finalised research reports, formal publications, journal/conference proceedings and website content

This is what we're testing now. Jury is still out on how rapid the various solutions will be – VITAL 3.1.1 is a definate improvement over 3.1 and Fez 2 should be better than Fez 1.3. Mura is proving difficult for us to install.

Basic and advanced search operations including thesaurus keywords, metadata, organisation, geographic locality and other supported metadata.

I think that all the products will be flexible enough to handle this requirement, but all of them have different configuration systems.

Thesaurus support is an issue, see comments below on adding items to the repository.

Capability to externally link to, and store in a central database, a variety of common file formats including HTML, PDF, word, excel, delimited text, tagged and XML content.

All the repositories can in some sense store all these things.

HTML is not well supported by VITAL or Fez – how about Mura?

(2007-09-19: I tried adding an HTML object to the version of Mura hosted by the developers. The metadata input screens are nice and clean, you really need to switch to advanced mode. For example in the simple input there is a field for 'author' but no way to separate given and family name. When I submitted the item the application threw an exception.)

(USQ is running a project called ICE-RS which is looking at how to get research publications into repositories in HTML and PDF format; by the end of 2007 we should have proof of concept implementations – it's not a given that this would be any use to AANRO though.)

Support for the development of applications to search special data-collections.

In one sense this is supported by default because all the solutions use Fedora, but it's not really a feature of any of the software yet. Mura is the most standards compliant of the software in the sense that it uses native Fedora services for as much as possible. So if you configure access control or indexing then you can use other software.

I know that Fez plans to add support for special-purpose portals. This will be at the level of the Fez application, you will need to use a Fez API to leverage any Fez access controls.

VTLS: would you like to share your plans in this area? (I have talked to VTLS and the ARROW development group about this issue but not sure if they want that made public yet)

Support for the existing custom developed applications and sites using subsets of AANRO data.

This raises question for AANRO – what sites are now consuming the data? Are there any sites that duplicate the data? Are the opportunities for federation – ie participating organizations who have or would like to have their own repository that feeds into the central AANRO registry? (It sounds like there are from the requirements, but we need more detail about the participating organizations.

Search results

A consistent, standardised and user-friendly website and search tool.

We will consider friendliness after we look at performance but I think all the packages will be OK here, subject to meeting the next few requirements.

Customisable, sorted by relevance, easy-to-read search results that include links to other relevant research.

All the packages are customizable – we'll share our impressions as we go.

VITAL

Uses XSLT – easy to use, but I suspect it may be one cause of the slowish performance we have seen in the past.

Relevance ranking: Not sure about this one. VTLS?

VITAL allows 'faceted search' which shows, for any given result set from a search it will show you how many matches the set contains against a number of indexes, you can then narrow or 'limit' your search. Say I search the whole repository for the word goat and it tells me there are 400 items of type Conference paper I can click on conference paper to limit the search to the 400 conference paper items that contain goat.

I mentioned in a previous post that you can also do this with Apache Solr, with only a very small amount of software on top of it; we will try to demonstrate this with AANRO data in the next couple of weeks.

Fez

Uses the full text engine that comes with the database. Not sure about how relevance is calculated but you can sort results by relevance so it's there.

Fez doesn't seem to support faceted searching or limits (based on my experiments with the Unversity of Queensland eSpace server) but when you look at an item it will let you search for related items by clicking on the subject. Faceted search would be a nice-to-have.

Fez has search-completion which will suggest search terms as you type, users are coming to expect this because it Google does it.

Mura

Search is quite rudimentary at present. There are no facets, and no hyperlinks to search for related items.

Feedback mechanism on unsuccessful searches.

We'll check.

Content harvesting and publishing

Use of high levels of automation to identify, capture, harvest, reformat and index new content from a variety of sources and agencies.

I think it's safe to say that currentlty none of our candidate systems do this. They all publish content via OAI-PMH but none of them harvest content automatically. This is one area where there might need to be custom development, after careful requirements gathering; hard to comment at this time without a lot more detail about the sources and agencies.

An alternative architecture where the AANRO portal used something like Apache Solr with an OAI-PMH harvester bolted on might meet this requirement. Content from agencies could either be indexed directly by a portal (via a feed of some kind) or stored in a central Fedora repository and then indexed. We'll have much more detail on alternative architectures by the end of the project.

Individual web-based publishing i.e. input of research information by participants.

VITAL

As noted previously, VITAL does not have a fully-integrated web input system. It is not suitable for AANRO's requirement that non-specialist staff from small agencies (at least) can use a web form to add and update repository content.

So, while the VITAL portal has some nice features in the portal it is severely lacking on the input workflow side. It seems likely that an AANRO solution would require additional development for web based input.

Fez

Fez has a customisable web-based input system with workflow that is highly configurable. We are confident that this would be usable by agencies contributing to a central database.

I have previously expressed concern with some aspects of the way the web-input system in Fez works. It uses a very complicated web interface to map elements in an XML schema to web forms and stores the results in a database. Quite a lot of the traffic on the Fez mailing list is about people having problems configuring this interface.

Another concern is with the way that configuration may be versioned and moved between test and production instances. Is it possible to export configuration, keep it in Subversion or the like and then deploy it to other instances? Fez team?

Mura

Mura has a configurable web-portal and uses the XForms standard for its web input.

As noted above, Mura is not running properly for us yet.

Automated and manual web-based batch publishing of documents in a variety of standard formats.

Fez is the only package that has a web-triggered batch ingest system, which can import directories full of content. This is a very simple arrangement requiring users to put data in a directory and the trigger the ingest via the web.

Indexing

Support for records indexed using metadata that meets relevant standards such as AGLS.

VITAL and Fez both definitely support multiple metadata standards, not sure how hard it would be to expose this in the web pages it servers in a machine readable way in either.

Support for open archives initiative protocol for metadata harvesting (OAI-PMH).

Yes for all three VITAL and Mura can be harvested using standard Fedora plugins, including specifying sets of data.

Fez

Do you support sets?

Support for indexing and search of records using the CABI thesaurus as used in the current system.

Indexing and search are no problem even without licensing the thesaurus – if the terms are attached to items in the repostitory.

VITAL

No support for controlled vocabularies on ingest.

Fez

Supports controlled vocabularies – configuration might be difficult, but there is no reason to think it would not work.

Mura

I am assuming that Mura would be OK here as it uses XForms for ingest and XForms is very configurable, but we are not able to test this yet (and may not get time). There is no example of thesaurus support in the Mura demonstration site.

Content Management

Support for web-based document management, auditing, simple workflow, including research status, publishing rights and ability to edit incorrect content.

See the comments above on web input.

Archive facility to flag data that exceeds a particular age.

Not as far as I can tell for any of them.

Content reporting

Analysis and reporting of traffic using an industry standard tool.

We'll look into it.

Activity reports summarising research documents, projects and research status.

There are stats in all three – they don't make the distinction between generic items and items that might have a particular type (project) or status, so customization may be required.

Application maintenance and support

Application development using open standards and industry supported software.

Yes for Fez and Mura.

VITAL is closed-source but developers are free to work with Fedora directly in any language using any of the standards supports.

A sufficiently scalable application able to cope with projected demand.

We can comment on scalability but does AANRO have projected demand?

Meet government standards for web delivered services i.e.. The Guide to Minimum Website Standards www.agimo.gov.au

Any comments from developers would be welcome here – I'll read the guidelines and flag any issues.

Capacity to deliver content for low bandwidth connections to ensure customers can access the system when using low bandwidth.

All the repositories work in pretty low bandwidth already. None of them

Interoperability with other repositories

Interoperability is required to provide the ability to transfer and use information in a uniform and efficient manner across multiple organizations and information technology systems. The primary interoperability is to transfer information from the 30 participating agencies and secondly the future interoperability with other web based systems.

Web Single Sign On

(WSSO) is required to simplify the administrative process of authentication and authorization and in the future to gain access to the resources of other web based systems.

All of the systems under consideration can use the Shibboleth single sign on system. Mura should be particularly good in this regard as it comes from a group who have work extensively with this technology. (We have not seen single sign on demonstrated with Fex or VITAL.)

Data migration

(Not much to say about this, as it is a matter for an implementer, not part of an evaluation. I will add some comments about helpdesks in the final report.)

Plan and undertake the migration of data from the current AANRO platform and demonstrate a strategy to enable any future migration. Within the context just discussed, it will be the ongoing responsibility of the Successful Respondent to assist in the migration of research content from AANRO associated agency systems to the new database. This will include the integration of exported database content from agencies, migration of the existing InMagic AANRO database, and ongoing harvesting of content from agency websites.

Database management

This is the requirement as written:

Extensive, ongoing content sourcing, harvesting, indexing, publishing and content management by the respondent/vendor.

Application/web development, maintenance, hosting and support

Ongoing development, maintenance, web and database hosting and support.

Software support services

Help desk.

Defect correction.

Muradora out of contention for now

We've been looking at the Mura repository, which now seems to be called Muradora. We're going to stop evaluating it. Here's part of the email I sent to Chi Nguyen, the project lead:

Hi Chi,
I'm still keen to support and encourage Mura development, but we're going to have to concede defeat with our testing of Mura - it's still too raw and too hard to install. We got it running, but have been unable to test the GUI for adding items. (And the demo at your site throws errors when I try to use it to add an item). We just don't have the time to keep playing with it for AANRO. I'll be posting something to the blog soon about this.
What we will do, though, if you can help us out is report on benchmarks with large volumes of objects if you can give us a utility to import AANRO's data from raw Fedora objects with MODS datastreams this is very useful data for the whole community as well as AANRO.
We are also happy to have a quick look at any new version you bring out, as I still think you may have something by the time AANRO are starting implementation, but you'll have to move quickly.
...
Peter

Chi replied with an offer to test AANRO data. We'll pack it up and ship it ASAP and let you know how it all goes. Here's his reply, with permission (minus a comment about another project):

I understand your concerns. I agree our stuff is still too raw esp. the installation and general reliability. We are trying to address that. Also it's a bit unfortunate that for the last month our focus was on the access control model which is the major deliverable for us, and so we didn't have time to focus on your trial. We will get a version out by IDEA which will also serve as our launch. But I'm not sure if it fits in with your timeline. As to the benchmarks, it would be great if we can get a hold of the AANRO data to try to import it into the repository. It will be kept strictly internal to us for testing. We can then report back on how we went.

Chi mentions the IDEA conference in Brisbane, Australia, October 8-11 where they're going to launch Muradora. I'll be there.

And we will be sending through the AANRO data. We'll check back with Muradora as it matures.

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Friday, September 14, 2007

VITAL - second go at the first round of testing

(This post is by Peter and Tim)

The first round of repository software testing is focusing mainly on scalability. AANRO has 200,000 records and counting – so there's no point in looking at software that can't handle that. Once we have an idea of which software handles 200,000 records we will begin to look at how AANRO might ingest new records an maintain old ones.

We're all confident that Fedora itself can handle lots of records, it's whether they can be used that's up in the air.

First up was VTLS Vital, version 3.1. We reported on that last week.

Next up was an upgrade to VITAL 3.1.1.

What we did

We took a plain-vanilla installation of VITAL 3.1.1. with no configuration changes at all (ie all default indexes and settings) and pointed it at a repository containing first 12979, then 17020 records. This is a total of 29999 records/items.

Set up

Machine used: 8GB Centos LINUX with 2000MB RAM

VITAL installation was completed successfully, without any problems.

As we noted in the last post, VITAL 3.1.1 requires that Fedora object have a certain structure; it uses a thing called an alternate ID to tie together a datasteam (such as PDF file) and metadata about that stream, or its full text or a thumbnail image.

We will report on the process of adding ALT-IDs soon, but first we let VITAL go ahead and index the items. VITAL is the least fussy of the software we're looking at. It will happily work with any-old Fedora repository, while Fez and Mura both require some things to be set up in a particular way before they can work with items.

Ingestion

12979 items ingested commenced approximately 9:50 am

All 12979 items successfully ingested into Fedora at 10:40am

All 12979 items successfully index into VITAL at 11.20 am

Further 17020 items ingested at approximately 11.25 am

All 17020 items successfully ingested into Fedora at 12.25 pm

All 17020 items successfully indexed into VITAL at 1.25 pm

Performance

During 12979 items being ingested – Takes 1 minute to load a Show All page.

With 12979 items ingested – Takes 10 seconds to load Show All page.

During further 17020 items being ingested – Takes 1 minute to load Show All page.

With 29999 items ingested – Takes 10 seconds to load Show All page.

With 29999 items ingested – Search for “pig” took 2.5 seconds to load

With 29999 items ingested – Displaying a page from page range took 15 seconds to load.

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Thursday, September 13, 2007

AANRO repository architecture

There's an assumption in the repositories that we have used and evaluated at USQ that the repository is a fairly self contained, monolithic application. Eprints, DSpace, VITAL and Fez are all pretty much like this; they provide end-to-end services for putting stuff in to a repository, managing it and disseminating it to the web, for humans to use. (Don't know enough about Mura to comment at this stage about its architecture)

But while they all provide end-to-end services, all of them are designed to disseminate their contents in a machine readable way too, making it straightforward to aggregate and dis-aggregate data and have it flow around a network of repositories, via the OAI-PMH (Open Access Initiative Protocol for Metadata Harvesting). The website says:

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. Data Providers are repositories that expose structured metadata via OAI-PMH. Service Providers then make OAI-PMH service requests to harvest that metadata. OAI-PMH is a set of six verbs or services that are invoked within HTTP.1

If you have a repository then OAI-PMH means it can be harvested by things like the ARROW discovery service2, or the OAIster3 service.

As I wrote for the RUBRIC website, you should consider how many repositories you might want:

It is by no means a given that a single repository needs to be used for everything. While there are benefits to minimizing the number of different packages required, different types of repository may be best served by different software.
Consider whether different software solutions might be appropriate for:
An Open Access research repository.
A thesis repository, which may require embargo features and authorization, for example if theses contain third party copyright material, confidential material or information that could compromise patents.
Image collections.
Work in progress.
A preservation repository, containing records form all of the above but without a public portal.4

One possible architecture for an AANRO repository would be something like the ARROW discovery service, an aggregated view of data that may reside in multiple repositories.

Why would AANRO consider an architecture like this?

Performance! So far we have found that managing 200K records is looking a bit daunting for the two (VITAL and Fez) packages we have tried. Yes, they both have new versions and yes we will re-test but it makes sense to consider a model where there may be a number of regional or subject-based repositories that are of a more manageable size for repository maintainers, with a high-performance discovery portal aggregating them together.
Re-using existing infrastructure. Nerida Hart from the AANRO told me on Tuesday (we were at the Long Lived Collections meeting in Canberra) that some of the agencies involved are using a project management system which already holds some of the material that needs to go into the repository. It may be more appropriate for groups that have such infrastructure to stick with it, and build connectors that can send it off to a central repository.
Meeting the needs of different agencies with different software. The players in AANRO vary in size – some may have the resources to manage their own system and be able to justify having a repository locally, which may extend beyond items that are in-scope for AANRO, while others are very small and may be better off with a simple, minimalist web application where they can deposit items and have them reviewed by AANRO-central. (I'm speculating a bit here – I'm sure AANRO staff can fill us in with specific examples, do any of the partners already have a repository that talks OAI-PMH?)

As part of the discussion about a potential federated archictecture. I will write soon about our experiments with Apache Solr, which is a search-portal service that makes it easy to index large amounts of data and build interfaces.

Solr is an open source enterprise search server based on the Lucene Java search library, with XML/HTTP and JSON APIs, hit highlighting, faceted search, caching, replication, and a web administration interface. It runs in a Java servlet container such as Tomcat.5

I was talking to Heather Myers from VTLS (the company behind VITAL) today and I mentioned Solr – and guess what? Their forthcoming product called Visualizer is built on Solr.

I've heard of Solr being used elsewhere in the Australian library scene but need to confirm details.

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

1 “Open Archives Initiative Protocol for Metadata Harvesting,” http://www.openarchives.org/pmh/ (accessed September 12, 2007)

2 “Australian Research Repositories Online to the World,” http://search.arrow.edu.au/apps/ArrowUI/ (accessed September 12, 2007)

3 “OAIster | Home,” http://www.oaister.org/ (accessed September 12, 2007)

4 “Repository software,” http://rubric.edu.au/repositories/choosingarepository.htm (accessed September 12, 2007)

5 “Welcome to Solr,” http://lucene.apache.org/solr/ (accessed September 12, 2007)

Wednesday, September 12, 2007

Fez - first round of testing

What we did
- Set up
- Fez Data Requirements
Data Ingestion
Performance
Current concerns

(This post is by Peter, Tim and Bron)

The second repository solution we tried out was Fez, again we focused more on the simple issue of “can it handle 200,000 plus items”. If it survives this round then we'll consider ingest processes and workflow.

First up was Fez, version 1.3 from the Subversion trunk as of July 16th.

What we did

As with the VITAL test Tim and Bron used a Fedora database with 30,000 or so items in it with MARCXML and MODS datastreams, and no full text. (We're being kind to the software by starting small.)

Set up

Machine used: 8GB Centos LINUX with 1.5GB RAM

FEZ installation was completed successfully, it was made more difficult because we chose to install the software on CentOS while the The Fez Digital Repository Wiki recommends installation on the following platforms: Windows2003, WindowsXP, MacOS and Kubuntu.

As part of data preparation for ingestion of items, marc.xml,mods.xml and dublin_core.xml were created from the basic excel file using (xsl) stylesheets. A python script was then used to construct a foxml object for each item.

(NOTE: To perform a Fedora ingest using the fedora-ingest command, the server running the Fez/Fedora repository needed to have the Fedora Client software installed.)

Fez Data Requirements

Fez requires the Fedora XML (FOXML) to contain the following values for each datastream within the FOXML.

Datastream ID:

example for Dublin Core

<foxml datastream id=”DC”>

example for MODS

<foxml datastream id=”MODS”>

Datastream Version ID:

example for Dublin Core

<foxml datastreamVersion id=”DC.0”>

example for MODS

<foxml datastreamVersion id=”MODS.0”>

MODS Subject:

Library of Congress creates MODS subject in the following way

<mods:subject><topic>QLD</topic><topic>NSW</topic></mods:subject>

Fez expects MODS subject in the following format

<mods:subject><topic>QLD</topic></mods:subject><mods:subject><topic>NSW</topic></mods:subject>

MODS Namespace:

The following namespace was suggested by Fez developers.

<mods:mods xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema">

We found that putting <mods:mods> caused conflict in the xml editor so we are using <mods> for eg

Data Ingestion

Items were ingested into Fedora easily. All would have been completed within an hour.

Fez is then required to index the items residing in Fedora. This process is initiated using a browser and took an extremely long time (ie at least overnight). The browser hung repeatedly. The indexing has been completely rewritten for Fez version 2.

Performance

30999 items ingested

Show All items took more than 3 minutes to load

Search for “pig” took 5 seconds to load

Performance should improve with the new release, but how much remains to be seen.

Current concerns

There are some concerns with Fez as the basis for an AANRO repository.

Indexing looks like it will be too slow using the current version, added to which the server is unusable while indexing takes place. If this were to remain the case then Fez would be out of the running but we're told that a new Fez release, due within a few days will improve performance dramatically, and will be usable while indexing takes place.
Fez has no support for handles as persistent identifiers, a topic we have yet to discuss with the AANRO team. There may be software coming out of PILIN that can help, and this may not be an issue if AANRO can live without handles.
The configuration for adding items is very, very complex. If we stick to using the same metdata schema as the University of Queensland, then this may not be too much of a problem. We do need to check that configuration changes can be exported from the database in which they are stored and kept under version control.

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Thursday, September 6, 2007

VITAL - first round of testing

[Update – Heather Myers of VTLS has pointed out that we started out testing on a machine that had too little RAM. She also expressed concerns about using Virtual machines, but there's nothing we can do about that in the short term, we will await advice from VTLS about why virtual machines are a problem. This was a stupid mistake on our part and I apologize to VTLS for publishing misleading result. I (Peter) am going to remove the errant results, even though it's against my normal blogging ethics, and it's a useful data point that emphasises that you should not try to under-spec a machine to run VITAL. We were going to retest soon anyway with an updated version of the software so we'll retry with a better specified machine.]

(This post is by Peter Sefton and Bron Dye)

We're all confident that Fedora itself can handle lots of records, it's whether they can be used that's in question.

First up was VTLS Vital, version 3.1.

What we did

Tim and Bron the technical officers who are working 50/50 on this project took the data from the AANRO database and went through a simple-but-tedious process to get it into Fedora in the MODS metadata standard. There's a post coming about how they did this which we'll try to make not as tedious as the process that it describes.

The result was a Fedora database with 30,000 or so items in it with MARCXML and MODS datastreams, and no full text. We're being kind to the software by starting small.

We took a plain-vanilla installation of VITAL with no configuration changes at all (ie all default indexes and settings) and pointed it at a repository containing first 10K, then 20K then 30K records.

Set up

Machine used: 8GB Centos LINUX with 512MB RAM

VITAL installation was completed successfully, without any problems.

As part of data preparation for ingestion of items, Tim and Bron created marc.xml,mods.xml and dublin_core.xml from the basic excel file using XSL stylesheets. They then used a Python script to construct a FOXML (Fedora) object for each item.

Ingestion

[This part has been removed as we tested with too little RAM.]

Performance

[two lines removed]

Following the increase of RAM to 1.5GB:

Show All page took 45 seconds to load

Search for “pig” took 35 seconds to load

Display a page from page range took 37 seconds to load.

Summary

At present, the AANRO data does not have full-text version of articles with it, which is lucky, because the current release is unable to take a Fedora repository and create a full-text index or add preservation related metadata streams automatically.

(This limitation includes existing VITAL version 2 repositories, meaning that there is no automated upgrade path for existing customers. An upgrade script has been promised and a USQ staff member has seen screenshots today, but so far we can't confirm that it worked.

[Update: we can now confirm that there is an upgrade path, but at time of writing this was correct])

After this experience, the testing with Fez that followed was conducted on a virtual machine with 1.5Gb of RAM from the start. We'll re-run the VITAL tests with the same configuration and the latest available software and post a comparison.

Current concerns

There are some concerns with VITAL as the basis for an AANRO repository.

Indexing looks like it will be too slow – if you needed to recover a repository then on current indications it could take days to re-index. To be fair, we'll re-run our indexing test using more RAM on the server, and remove unwanted indexing code. The vendor has just notified us that there is a version 3.1.1 that will fix some of the indexing / searching problems.
At present our understanding is that VITAL can only be used for OAI harvesting if handles are being used for identifiers. We have not discussed using handles for AANRO but if they wanted to go live without them, then this would be a problem.
Getting items in to VITAL remains a major problem. There are two options for user input in the current version, both of which are less than ideal:
1. VALET is a simple web application for entering metadata and uploading data which may suit AANRO's needs, except for one very major flaw – it is not connected to Fedora so it can be used for creating new records, but not editing existing ones.
2. VITAL has a limited management interface via the web and a Microsoft Windows client, both of which require users to edit XML using an external editor. This is not going to suite AANROs data entry staff.

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Wednesday, September 5, 2007

Which repositories?

So far all I've let on in this blog is that we're evaluating Fedora based repository software application, but I didn't say which ones.

We're testing three.

VTLS VITAL and VALET, known in Australia as the ARROW solution.
(VALET is free software, (at least some versions of it are) while Vital is commercial software)
FEZ , the free Fedora based IR solution from the University of Queensland.
(Fez was evaluated by RUBRIC in 2005 and 2006, at which time it was not deemed suitable for production unless there was a developer available during the deployment. It is now more mature)
Mura, a new Fedora front-end written by the a group at Macquarie University.

So far we have looked at the portal side of VITAL and Fez â€“ and we'll be starting on Mura today.

Plus, any software that becomes available from NSDL or the Colorado Alliance of Research Libraries. (More on what I learned from talking to technical staff at both of those institutions later).

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Monday, September 3, 2007

Welcome

Welcome to the the AANRO repository evaluation blog.

AANRO stands for Australian Agriculture and Natural Resources Online.

The University of Southern Queensland has been contracted by Land and Water Australia to undertake an evaluation front-end software to the Fedora repository back-end. This project blog will be written by some of the project staff, who also work on the RUBRIC project. This project is not part of RUBRIC.

I'm Peter Sefton, and I'll be taking a high level strategic view of the process. Tim McCallum and Bron Dye and Caroline Drury will also contribute posts about the evaluation process.

Here's a cut-n-paste from a recent job ad that explains some of the context:

Land & Water Australia is a Statutory Corporation established under the Primary Industries and Energy Research and Development Act of 1989. Its primary focus is to organise and fund research and development (R&D) activities with a view to improving the long term productive capacity, sustainable use, management and conservation of Australia's land, water and vegetation resources.
While its key role is to organise and fund research, the Corporation also has a major objective to ensure that the knowledge base is applied to improve management practices and natural resource management policy. Land & Water Australia objectives, in keeping with its mission and charter, are outlined in its five year Research and Development Plan.
The AANRO research database of information for agriculture and natural resource management is jointly funded by some 30 participating research organisations through the Primary Industries Standing Committee (PISC), the Natural Resource Management Standing Committee (NRMSC) and the Rural Research and Development Corporations (RDC's).
AANRO is currently being redeveloped from a bibliographic database running on an Inmagic IT platform into a full text subject digital repository of research material in the agriculture and natural resource management area. The new AANRO will be running on open source software similar that used in the ARROW and RUBRIC projects. For further information on the existing AANRO database see http://www.aanro.net (please note the database is currently in a state of suspension until the end of July 2007).

To keep track of these posts you can subscribe to this blog using a feed reader, but you may also like to check out the del.icio.us bookmarking system where we'll use the tag aanrorepos to tag items of interest to the project.

This project is unusually open. The agreement is that we will blog as we go, and the final report will be available under a creative commons license, open access. We've been going for a couple of weeks now, so there's a backlog of stuff to report.

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Wednesday, September 19, 2007

About this document

Search Facility

Rapid and comprehensive access to records of research in-progress, finalised research reports, formal publications, journal/conference proceedings and website content

Basic and advanced search operations including thesaurus keywords, metadata, organisation, geographic locality and other supported metadata.

Capability to externally link to, and store in a central database, a variety of common file formats including HTML, PDF, word, excel, delimited text, tagged and XML content.

Support for the development of applications to search special data-collections.

Support for the existing custom developed applications and sites using subsets of AANRO data.

Search results

A consistent, standardised and user-friendly website and search tool.

Customisable, sorted by relevance, easy-to-read search results that include links to other relevant research.

VITAL

Fez

Mura

Feedback mechanism on unsuccessful searches.

Content harvesting and publishing

Use of high levels of automation to identify, capture, harvest, reformat and index new content from a variety of sources and agencies.

Individual web-based publishing i.e. input of research information by participants.

VITAL

Fez

Mura

Automated and manual web-based batch publishing of documents in a variety of standard formats.

Indexing

Support for records indexed using metadata that meets relevant standards such as AGLS.

Support for open archives initiative protocol for metadata harvesting (OAI-PMH).

Fez

Support for indexing and search of records using the CABI thesaurus as used in the current system.

VITAL

Fez

Mura

Content Management

Support for web-based document management, auditing, simple workflow, including research status, publishing rights and ability to edit incorrect content.

Archive facility to flag data that exceeds a particular age.

Content reporting

Analysis and reporting of traffic using an industry standard tool.

Activity reports summarising research documents, projects and research status.

Application maintenance and support

Application development using open standards and industry supported software.

A sufficiently scalable application able to cope with projected demand.

Meet government standards for web delivered services i.e.. The Guide to Minimum Website Standards www.agimo.gov.au

Capacity to deliver content for low bandwidth connections to ensure customers can access the system when using low bandwidth.

Interoperability with other repositories

Web Single Sign On

Data migration

Database management

Application/web development, maintenance, hosting and support

Software support services

Help desk.

Defect correction.

Friday, September 14, 2007

What we did

Set up

Ingestion

Performance

Thursday, September 13, 2007

Wednesday, September 12, 2007

What we did

Set up

Fez Data Requirements

Datastream ID:

Datastream Version ID:

MODS Subject:

MODS Namespace:

Data Ingestion

Performance

Current concerns

Thursday, September 6, 2007

What we did

Set up

Ingestion

Performance

Summary

Current concerns

Wednesday, September 5, 2007

Monday, September 3, 2007

Blog Archive

Contributors