Wednesday, September 19, 2007

AANRO Requirements summary



About this document

[UPDATE: Changed the format of the document so it renders better in the blog fixed a few typos.]

This document contains a summary of requirements for the AANRO repository, which originally appeared in a Request For Quote from 2006. One of the deliverables on this project is an annotated version of the RFQ so I'm starting on that now and I'm doing it in public so others can contribute and correct me. (I may update this post with change-tracking turned on)

Mathew Silver of Land and Water prepared the original summary of requirements document for me early this year [update Mat only passed it on] and we used it to to compare the requirements with the then-current VITAL 2.1. At that stage VITAL definitely would not have met requirements, it does not have a web interface to edit Fedora objects, and the index and search performance was too slow. We have seen a marked improvement in performance with a pre-release version of 3.1.1, but it still lacks an easy to use web interface for modifying objects stored in Fedora.

I'm putting up this list with some comments and questions, inviting responses from the developers involved. If you can clarify anything or I've made an error, use the comments below or drop me an email.

Search Facility

Rapid and comprehensive access to records of research in-progress, finalised research reports, formal publications, journal/conference proceedings and website content

This is what we're testing now. Jury is still out on how rapid the various solutions will be VITAL 3.1.1 is a definate improvement over 3.1 and Fez 2 should be better than Fez 1.3. Mura is proving difficult for us to install.

Basic and advanced search operations including thesaurus keywords, metadata, organisation, geographic locality and other supported metadata.

I think that all the products will be flexible enough to handle this requirement, but all of them have different configuration systems.

Thesaurus support is an issue, see comments below on adding items to the repository.

Capability to externally link to, and store in a central database, a variety of common file formats including HTML, PDF, word, excel, delimited text, tagged and XML content.

All the repositories can in some sense store all these things.

HTML is not well supported by VITAL or Fez how about Mura?

(2007-09-19: I tried adding an HTML object to the version of Mura hosted by the developers. The metadata input screens are nice and clean, you really need to switch to advanced mode. For example in the simple input there is a field for 'author' but no way to separate given and family name. When I submitted the item the application threw an exception.)

(USQ is running a project called ICE-RS which is looking at how to get research publications into repositories in HTML and PDF format; by the end of 2007 we should have proof of concept implementations it's not a given that this would be any use to AANRO though.)

Support for the development of applications to search special data-collections.

In one sense this is supported by default because all the solutions use Fedora, but it's not really a feature of any of the software yet. Mura is the most standards compliant of the software in the sense that it uses native Fedora services for as much as possible. So if you configure access control or indexing then you can use other software.

I know that Fez plans to add support for special-purpose portals. This will be at the level of the Fez application, you will need to use a Fez API to leverage any Fez access controls.

VTLS: would you like to share your plans in this area? (I have talked to VTLS and the ARROW development group about this issue but not sure if they want that made public yet)

Support for the existing custom developed applications and sites using subsets of AANRO data.

This raises question for AANRO what sites are now consuming the data? Are there any sites that duplicate the data? Are the opportunities for federation ie participating organizations who have or would like to have their own repository that feeds into the central AANRO registry? (It sounds like there are from the requirements, but we need more detail about the participating organizations.

Search results

A consistent, standardised and user-friendly website and search tool.

We will consider friendliness after we look at performance but I think all the packages will be OK here, subject to meeting the next few requirements.

Customisable, sorted by relevance, easy-to-read search results that include links to other relevant research.

All the packages are customizable we'll share our impressions as we go.

VITAL

Uses XSLT easy to use, but I suspect it may be one cause of the slowish performance we have seen in the past.

Relevance ranking: Not sure about this one. VTLS?

VITAL allows 'faceted search' which shows, for any given result set from a search it will show you how many matches the set contains against a number of indexes, you can then narrow or 'limit' your search. Say I search the whole repository for the word goat and it tells me there are 400 items of type Conference paper I can click on conference paper to limit the search to the 400 conference paper items that contain goat.

I mentioned in a previous post that you can also do this with Apache Solr, with only a very small amount of software on top of it; we will try to demonstrate this with AANRO data in the next couple of weeks.

Fez

Uses the full text engine that comes with the database. Not sure about how relevance is calculated but you can sort results by relevance so it's there.

Fez doesn't seem to support faceted searching or limits (based on my experiments with the Unversity of Queensland eSpace server) but when you look at an item it will let you search for related items by clicking on the subject. Faceted search would be a nice-to-have.

Fez has search-completion which will suggest search terms as you type, users are coming to expect this because it Google does it.

Mura

Search is quite rudimentary at present. There are no facets, and no hyperlinks to search for related items.

Feedback mechanism on unsuccessful searches.

We'll check.

Content harvesting and publishing

Use of high levels of automation to identify, capture, harvest, reformat and index new content from a variety of sources and agencies.

I think it's safe to say that currentlty none of our candidate systems do this. They all publish content via OAI-PMH but none of them harvest content automatically. This is one area where there might need to be custom development, after careful requirements gathering; hard to comment at this time without a lot more detail about the sources and agencies.

An alternative architecture where the AANRO portal used something like Apache Solr with an OAI-PMH harvester bolted on might meet this requirement. Content from agencies could either be indexed directly by a portal (via a feed of some kind) or stored in a central Fedora repository and then indexed. We'll have much more detail on alternative architectures by the end of the project.

Individual web-based publishing i.e. input of research information by participants.

VITAL

As noted previously, VITAL does not have a fully-integrated web input system. It is not suitable for AANRO's requirement that non-specialist staff from small agencies (at least) can use a web form to add and update repository content.

So, while the VITAL portal has some nice features in the portal it is severely lacking on the input workflow side. It seems likely that an AANRO solution would require additional development for web based input.

Fez

Fez has a customisable web-based input system with workflow that is highly configurable. We are confident that this would be usable by agencies contributing to a central database.

I have previously expressed concern with some aspects of the way the web-input system in Fez works. It uses a very complicated web interface to map elements in an XML schema to web forms and stores the results in a database. Quite a lot of the traffic on the Fez mailing list is about people having problems configuring this interface.

Another concern is with the way that configuration may be versioned and moved between test and production instances. Is it possible to export configuration, keep it in Subversion or the like and then deploy it to other instances? Fez team?

Mura

Mura has a configurable web-portal and uses the XForms standard for its web input.

As noted above, Mura is not running properly for us yet.

Automated and manual web-based batch publishing of documents in a variety of standard formats.

Fez is the only package that has a web-triggered batch ingest system, which can import directories full of content. This is a very simple arrangement requiring users to put data in a directory and the trigger the ingest via the web.

Indexing

Support for records indexed using metadata that meets relevant standards such as AGLS.

VITAL and Fez both definitely support multiple metadata standards, not sure how hard it would be to expose this in the web pages it servers in a machine readable way in either.

Support for open archives initiative protocol for metadata harvesting (OAI-PMH).

Yes for all three VITAL and Mura can be harvested using standard Fedora plugins, including specifying sets of data.

Fez

Do you support sets?

Support for indexing and search of records using the CABI thesaurus as used in the current system.

Indexing and search are no problem even without licensing the thesaurus if the terms are attached to items in the repostitory.

VITAL

No support for controlled vocabularies on ingest.

Fez

Supports controlled vocabularies configuration might be difficult, but there is no reason to think it would not work.

Mura

I am assuming that Mura would be OK here as it uses XForms for ingest and XForms is very configurable, but we are not able to test this yet (and may not get time). There is no example of thesaurus support in the Mura demonstration site.

Content Management

Support for web-based document management, auditing, simple workflow, including research status, publishing rights and ability to edit incorrect content.

See the comments above on web input.

Archive facility to flag data that exceeds a particular age.

Not as far as I can tell for any of them.

Content reporting

Analysis and reporting of traffic using an industry standard tool.

We'll look into it.

Activity reports summarising research documents, projects and research status.

There are stats in all three they don't make the distinction between generic items and items that might have a particular type (project) or status, so customization may be required.

Application maintenance and support

Application development using open standards and industry supported software.

Yes for Fez and Mura.

VITAL is closed-source but developers are free to work with Fedora directly in any language using any of the standards supports.

A sufficiently scalable application able to cope with projected demand.

We can comment on scalability but does AANRO have projected demand?

Meet government standards for web delivered services i.e.. The Guide to Minimum Website Standards www.agimo.gov.au

Any comments from developers would be welcome here I'll read the guidelines and flag any issues.

Capacity to deliver content for low bandwidth connections to ensure customers can access the system when using low bandwidth.

All the repositories work in pretty low bandwidth already. None of them

Interoperability with other repositories

Interoperability is required to provide the ability to transfer and use information in a uniform and efficient manner across multiple organizations and information technology systems. The primary interoperability is to transfer information from the 30 participating agencies and secondly the future interoperability with other web based systems.

Web Single Sign On

(WSSO) is required to simplify the administrative process of authentication and authorization and in the future to gain access to the resources of other web based systems.

All of the systems under consideration can use the Shibboleth single sign on system. Mura should be particularly good in this regard as it comes from a group who have work extensively with this technology. (We have not seen single sign on demonstrated with Fex or VITAL.)

Data migration

(Not much to say about this, as it is a matter for an implementer, not part of an evaluation. I will add some comments about helpdesks in the final report.)

Plan and undertake the migration of data from the current AANRO platform and demonstrate a strategy to enable any future migration. Within the context just discussed, it will be the ongoing responsibility of the Successful Respondent to assist in the migration of research content from AANRO associated agency systems to the new database. This will include the integration of exported database content from agencies, migration of the existing InMagic AANRO database, and ongoing harvesting of content from agency websites.

Database management

This is the requirement as written:

Extensive, ongoing content sourcing, harvesting, indexing, publishing and content management by the respondent/vendor.

Application/web development, maintenance, hosting and support

Ongoing development, maintenance, web and database hosting and support.

Software support services

Help desk.

Defect correction.



No comments: