Tuesday, October 30, 2007

AANRO Requirements summary - Fez

About this document

This document contains a summary of requirements for the AANRO repository with notes on how Fez 2.0 will address them. As usual, comments are welcome.

At time of writing the AANRO data is in the Fez demo site (it may not remain there).

For a more stable Fez-driven site which will persist see UQ's eSpace repository.

USQ has a Fez demonstration site into which we will endeavour to put a complete copy of the latest AANRO data.

Search Facility

Rapid and comprehensive access to records of research in-progress, finalised research reports, formal publications, journal/conference proceedings and website content

In general Fez deals with this really nicely. By simply transforming AANRO data to MODS and importing into Fez the AANRO data were immediately browsable further confguration is required but 'out of the box' works.

Depending on what is meant by 'website content' this could be more of an issue. Some minor changes to Fez would need to be made for it to be able to serve HTML with stylesheets and images.

Basic and advanced search operations including thesaurus keywords, metadata, organisation, geographic locality and other supported metadata.

Fez can be configured to use multiple metadata schemas so locality could be added.

Capability to externally link to, and store in a central database, a variety of common file formats including HTML, PDF, word, excel, delimited text, tagged and XML content.

Yes to all of these, although HTML upload is sub-optimal it can display an HTML page but not its images unless the user goes to a lot of trouble to change the HTML page so it can reference Fez datastreams.

A change would need to be made to Fez so that it can serve HTML with images, ideally by cahnging the URLs used by Fez to be path-like, so that relative URLs would work.

If AANRO requires HTML items in the repository then the budget will need to include making these chages.

Support for the development of applications to search special data-collections.

In one sense this is supported by default because Fez uses Fedora but Fez itself does not support deveopment of special purpose portals, except in as much as it is open source and extensible. Fez does not use standard Fedora mechanisms for indexing or security, so any work done in these areas needs to be duplicated in a special purpose application. I know that the Fez team plans to add support for special-purpose portals. This will be at the level of the Fez application, you will need to use a Fez API to leverage any Fez access controls.

In the final report we discuss the option of building a federated AANRO service which would harvest items from a number of repositories via OAI-PMH. Such a harvesting service could provide 'filtered' views of data from a number of repositories either through a single or multiple portals.

Support for the existing custom developed applications and sites using subsets of AANRO data.

This raises question for AANRO what sites are now consuming the data? Are there any sites that duplicate the data? Are the opportunities for federation ie participating organizations who have or would like to have their own repository that feeds into the central AANRO registry? This needs to be covered in implementation.

Search results

A consistent, standardised and user-friendly website and search tool.

Yes. Fez has a lot of nice touches including suggestions when searching (as you type it offers to complete what you're typing). More detail on functionality is included in the discussion below.

Customisable, sorted by relevance, easy-to-read search results that include links to other relevant research.

Fez uses the full text engine that comes with the MYSQL database and uses its relevance ranking engine.

Fez doesn't yet support faceted searching but when you look at an item it will let you search for related items by clicking on the subject or authors. Faceted search would be a nice-to-have.

Feedback mechanism on unsuccessful searches.

Fez returns a simple message:

Search Results (All Fields:"xsds", Status:"Published"): (0 results found)

Christiaan Kortekaass tells me that there is a new feature coming which will use a thesaurus to retry unsuccessful searches:

If you search for happy and get no results it will say 'similar terms "cheerful (3 matches), delerious (2 matches)"

Or if its mispelled like hapyy it will say did you mean 'happy' (5 matches) ?

Content harvesting and publishing

Use of high levels of automation to identify, capture, harvest, reformat and index new content from a variety of sources and agencies.

Fez can publish content via OAI-PMH but currently does not havest. This is one area where there might need to be custom development, after careful requirements gathering; hard to comment at this time without a lot more detail about the sources and agencies.

An alternative architecture where the AANRO portal used something like Apache Solr with an OAI-PMH harvester bolted on might meet this requirement. Content from agencies could either be indexed directly by a portal (via a feed of some kind) or stored in a central Fedora repository and then indexed. We'll have much more detail on alternative architectures by the end of the project.

(See a demonstration site that may not persist for long after November 2007. This site has a harvesting component that fetches data from a number of sources via OAI-PMH).

Individual web-based publishing i.e. input of research information by participants.

Fez has a customisable web-based input system with workflow that is highly configurable. We are confident that this would be usable by agencies contributing to a central database.

I have previously expressed concern with some aspects of the way the web-input system in Fez works. It uses a very complicated web interface to map elements in an XML schema to web forms and stores the results in a database. Quite a lot of the traffic on the Fez mailing list is about people having problems configuring this interface. We recommend budgeting at least two weeks of developer time to ensure that mappings can be constructured for AANRO data.

Another concern is with the way that configuration may be versioned and moved between test and production instances Fez does have a method for exporting schema mappings from one server and importing them into another. Our recommendation is to set up procedures where mappings are developed in a test environment and deployed to production via a version control system. The production system should be locked down so only developers can change mappoings and other configuration (this feature is coming).

Automated and manual web-based batch publishing of documents in a variety of standard formats.

Fez has a web-triggered batch ingest system, which can import directories full of content, in any format. This is a very simple arrangement requiring users to put data in a directory and the trigger the ingest via the web.

Indexing

Support for records indexed using metadata that meets relevant standards such as AGLS.

Fez can support multiple metadata standards and indexing is highly configurable. See the AANRO metadata guidelines document (forthcoming) for a discussion of a number of metadata schemas and thesuari.

Support for open archives initiative protocol for metadata harvesting (OAI-PMH).

An OAI-PMHfedd (but not harvesting) is supported by Fez, and custom feeds can be set up by changing PHP code and Smarty templates.

Support for indexing and search of records using the CABI thesaurus as used in the current system.

Indexing and search are no problem even without licensing the thesaurus if the terms are attached to items in the repostitory. The Fez ingest system can be configured to use different thesauri this would need to be done by developers.

Content Management

Support for web-based document management, auditing, simple workflow, including research status, publishing rights and ability to edit incorrect content.

See the comments above on web input.

Archive facility to flag data that exceeds a particular age.

This could be done in Fez using a customized workflow, to perform a search, and then provide an interface to action.

Content reporting

Analysis and reporting of traffic using an industry standard tool.

Fez is part of a collaboratuve standards based effort to build a benchmark statistics service (BEST). From the wiki:

The focus of this initiative is on designing a Repository Statistics Service that will enhance the type and quality of statistical information about collections (and items) and usage statistics in repositories. The problem to be solved here relates to the strategic need for better, standardised, statistical information about the repository holdings and usage in order to inform a wide range of policy and funding decisions within the overall scholarly communications cycle. More specifically, BEST will design a federated repository statistics service that will facilitate the automated collection and standard analysis of statistical information about the collections and usage of APSR Partner repositories.

Methods and Approaches

  • Adopt and adapt the findings of relevant international projects

  • Scope open-source statistics and web-metric solutions

  • With cooperation of APSR partners and other interested groups propose and trial a 'benchmark set' of statistical measures and filtering approaches appropriate to research repositories.

  • Using a service-oriented approach, design a service for collecting, aggregating, analysing and presenting this statistical information related to repository collections and usage; and

  • Asses the development needs of partner repositories to implement the pilot service.

Activity reports summarising research documents, projects and research status.

Fez reports on generic repository statistics to do with items, author etc. For special fields such as research status development would be required. they don't make the distinction between generic items and items that might have a particular type (such as project) or status, so customization may be required.

Application maintenance and support

Application development using open standards and industry supported software.

Fez uses standard open source components (PHP and MySQL plus standard libraries) and is itself open source.

A sufficiently scalable application able to cope with projected demand.

Fez can be scaled up using load balancing servers (according to the developers). It is difficult to comment furhter on this issue without access to some projections from AANRO, but the Fez demo serverrunnign on shared infrastrcuture is displaying acceptable speed for a test server (about 2 seconds to generate each page).

Meet government standards for web delivered services i.e.. The Guide to Minimum Website Standards www.agimo.gov.au

The minimum standards are available from the agimo web site. Not all of the standards are relevant as a lot of them are about content - which is not up to the repository software.

As noted above Fez will be able to meet Metadata requirements for AGLS metadata.

Regarding accessibility, there was not time in this consultancy to undertake a full accessibility audit of Fez. A lot of the the basics, such as alternate text on images in the application appear to be well thought out. We note, though that the application relies heavily on Javascript to work. Browsing with javascript dsiabled makes the interface less usable, while using the adminstrative interface (for example to add an item) does not function at all without javascript. This is not unusual for a modern web application AANRO should seek advice from an accessibility expert as part of any redevelopment.

Capacity to deliver content for low bandwidth connections to ensure customers can access the system when using low bandwidth.

Fez is an inherently low-bandwidth application, although AANRO content may not be once it has been sourced and added to the repository.

Interoperability with other repositories

Fez will interoperate with other repositories mainly via the use of OAI-PMH. This is a well established protocol that can be used to move metadata around a networ of repositories and portals. See the discussion on architecture in the final report.

Web Single Sign On

(WSSO) is required to simplify the administrative process of authentication and authorization and in the future to gain access to the resources of other web based systems.

Fez can use the Shibboleth single sign on system but note that participating users will need to belong to the same federation this is increasingly true for university staff but not for the broader community.

Data migration

We are provding a data migration guide which covers initial conversion for th AANRO metadata guide. We have domonstrated that this approach is Fez-compatible.n Note that

Plan and undertake the migration of data from the current AANRO platform and demonstrate a strategy to enable any future migration. Within the context just discussed, it will be the ongoing responsibility of the Successful Respondent to assist in the migration of research content from AANRO associated agency systems to the new database. This will include the integration of exported database content from agencies, migration of the existing InMagic AANRO database, and ongoing harvesting of content from agency websites.

Database management

This is the requirement as written:

Extensive, ongoing content sourcing, harvesting, indexing, publishing and content management by the respondent/vendor.

Application/web development, maintenance, hosting and support

Ongoing development, maintenance, web and database hosting and support.

Software support services

If Fez is selected for use by AANRO, we recommend that the developer and repository maintainer set up a formal relationship with the University of Queensland, or otherwise procure the services of a Fex expert, to supply help desk an defect correction services. The Fez team are very responsive to outside critiques and questions but they are under no obligation to assist outside developers.

No comments: