Thursday, November 1, 2007

AANRO data in Muradora

Last week the Namchi Nguyen (Chi) got in contact with us to show that the Muradora team had added an import utility so Muradora can index items sitting in an existing Fedora repository (Fez can do this, and after some prompting from the ARROW and RUBRIC communities VTLS added this feature to Vital as well).

The Muradora team have put up a demo server containing AANRO data and some other stuff. Remember this is someone else's server and the data may not be in there for long.

I had a conversation with Chi this week about this new version. It looks promising – and it seems that it would meet most of the AANRO requirements although with its focus very much on access control it may not be a perfect match with AANROs open access data.

There are a couple of issues that need to be resolved (I've already mentioned this stuff to Chi):

At present when you do a search the default behaviour is to show that there were a certain number of results, and only filter out results that you are not meant to be able to see when they are displayed. From what I know of institutional repositories this is not acceptable as even knowing that someone is working on a particular topic may be a problem for some intellectual property.
I'm not sure that this would be huge problem for AANRO, but in the current version is certainly compromises usability in the general case.
Chi tells me they will fix this soon by having a 'guest' mode that only searches open access objects.
The current interface is not very hypertextual – once you get to a metadata page there are no links to see other things about the same subject or by the same author. I'm sure this would be simple to add.
There is nothing in the demo to show a subject hierarchy or ontology at work unless you count collections (we don't have that on our Solr demo either come to think of it).
There are a number of bugs and little tidy-ups that are required, and anyone talking on this software would have to be prepared to work with something where they would be amongst the first to use it.
The current demo does not have its metadata editing configured to work with author affiliation in MODS. I've confirmed with Chi that this is just a matter of writing some more XForms code, which can be tricky – but there are others working on the same thing (also without affiliation, though).

Unfortunately we've run out of time on this evaluation – the draft final report went off this week, but I'll keep an eye on Muradora.

(Regarding performance, from Chi's email I gather that Index time for 140,000 records including adding all records to an AANRO collection was 16-18 hours on a machine with 4GB RAM, 2x 1.8Ghz Opteron CPU (each with 2 cores), 320GB SATA hard disk, but this time could be improved doing a bit more work to the data – and re-indexing will not take that long. (Chi – if you want to comment here, please do so)).

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Tuesday, October 30, 2007

AANRO Requirements summary - Fez

About this document
Search Facility
Search results
- A consistent, standardised and user-friendly website and search tool.
- Feedback mechanism on unsuccessful searches.
Content harvesting and publishing
Indexing
Content Management
- Support for web-based document management, auditing, simple workflow, including research status, publishing rights and ability to edit incorrect content.
- Archive facility to flag data that exceeds a particular age.
Content reporting
- Analysis and reporting of traffic using an industry standard tool.
- Activity reports summarising research documents, projects and research status.
Application maintenance and support
Interoperability with other repositories
- Web Single Sign On
Data migration

About this document

This document contains a summary of requirements for the AANRO repository with notes on how Fez 2.0 will address them. As usual, comments are welcome.

At time of writing the AANRO data is in the Fez demo site (it may not remain there).

For a more stable Fez-driven site which will persist see UQ's eSpace repository.

USQ has a Fez demonstration site into which we will endeavour to put a complete copy of the latest AANRO data.

Search Facility

Rapid and comprehensive access to records of research in-progress, finalised research reports, formal publications, journal/conference proceedings and website content

In general Fez deals with this really nicely. By simply transforming AANRO data to MODS and importing into Fez the AANRO data were immediately browsable – further confguration is required but 'out of the box' works.

Depending on what is meant by 'website content' this could be more of an issue. Some minor changes to Fez would need to be made for it to be able to serve HTML with stylesheets and images.

Basic and advanced search operations including thesaurus keywords, metadata, organisation, geographic locality and other supported metadata.

Fez can be configured to use multiple metadata schemas so locality could be added.

Capability to externally link to, and store in a central database, a variety of common file formats including HTML, PDF, word, excel, delimited text, tagged and XML content.

Yes to all of these, although HTML upload is sub-optimal – it can display an HTML page but not its images – unless the user goes to a lot of trouble to change the HTML page so it can reference Fez datastreams.

A change would need to be made to Fez so that it can serve HTML with images, ideally by cahnging the URLs used by Fez to be path-like, so that relative URLs would work.

If AANRO requires HTML items in the repository then the budget will need to include making these chages.

Support for the development of applications to search special data-collections.

In one sense this is supported by default because Fez uses Fedora but Fez itself does not support deveopment of special purpose portals, except in as much as it is open source and extensible. Fez does not use standard Fedora mechanisms for indexing or security, so any work done in these areas needs to be duplicated in a special purpose application. I know that the Fez team plans to add support for special-purpose portals. This will be at the level of the Fez application, you will need to use a Fez API to leverage any Fez access controls.

In the final report we discuss the option of building a federated AANRO service which would harvest items from a number of repositories via OAI-PMH. Such a harvesting service could provide 'filtered' views of data from a number of repositories either through a single or multiple portals.

Support for the existing custom developed applications and sites using subsets of AANRO data.

This raises question for AANRO – what sites are now consuming the data? Are there any sites that duplicate the data? Are the opportunities for federation – ie participating organizations who have or would like to have their own repository that feeds into the central AANRO registry? This needs to be covered in implementation.

Search results

A consistent, standardised and user-friendly website and search tool.

Yes. Fez has a lot of nice touches including suggestions when searching (as you type it offers to complete what you're typing). More detail on functionality is included in the discussion below.

Customisable, sorted by relevance, easy-to-read search results that include links to other relevant research.

Fez uses the full text engine that comes with the MYSQL database and uses its relevance ranking engine.

Fez doesn't yet support faceted searching but when you look at an item it will let you search for related items by clicking on the subject or authors. Faceted search would be a nice-to-have.

Feedback mechanism on unsuccessful searches.

Fez returns a simple message:

Search Results (All Fields:"xsds", Status:"Published"): (0 results found)

Christiaan Kortekaass tells me that there is a new feature coming which will use a thesaurus to retry unsuccessful searches:

If you search for happy and get no results it will say 'similar terms "cheerful (3 matches), delerious (2 matches)"
Or if its mispelled like hapyy it will say did you mean 'happy' (5 matches) ?

Content harvesting and publishing

Use of high levels of automation to identify, capture, harvest, reformat and index new content from a variety of sources and agencies.

Fez can publish content via OAI-PMH but currently does not havest. This is one area where there might need to be custom development, after careful requirements gathering; hard to comment at this time without a lot more detail about the sources and agencies.

An alternative architecture where the AANRO portal used something like Apache Solr with an OAI-PMH harvester bolted on might meet this requirement. Content from agencies could either be indexed directly by a portal (via a feed of some kind) or stored in a central Fedora repository and then indexed. We'll have much more detail on alternative architectures by the end of the project.

(See a demonstration site that may not persist for long after November 2007. This site has a harvesting component that fetches data from a number of sources via OAI-PMH).

Individual web-based publishing i.e. input of research information by participants.

Fez has a customisable web-based input system with workflow that is highly configurable. We are confident that this would be usable by agencies contributing to a central database.

I have previously expressed concern with some aspects of the way the web-input system in Fez works. It uses a very complicated web interface to map elements in an XML schema to web forms and stores the results in a database. Quite a lot of the traffic on the Fez mailing list is about people having problems configuring this interface. We recommend budgeting at least two weeks of developer time to ensure that mappings can be constructured for AANRO data.

Another concern is with the way that configuration may be versioned and moved between test and production instances – Fez does have a method for exporting schema mappings from one server and importing them into another. Our recommendation is to set up procedures where mappings are developed in a test environment and deployed to production via a version control system. The production system should be locked down so only developers can change mappoings and other configuration (this feature is coming).

Automated and manual web-based batch publishing of documents in a variety of standard formats.

Fez has a web-triggered batch ingest system, which can import directories full of content, in any format. This is a very simple arrangement requiring users to put data in a directory and the trigger the ingest via the web.

Indexing

Support for records indexed using metadata that meets relevant standards such as AGLS.

Fez can support multiple metadata standards and indexing is highly configurable. See the AANRO metadata guidelines document (forthcoming) for a discussion of a number of metadata schemas and thesuari.

Support for open archives initiative protocol for metadata harvesting (OAI-PMH).

An OAI-PMHfedd (but not harvesting) is supported by Fez, and custom feeds can be set up by changing PHP code and Smarty templates.

Support for indexing and search of records using the CABI thesaurus as used in the current system.

Indexing and search are no problem even without licensing the thesaurus – if the terms are attached to items in the repostitory. The Fez ingest system can be configured to use different thesauri – this would need to be done by developers.

Content Management

Support for web-based document management, auditing, simple workflow, including research status, publishing rights and ability to edit incorrect content.

See the comments above on web input.

Archive facility to flag data that exceeds a particular age.

This could be done in Fez using a customized workflow, to perform a search, and then provide an interface to action.

Content reporting

Analysis and reporting of traffic using an industry standard tool.

Fez is part of a collaboratuve standards based effort to build a benchmark statistics service (BEST). From the wiki:

The focus of this initiative is on designing a Repository Statistics Service that will enhance the type and quality of statistical information about collections (and items) and usage statistics in repositories. The problem to be solved here relates to the strategic need for better, standardised, statistical information about the repository holdings and usage in order to inform a wide range of policy and funding decisions within the overall scholarly communications cycle. More specifically, BEST will design a federated repository statistics service that will facilitate the automated collection and standard analysis of statistical information about the collections and usage of APSR Partner repositories.
Methods and Approaches
Adopt and adapt the findings of relevant international projects
Scope open-source statistics and web-metric solutions
With cooperation of APSR partners and other interested groups propose and trial a 'benchmark set' of statistical measures and filtering approaches appropriate to research repositories.
Using a service-oriented approach, design a service for collecting, aggregating, analysing and presenting this statistical information related to repository collections and usage; and
Asses the development needs of partner repositories to implement the pilot service.

Activity reports summarising research documents, projects and research status.

Fez reports on generic repository statistics to do with items, author etc. For special fields such as research status development would be required. they don't make the distinction between generic items and items that might have a particular type (such as project) or status, so customization may be required.

Application maintenance and support

Application development using open standards and industry supported software.

Fez uses standard open source components (PHP and MySQL plus standard libraries) and is itself open source.

A sufficiently scalable application able to cope with projected demand.

Fez can be scaled up using load balancing servers (according to the developers). It is difficult to comment furhter on this issue without access to some projections from AANRO, but the Fez demo serverrunnign on shared infrastrcuture is displaying acceptable speed for a test server (about 2 seconds to generate each page).

Meet government standards for web delivered services i.e.. The Guide to Minimum Website Standards www.agimo.gov.au

The minimum standards are available from the agimo web site. Not all of the standards are relevant as a lot of them are about content - which is not up to the repository software.

As noted above Fez will be able to meet Metadata requirements for AGLS metadata.

Regarding accessibility, there was not time in this consultancy to undertake a full accessibility audit of Fez. A lot of the the basics, such as alternate text on images in the application appear to be well thought out. We note, though that the application relies heavily on Javascript to work. Browsing with javascript dsiabled makes the interface less usable, while using the adminstrative interface (for example to add an item) does not function at all without javascript. This is not unusual for a modern web application – AANRO should seek advice from an accessibility expert as part of any redevelopment.

Capacity to deliver content for low bandwidth connections to ensure customers can access the system when using low bandwidth.

Fez is an inherently low-bandwidth application, although AANRO content may not be once it has been sourced and added to the repository.

Interoperability with other repositories

Fez will interoperate with other repositories mainly via the use of OAI-PMH. This is a well established protocol that can be used to move metadata around a networ of repositories and portals. See the discussion on architecture in the final report.

Web Single Sign On

(WSSO) is required to simplify the administrative process of authentication and authorization and in the future to gain access to the resources of other web based systems.

Fez can use the Shibboleth single sign on system but note that participating users will need to belong to the same federation – this is increasingly true for university staff but not for the broader community.

Data migration

We are provding a data migration guide which covers initial conversion for th AANRO metadata guide. We have domonstrated that this approach is Fez-compatible.n Note that

Plan and undertake the migration of data from the current AANRO platform and demonstrate a strategy to enable any future migration. Within the context just discussed, it will be the ongoing responsibility of the Successful Respondent to assist in the migration of research content from AANRO associated agency systems to the new database. This will include the integration of exported database content from agencies, migration of the existing InMagic AANRO database, and ongoing harvesting of content from agency websites.

Database management

This is the requirement as written:

Extensive, ongoing content sourcing, harvesting, indexing, publishing and content management by the respondent/vendor.

Application/web development, maintenance, hosting and support

Ongoing development, maintenance, web and database hosting and support.

Software support services

If Fez is selected for use by AANRO, we recommend that the developer and repository maintainer set up a formal relationship with the University of Queensland, or otherwise procure the services of a Fex expert, to supply help desk an defect correction services. The Fez team are very responsive to outside critiques and questions but they are under no obligation to assist outside developers.

Wednesday, October 24, 2007

Solr demo is up

I have speculated a bit about repository architecture here. As part of the thinking we have put up a proof-of-concept-only portal based on Apache Solr. This uses harvesting and indexing code from the FRED project. As I write this is contains metadata for 130K AANRO records, two copies of the USQ ePrints data, UQ's espace and Monash's repository which shows up as ARROW (hi Neil D – I know you're following the feed here).

For example if you click on the facet for one of the repositories then all further searches are just on that repository. Hit [clear all ] to go back to showing everything.

It took 188 seconds to slurp up the 6278 UQ metadata records via the internet.

Thanks to Oliver Lucido and David Levy for this one.

This is not a real service. It might disappear at any moment. The interface might change. It might be down for maintenance without warning and if you're reading this after the end of November 2007 it will almost certainly be gone.

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Implementing Author Affiliation in Fez

I wrote here before about author affiliation. How can you keep track of not only the name of an author but their institutional affiliation for a particular item in a repository?

Because it uses MODS metadata by default the Fez repository can deal with this, but the default configuration for generic items does not include affiliation. So this is a good opportunity to see how Fez configuration works.

In this post I'll look at some of the main issues for AANRO:

Mapping the existing data to MODS using the tools we developed for the RUBRIC project.
Configuring Fez to recognize the affiliation and display it.
Configuring an OAI-PMH feed so that the repository can be part of a federation and be linked into services like Google Scholar.

1 Data mapping

In the AANRO data we're working with the author field contains the affiliation in brackets. Our first mapping to MODS left this as-is. To fix this I needed to change the stylesheets we use to map the original data which is in an ad hoc format into a standard format.

Here's the process in outline. This is what our tech officers did to get the data into Fedora:

Transform the raw data to XML using Excel (might be better do this with Python in future).
Split the resulting file into multiple parts, by using an XPath to select an element on which to break:
```
python split_xml_into_archive.py   ~/aanro/Data/Xml/AANROPubArchive0_30k.xml  ~/datamigration/aanro1 "//record" DC.xml False
```
The result is a series of dirctories (30,000 of them in this case) each containing a file called rec.xml in a format like this
<record>
<id>38161</id>
<doctype>Book or Report</doctype>
<pubyear>1996</pubyear>
<title>Guidelines on the quality of stormwater and treated wastewater for injection into
aquifers for storage and reuse</title>
<author>Dillon P (Australian Centre for Groundwater Studies)|Pavelic P (Australian Centre
for Groundwater Studies)</author>
...
</record>
Transform each metadata record into MARC XML (we could change this process and go straight to MODS – but we are used to working with MARC because that's what our staff were used to with RUBRIC and we have some VITAL configuration that uses Marc).
```
python xsl_transform.py  temp.xml AANRO_xml_to_marc.xsl MARC.xml aanro1/ False
```
Transform the MARC XML to MODS using a variant of a stylesheet we downloaded from the Library of Congress, this uses the same script as above, with a different XSL stylesheet.
Transform the MARC XML to Dublin Core.
Convert the Dublin Core and MODS metadata into FOXML, which can be ingested directly into Fedora.
Ingest into Fedora using the Fedora client.
Index the items into Fez
- Click on administration.
- Click on (Maintenance) Index Objects.
- Click Discover new Fedora objects.
- Select the Generic Document Version MODS 1.0 under Document Type.
- Click Index All Items into Fez.
While the indexing is happening you can monitor it from the My Fez area, in the Background Processes tab – it will tell you how long it expects the indexing to take.
(Currently this is very slow – as I noted before but Christiaan Kortekaass has another trick up his sleeve which should improve indexing by at least an order of magnitude using the same technique for accessing Fedora records as the Fedora Gsearch indexer.)

I had to make changes to the stylesheet used for step 3. Here's the process I followed:

Check out the RUBRIC toolkit code from Subversion.
Locate the AANRO specific stylesheets.

Find the UTF-X unit test for the AANRO to MARCXML XSLT transformation.

A test for a simple transformation from <author> in the input data to MARCXML looks like this:

<utfx:test>

        <utfx:name>Author</utfx:name>

        <utfx:assert-equal>

            <utfx:source validate="no">

                <author>Leng RA</author>

            </utfx:source>

            <utfx:expected validate="no">

                <datafield tag="100" ind1=" " ind2=" ">

                    <subfield code="a">Leng RA</subfield>

                </datafield>

            </utfx:expected>

        </utfx:assert-equal>

    </utfx:test>

Add new tests to deal with affiliations

I added another test that deals with multiple authors – with affiliation.

<utfx:assert-equal>

            <utfx:source validate="no">

                <author>Dillon P (Australian Centre for Groundwater Studies)|Pavelic P (Australian

                    Centre for Groundwater Studies)</author>

            </utfx:source>

            <utfx:expected validate="no">

                <datafield tag="100" ind1=" " ind2=" ">

                    <subfield code="a">Dillon P</subfield>

                    <subfield code="u">Australian Centre for Groundwater Studies</subfield>

                </datafield>

                <datafield tag="700" ind1=" " ind2=" ">

                    <subfield code="a">Pavelic P</subfield>

                    <subfield code="u">Australian Centre for Groundwater Studies X</subfield>

                </datafield>

            </utfx:expected>

Since the authors had the same affiliation I added an 'X' to the second one to make sure that the right author gets the right affiliation.

Write the XSLT to pass the test.
This bit took a bit longer than expected and I ended up having to talk to our metadata specialist about MARC indicators and the like but I managed to get it working without breaking the other parts of the stylesheet, and simplified the code in the process.
Check back in the code to subversion
Update the code on our test machine and use it on the AANRO data.

We will include the AANRO code in a release of the RUBRIC toolkit which will be undergoing a tidy-up in early 2008.

2 Configuring Fez metadata entry and display

Once there are records in Fedora and a Fez index to make them show up in Fez we have to configure the display and the HTML form used to edit metadata.

The key to this is a system which maps XML Schemas to HTML forms. An XML schema is a description of the structural potential of a type of document, such as a MODS metadata document.

This is a complex and confusing area of the system, which generates a large percentage of the traffic on the Fez users mailing list.

I'm stuck on this at the moment – even with the very helpful Fez developers online in their Campfire chat I can't figure out how to add an affiliation to an author.

One thing I have tried is to export the mapping as XML, but it's really not a human-editable format, still it's good to see that the configuration can be exported – kept under version control and re-imported to a server. (And Christiaan Kortekaass tells me that there will soon be an option to lock down Fez so that even administrative users can't mess around with the mappings on a live system. Good news).

So we have to leave this configuration to a future AANRO implementation team, if Fez is selected. One big issue outstanding is that quite a bit of the AANRO data consists of project descriptions, not documents. This will require a fair bit of configuration.

3 Configuring OAI-PMH feeds

The web interface on a repository is not the only way that people find things. In fact most traffic is likely to come from elsewhere, particularly large search services like Google Scholar.

To expose metadata to other services you need to configure an OAI-PMH feed.

Fez has highly configurable OAI-PHM templates. Here's a snippet from the ListRecords template:

{assign var="loop_authors" value=$list[i].rek_author}

{section name="a" loop=$loop_authors}

<dc:creator>{$list[i].rek_author[a]|escape:"html"}</dc:creator>

{/section}

Now I don't speak Smarty Templates but obviously this is a loop (loop_authors) that iterates over the record for an item and spits out the Dublin Core creator element. For AANRO it is possible that the implementer would want to change this to add an affiliation in brackets as there's no simple way in Dublin Core to associate an author with an affiliation. Yes this effectively undoing the work I described above to split the author and their affiliation, but it may turn out to be a useful mechanism in an AANRO federation, to overcome some of the limitations of Dublin Core metadata.

It might looks something like this, assuming that there's a key 'f' for affiliation in the author array:

<dc:creator>{$list[i].rek_author[a]|escape:"html"}

({$list[i].rek_author[f]|escape:"html"})

</dc:creator>

4 Conclusion

This brief blog post has covered a few of the major areas of concern for AANRO. The data migration stuff will be expanded in the forthcoming data migration guide for AANRO, and pointers to potential Fez implementation will be in the final report.

I mentioned the Subversion revision control system and UTF-X unit tests for XSLT because they're really helpful in this kind of work, but not everyone we come across in the library systems world uses them.

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Thursday, October 18, 2007

Performance anxiety

We started the AANRO evaluation with a look at performance, because 200,000 odd items is quite a lot to manage. The repository manager will need to think about how long it will take to re-index after a disaster, an upgrade or a configuration change that requires indexes to be rebuilt.

VITAL is now no longer being considered, but there is one final test to report on that front and we have some new data about Fez 2.

VITAL with 130,000 or so records

Before we decided to stop evaluating VITAL, Bron Dye ran a test on a virtual machine with 3GB of memory (more than we had for previous tests).

Bron used the Fedora batch too to ingest 96000 into fedora at a rate of about 40,000 per hour.
On Saturday Morning, she told VITAL to index the records.
On Monday, the VITAL portal would not respond to HTTP requests.
The admin page for VITAL did work so Bron told it to stop indexing, and proceeded to ingest some more Fedora records to take the total up to 130592.
The VITAL server now returns a server error, while the underlying Fedora works fine.

(We have to leave the experiment there, but we still have the data so if VTLS would like to have a look we'd be happy to help with testing. Could it be that stopping the indexer caused the problem?).

Fez 2

We had some issues installing Fez 2 at USQ, and it was taking some time to sort them out, so the Fez team very kindly offered to let us use their virtual demo server.

The bottom line is that we're seeing about 40,000 records a day indexing into Fez, at this stage there will be 150,000ish on Saturday morning – it's Thursday afternoon now.

You can see the Fez demo server, but be aware that the AANRO data may disappear at any time, and performance will be slow until it finished indexing on Saturday morning. As it is the site is usable, with most pages taking 4 or 5 seconds to build and I'm impressed that the records show up pretty well considering all we did was put them in MODS format. Shows the benefits of standardization.

So on a demo machine, indexing AANRO metadata is a days-long proposition with Fez 2 just as it was with VITAL 3.1.1 This would add an enormous overhead to building and testing a new repository, not to mention disaster recovery, and things would only get slower once we start sourcing full-text for items. On the positive side, the Fez team are actively improving the software as we speak but I'd be looking for an order of magnitude improvement before I wanted to work with all the data in one instance – this might be achievable with some more optimization and some beefier hardware, I'm not sure.

The current performance does not rule out using Fez but it does point in the direction of a federated architecture, with smaller repositories feeding a central portal, running something like Apache Solr. For comparison, experiments I did on my MacBook laptop had Solr indexing AANRO metadata at something like 3000 items per minute, but that was without the overhead of having to fetch all the items from Fedora and generate preservation metadata like Fez is doing, so it's not sensible to compare it with a repository. Any federated solution would have to look at using caching if there are performance bottlenecks.

Christiaan Kortekaas points out that UQ's Fez repository has 66487 items, with 5261 currently publicly available.

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Wednesday, October 17, 2007

(I was going to call this post Collections considered harmful , but then there's then I thought about "Considered Harmful" Essays Considered Harmful (Meyer 2002) . So let's just consider collections.)

One of the important decisions in setting up a repository is choosing a collection architecture.

The word collection is used in a variety of ways, so I will define my terms here and try to stick to them.

Hard collection: I'll define a hard collection loosely as something that requires a repository manager to set up container objects into which items, including other collections, need to be added manually. Implementation details vary, but the effect is that somebody has to decide where in the hierarchy of collections to place each item, usually at ingest time. You can generally add things to more than one collection, too. Note that some software has a notion of communities as well as collections, but I think of communities as being a type of hard collection.
Virtual collection: A virtual collection, to my way of thinking would be a view of the repository that is constructed automatically by the repository software based on some kid of rule. A rule might be “the Arts Faculty thesis collection contains all the items of type 'thesis' where the author is affiliated with the faculty”. Machines a pretty good at executing queries of this type.

Some options

Of course, one option is not to use hard collections at all. This is the Eprints way. Items are described purely by their metadata. Take a look at USQ's Eprints. The only collection-like things in the software here are the browse views, which are generated dynamically from the metadata and hence qualify in my terms as virtual collections. Eprints is fairly limited in its browsing but it could be extended, either by adding to the software, or by using a more capable portal instead of the built-in one. (I can see Eprints fitting quite well into a federated AANRO repository).

Another option is a repository which forces you to use hard collections. Fez (currently), Muradora and DSpace require items to be in hard collections.

In DSpace collections and communities get used for all sorts of organizing. Here are a few examples from Flinders (a RUBRIC partner):

A sub-community named after a person which contains a collection of articles about Heritage and tourism.
Collections each containing an issue of a journal, which is itself a community.
A community containing University of Adelaide Theatre programs.

Fez and Muradora also require you to set up collections, and add objects into them.

In Muradora's case this is to make its access control more efficient, I believe. (I have tried to influence the Muadora developers to allow virtual collections built by the computer, not by people, but so far I have not been that persuasive.)
For Fez access control is the main reason the developers added hard collections – you can set up user roles per collection.
The current version of Fez requires hard collections, but I have been chatting to Christiaan Kortekaass from the Fez team this morning and he says that are going to relax this restriction. Good news, I think.

If you want to have choice between using collections of not, the other software which was (but is nor longer) under consideration for AANRO was VITAL, which allows you to work without collections but enables them if you really want them.

One prominent Eprints advocate, Arthur Sale put it like this in response to a question on the Fez users mailing list about collections within collections:

What an awful idea. Just like collections themselves. What is needed is customizable “views” of data.

Arthur told a group of us from the RUBRIC project, when we visited Hobart that collections (ie hard collections) are an unnecessary throwback to the days of physical library collections where you needed to put each item in one and only one place. I tend to agree with Arthur, but I will try here to summarize some of the pros and cons of hard collections.

What does this mean for AANRO?

I'll look briefly at a the reasons AANRO might or might not want to use hard collections, and I'll take it as given that virtual collections are A Good Thing as this is implicit in the original AANRO requirements.

Why hard collections?

Collections are one way for repository managers to build navigation, creating a hierarchy that can be navigated top-down.
Collections make it easy to express permissions, eg. “User X has permission to add items to collection Y”.
Collections are a way to express the kinds of relationships you get in complex objects such as a journal, which has volumes, which have issues, which have articles.

I don't think any of these are compelling in aanro except maybe the second – some kind of agency-level collection may make administration easier, depending on the software selected.

Why not hard collections?

Depending on implementation, using collections may tie you to a particular repository solution. (It is important to check if the hard work you put into creating and populating collections can be exported in a standard format for re-use).
Large-scale collections can become tedious and error-prone to manage.
For example how can you be sure that there is not a stray thesis somewhere that has not been added to the thesis collection?
If you answered “do a search on the metadata” then why don't you just use the metadata to define a virtual-collection?
To avoid the lock in problem (1) you should probably make sure that all the important information needed for a collection is stored redundantly in the metadata item, which brings us back to the question why not use virtual, machine generated collections.
Collections create management issues that require repository software designers and managers to really think hard – what if you want to delete a collection but it has thousands of member items? Presumably you have to re-house them first into different collections, which could be quite a job. All the complexities make the software complex, and increase the likelihood of bugs in the code, and errors in its application to real world problems.

For AANRO these are all reasons to seek software that does not require hard collections to work. I'm not saying that further analysis won't show that some level of collection is worth having, just that there is a good chance that it may not be needed. If AANRO chose to go with a distributed network of repositories then access control may be able to be made much simpler – with only one workflow per repository, whereas in a more centralized system collections may be required to configure a variety of access controls because that's the way the software works.

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Monday, October 15, 2007

VITAL out of contention

We've been looking at the VITAL repository software and reported on some testing we did to see how many objects it can handle, but we are going to stop evaluating it for AANRO. (One more report to come – I am happy to report that v 3.1.1 is MUCH better than previous versions of Vital.)

There is one key reason for this decision: There is no web interface that allows non-technical users to edit the metadata for an item in Fedora.

Yes, there is an ingest tool for VITAL, known as VALET, which offers simple workflow for new items. But once an item is in Fedora it cannot be edited with VALET, a technically proficient user must use an XML editor to change metadata, a far from easy or intuitive process.

Obviously we knew this going in to the evaluation, but we looked at VITAL in case there was no other Fedora solution that would scale, and we reported our concerns to VTLS to see if they responded with changes to VALET or VITAL. I am a member of the ARROW Developers Group and we have made the point before that a full web interface for editing metadata is a must. I even mentioned this to Vinod Chachra (VTLS President and CEO) at the ARROW community day who told me that they built VITAL for environments where using an XML editor for metadata maintenance was appropriate.

(Note also that there has also been no response from VTLS to the questions I asked ten days ago about their product strategy.)

Given that Fez has a web interface and scalability has improved dramatically with version 2 we now turn our attention to Fez. If VTLS respond with a web interface for VITAL in the next couple of weeks I'll report it here, and I'd be happy to post a feature list for the forthcoming version. Even if Fez cannot handle hundreds of thousands of items comfortably, there is still scope to use a number smaller regional or subject based Fez repositories, with an aggregated search.

To finish on a more positive note, it's appropriate to talk about the good bits of VITAL. In my opinion the best feature in terms of usability is the indexing; it is possible to run VITAL without having to pre-organize items into collections (unlike Fez), instead you can define virtual collections by defining indexes based on metadata. VITAL also promises to use standard Fedora access controls via work done by the Muradora team, but I have not seen this in action yet.

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

AANRO Repository Evaluation

Thursday, November 1, 2007

AANRO data in Muradora

Tuesday, October 30, 2007

AANRO Requirements summary - Fez

About this document

Search Facility

Rapid and comprehensive access to records of research in-progress, finalised research reports, formal publications, journal/conference proceedings and website content

Basic and advanced search operations including thesaurus keywords, metadata, organisation, geographic locality and other supported metadata.

Capability to externally link to, and store in a central database, a variety of common file formats including HTML, PDF, word, excel, delimited text, tagged and XML content.

Support for the development of applications to search special data-collections.

Support for the existing custom developed applications and sites using subsets of AANRO data.

Search results

A consistent, standardised and user-friendly website and search tool.

Customisable, sorted by relevance, easy-to-read search results that include links to other relevant research.

Feedback mechanism on unsuccessful searches.

Content harvesting and publishing

Use of high levels of automation to identify, capture, harvest, reformat and index new content from a variety of sources and agencies.

Individual web-based publishing i.e. input of research information by participants.

Automated and manual web-based batch publishing of documents in a variety of standard formats.

Indexing

Support for records indexed using metadata that meets relevant standards such as AGLS.

Support for open archives initiative protocol for metadata harvesting (OAI-PMH).

Support for indexing and search of records using the CABI thesaurus as used in the current system.

Content Management

Support for web-based document management, auditing, simple workflow, including research status, publishing rights and ability to edit incorrect content.

Archive facility to flag data that exceeds a particular age.

Content reporting

Analysis and reporting of traffic using an industry standard tool.

Activity reports summarising research documents, projects and research status.

Application maintenance and support

Application development using open standards and industry supported software.

A sufficiently scalable application able to cope with projected demand.

Meet government standards for web delivered services i.e.. The Guide to Minimum Website Standards www.agimo.gov.au

Capacity to deliver content for low bandwidth connections to ensure customers can access the system when using low bandwidth.

Interoperability with other repositories

Web Single Sign On

Data migration

Database management

Application/web development, maintenance, hosting and support

Software support services

Wednesday, October 24, 2007

Solr demo is up

Implementing Author Affiliation in Fez

1 Data mapping

2 Configuring Fez metadata entry and display

3 Configuring OAI-PMH feeds

4 Conclusion

Thursday, October 18, 2007

Performance anxiety

Wednesday, October 17, 2007

Collections considered ...

Some options

What does this mean for AANRO?

Monday, October 15, 2007

VITAL out of contention

Blog Archive

Contributors