Tuesday, October 30, 2007

AANRO Requirements summary - Fez

About this document

This document contains a summary of requirements for the AANRO repository with notes on how Fez 2.0 will address them. As usual, comments are welcome.

At time of writing the AANRO data is in the Fez demo site (it may not remain there).

For a more stable Fez-driven site which will persist see UQ's eSpace repository.

USQ has a Fez demonstration site into which we will endeavour to put a complete copy of the latest AANRO data.

Search Facility

Rapid and comprehensive access to records of research in-progress, finalised research reports, formal publications, journal/conference proceedings and website content

In general Fez deals with this really nicely. By simply transforming AANRO data to MODS and importing into Fez the AANRO data were immediately browsable further confguration is required but 'out of the box' works.

Depending on what is meant by 'website content' this could be more of an issue. Some minor changes to Fez would need to be made for it to be able to serve HTML with stylesheets and images.

Basic and advanced search operations including thesaurus keywords, metadata, organisation, geographic locality and other supported metadata.

Fez can be configured to use multiple metadata schemas so locality could be added.

Capability to externally link to, and store in a central database, a variety of common file formats including HTML, PDF, word, excel, delimited text, tagged and XML content.

Yes to all of these, although HTML upload is sub-optimal it can display an HTML page but not its images unless the user goes to a lot of trouble to change the HTML page so it can reference Fez datastreams.

A change would need to be made to Fez so that it can serve HTML with images, ideally by cahnging the URLs used by Fez to be path-like, so that relative URLs would work.

If AANRO requires HTML items in the repository then the budget will need to include making these chages.

Support for the development of applications to search special data-collections.

In one sense this is supported by default because Fez uses Fedora but Fez itself does not support deveopment of special purpose portals, except in as much as it is open source and extensible. Fez does not use standard Fedora mechanisms for indexing or security, so any work done in these areas needs to be duplicated in a special purpose application. I know that the Fez team plans to add support for special-purpose portals. This will be at the level of the Fez application, you will need to use a Fez API to leverage any Fez access controls.

In the final report we discuss the option of building a federated AANRO service which would harvest items from a number of repositories via OAI-PMH. Such a harvesting service could provide 'filtered' views of data from a number of repositories either through a single or multiple portals.

Support for the existing custom developed applications and sites using subsets of AANRO data.

This raises question for AANRO what sites are now consuming the data? Are there any sites that duplicate the data? Are the opportunities for federation ie participating organizations who have or would like to have their own repository that feeds into the central AANRO registry? This needs to be covered in implementation.

Search results

A consistent, standardised and user-friendly website and search tool.

Yes. Fez has a lot of nice touches including suggestions when searching (as you type it offers to complete what you're typing). More detail on functionality is included in the discussion below.

Customisable, sorted by relevance, easy-to-read search results that include links to other relevant research.

Fez uses the full text engine that comes with the MYSQL database and uses its relevance ranking engine.

Fez doesn't yet support faceted searching but when you look at an item it will let you search for related items by clicking on the subject or authors. Faceted search would be a nice-to-have.

Feedback mechanism on unsuccessful searches.

Fez returns a simple message:

Search Results (All Fields:"xsds", Status:"Published"): (0 results found)

Christiaan Kortekaass tells me that there is a new feature coming which will use a thesaurus to retry unsuccessful searches:

If you search for happy and get no results it will say 'similar terms "cheerful (3 matches), delerious (2 matches)"

Or if its mispelled like hapyy it will say did you mean 'happy' (5 matches) ?

Content harvesting and publishing

Use of high levels of automation to identify, capture, harvest, reformat and index new content from a variety of sources and agencies.

Fez can publish content via OAI-PMH but currently does not havest. This is one area where there might need to be custom development, after careful requirements gathering; hard to comment at this time without a lot more detail about the sources and agencies.

An alternative architecture where the AANRO portal used something like Apache Solr with an OAI-PMH harvester bolted on might meet this requirement. Content from agencies could either be indexed directly by a portal (via a feed of some kind) or stored in a central Fedora repository and then indexed. We'll have much more detail on alternative architectures by the end of the project.

(See a demonstration site that may not persist for long after November 2007. This site has a harvesting component that fetches data from a number of sources via OAI-PMH).

Individual web-based publishing i.e. input of research information by participants.

Fez has a customisable web-based input system with workflow that is highly configurable. We are confident that this would be usable by agencies contributing to a central database.

I have previously expressed concern with some aspects of the way the web-input system in Fez works. It uses a very complicated web interface to map elements in an XML schema to web forms and stores the results in a database. Quite a lot of the traffic on the Fez mailing list is about people having problems configuring this interface. We recommend budgeting at least two weeks of developer time to ensure that mappings can be constructured for AANRO data.

Another concern is with the way that configuration may be versioned and moved between test and production instances Fez does have a method for exporting schema mappings from one server and importing them into another. Our recommendation is to set up procedures where mappings are developed in a test environment and deployed to production via a version control system. The production system should be locked down so only developers can change mappoings and other configuration (this feature is coming).

Automated and manual web-based batch publishing of documents in a variety of standard formats.

Fez has a web-triggered batch ingest system, which can import directories full of content, in any format. This is a very simple arrangement requiring users to put data in a directory and the trigger the ingest via the web.

Indexing

Support for records indexed using metadata that meets relevant standards such as AGLS.

Fez can support multiple metadata standards and indexing is highly configurable. See the AANRO metadata guidelines document (forthcoming) for a discussion of a number of metadata schemas and thesuari.

Support for open archives initiative protocol for metadata harvesting (OAI-PMH).

An OAI-PMHfedd (but not harvesting) is supported by Fez, and custom feeds can be set up by changing PHP code and Smarty templates.

Support for indexing and search of records using the CABI thesaurus as used in the current system.

Indexing and search are no problem even without licensing the thesaurus if the terms are attached to items in the repostitory. The Fez ingest system can be configured to use different thesauri this would need to be done by developers.

Content Management

Support for web-based document management, auditing, simple workflow, including research status, publishing rights and ability to edit incorrect content.

See the comments above on web input.

Archive facility to flag data that exceeds a particular age.

This could be done in Fez using a customized workflow, to perform a search, and then provide an interface to action.

Content reporting

Analysis and reporting of traffic using an industry standard tool.

Fez is part of a collaboratuve standards based effort to build a benchmark statistics service (BEST). From the wiki:

The focus of this initiative is on designing a Repository Statistics Service that will enhance the type and quality of statistical information about collections (and items) and usage statistics in repositories. The problem to be solved here relates to the strategic need for better, standardised, statistical information about the repository holdings and usage in order to inform a wide range of policy and funding decisions within the overall scholarly communications cycle. More specifically, BEST will design a federated repository statistics service that will facilitate the automated collection and standard analysis of statistical information about the collections and usage of APSR Partner repositories.

Methods and Approaches

  • Adopt and adapt the findings of relevant international projects

  • Scope open-source statistics and web-metric solutions

  • With cooperation of APSR partners and other interested groups propose and trial a 'benchmark set' of statistical measures and filtering approaches appropriate to research repositories.

  • Using a service-oriented approach, design a service for collecting, aggregating, analysing and presenting this statistical information related to repository collections and usage; and

  • Asses the development needs of partner repositories to implement the pilot service.

Activity reports summarising research documents, projects and research status.

Fez reports on generic repository statistics to do with items, author etc. For special fields such as research status development would be required. they don't make the distinction between generic items and items that might have a particular type (such as project) or status, so customization may be required.

Application maintenance and support

Application development using open standards and industry supported software.

Fez uses standard open source components (PHP and MySQL plus standard libraries) and is itself open source.

A sufficiently scalable application able to cope with projected demand.

Fez can be scaled up using load balancing servers (according to the developers). It is difficult to comment furhter on this issue without access to some projections from AANRO, but the Fez demo serverrunnign on shared infrastrcuture is displaying acceptable speed for a test server (about 2 seconds to generate each page).

Meet government standards for web delivered services i.e.. The Guide to Minimum Website Standards www.agimo.gov.au

The minimum standards are available from the agimo web site. Not all of the standards are relevant as a lot of them are about content - which is not up to the repository software.

As noted above Fez will be able to meet Metadata requirements for AGLS metadata.

Regarding accessibility, there was not time in this consultancy to undertake a full accessibility audit of Fez. A lot of the the basics, such as alternate text on images in the application appear to be well thought out. We note, though that the application relies heavily on Javascript to work. Browsing with javascript dsiabled makes the interface less usable, while using the adminstrative interface (for example to add an item) does not function at all without javascript. This is not unusual for a modern web application AANRO should seek advice from an accessibility expert as part of any redevelopment.

Capacity to deliver content for low bandwidth connections to ensure customers can access the system when using low bandwidth.

Fez is an inherently low-bandwidth application, although AANRO content may not be once it has been sourced and added to the repository.

Interoperability with other repositories

Fez will interoperate with other repositories mainly via the use of OAI-PMH. This is a well established protocol that can be used to move metadata around a networ of repositories and portals. See the discussion on architecture in the final report.

Web Single Sign On

(WSSO) is required to simplify the administrative process of authentication and authorization and in the future to gain access to the resources of other web based systems.

Fez can use the Shibboleth single sign on system but note that participating users will need to belong to the same federation this is increasingly true for university staff but not for the broader community.

Data migration

We are provding a data migration guide which covers initial conversion for th AANRO metadata guide. We have domonstrated that this approach is Fez-compatible.n Note that

Plan and undertake the migration of data from the current AANRO platform and demonstrate a strategy to enable any future migration. Within the context just discussed, it will be the ongoing responsibility of the Successful Respondent to assist in the migration of research content from AANRO associated agency systems to the new database. This will include the integration of exported database content from agencies, migration of the existing InMagic AANRO database, and ongoing harvesting of content from agency websites.

Database management

This is the requirement as written:

Extensive, ongoing content sourcing, harvesting, indexing, publishing and content management by the respondent/vendor.

Application/web development, maintenance, hosting and support

Ongoing development, maintenance, web and database hosting and support.

Software support services

If Fez is selected for use by AANRO, we recommend that the developer and repository maintainer set up a formal relationship with the University of Queensland, or otherwise procure the services of a Fex expert, to supply help desk an defect correction services. The Fez team are very responsive to outside critiques and questions but they are under no obligation to assist outside developers.

Wednesday, October 24, 2007

Solr demo is up

I have speculated a bit about repository architecture here. As part of the thinking we have put up a proof-of-concept-only portal based on Apache Solr. This uses harvesting and indexing code from the FRED project. As I write this is contains metadata for 130K AANRO records, two copies of the USQ ePrints data, UQ's espace and Monash's repository which shows up as ARROW (hi Neil D I know you're following the feed here).

For example if you click on the facet for one of the repositories then all further searches are just on that repository. Hit [clear all ] to go back to showing everything.

It took 188 seconds to slurp up the 6278 UQ metadata records via the internet.

Thanks to Oliver Lucido and David Levy for this one.

This is not a real service. It might disappear at any moment. The interface might change. It might be down for maintenance without warning and if you're reading this after the end of November 2007 it will almost certainly be gone.

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Implementing Author Affiliation in Fez

I wrote here before about author affiliation. How can you keep track of not only the name of an author but their institutional affiliation for a particular item in a repository?

Because it uses MODS metadata by default the Fez repository can deal with this, but the default configuration for generic items does not include affiliation. So this is a good opportunity to see how Fez configuration works.

In this post I'll look at some of the main issues for AANRO:

  1. Mapping the existing data to MODS using the tools we developed for the RUBRIC project.

  2. Configuring Fez to recognize the affiliation and display it.

  3. Configuring an OAI-PMH feed so that the repository can be part of a federation and be linked into services like Google Scholar.

1 Data mapping

In the AANRO data we're working with the author field contains the affiliation in brackets. Our first mapping to MODS left this as-is. To fix this I needed to change the stylesheets we use to map the original data which is in an ad hoc format into a standard format.

Here's the process in outline. This is what our tech officers did to get the data into Fedora:

  1. Transform the raw data to XML using Excel (might be better do this with Python in future).

  2. Split the resulting file into multiple parts, by using an XPath to select an element on which to break:


    python split_xml_into_archive.py   ~/aanro/Data/Xml/AANROPubArchive0_30k.xml  ~/datamigration/aanro1 "//record" DC.xml False

    The result is a series of dirctories (30,000 of them in this case) each containing a file called rec.xml in a format like this

    <record>

    <id>38161</id>

    <doctype>Book or Report</doctype>

    <pubyear>1996</pubyear>

    <title>Guidelines on the quality of stormwater and treated wastewater for injection into

    aquifers for storage and reuse</title>

    <author>Dillon P (Australian Centre for Groundwater Studies)|Pavelic P (Australian Centre

    for Groundwater Studies)</author>

    ...

    </record>

  3. Transform each metadata record into MARC XML (we could change this process and go straight to MODS but we are used to working with MARC because that's what our staff were used to with RUBRIC and we have some VITAL configuration that uses Marc).


    python xsl_transform.py  temp.xml AANRO_xml_to_marc.xsl MARC.xml aanro1/ False
  4. Transform the MARC XML to MODS using a variant of a stylesheet we downloaded from the Library of Congress, this uses the same script as above, with a different XSL stylesheet.

  5. Transform the MARC XML to Dublin Core.

  6. Convert the Dublin Core and MODS metadata into FOXML, which can be ingested directly into Fedora.

  7. Ingest into Fedora using the Fedora client.

  8. Index the items into Fez

    • Click on administration.

    • Click on (Maintenance) Index Objects.

    • Click Discover new Fedora objects.

    • Select the Generic Document Version MODS 1.0 under Document Type.

    • Click Index All Items into Fez.

    While the indexing is happening you can monitor it from the My Fez area, in the Background Processes tab it will tell you how long it expects the indexing to take.

    (Currently this is very slow as I noted before but Christiaan Kortekaass has another trick up his sleeve which should improve indexing by at least an order of magnitude using the same technique for accessing Fedora records as the Fedora Gsearch indexer.)

I had to make changes to the stylesheet used for step 3. Here's the process I followed:

  1. Check out the RUBRIC toolkit code from Subversion.

  2. Locate the AANRO specific stylesheets.

  3. Find the UTF-X unit test for the AANRO to MARCXML XSLT transformation.

    A test for a simple transformation from <author> in the input data to MARCXML looks like this:


    <utfx:test>

            <utfx:name>Author</utfx:name>

            <utfx:assert-equal>

                <utfx:source validate="no">

                    <author>Leng RA</author>

                </utfx:source>

                <utfx:expected validate="no">

                    <datafield tag="100" ind1=" " ind2=" ">

                        <subfield code="a">Leng RA</subfield>

                    </datafield>

                </utfx:expected>

            </utfx:assert-equal>

        </utfx:test>
  4. Add new tests to deal with affiliations

    I added another test that deals with multiple authors with affiliation.


    <utfx:assert-equal>

                <utfx:source validate="no">

                    <author>Dillon P (Australian Centre for Groundwater Studies)|Pavelic P (Australian

                        Centre for Groundwater Studies)</author>

                </utfx:source>

                <utfx:expected validate="no">

                    <datafield tag="100" ind1=" " ind2=" ">

                        <subfield code="a">Dillon P</subfield>

                        <subfield code="u">Australian Centre for Groundwater Studies</subfield>

                    </datafield>

                    <datafield tag="700" ind1=" " ind2=" ">

                        <subfield code="a">Pavelic P</subfield>

                        <subfield code="u">Australian Centre for Groundwater Studies X</subfield>

                    </datafield>

                </utfx:expected>

    Since the authors had the same affiliation I added an 'X' to the second one to make sure that the right author gets the right affiliation.

  5. Write the XSLT to pass the test.

    This bit took a bit longer than expected and I ended up having to talk to our metadata specialist about MARC indicators and the like but I managed to get it working without breaking the other parts of the stylesheet, and simplified the code in the process.

  6. Check back in the code to subversion

  7. Update the code on our test machine and use it on the AANRO data.

We will include the AANRO code in a release of the RUBRIC toolkit which will be undergoing a tidy-up in early 2008.

2 Configuring Fez metadata entry and display

Once there are records in Fedora and a Fez index to make them show up in Fez we have to configure the display and the HTML form used to edit metadata.

The key to this is a system which maps XML Schemas to HTML forms. An XML schema is a description of the structural potential of a type of document, such as a MODS metadata document.

This is a complex and confusing area of the system, which generates a large percentage of the traffic on the Fez users mailing list.

I'm stuck on this at the moment even with the very helpful Fez developers online in their Campfire chat I can't figure out how to add an affiliation to an author.

One thing I have tried is to export the mapping as XML, but it's really not a human-editable format, still it's good to see that the configuration can be exported kept under version control and re-imported to a server. (And Christiaan Kortekaass tells me that there will soon be an option to lock down Fez so that even administrative users can't mess around with the mappings on a live system. Good news).

So we have to leave this configuration to a future AANRO implementation team, if Fez is selected. One big issue outstanding is that quite a bit of the AANRO data consists of project descriptions, not documents. This will require a fair bit of configuration.

3 Configuring OAI-PMH feeds

The web interface on a repository is not the only way that people find things. In fact most traffic is likely to come from elsewhere, particularly large search services like Google Scholar.

To expose metadata to other services you need to configure an OAI-PMH feed.

Fez has highly configurable OAI-PHM templates. Here's a snippet from the ListRecords template:

{assign var="loop_authors" value=$list[i].rek_author}
{section name="a" loop=$loop_authors}
<dc:creator>{$list[i].rek_author[a]|escape:"html"}</dc:creator>
{/section}

Now I don't speak Smarty Templates but obviously this is a loop (loop_authors) that iterates over the record for an item and spits out the Dublin Core creator element. For AANRO it is possible that the implementer would want to change this to add an affiliation in brackets as there's no simple way in Dublin Core to associate an author with an affiliation. Yes this effectively undoing the work I described above to split the author and their affiliation, but it may turn out to be a useful mechanism in an AANRO federation, to overcome some of the limitations of Dublin Core metadata.

It might looks something like this, assuming that there's a key 'f' for affiliation in the author array:

<dc:creator>{$list[i].rek_author[a]|escape:"html"}
({$list[i].rek_author[f]|escape:"html"})
</dc:creator>

4 Conclusion

This brief blog post has covered a few of the major areas of concern for AANRO. The data migration stuff will be expanded in the forthcoming data migration guide for AANRO, and pointers to potential Fez implementation will be in the final report.

I mentioned the Subversion revision control system and UTF-X unit tests for XSLT because they're really helpful in this kind of work, but not everyone we come across in the library systems world uses them.

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Thursday, October 18, 2007

Performance anxiety

We started the AANRO evaluation with a look at performance, because 200,000 odd items is quite a lot to manage. The repository manager will need to think about how long it will take to re-index after a disaster, an upgrade or a configuration change that requires indexes to be rebuilt.

VITAL is now no longer being considered, but there is one final test to report on that front and we have some new data about Fez 2.

VITAL with 130,000 or so records
Before we decided to stop evaluating VITAL, Bron Dye ran a test on a virtual machine with 3GB of memory (more than we had for previous tests).
  • Bron used the Fedora batch too to ingest 96000 into fedora at a rate of about 40,000 per hour.

  • On Saturday Morning, she told VITAL to index the records.

  • On Monday, the VITAL portal would not respond to HTTP requests.

  • The admin page for VITAL did work so Bron told it to stop indexing, and proceeded to ingest some more Fedora records to take the total up to 130592.

  • The VITAL server now returns a server error, while the underlying Fedora works fine.

(We have to leave the experiment there, but we still have the data so if VTLS would like to have a look we'd be happy to help with testing. Could it be that stopping the indexer caused the problem?).
Fez 2
We had some issues installing Fez 2 at USQ, and it was taking some time to sort them out, so the Fez team very kindly offered to let us use their virtual demo server.
The bottom line is that we're seeing about 40,000 records a day indexing into Fez, at this stage there will be 150,000ish on Saturday morning it's Thursday afternoon now.
You can see the Fez demo server, but be aware that the AANRO data may disappear at any time, and performance will be slow until it finished indexing on Saturday morning. As it is the site is usable, with most pages taking 4 or 5 seconds to build and I'm impressed that the records show up pretty well considering all we did was put them in MODS format. Shows the benefits of standardization.

So on a demo machine, indexing AANRO metadata is a days-long proposition with Fez 2 just as it was with VITAL 3.1.1 This would add an enormous overhead to building and testing a new repository, not to mention disaster recovery, and things would only get slower once we start sourcing full-text for items. On the positive side, the Fez team are actively improving the software as we speak but I'd be looking for an order of magnitude improvement before I wanted to work with all the data in one instance this might be achievable with some more optimization and some beefier hardware, I'm not sure.

The current performance does not rule out using Fez but it does point in the direction of a federated architecture, with smaller repositories feeding a central portal, running something like Apache Solr. For comparison, experiments I did on my MacBook laptop had Solr indexing AANRO metadata at something like 3000 items per minute, but that was without the overhead of having to fetch all the items from Fedora and generate preservation metadata like Fez is doing, so it's not sensible to compare it with a repository. Any federated solution would have to look at using caching if there are performance bottlenecks.

Christiaan Kortekaas points out that UQ's Fez repository has 66487 items, with 5261 currently publicly available.

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Wednesday, October 17, 2007

Collections considered ...

(I was going to call this post Collections considered harmful , but then there's then I thought about "Considered Harmful" Essays Considered Harmful (Meyer 2002) . So let's just consider collections.)

One of the important decisions in setting up a repository is choosing a collection architecture.

The word collection is used in a variety of ways, so I will define my terms here and try to stick to them.

Hard collection
I'll define a hard collection loosely as something that requires a repository manager to set up container objects into which items, including other collections, need to be added manually. Implementation details vary, but the effect is that somebody has to decide where in the hierarchy of collections to place each item, usually at ingest time. You can generally add things to more than one collection, too. Note that some software has a notion of communities as well as collections, but I think of communities as being a type of hard collection.
Virtual collection
A virtual collection, to my way of thinking would be a view of the repository that is constructed automatically by the repository software based on some kid of rule. A rule might be the Arts Faculty thesis collection contains all the items of type 'thesis' where the author is affiliated with the faculty. Machines a pretty good at executing queries of this type.

Some options

Of course, one option is not to use hard collections at all. This is the Eprints way. Items are described purely by their metadata. Take a look at USQ's Eprints. The only collection-like things in the software here are the browse views, which are generated dynamically from the metadata and hence qualify in my terms as virtual collections. Eprints is fairly limited in its browsing but it could be extended, either by adding to the software, or by using a more capable portal instead of the built-in one. (I can see Eprints fitting quite well into a federated AANRO repository).

Another option is a repository which forces you to use hard collections. Fez (currently), Muradora and DSpace require items to be in hard collections.

In DSpace collections and communities get used for all sorts of organizing. Here are a few examples from Flinders (a RUBRIC partner):

Fez and Muradora also require you to set up collections, and add objects into them.

  • In Muradora's case this is to make its access control more efficient, I believe. (I have tried to influence the Muadora developers to allow virtual collections built by the computer, not by people, but so far I have not been that persuasive.)

  • For Fez access control is the main reason the developers added hard collections you can set up user roles per collection.

    The current version of Fez requires hard collections, but I have been chatting to Christiaan Kortekaass from the Fez team this morning and he says that are going to relax this restriction. Good news, I think.

If you want to have choice between using collections of not, the other software which was (but is nor longer) under consideration for AANRO was VITAL, which allows you to work without collections but enables them if you really want them.

One prominent Eprints advocate, Arthur Sale put it like this in response to a question on the Fez users mailing list about collections within collections:

What an awful idea. Just like collections themselves. What is needed is customizable views of data.

Arthur told a group of us from the RUBRIC project, when we visited Hobart that collections (ie hard collections) are an unnecessary throwback to the days of physical library collections where you needed to put each item in one and only one place. I tend to agree with Arthur, but I will try here to summarize some of the pros and cons of hard collections.

What does this mean for AANRO?

I'll look briefly at a the reasons AANRO might or might not want to use hard collections, and I'll take it as given that virtual collections are A Good Thing as this is implicit in the original AANRO requirements.

Why hard collections?
  1. Collections are one way for repository managers to build navigation, creating a hierarchy that can be navigated top-down.

  2. Collections make it easy to express permissions, eg. User X has permission to add items to collection Y.

  3. Collections are a way to express the kinds of relationships you get in complex objects such as a journal, which has volumes, which have issues, which have articles.

I don't think any of these are compelling in aanro except maybe the second some kind of agency-level collection may make administration easier, depending on the software selected.
Why not hard collections?
  1. Depending on implementation, using collections may tie you to a particular repository solution. (It is important to check if the hard work you put into creating and populating collections can be exported in a standard format for re-use).

  2. Large-scale collections can become tedious and error-prone to manage.

    For example how can you be sure that there is not a stray thesis somewhere that has not been added to the thesis collection?

    If you answered do a search on the metadata then why don't you just use the metadata to define a virtual-collection?

  3. To avoid the lock in problem (1) you should probably make sure that all the important information needed for a collection is stored redundantly in the metadata item, which brings us back to the question why not use virtual, machine generated collections.

  4. Collections create management issues that require repository software designers and managers to really think hard what if you want to delete a collection but it has thousands of member items? Presumably you have to re-house them first into different collections, which could be quite a job. All the complexities make the software complex, and increase the likelihood of bugs in the code, and errors in its application to real world problems.

For AANRO these are all reasons to seek software that does not require hard collections to work. I'm not saying that further analysis won't show that some level of collection is worth having, just that there is a good chance that it may not be needed. If AANRO chose to go with a distributed network of repositories then access control may be able to be made much simpler with only one workflow per repository, whereas in a more centralized system collections may be required to configure a variety of access controls because that's the way the software works.

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Monday, October 15, 2007

VITAL out of contention

We've been looking at the VITAL repository software and reported on some testing we did to see how many objects it can handle, but we are going to stop evaluating it for AANRO. (One more report to come I am happy to report that v 3.1.1 is MUCH better than previous versions of Vital.)

There is one key reason for this decision: There is no web interface that allows non-technical users to edit the metadata for an item in Fedora.

Yes, there is an ingest tool for VITAL, known as VALET, which offers simple workflow for new items. But once an item is in Fedora it cannot be edited with VALET, a technically proficient user must use an XML editor to change metadata, a far from easy or intuitive process.

Obviously we knew this going in to the evaluation, but we looked at VITAL in case there was no other Fedora solution that would scale, and we reported our concerns to VTLS to see if they responded with changes to VALET or VITAL. I am a member of the ARROW Developers Group and we have made the point before that a full web interface for editing metadata is a must. I even mentioned this to Vinod Chachra (VTLS President and CEO) at the ARROW community day who told me that they built VITAL for environments where using an XML editor for metadata maintenance was appropriate.

(Note also that there has also been no response from VTLS to the questions I asked ten days ago about their product strategy.)

Given that Fez has a web interface and scalability has improved dramatically with version 2 we now turn our attention to Fez. If VTLS respond with a web interface for VITAL in the next couple of weeks I'll report it here, and I'd be happy to post a feature list for the forthcoming version. Even if Fez cannot handle hundreds of thousands of items comfortably, there is still scope to use a number smaller regional or subject based Fez repositories, with an aggregated search.

To finish on a more positive note, it's appropriate to talk about the good bits of VITAL. In my opinion the best feature in terms of usability is the indexing; it is possible to run VITAL without having to pre-organize items into collections (unlike Fez), instead you can define virtual collections by defining indexes based on metadata. VITAL also promises to use standard Fedora access controls via work done by the Muradora team, but I have not seen this in action yet.

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

Thursday, October 4, 2007

VTLS Visualizer issues

One of the things to consider when picking a repository is the roadmap for the application. Here's a post I sent to the ARROW group (Australian Research Repositories Online to the World) on one level it's a VITAL users group, but it's also a lot more than that.

In summary I'm saying that it looks like the new VTLS Visualizer product looks like a really good idea, but I am concerned about the upgrade path between current VITAL repositories and the new approach used by Visualizer.

(From an AANRO perspective it's worth nothing that Visualizer does nothing to address the lack of a web based interface where non-technical users can input and edit Fedora objects.)

At the end are some as-yet unanswered questions for VTLS, the company behind VITAL.

At the ARROW community day last week Vinod Chachra demonstrated Visualizer - a new portal product which I believe is based on Apache Solr, which in turn is based on the Apache Lucene search library. With more than a million records it performed very well, at least in the demo.

Like VITAL, Visualizer offers faceted searching. I was intrigued to see that it also has some kind of ontology feature - it was able to work out that Scotland is part of the UK (or was it Europe?), I forget what Vinod called this, but it looks like you can set up hierarchies for various fields in the index and have the portal show the relationships. We have discussed this very feature at USQ as part of our work on the Fred project and a proof-of-concept Solr portal that David Levy and Oliver Lucido are writing, and we think that this will perform better than an RDF collections based approach as used in the current VITAL. The other advantage of this technology is that setting up a new subject based portal or restricting access is as simple as appending a query to every search.

Vinod confirmed at the meeting that Visualizer will eventually become the portal application for VITAL so it will be important to start planning for this now. I also got the impression from talking to him that some of the same technologies will be used in VITAL, starting with the forthcoming version 4, which is quite different from what Heather told us at the last ARROW development group.

Here are some questions for VTLS:

  1. If we use Communities and Collections in VITAL will these transfer to VITAL 4 & Visualizer? (RUBRIC central recommends NOT using collections, by the way)

  2. Will the new Consortium (portal) support in VITAL 4 transfer to Visualizer?

  3. Can the ontologies used by Visualizer be used as authority control for VITAL/VALET?

  4. What kinds of indexing technologies will be used by VITAL 4 (RDF? Lucene? A proprietary database?)

It is important for the user community to ask these questions now. Remember that when VITAL 3 was released it was several weeks before VTLS released software to migrate VITAL 2 repositories to VITAL 3. A well tested migration path is an essential part of any product roadmap.

(In my judgment Visualizer will provide a much more flexible and scalable platform for repository portals than the current VITAL architecture so I support this development if the transition can be managed effectively. )

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

The Affiliation Issue

Let's talk about my favourite repository issue, as seen on my blog. Author affiliation. How do you describe the institutional affiliation for each author on an item in your repository? In some repository software it's basically impossible (I'm looking at you Eprints and DSpace). In Fedora it's easy if you use a metadata schema that supports it, but once you start talking about harvesting via OAI-PMH (which by default uses flat Dublin Core metadata) it becomes a problem again.

Here's another quick trip through some of the issues that will face AANRO.

Here's a bit of raw AANRO data, a record for a single article in the format we received from AANRO-central.

"AG199600004" "Journal article" "1996" "Grazing and anthelmintic treatments to increase growth of Cashmere and Angora weaner goats" "Robertson JA (Victoria University of Technology, Centre for Bioprocessing and Food Technology, Melbourne)|Ritar AJ (Tasmanian Department of Primary Industry and Fisheries, Marine Research Laboratories, Taroona)|Evans G (University of Sydney, Department of Animal Science)" "Rural Industries Research and Development Corporation" "Australian Veterinary Journal, 1996-09, 74 (3), p246-248, 1 table, 7 refs, ISSN 0005-0423." "0005-0423" "A study was undertaken to examine whether growth of kids after weaning and internal parasitism were affected by grazing on pasture or fodder crop and by an initial treatment of ivermectin at weaning. Experiments were conducted at Cressy Tas on 21 Angora and 94 Cashmere kids, grazing forage rape, uncontaminated pastures, or pastures contaminated with nematodes, and treated with ivermectin anthelmintic at weaning or not treated. The results demonstrated the benefit to weaner goats of grazing a fodder crop of rape compared with pasture treatments. Grazing a crop of high nutritional value maintained a weight advantage and there were no production losses due to nematode infection. It was shown that careful grazing management and strategic use of anthelmintics reduces the reliance on and the cost of anthelmintics and probably decreases selection pressure on parasites for anthelmintic resistance while increasing rates of growth for weaner goats." "Goats|Kids|Cashmere|Angora|Animal parasitic nematodes|Anthelmintics|Ivermectin|Pest control|Grazing|Pastures|Rape|Fodder crops|Growth|Liveweight" "Cressy Tas|Tamar River (III18)|AER (1)"

Look at the info we have about one of the authors:

Robertson JA (Victoria University of Technology, Centre for Bioprocessing and Food Technology, Melbourne)

The author's name is followed by their affiliation. The first step in our transformation process is to put this into simple XML:

<author>Robertson JA (Victoria University of Technology, Centre for Bioprocessing and Food Technology, Melbourne)</author>

Then we transform it via a few more steps to this MODS (confession, we actually still have work do to on the process):

<name>

<namePart type="personal">Robertson JA</namePart>

<role>

<roleTerm type="text">author</roleTerm>

</role>

<affiliation>Victoria University of Technology, Centre for Bioprocessing and Food Technology, Melbourne</affiliation>

</name>

This says that J A Roberston authored this item while affiliated with the Victoria University of Technology, Centre for Bioprocessing and Food Technology, Melbourne. The same person might have multiple affiliations over time, though and the MODS schema allows us to keep track of this.

(I picked this item at random but it serves to illustrate another point. That uni has changed its name to Victoria University. It would be nice to be able to preserve the name of the institution as it used to be but also to be able to show that it's the same place. More on this currently impossible dream soon.)

As noted, it's hard deal with author affiliation nicely in Eprints and DSpace repositories. Our USQ Eprints team is working on the problem, though.

For non-federated repositories, VITAL and Fez can both handle metadata like this, although in VITAL there is no web-based editing of Fedora items. You have to load the MODS metadata into an XML editor, which makes it a task that's not for general users. We'll be exploring how this works in Fez real soon now, as soon as we can get Version 2 to run.

There are two other issues, though.

  1. Harvesting OAI-PMNH.

    Now it is possible to harvest any metadata stream, so in a potential federated architecture, some AANRO repositories might be able to supply MODS (or similar) metadata to harvesters.

    But there are other discovery services that AANRO, and by default they just use Dublin Core metadata, so to play with those you need to walk a delicate line between jamming metadata into the available transport mechanism to get it out there, and the user experience when people try to use the harvesting service to find things and discover that different services have used or misused different metadata fields in different ways.

    And given that some otherwise very good software can't handle nested metadata like MODS we need to think about how you might support AANRO contributing organizations who might be running such software.

    One way to get the information out there would be to do something like this:

    <dc:creator>Robertson JA (Victoria University of Technology, Centre for Bioprocessing and Food Technology, Melbourne)</dc:creator>

    I'm going to see what Neil Godfrey, USQ's metatdata maven has to say about this approach. Comment away Neil!

  2. Indexing in a portal

    Vital and Fez both have configuration that will let you search by affiliation, no problems, but what about in a federated architecture?

    Provided the affiliation information is in the OAI-PMH feed then it can be used to build an interface where you can search or browse for all items associated with a particular institution; an indexer could easily find the affiliation information that's in the brackets in the example above.

And finally, relevant to both of the above issues, there was an interesting presentation from Alison Dellit at the recent ARROW community day where she talked about how the National Library of Australia is building harvesting and search services that can adapt to the different ways that repositories serve up data, looks like a promising approach to me.

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.