Thursday, November 1, 2007

AANRO data in Muradora

Last week the Namchi Nguyen (Chi) got in contact with us to show that the Muradora team had added an import utility so Muradora can index items sitting in an existing Fedora repository (Fez can do this, and after some prompting from the ARROW and RUBRIC communities VTLS added this feature to Vital as well).

The Muradora team have put up a demo server containing AANRO data and some other stuff. Remember this is someone else's server and the data may not be in there for long.

I had a conversation with Chi this week about this new version. It looks promising and it seems that it would meet most of the AANRO requirements although with its focus very much on access control it may not be a perfect match with AANROs open access data.

There are a couple of issues that need to be resolved (I've already mentioned this stuff to Chi):

  1. At present when you do a search the default behaviour is to show that there were a certain number of results, and only filter out results that you are not meant to be able to see when they are displayed. From what I know of institutional repositories this is not acceptable as even knowing that someone is working on a particular topic may be a problem for some intellectual property.

    I'm not sure that this would be huge problem for AANRO, but in the current version is certainly compromises usability in the general case.

    Chi tells me they will fix this soon by having a 'guest' mode that only searches open access objects.

  2. The current interface is not very hypertextual once you get to a metadata page there are no links to see other things about the same subject or by the same author. I'm sure this would be simple to add.

  3. There is nothing in the demo to show a subject hierarchy or ontology at work unless you count collections (we don't have that on our Solr demo either come to think of it).

  4. There are a number of bugs and little tidy-ups that are required, and anyone talking on this software would have to be prepared to work with something where they would be amongst the first to use it.

  5. The current demo does not have its metadata editing configured to work with author affiliation in MODS. I've confirmed with Chi that this is just a matter of writing some more XForms code, which can be tricky but there are others working on the same thing (also without affiliation, though).

Unfortunately we've run out of time on this evaluation the draft final report went off this week, but I'll keep an eye on Muradora.

(Regarding performance, from Chi's email I gather that Index time for 140,000 records including adding all records to an AANRO collection was 16-18 hours on a machine with 4GB RAM, 2x 1.8Ghz Opteron CPU (each with 2 cores), 320GB SATA hard disk, but this time could be improved doing a bit more work to the data and re-indexing will not take that long. (Chi if you want to comment here, please do so)).

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

2 comments:

Chi Nguyen said...

Thanks Peter for having a look over Muradora. Our focus is indeed on more complex access control requirements, otherwise there are already a whole bunch of repository products out there which do pretty much the same thing.

Fedora themselves recognized the need for this flexibility, and that's why they have XACML support in their latest version. However, like us they will have the same problem with the search function. That is if a search query returns 50000 results, then they will need to execute 50000 XACML evaluations to know which results are to be displayed. This is clearly not possible for timely display and I guess that's why Fedora have not supported search filtering. Ultimately it's a tradeoff between flexibility and speed.

However, as you said, for most people out there who just want to use a repository, they probably won't understand any of this (or care). That's why we are working on a solution that can get the best of both worlds.

Your other points (2-4) are quite valid and these problems should be ironed out soon.

Finally regarding your point about the author affiliation. It's a tricky one for us, because as you know there are lots of ways, according to the MODS schema, to do the "author" element; way too many to design a form with sufficient script logic for all of them. That's why if there's an agreed way for a given repository on how to handle the author element, then one can design the form for that.

Chi Nguyen said...

I forgot to mention a few things about the performance:

* Even though the system was running on a 2x Dual core CPU (ie. like a 4xCPU system), it only used 1 core since the everything (fedora, muradora, fedoragsearch etc..) was running inside the one tomcat container. It would be much faster if we used a separate container for each since muradora, gsearch indexing, and fedora can all be running on differing CPUs.

* We didn't try to optimize the system very much. Debug logging was enabled for everything, and we kept things like authorization filter as well as gsearch callback by Fedora enabled. If we disable all of these, the total time would be reduced by a fair bit.