Wednesday, October 17, 2007

Collections considered ...

(I was going to call this post Collections considered harmful , but then there's then I thought about "Considered Harmful" Essays Considered Harmful (Meyer 2002) . So let's just consider collections.)

One of the important decisions in setting up a repository is choosing a collection architecture.

The word collection is used in a variety of ways, so I will define my terms here and try to stick to them.

Hard collection
I'll define a hard collection loosely as something that requires a repository manager to set up container objects into which items, including other collections, need to be added manually. Implementation details vary, but the effect is that somebody has to decide where in the hierarchy of collections to place each item, usually at ingest time. You can generally add things to more than one collection, too. Note that some software has a notion of communities as well as collections, but I think of communities as being a type of hard collection.
Virtual collection
A virtual collection, to my way of thinking would be a view of the repository that is constructed automatically by the repository software based on some kid of rule. A rule might be the Arts Faculty thesis collection contains all the items of type 'thesis' where the author is affiliated with the faculty. Machines a pretty good at executing queries of this type.

Some options

Of course, one option is not to use hard collections at all. This is the Eprints way. Items are described purely by their metadata. Take a look at USQ's Eprints. The only collection-like things in the software here are the browse views, which are generated dynamically from the metadata and hence qualify in my terms as virtual collections. Eprints is fairly limited in its browsing but it could be extended, either by adding to the software, or by using a more capable portal instead of the built-in one. (I can see Eprints fitting quite well into a federated AANRO repository).

Another option is a repository which forces you to use hard collections. Fez (currently), Muradora and DSpace require items to be in hard collections.

In DSpace collections and communities get used for all sorts of organizing. Here are a few examples from Flinders (a RUBRIC partner):

Fez and Muradora also require you to set up collections, and add objects into them.

  • In Muradora's case this is to make its access control more efficient, I believe. (I have tried to influence the Muadora developers to allow virtual collections built by the computer, not by people, but so far I have not been that persuasive.)

  • For Fez access control is the main reason the developers added hard collections you can set up user roles per collection.

    The current version of Fez requires hard collections, but I have been chatting to Christiaan Kortekaass from the Fez team this morning and he says that are going to relax this restriction. Good news, I think.

If you want to have choice between using collections of not, the other software which was (but is nor longer) under consideration for AANRO was VITAL, which allows you to work without collections but enables them if you really want them.

One prominent Eprints advocate, Arthur Sale put it like this in response to a question on the Fez users mailing list about collections within collections:

What an awful idea. Just like collections themselves. What is needed is customizable views of data.

Arthur told a group of us from the RUBRIC project, when we visited Hobart that collections (ie hard collections) are an unnecessary throwback to the days of physical library collections where you needed to put each item in one and only one place. I tend to agree with Arthur, but I will try here to summarize some of the pros and cons of hard collections.

What does this mean for AANRO?

I'll look briefly at a the reasons AANRO might or might not want to use hard collections, and I'll take it as given that virtual collections are A Good Thing as this is implicit in the original AANRO requirements.

Why hard collections?
  1. Collections are one way for repository managers to build navigation, creating a hierarchy that can be navigated top-down.

  2. Collections make it easy to express permissions, eg. User X has permission to add items to collection Y.

  3. Collections are a way to express the kinds of relationships you get in complex objects such as a journal, which has volumes, which have issues, which have articles.

I don't think any of these are compelling in aanro except maybe the second some kind of agency-level collection may make administration easier, depending on the software selected.
Why not hard collections?
  1. Depending on implementation, using collections may tie you to a particular repository solution. (It is important to check if the hard work you put into creating and populating collections can be exported in a standard format for re-use).

  2. Large-scale collections can become tedious and error-prone to manage.

    For example how can you be sure that there is not a stray thesis somewhere that has not been added to the thesis collection?

    If you answered do a search on the metadata then why don't you just use the metadata to define a virtual-collection?

  3. To avoid the lock in problem (1) you should probably make sure that all the important information needed for a collection is stored redundantly in the metadata item, which brings us back to the question why not use virtual, machine generated collections.

  4. Collections create management issues that require repository software designers and managers to really think hard what if you want to delete a collection but it has thousands of member items? Presumably you have to re-house them first into different collections, which could be quite a job. All the complexities make the software complex, and increase the likelihood of bugs in the code, and errors in its application to real world problems.

For AANRO these are all reasons to seek software that does not require hard collections to work. I'm not saying that further analysis won't show that some level of collection is worth having, just that there is a good chance that it may not be needed. If AANRO chose to go with a distributed network of repositories then access control may be able to be made much simpler with only one workflow per repository, whereas in a more centralized system collections may be required to configure a variety of access controls because that's the way the software works.

Copyright 2007 The University of Southern Queensland

Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.

4 comments:

matt said...
This comment has been removed by the author.
matt said...

I've replied on my blog

Christiaan Kortekaas said...

Hi Peter,

Collections are now optional in the latest Fez SVN trunk commit.

http://dev-repo.library.uq.edu.au/websvn/listing.php?repname=fez&path=%2F&rev=1098&sc=1

An example of one AANRO objects I tested this with is here: https://vmdev-repo.library.uq.edu.au/view.php?pid=UQ:46192

I'll wait until Saturday to update the fez demo site with this functionality.

Matt's blog entry is worthy of some thought and I tend to agree with his idea.

Cheers,

Christiaan Kortekaas
Lead Fez Developer

Chi Nguyen said...

Some comments about your dot points for reasons "against hard collections"

- 1) the use of collection is never tied to any particular repository (at least not in our way). As far as Fedora is concerned, the collection information is just metadata for that object. If a different GUI wants to make use of that metadata for tree-view display or access control etc... then that's great. If it wants to ignore it, then that is also ok. It doesn't tie you into any particular repository system. I would imagine a Fedora object created in Muradora with a RELS-EXT datastream can still be used by something like Fez. If Fez needs that information to be transformed to its "collection" metadata information, then that should be ok too since the metadata schema is freely available. The tie-in problem only occurs if the metadata is proprietary.

2) Again this depends on your requirements. If your system needs better access control (and I don't just mean public access), then not having hard collections means managing access control would be a nightmare. If you don't need such extensive access control, then you really don't need the hard collection and can get away with using smart-folders (which Muradora has).

4) In my opinion, the main reason why you would use hard collection is to have extensive access control mechanisms for your system, ie. the ability to say certain people or groups/roles being able to do certain things to collections, objects and datastreams. Without hard collections, managing these access control requirements would be very difficult. So if you don't need extensive access control, then you can just live away with "smart folders".