(I was going to call this post Collections considered harmful , but then there's then I thought about "Considered Harmful" Essays Considered Harmful (Meyer 2002) . So let's just consider collections.)
One of the important decisions in setting up a repository is choosing a collection architecture.
The word collection is used in a variety of ways, so I will define my terms here and try to stick to them.
- Hard collection
- I'll define a hard collection loosely as something that requires a repository manager to set up container objects into which items, including other collections, need to be added manually. Implementation details vary, but the effect is that somebody has to decide where in the hierarchy of collections to place each item, usually at ingest time. You can generally add things to more than one collection, too. Note that some software has a notion of communities as well as collections, but I think of communities as being a type of hard collection.
- Virtual collection
- A virtual collection, to my way of thinking would be a view of the repository that is constructed automatically by the repository software based on some kid of rule. A rule might be “the Arts Faculty thesis collection contains all the items of type 'thesis' where the author is affiliated with the faculty”. Machines a pretty good at executing queries of this type.
Of course, one option is not to use hard collections at all. This is the Eprints way. Items are described purely by their metadata. Take a look at USQ's Eprints. The only collection-like things in the software here are the browse views, which are generated dynamically from the metadata and hence qualify in my terms as virtual collections. Eprints is fairly limited in its browsing but it could be extended, either by adding to the software, or by using a more capable portal instead of the built-in one. (I can see Eprints fitting quite well into a federated AANRO repository).
Another option is a repository which forces you to use hard collections. Fez (currently), Muradora and DSpace require items to be in hard collections.
In DSpace collections and communities get used for all sorts of organizing. Here are a few examples from Flinders (a RUBRIC partner):
Collections each containing an issue of a journal, which is itself a community.
A community containing University of Adelaide Theatre programs.
Fez and Muradora also require you to set up collections, and add objects into them.
In Muradora's case this is to make its access control more efficient, I believe. (I have tried to influence the Muadora developers to allow virtual collections built by the computer, not by people, but so far I have not been that persuasive.)
For Fez access control is the main reason the developers added hard collections – you can set up user roles per collection.
The current version of Fez requires hard collections, but I have been chatting to Christiaan Kortekaass from the Fez team this morning and he says that are going to relax this restriction. Good news, I think.
If you want to have choice between using collections of not, the other software which was (but is nor longer) under consideration for AANRO was VITAL, which allows you to work without collections but enables them if you really want them.
One prominent Eprints advocate, Arthur Sale put it like this in response to a question on the Fez users mailing list about collections within collections:
What an awful idea. Just like collections themselves. What is needed is customizable “views” of data.
Arthur told a group of us from the RUBRIC project, when we visited Hobart that collections (ie hard collections) are an unnecessary throwback to the days of physical library collections where you needed to put each item in one and only one place. I tend to agree with Arthur, but I will try here to summarize some of the pros and cons of hard collections.
I'll look briefly at a the reasons AANRO might or might not want to use hard collections, and I'll take it as given that virtual collections are A Good Thing as this is implicit in the original AANRO requirements.
- Why hard collections?
Collections are one way for repository managers to build navigation, creating a hierarchy that can be navigated top-down.
Collections make it easy to express permissions, eg. “User X has permission to add items to collection Y”.
Collections are a way to express the kinds of relationships you get in complex objects such as a journal, which has volumes, which have issues, which have articles.
- I don't think any of these are compelling in aanro except maybe the second – some kind of agency-level collection may make administration easier, depending on the software selected.
- Why not hard collections?
Depending on implementation, using collections may tie you to a particular repository solution. (It is important to check if the hard work you put into creating and populating collections can be exported in a standard format for re-use).
Large-scale collections can become tedious and error-prone to manage.
For example how can you be sure that there is not a stray thesis somewhere that has not been added to the thesis collection?
If you answered “do a search on the metadata” then why don't you just use the metadata to define a virtual-collection?
To avoid the lock in problem (1) you should probably make sure that all the important information needed for a collection is stored redundantly in the metadata item, which brings us back to the question why not use virtual, machine generated collections.
Collections create management issues that require repository software designers and managers to really think hard – what if you want to delete a collection but it has thousands of member items? Presumably you have to re-house them first into different collections, which could be quite a job. All the complexities make the software complex, and increase the likelihood of bugs in the code, and errors in its application to real world problems.
- For AANRO these are all reasons to seek software that does not require hard collections to work. I'm not saying that further analysis won't show that some level of collection is worth having, just that there is a good chance that it may not be needed. If AANRO chose to go with a distributed network of repositories then access control may be able to be made much simpler – with only one workflow per repository, whereas in a more centralized system collections may be required to configure a variety of access controls because that's the way the software works.
Copyright 2007 The University of Southern Queensland
Content license: Creative Commons Attribution-ShareAlike 2.5 Australia.