One facility that the ZOOM specification could choose to mandate is the ability for a Connection object to behave as a front-end for a bundle of multiple physical connections to different servers. With such a setup, a search submitted to the Connection would be broadcast to all the servers, and the Connection object would seamlessly integrate the results returned by those servers into a single unified result set.
This is very appealing functionality, but would place an unacceptably high burden of complexity on ZOOM implementors. In this email, Sebastian Hammer <quinn@indexdata.dk> explains eloquently why ZOOM should not make this requirement.
Date: Wed, 07 Nov 2001 22:06:02 +0100 From: Sebastian Hammer <quinn@indexdata.dk> To: Mike Taylor <mike@tecc.co.uk>, zoom@indexdata.dk Subject: Re: [ZOOM] Re: Scope of ZOOM (was Hello etc) > > [ZOOM] shouldn't try to seamlessly hide multi-target searching, > > because that is a job for smarter folks -- specifically folks closer > > to the application and the data being exchanged. > >Interesting. If we end up agreeing with you here (which we might) >then that militates against trying for something like Ian's >BadgeringGreatAggregationOfLotsOfDifferentConnection type. Not sure >yet how I feel about that. I have been happy with the notion of your "manager" in the PErl layer (except I prefer for it to be implicit and invisible and unmentioned to the greatest degree possible). As much as I think a great big transparent merging thing would be cool, I also shudder at the complexity... but hey, maybe there's a way to cut down on that. Ian, how much does your API layer do in terms of merging results from different sources behind the scenes? Here's a few of the problems associated with result set merging. First, of all, the result of a search is a result set handle, or X handles, in the case of a multi-target search. You don't actually have any records in hand until you start retrieving them. What's your result set count? If your databases hold totally distinct sets of records, then the merging process need not worry about duplicates and you can simply add the result set counts from the different servers. That's not the general case, however... a decent merging of result sets should look to remove duplicates -- this absolutely holds in bibliographic types of systems, where you'll frequently find entries for the same book/journal/whatever in different places. In some applications, you don't care about the result set count anyway, or you're happy updating it progressively while you're getting more information. Once you start getting records down from the server, you can start to merge them. If they arrive nicely sorted by whatever field you use to merge them by, then this is a nice, smooth process. If this sort field happens to be the one your user wants results sorted by, then that's peachy, you'll have a screenful of beautifully sorted and merged records ready in no time, right? But this is not the general case, or even a common case, or even one that I have ever encountered. The normal case in a mixed group of servers is that the default sort order is completely random from your point of view. Less than half of the servers support the SORT facility, and those that do probably don't support the same parameters, because the use of sorting is completely unprofiled. Even if it was widely supported, you would have issues like whether you exclude leading articles ("the", "a", etc.) from, say, title fields, whether you put surnames before or after first names in author fields, etc. Except in very controlled environments (basically those where you have written all of the clients and servers yourself), it's very hard to get everybody to sort the same way. Add to this that in libraries, you're typically merging by a host of different fields, and not necessarily those you want to sort by anyway. If the data model is complex and you have a lot of fallible humans doing the data entry, your merging routine needs heuristics to deal with variations in people's use of the data entry guidelines. Consider again bibliographic systems where the same title may exist in hardcover, paperback, as an audio tape, or a DVD of the Hollywood production. Add to that different editions, revisions, etc., and it becomes clear that more than a call to strcmp() is involved. Bibliographic merging at its best (or worst) is a Zen art. So... assuming someone wants to build a user interface that shows a brief list of records on the screen, it's not going to be of much use to retrieve just half a dozen or so from each server. You can trick people with this for a demonstration, but it's not going to work for something people will actually use. In principle, you're going to need to get *all* of the records that match your search from each server, sort them, deduplicate them, count 'em, and show them to the user. This works if you get a few tens of hits per server, but otherwise, it's a job that never finishes. BookWhere, which is probably the only commercial stand-alone Windows client that has hit it off, essentially doesn't stop fetching records. It retrieves and merges, retrieves and merges, all the time updating the nicely sorted display of records on the screen. But this has drawbacks too. Performance-tests I have done have suggested that for many servers, retrieving records is at least as expensive as executing a search, if not more so. Database servers that construct retrieval records by pulling together data from an RDBMS have to do a lot of legwork to retrieve a record. Also, a continuously updating display can be done relatively prettily in a Windows interface, but it requires trickery to get it to work well in a web application. *maybe* we can get around some of these issues by adding in hooks where you can inherit/override classes to put in your own heuristics, but it more than just sorting and matching -- the whole strategy for how you do the retrieval efficiently should be adapted to the kind of merging/matching you need to do. In Z39.50 implementor's terms, I'm middle-aged if not quite a senior citizen, and the last thing I want to do is discourage innovation.. I just want to get across that it is more intricate than it appears the first time you consider it. In our projects, we mostly just ignore it and display results separately (although we obviously do the Z39.50 network stuff in parallel), but we're beginning to have a hard time convincing customers that this is a good thing. Many client systems cheat and just display results at random or round-robin from different servers... but I would claim that those systems won't be popular with users in the long run. I really think we're biting off more than we can chew if we try to put result set merging into the guts of ZOOM -- it'll bring more confusion than help. You can have a HeapingGreatSackofHeterogeneousConnectionsAndStuff, but it should provide access to individual result sets. --Sebastian _______________________________________________ ZOOM mailing list ZOOM@indexdata.dk http://www.indexdata.dk/mailman/listinfo/zoom