No transparent fan-out/result-merging

9th November 2001

One facility that the ZOOM specification could choose to mandate is the ability for a Connection object to behave as a front-end for a bundle of multiple physical connections to different servers. With such a setup, a search submitted to the Connection would be broadcast to all the servers, and the Connection object would seamlessly integrate the results returned by those servers into a single unified result set.

This is very appealing functionality, but would place an unacceptably high burden of complexity on ZOOM implementors. In this email, Sebastian Hammer <quinn@indexdata.dk> explains eloquently why ZOOM should not make this requirement.


Date: Wed, 07 Nov 2001 22:06:02 +0100
From: Sebastian Hammer <quinn@indexdata.dk>
To: Mike Taylor <mike@tecc.co.uk>, zoom@indexdata.dk
Subject: Re: [ZOOM] Re: Scope of ZOOM (was Hello etc)

> > [ZOOM] shouldn't try to seamlessly hide multi-target searching,
> > because that is a job for smarter folks -- specifically folks closer
> > to the application and the data being exchanged.
>
>Interesting.  If we end up agreeing with you here (which we might)
>then that militates against trying for something like Ian's
>BadgeringGreatAggregationOfLotsOfDifferentConnection type.  Not sure
>yet how I feel about that.

I have been happy with the notion of your "manager" in the PErl layer 
(except I prefer for it to be implicit and invisible and unmentioned to the 
greatest degree possible). As much as I think a great big transparent 
merging thing would be cool, I also shudder at the complexity... but hey, 
maybe there's a way to cut down on that.

Ian, how much does your API layer do in terms of merging results from 
different sources behind the scenes?

Here's a few of the problems associated with result set merging.

First, of all, the result of a search is a result set handle, or X handles, 
in the case of a multi-target search. You don't actually have any records 
in hand until you start retrieving them. What's your result set count? If 
your databases hold totally distinct sets of records, then the merging 
process need not worry about duplicates and you can simply add the result 
set counts from the different servers. That's not the general case, 
however... a decent merging of result sets should look to remove duplicates 
-- this absolutely holds in bibliographic types of systems, where you'll 
frequently find entries for the same book/journal/whatever in different 
places. In some applications, you don't care about the result set count 
anyway, or you're happy updating it progressively while you're getting more 
information.

Once you start getting records down from the server, you can start to merge 
them. If they arrive nicely sorted by whatever field you use to merge them 
by, then this is a nice, smooth process. If this sort field happens to be 
the one your user wants results sorted by, then that's peachy, you'll have 
a screenful of beautifully sorted and merged records ready in no time, right?

But this is not the general case, or even a common case, or even one that I 
have ever encountered. The normal case in a mixed group of servers is that 
the default sort order is completely random from your point of view. Less 
than half of the servers support the SORT facility, and those that do 
probably don't support the same parameters, because the use of sorting is 
completely unprofiled. Even if it was widely supported, you would have 
issues like whether you exclude leading articles ("the", "a", etc.) from, 
say, title fields, whether you put surnames before or after first names in 
author fields, etc. Except in very controlled environments (basically those 
where you have written all of the clients and servers yourself), it's very 
hard to get everybody to sort the same way. Add to this that in libraries, 
you're typically merging by a host of different fields, and not necessarily 
those you want to sort by anyway. If the data model is complex and you have 
a lot of fallible humans doing the data entry, your merging routine needs 
heuristics to deal with variations in people's use of the data entry 
guidelines. Consider again bibliographic systems where the same title may 
exist in hardcover, paperback, as an audio tape, or a DVD of the Hollywood 
production. Add to that different editions, revisions, etc., and it becomes 
clear that more than a call to strcmp() is involved. Bibliographic merging 
at its best (or worst) is a Zen art.

So... assuming someone wants to build a user interface that shows a brief 
list of records on the screen, it's not going to be of much use to retrieve 
just half a dozen or so from each server. You can trick people with this 
for a demonstration, but it's not going to work for something people will 
actually use.

In principle, you're going to need to get *all* of the records that match 
your search from each server, sort them, deduplicate them, count 'em, and 
show them to the user. This works if you get a few tens of hits per server, 
but otherwise, it's a job that never finishes. BookWhere, which is probably 
the only commercial stand-alone Windows client that has hit it off, 
essentially doesn't stop fetching records. It retrieves and merges, 
retrieves and merges, all the time updating the nicely sorted display of 
records on the screen. But this has drawbacks too. Performance-tests I have 
done have suggested that for many servers, retrieving records is at least 
as expensive as executing a search, if not more so. Database servers that 
construct retrieval records by pulling together data from an RDBMS have to 
do a lot of legwork to retrieve a record. Also, a continuously updating 
display can be done relatively prettily in a Windows interface, but it 
requires trickery to get it to work well in a web application.

*maybe* we can get around some of these issues by adding in hooks where you 
can inherit/override classes to put in your own heuristics, but it more 
than just sorting and matching -- the whole strategy for how you do the 
retrieval efficiently should be adapted to the kind of merging/matching you 
need to do.

In Z39.50 implementor's terms, I'm middle-aged if not quite a senior 
citizen, and the last thing I want to do is discourage innovation.. I just 
want to get across that it is more intricate than it appears the first time 
you consider it. In our projects, we mostly just ignore it and display 
results separately (although we obviously do the Z39.50 network stuff in 
parallel), but we're beginning to have a hard time convincing customers 
that this is a good thing. Many client systems cheat and just display 
results at random or round-robin from different servers... but I would 
claim that those systems won't be popular with users in the long run.

I really think we're biting off more than we can chew if we try to put 
result set merging into the guts of ZOOM -- it'll bring more confusion than 
help. You can have a HeapingGreatSackofHeterogeneousConnectionsAndStuff, 
but it should provide access to individual result sets.

--Sebastian

_______________________________________________
ZOOM mailing list
ZOOM@indexdata.dk
http://www.indexdata.dk/mailman/listinfo/zoom

Feedback to <mike@indexdata.com> is welcome!