Catch up on stories from the past week (and beyond) at the Slashdot story archive


Forgot your password?
User Journal

Journal Wonko the Sane's Journal: Aggregating contacts is hard.

I'm writing this mainly to get my own thoughts straight before I take a stab at implementing this myself.

Merging and synchronizing contact information between all the various services that a person might use appears to be to be an unsolved problem. I've looked high and low and I can not find a single piece of software that will:

  1. Maintain a definitive list of meta-contacts and mappings (i.e. map a Gmail contact to a Facebook contact)
  2. Import data from every service I use
  3. Automatically export information that missing from one member of a metacontact but present in another member (when the underlying service supports this)
  4. Gives me means to both access and edit this information both on my PC and on my mobile phone
  5. Present an editable, unified view that eliminates redundant and obsolete data.

Accessing the data is the easy part. Most services have API functions that let you access it in either a read-only or read-write format. In some cases though all you have to work with is a CSV file.

So for this to work I need to a way to gather all the required information, put in into relational form and find/create the appropriate mappings

This is where it gets tricky. Very few software projects properly handle contact information. The data model needs to include all the metadata about the contact that you care about and meta-metadata. A person can have an unlimited number of email addresses, for example. Any particular email address might be a home address or work address. It might be an active address that you should send mail to or it might be an old address that is no longer in use (but you want to keep it associated with that person so you know who all those old emails came from)

So creating a robust relational structure in your database is non-trivial, but solvable. The hard part is wrangling the data from the other sources into relational form. Most data services do not have a unique, invariant identifier for each contact. Each and every attribute is subject to change. Usually matching based on name or email address will work but contacts can and do change both of these from time to time.

Once you get all the information pulled into the database now it's time to eliminate duplicate information by merging all those subcontacts into their respective metacontacts. Each subcontact should map to exactly one metacontact. The metacontact itself should NOT have any attribute information (name, email address, phone number, etc) directly associated with it to prevent data duplication.

Once you get the subcontacts merged into metacontacts now you should merge the metadata to eliminate duplication. The easiest way to do this is to aggregate all the information from every subcontact and display anything that not a duplicate. If Facebook and Google both say the John Doe has a email of then we only need to display that once. If they have different email addresses then we should display both. Finding duplicates isn't always easy: +1 (800) 555-1212 and 8005551212 are actually the same number (from the point of view of a caller in the US) but a simple text search will not reveal that. The former would be better to display so ideally you'd just update the latter data source, but what if it's read only? In that case you need a way to prevent certain subcontact attributes from being pull into the metacontact. In addition certain attributes shouldn't allow duplicates. If a person only has one canonical name, then should you use their Twitter username, their Facebook user name or the name stored in their associated Gmail contact? The user must decide and the database needs to store this choice.

So after we're all done with this we'll have a nice, unified view of all contact information. This unified view should be editable and any changes made to the underlying data should be pushed out to all services which are not read-only. In the case of the read-only services the stale data should not roll up into the metacontact, unless and until the underlying data changes. Example: someone in their Facebook list their phone number as 800-555-1212 but I edit this number to include the country code because I want to be able to call him from outside the US: +1 (800) 555-1212. This change can be pushed to Gmail but not to Facebook. So from now on the mapping between metacontacts and subcontacts should exclude the mobile number from the Facebook subcontact, unless my friend changes his phone number on Facebook to something else. If that happens the new number should roll up into the metacontact.

Some services do not support attribute metadata. A CSV file might just have a "address" field without specifying if it is a home or work address, physical or mailing, active or deprecated, etc. So this meta-metadata will need to be stored in the database itself. Meta-metadata as possible should be synchronized to the maximum extent supported by the underlying service.

I think I've got enough there to keep me busy for a while. I'm going to try to build a proof of concept of this but I may not get very far before I throw my hands up in disgust or someone else implements it (maybe Akonadi, but I haven't seen anything that indicates that it will have robust metacontact functionality)

This discussion has been archived. No new comments can be posted.

Aggregating contacts is hard.

Comments Filter:

The Macintosh is Xerox technology at its best.