Front page | perl.gedcom |
Postings from August 2011
Re: A draft proposal for UUIDs
From: Stephen Woodbridge
August 16, 2011 07:41
Re: A draft proposal for UUIDs
Message ID: 4E4A7543.email@example.com
In line below ...
On 8/16/2011 1:26 AM, Ron Savage wrote:
> Hi Steve
> On Mon, 2011-08-15 at 21:53 -0400, Stephen Woodbridge wrote:
>> On 8/15/2011 6:06 PM, Ron Savage wrote:
>>> Hi Steve
>>> See below
>>> On Mon, 2011-08-15 at 09:12 -0400, Stephen Woodbridge wrote:
>>>> On 8/14/2011 11:41 PM, Ron Savage wrote:
>>>>> Hi Folks
>>>>> Let the replyfest begin!
>>>> Overall this is a great start, here are some comments:
>>>>> Importation of GEDCOM data
>>>>> When a file of GEDCOM data is imported, various cases arise:
>>>>> o Importation into an empty system
>>>>> In this case, UUIDs do not need to be geneated.
>>>> I think this is only true if the empty system does not support UUIDs,
>>>> otherwise it needs to create at least one top level UUID to reflect the
>>>> importation. This would be identical to the case of having an empty
>>>> system and adding INDI(s) to it.
>>> Hmmm. Not sure about this.
>>> I guess this raises the question: Is that UUID meant to represent the
>>> source or the new db?
>> This is a good question and I tried to answer it with my implied question:
>> Given an empty system, what happens when you add the first INDI?
> My proposal is to delay creating UUIDs until they're needed, but of
> course any system could create them as soon as they /might/ be needed.
>> I would assume that the act of importation would be similar to creating
>> an INDI. Also if the imported GEDCOM had UUIDs then you would import
> Yes. I've proposed software must always be able to 'handle' UUIDs even if it can't generate them.
>> them. Would you create an extra import action UUID, probably not needed,
> This quote from the Gedcom::Record docs:
> "All the Gedcom tag names can be used as function names."
> I guess that's what you're referring to?
No. The idea of using UUIDs was to support versioning and tracking of
the source(s) of any data item. So the act of importing non-UUID data
into a file must at a minimum create a UUID for the import action that
all the imported data would inherit. Because it you then add a new INDI
to the file, you need to know that that is from a different source the
those that were imported.
> Gedcom.pm and Gedcom::Record.pm use AUTOLOAD, presumably for responding
> to tags as though they were method names. I'm never going to do that.
> So that means my code will have to do something different when, e.g.
> someone calls a method and gets back a data item representing some
> aspect of an INDI (say).
> I can understand using a tag as a method on an INDI or FAM, but would
> you even need that tag to work on the db as a whole? Presumably not.
> If anyone can suggest use cases for tags as methods, please let me know.
>> but I think you are right to create one for the import action because
>> all imported entities would then inherit that UUID and be tagged for the
>> future that they came from that act.
> I'm still not clear as to whether the UUID generated upon import should
> be seen as belonging to the source or the resultant db.
> Let's just say it's created and stored upon import.
Well this is fine, but I think that the ownership issue is important for
understanding the semantics the system. So may some use cases would help.
1. a user has a birth in the family and enters that into the system
2. a user does some research and copies a family tree from another
source like a book
3. a cousin sends a Family Tree Maker GEDCOM file (ie: no UUIDs) of his
family tree which might overlap with the existing system data and it
4. another cousin sends GEDCOM file with UUIDs to be merged
5. edits are made to change names, dates, places, notes, etc on existing
records in the system
6. user wants to review the history or variants on a given item, event,
So given these or similar cases/actions, how do the UUIDs come into play?
> Now, if some interface code allows the user to create INDIs, say, them
> they have to be flagged as having a different UUID. Or do they?
> If the original UUID belonged to the source, then yes, since the new
> INDIs are coming from a different source.
I think this is the correct answer. the UUID belongs to the source of
the import action the created the data when the import did not have
UUIDs of its own.
Adding a UUID for the import action would then allow all the data to be
later purged if it needed to be so there might be value in adding a UUID
to the import even if the imported data already has UUIDs.
> If it belongs to the db, then no. The one UUID can own all INDIs.
> Suggestions welcome.
>>> I was thinking only that a UUID would be generated on demand, but I'll
>>> Pauses a moment ... The source I'd say, so yes, probably better to
>>> generate one at importation time.
>>>> Nice job.
>>> Thanx. Errrr - Your reply stands out in the deafening silence :-)).
>> OK, well I'll throw out another thought that I had on this for
>> discussion. It has to do with optimization of UUIDs for storage purposes
>> in the GEDCOM. This seems risky, as it implies looking up some hierarchy
>> of structure that is not clear to me to find the UUID for some given
>> item. While this might be very straight forward in an object oriented
>> system that supports inheritance, I'm concerned that this might be:
>> 1. difficult to do in/with a GEDCOM file
>> 2. while it decreases spatial complexity, it increases algorithmic and
>> time complexity
>> 3. it might be error-prone or hard to verify algorithmic correctness
>> 4. it might be expensive to find the UUID for a given item
>> OK, so these concerns are somewhat abstract at the moment, I think they
>> warrant at least some thought and you might want to consider supporting
>> GEDCOM import/export where all items are tagged with their UUIDs.
>> Although, I can imagine that that would double or triple the size of the
>> GEDCOM generated. If you can generate both, then this could be used to
>> generate test cases to validate the correctness of other software.
>> Anyway I did mention this originally, because it seems like a valid
>> optimization, so I'm not suggesting that you change anything, but I
>> thought it worth while to raise this point to see what others might
>> think about it.
> These are reasonable questions...
> Some off-the-top-of-the-head (OTTOTH?) answers:
> 1) A GEDCOM file is not random access. So, to read any record implies
> the header has been read. With a list of UUIDs in the header (as per the
> most recent proposal), attaching a UUID to any INDI lacking one is easy.
> And if the INDI has a UUID already, it's available just by having read
> the INDI.
> Nevertheless, there is still the problem of adding the default UUID to
> any item lacking one.
> 2) With a GEDCOM file, (1) should answer most questions.
> With a random access file (e.g. DBD::SQLite), the cumulative header
> (from all file activity) should only a single read away, and read when
> the program starts.
> 3) Complexity and Correctness. Yes. These are valid concerns. I'll have
> to keep them in mind.
> 4) Expensive. I'm not convinced on that one.
> 5) An option to export/import where each item has in situ UUIDs is not
> unreasonable, and yet ... it's just like indented files, which add size
> without adding value. Still, possible.
> 6) Validation of software. Now that's a great thing to aim for.