Parsing GEDCOM files in Python
As Dad is coming to about the end of how far we can get using Ancestry.com, we’ve started maintaining copies of the tree (downloadable as a .ged file) so we still have all the data on there when we quit paying for it. I decided to take a look at parsing the file, which is actually quite trivial - except I don’t actually know what I want to do with the data once I parse it.
A question on StackOverflow leads one to one implementation (unfortunately simply converting it to a not-entirely-any-more-useful XML format), but I went in a bit of a different direction.
I’m pretty new at Python, so I don’t entirely understand the use of codecs here, so I threw that away and used a simple file handle (with the advantage that I could at one point in my hacking around read extra lines if I wanted, but that turned out to be a less than useful way of doing it). My regexes are also slightly different, but I’m not entirely sure if it’s the “Pythonic” way to do it or not.
import sys, re
tree = {}
def parse(line):
# setup static variables
if 'person' not in parse.__dict__:
parse.person = {}
if 'parent' not in parse.__dict__:
parse.parent = None
m = re.match('^(\d+) (?:@(\w+)@ )?(\w+)(?: )?(.+)?$', line)
if m != None:
layer, id, type, data = m.group(1,2,3,4)
# individual records
if type == 'INDI':
# new individual
if 'id' in parse.person:
tree[parse.person['id']] = parse.person
parse.person = {}
parse.parent = None
parse.person['id'] = id
elif type == 'NAME':
parse.person['name'] = data
### SNIP
# open the file and read it line by line
with open(sys.argv[1], 'r') as ged:
for line in ged:
parse(line.strip())
print tree
In the interests of being terse, I’ve cut out a lot of the code to paste it here - most of it is simply about detecting record types and storing data accordingly. parse.parent is used to store the previous record type for things like BIRT and MARR records, which can both have child entries like DATE and PLAC. This approach is, of course, not particularly tolerant of malformed databases.
There’s probably far easier ways to deal with the data than I am doing, but again - I don’t actually know what I want to do with the data once I’ve parsed it. I do know that I would like to store it in either a properly linked format where one record links directly to another, instead of having a seperate index of relationships. Once I solve the problem of knowing what I want to do with it, I’ll likely post the whole thing on Github.