indexpost archiveatom feed syndication feed icon

An XML Deserializer

2024-05-26

Rouding out my last entry with one final thought. How simple can you make the deserialization of XML into Python objects? How clunky is it in practice?

Serializer

My XML serialization is intentionally limited, it tries to reflect the structure and names of my preferred sorts of data objects. It doesn't do much and tries to be about as obvious as possible. The intended means of extending it is through singledispatch methods on new non-dataclass object types.

Deserializer

I spent a while thinking up the most generic thing I could think of. An abstract XML deserializer that "does the right thing" in all cases. There are numerous libraries that attempt the same and while I was browsing around some I got to wondering whether such flexibility really fits my own preferences for obvious code. Where I had done a modest amount of munging through type hints to derive XML tag names I found the idea of trying to reverse the process almost too surprisng. Taking a step back then I got to thinking about my usual struggles with data exchange: fields being renamed, types being altered or defined loosely, too much magic.

With these things in mind then I imagined instead the simplest thing that could possibly work. What I dreamed up was something like this:

@dataclasses.dataclass
class Address:
    street: str
    city: str
    zipcode: str

def from_xml(Address, xml_element):
    return Address(
	street=xml_element.find('street').text,
	city=xml_element.find('city').text,
	zipcode=xml_element.find('zipcode').text
    )

There's some obvious duplication between the attributes of the class and the "find" calls but it is also exceedingly obvious how the deserialization happens this way. Additionally, it makes it easy to cast the textual values from XML into their appropriate type (int, float, UUID). Having written another free floating method to deal with XML (the first being to_xml) I got the feeling I was looking at a nice place to extract some functionality. The question became how I'd prefer to annotate these XML-capable dataclasses, I decided (for now) on a class decorator and to include the from_xml method on the class to keep things tidy. The result then looks like this:

@xml_serializable
@dataclasses.dataclass
class Address:
    street: str
    city: str
    zipcode: str

    @classmethod
    def from_xml(cls, elem):
        return cls(
            street=elem.find('street').text,
            city=elem.find('city').text,
            zipcode=elem.find('zipcode').text
        )

The decorator @xml_serializable is part of a module defined like this:

def from_xml(cls, xml_element):
    raise NotImplementedError(f"{cls} lacks a from_xml method!")

def from_xml_string(cls, xml_string):
    root = xml.etree.ElementTree.fromstring(xml_string)
    return cls.from_xml(root)

def xml_serializable(cls):
    cls.to_xml = to_xml
    if not hasattr(cls, 'from_xml'):
        cls.from_xml = classmethod(from_xml)
    cls.from_xml_string = classmethod(from_xml_string)
    return cls

I've omitted to_xml for brevity but it is copied from the last post almost directly. I did end up changing the return type from returning xml.dom.minidom.Document() to the text-formatted version by adding document.toxml() since I was already doing that by hand.

With this class decorator serialization should mostly just work out of the box and deserialization requires only that a class implements from_xml(cls, xml_element). Rather than require preprocessing of XML strings into documents each class gains the from_xml_string method without any further work. This separation between the string and element handling allows composite objects to compose nicely:

@xml_serializable
@dataclasses.dataclass
class Person:
    name: str
    age: int
    address: Address

    @classmethod
    def from_xml(cls, elem):
        return cls(
            name=elem.find('name').text,
            age=int(elem.find('age').text),
            address=Address.from_xml(elem.find('address'))
        )

Notice the cast from "age" to an integer. It was surprsingly annoying to work out as much using introspection of the type or type hints, enough so that I am actually more convinced this is a good idea for how simple it is. Finally then it is worth touching on how lists are handled since that induced some complexity for serializing. It is actually no more difficult than the prior example, you just write more python code and rely on from_xml being implemented!

@xml_serializable
@dataclasses.dataclass
class Example:
    persons: list[Person]

    @classmethod
    def from_xml(cls, elem):
        return cls(
            persons=[Person.from_xml(p)
                     for p in
                     elem.find('persons').findall('person')]
        )

The calls to from_xml might get marginally more complicated in the face of optional tags but anyone comfortable with Python is probably writing their fair share of if foo is not None already.

A Final Example

Finally to demonstrate the full thing using the above classes it is possible to round-trip an XML string into Python, back to XML and back into Python — just because we can!

example_xml = '''
<example>
  <persons>
    <person>
        <name>Alyssa P. Hacker</name>
        <age>35</age>
        <address>
            <street>321 Main St</street>
            <city>New Mexico</city>
            <zipcode>99999</zipcode>
        </address>
    </person>
    <person>
        <name>Ben Bitdiddle</name>
        <age>30</age>
        <address>
            <street>123 Main St</street>
            <city>New York</city>
            <zipcode>10001</zipcode>
        </address>
    </person>
  </persons>
</example>
'''

e = Example.from_xml_string(example_xml)
ee = e.to_xml()
eee = Example.from_xml_string(ee)

Example(persons=[Person(name='Alyssa P. Hacker',
                        age=35,
                        address=Address(street='321 Main St',
                                        city='New Mexico',
                                        zipcode='99999')),
                 Person(name='Ben Bitdiddle',
                        age=30,
                        address=Address(street='123 Main St',
                                        city='New York',
                                        zipcode='10001'))])

assert e == eee

Thoughts

I am really rather pleased with how this turned out. While I was a little unsure about the real utility in constructing my own XML serializer I am already thinking of places I might use this pattern or at least demonstrate it to others for the kinds of complexity tradeoffs I find palatable. The serializer, deserializer class decorator rigging and "extension points" ends up being less than 70 lines of code:

import functools
import dataclasses
import typing
import xml.dom.minidom
import xml.etree.ElementTree


@functools.singledispatch
def to_serializable(value):
    return str(value)

def to_xml(obj):
    def build(parent, obj, type_hint=None):

        if dataclasses.is_dataclass(obj):
            for key, value in dataclasses.asdict(obj).items():
                tag = document.createElement(key)
                parent.appendChild(tag)
                build(tag, value, typing.get_type_hints(type(obj)).get(key))

        elif isinstance(obj, list):
            elem_type = None
            if type_hint and hasattr(type_hint, '__args__'):
                elem_type = type_hint.__args__[0]
                tag_name = elem_type.__name__.lower()
            elif type_hint:
                tag_name = type_hint.__name__.lower()
            else:
                tag_name = 'value'

            for elem in obj:
                tag = document.createElement(tag_name)
                parent.appendChild(tag)
                build(tag, elem, elem_type)

        elif isinstance(obj, dict):
            for key, value in obj.items():
                tag = document.createElement(key)
                parent.appendChild(tag)
                build(tag, value, typing.get_type_hints(type(obj)).get(key))

        else:
            data = to_serializable(obj)
            tag = document.createTextNode(data)
            parent.appendChild(tag)

    document = xml.dom.minidom.Document()
    root_tag_name = type(obj).__name__.lower()
    root = document.createElement(root_tag_name)
    document.appendChild(root)
    build(root, obj, typing.get_type_hints(type(obj)).get('obj'))
    return document.toxml()

def from_xml(cls, xml_element):
    raise NotImplementedError(f"{cls} lacks a from_xml method!")

def from_xml_string(cls, xml_string):
    root = xml.etree.ElementTree.fromstring(xml_string)
    return cls.from_xml(root)

def xml_serializable(cls):
    cls.to_xml = to_xml
    if not hasattr(cls, 'from_xml'):
        cls.from_xml = classmethod(from_xml)
    cls.from_xml_string = classmethod(from_xml_string)
    return cls

As a final sanity check on whether I was backing myself into a corner I also tried out two different cases, representative of the opposite ends of a spectrum I often find myself working in. The first is how easily the serialization can be made to accommodate arbitrary regular python classes. Of course, most classes of this sort don't tend to have sufficient information to entirely reconstruct them but if you imagine you at least have a database key or other definite reference it seems to hold up, even if additional handling is needed to produce a fully instantiated object:

@xml_serializable
class FooBar:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z

    def __repr__(self):
        return f'FooBar X:{self.x} Y:{self.y} Z:{self.z}'

    @classmethod
    def from_xml(cls, elem):
        intermediate = elem.text
        x, y, z = intermediate.split('-')
        return cls(x, y, z)

@to_serializable.register
def ts_foobar(value: FooBar):
    return f'{value.x}-{value.y}-{value.z}'

f = FooBar(1,2,3)

print(f.to_xml())    # <?xml version="1.0" ?><foobar>1-2-3</foobar>

FooBar.from_xml_string(f.to_xml())    # FooBar X:1 Y:2 Z:3

The other case worth at least testing is the possibility of complex types to ensure information doesn't get lost in serializing and deserializating.

@to_serializable.register
def ts_datetime(value: datetime.datetime):
    return value.isoformat()

type Nonsense = typing.Literal["foo", "bar", "baz"]
type X = typing.Literal["X"]
type Y = typing.Literal["y"]

class Quality(enum.StrEnum):
    A = enum.auto()
    B = enum.auto()
    C = enum.auto()

@xml_serializable
@dataclasses.dataclass
class Thing:
    id: uuid.UUID
    date: datetime.datetime
    quality: typing.Optional[Quality] = None
    comments: list[Nonsense] = dataclasses.field(default_factory=list)
    zzz: typing.Optional[X|Y] = None

    @classmethod
    def from_xml(cls, elem):
        optional_quality = elem.find('quaility')
        optional_zzz = elem.find('zzz')
        return cls(
            id=elem.find('id').text,
            date=datetime.datetime.fromisoformat(elem.find('date').text),
            quality=Quality(optional_quality.text) if optional_quality is not None else None,
            comments=[e.text for e in elem.find('comments').findall('nonsense')],
            zzz=optional_zzz.text if optional_zzz is not None else None
        )

What I did end up finding in trying to exercise all of the corner cases that came to mind was that the XML serialization is still a bit rough. For example in the case of an optional list of literal types the generated XML ends up with a tag name of "list" - a workaround is to create a type alias to name the list element tags. Not a deal breaker but something I may still think on some more. The deserialization though seems to stand up to my testing and looks at least okay in the face of numerous None-checks. I remain convinced that the best way to tighten up Python day-to-day is to minimize the number of places where None creeps in.