Rouding out my last entry with one final thought. How simple can you make the deserialization of XML into Python objects? How clunky is it in practice?
My XML
serialization is intentionally limited, it tries to reflect the
structure and names of my preferred sorts of data objects. It
doesn't do much and tries to be about as obvious as possible. The
intended means of extending it is
through singledispatch
methods on new non-dataclass
object types.
I spent a while thinking up the most generic thing I could think of. An abstract XML deserializer that "does the right thing" in all cases. There are numerous libraries that attempt the same and while I was browsing around some I got to wondering whether such flexibility really fits my own preferences for obvious code. Where I had done a modest amount of munging through type hints to derive XML tag names I found the idea of trying to reverse the process almost too surprisng. Taking a step back then I got to thinking about my usual struggles with data exchange: fields being renamed, types being altered or defined loosely, too much magic.
With these things in mind then I imagined instead the simplest thing that could possibly work. What I dreamed up was something like this:
@dataclasses.dataclass
class Address:
street: str
city: str
zipcode: str
def from_xml(Address, xml_element):
return Address(
street=xml_element.find('street').text,
city=xml_element.find('city').text,
zipcode=xml_element.find('zipcode').text
)
There's some obvious duplication between the attributes of the class
and the "find" calls but it is also exceedingly obvious how the
deserialization happens this way. Additionally, it makes it easy to
cast the textual values from XML into their appropriate type (int,
float, UUID). Having written another free floating method to deal
with XML (the first being to_xml
) I got the feeling I
was looking at a nice place to extract some functionality. The
question became how I'd prefer to annotate these XML-capable
dataclasses, I decided (for now) on a class decorator and to include
the from_xml
method on the class to keep things
tidy. The result then looks like this:
@xml_serializable
@dataclasses.dataclass
class Address:
street: str
city: str
zipcode: str
@classmethod
def from_xml(cls, elem):
return cls(
street=elem.find('street').text,
city=elem.find('city').text,
zipcode=elem.find('zipcode').text
)
The decorator @xml_serializable
is part of a module
defined like this:
def from_xml(cls, xml_element):
raise NotImplementedError(f"{cls} lacks a from_xml method!")
def from_xml_string(cls, xml_string):
root = xml.etree.ElementTree.fromstring(xml_string)
return cls.from_xml(root)
def xml_serializable(cls):
cls.to_xml = to_xml
if not hasattr(cls, 'from_xml'):
cls.from_xml = classmethod(from_xml)
cls.from_xml_string = classmethod(from_xml_string)
return cls
I've omitted to_xml
for brevity but it is copied from
the last post almost directly. I did end up changing the return type
from returning xml.dom.minidom.Document()
to the
text-formatted version by adding document.toxml()
since
I was already doing that by hand.
With this class decorator serialization should mostly just work out
of the box and deserialization requires only that a class
implements from_xml(cls, xml_element)
. Rather than
require preprocessing of XML strings into documents each class gains
the from_xml_string
method without any further
work. This separation between the string and element handling allows
composite objects to compose nicely:
@xml_serializable
@dataclasses.dataclass
class Person:
name: str
age: int
address: Address
@classmethod
def from_xml(cls, elem):
return cls(
name=elem.find('name').text,
age=int(elem.find('age').text),
address=Address.from_xml(elem.find('address'))
)
Notice the cast from "age" to an integer. It was surprsingly
annoying to work out as much using introspection of the type or type
hints, enough so that I am actually more convinced this is a good
idea for how simple it is. Finally then it is worth touching on how
lists are handled since that induced some complexity for
serializing. It is actually no more difficult than the prior
example, you just write more python code and rely
on from_xml
being implemented!
@xml_serializable
@dataclasses.dataclass
class Example:
persons: list[Person]
@classmethod
def from_xml(cls, elem):
return cls(
persons=[Person.from_xml(p)
for p in
elem.find('persons').findall('person')]
)
The calls to from_xml
might get marginally more
complicated in the face of optional tags but anyone comfortable with
Python is probably writing their fair share of if foo is not
None
already.
Finally to demonstrate the full thing using the above classes it is possible to round-trip an XML string into Python, back to XML and back into Python — just because we can!
example_xml = '''
<example>
<persons>
<person>
<name>Alyssa P. Hacker</name>
<age>35</age>
<address>
<street>321 Main St</street>
<city>New Mexico</city>
<zipcode>99999</zipcode>
</address>
</person>
<person>
<name>Ben Bitdiddle</name>
<age>30</age>
<address>
<street>123 Main St</street>
<city>New York</city>
<zipcode>10001</zipcode>
</address>
</person>
</persons>
</example>
'''
e = Example.from_xml_string(example_xml)
ee = e.to_xml()
eee = Example.from_xml_string(ee)
Example(persons=[Person(name='Alyssa P. Hacker',
age=35,
address=Address(street='321 Main St',
city='New Mexico',
zipcode='99999')),
Person(name='Ben Bitdiddle',
age=30,
address=Address(street='123 Main St',
city='New York',
zipcode='10001'))])
assert e == eee
I am really rather pleased with how this turned out. While I was a little unsure about the real utility in constructing my own XML serializer I am already thinking of places I might use this pattern or at least demonstrate it to others for the kinds of complexity tradeoffs I find palatable. The serializer, deserializer class decorator rigging and "extension points" ends up being less than 70 lines of code:
import functools import dataclasses import typing import xml.dom.minidom import xml.etree.ElementTree @functools.singledispatch def to_serializable(value): return str(value) def to_xml(obj): def build(parent, obj, type_hint=None): if dataclasses.is_dataclass(obj): for key, value in dataclasses.asdict(obj).items(): tag = document.createElement(key) parent.appendChild(tag) build(tag, value, typing.get_type_hints(type(obj)).get(key)) elif isinstance(obj, list): elem_type = None if type_hint and hasattr(type_hint, '__args__'): elem_type = type_hint.__args__[0] tag_name = elem_type.__name__.lower() elif type_hint: tag_name = type_hint.__name__.lower() else: tag_name = 'value' for elem in obj: tag = document.createElement(tag_name) parent.appendChild(tag) build(tag, elem, elem_type) elif isinstance(obj, dict): for key, value in obj.items(): tag = document.createElement(key) parent.appendChild(tag) build(tag, value, typing.get_type_hints(type(obj)).get(key)) else: data = to_serializable(obj) tag = document.createTextNode(data) parent.appendChild(tag) document = xml.dom.minidom.Document() root_tag_name = type(obj).__name__.lower() root = document.createElement(root_tag_name) document.appendChild(root) build(root, obj, typing.get_type_hints(type(obj)).get('obj')) return document.toxml() def from_xml(cls, xml_element): raise NotImplementedError(f"{cls} lacks a from_xml method!") def from_xml_string(cls, xml_string): root = xml.etree.ElementTree.fromstring(xml_string) return cls.from_xml(root) def xml_serializable(cls): cls.to_xml = to_xml if not hasattr(cls, 'from_xml'): cls.from_xml = classmethod(from_xml) cls.from_xml_string = classmethod(from_xml_string) return cls
As a final sanity check on whether I was backing myself into a corner I also tried out two different cases, representative of the opposite ends of a spectrum I often find myself working in. The first is how easily the serialization can be made to accommodate arbitrary regular python classes. Of course, most classes of this sort don't tend to have sufficient information to entirely reconstruct them but if you imagine you at least have a database key or other definite reference it seems to hold up, even if additional handling is needed to produce a fully instantiated object:
@xml_serializable
class FooBar:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
def __repr__(self):
return f'FooBar X:{self.x} Y:{self.y} Z:{self.z}'
@classmethod
def from_xml(cls, elem):
intermediate = elem.text
x, y, z = intermediate.split('-')
return cls(x, y, z)
@to_serializable.register
def ts_foobar(value: FooBar):
return f'{value.x}-{value.y}-{value.z}'
f = FooBar(1,2,3)
print(f.to_xml()) # <?xml version="1.0" ?><foobar>1-2-3</foobar>
FooBar.from_xml_string(f.to_xml()) # FooBar X:1 Y:2 Z:3
The other case worth at least testing is the possibility of complex types to ensure information doesn't get lost in serializing and deserializating.
@to_serializable.register
def ts_datetime(value: datetime.datetime):
return value.isoformat()
type Nonsense = typing.Literal["foo", "bar", "baz"]
type X = typing.Literal["X"]
type Y = typing.Literal["y"]
class Quality(enum.StrEnum):
A = enum.auto()
B = enum.auto()
C = enum.auto()
@xml_serializable
@dataclasses.dataclass
class Thing:
id: uuid.UUID
date: datetime.datetime
quality: typing.Optional[Quality] = None
comments: list[Nonsense] = dataclasses.field(default_factory=list)
zzz: typing.Optional[X|Y] = None
@classmethod
def from_xml(cls, elem):
optional_quality = elem.find('quaility')
optional_zzz = elem.find('zzz')
return cls(
id=elem.find('id').text,
date=datetime.datetime.fromisoformat(elem.find('date').text),
quality=Quality(optional_quality.text) if optional_quality is not None else None,
comments=[e.text for e in elem.find('comments').findall('nonsense')],
zzz=optional_zzz.text if optional_zzz is not None else None
)
What I did end up finding in trying to exercise all of the corner cases that came to mind was that the XML serialization is still a bit rough. For example in the case of an optional list of literal types the generated XML ends up with a tag name of "list" - a workaround is to create a type alias to name the list element tags. Not a deal breaker but something I may still think on some more. The deserialization though seems to stand up to my testing and looks at least okay in the face of numerous None-checks. I remain convinced that the best way to tighten up Python day-to-day is to minimize the number of places where None creeps in.