index • post archive • atom feed

dataclasses and XML

2023-07-01

I don't even remember what set me off about Python's data validation ecosystem recently. Probably some code I maintain was broken yet again by a package upgrade required by code sharing across projects. Without delving into the specifics of any one package I will say I feel more and more adamant that third-party dependencies are a liability more than any great boon. Saving some typing doesn't count for much on the timescale of years of maintenance. As a result I tried drumming up a list of things I would like to do with data objects in Python and how validation is handled with them.

The list I came up with is something like:

Few breaking changes
Unopinionated serialization
Small API

I swear I don't think I am asking for much. I like the Python standard library just fine and don't like pulling in very many dependencies. As a result I spent some time investigating my options using just the standard library.

I don't have a specific example to share from my current projects but the following is demonstrative of the scale that I tend to work in. A few obviously named classes exposing compound types built-up from simple types. Basic building blocks of most of my Python programming; These sorts of things end up shuffled between ~~distributed systems~~ several computers. Key to much of what I like to do is define explicit data transfer objects, rather than serializing business domain objects directly. I experience a significant amount of resistance to this idea and I think it tends to come across as an enterprise programming idea — which it is — but it also works pretty well. Having dedicated classes that you can modify and format separately from what your program sends and receives is an easy way to introduce seams in legacy code.

Below are a few I've made up. You'll have to imagine a more productive context for them to be used in, I'm only showing how to shuffle them around:

@dataclasses.dataclass
class Point:
    when: datetime.datetime
    latitude: float
    longitude: float

@dataclasses.dataclass
class Path:
    identifier: uuid.UUID
    points: list[Point]

@dataclasses.dataclass
class Collection:
    note: str
    paths: list[Path]

Serialization

Python's dataclasses are easily converted into dictionaries with the asdict function. It works like you would expect:

>>> dataclasses.asdict(Collection(note="free-form text etc. ad nauseam", paths=[]))
{'note': 'free-form text etc. ad nauseam', 'paths': []}

With Python dictionaries it is possible to serialize to JSON with a simple json.dumps() call.

>>> my_collection = Collection(note="free-form text etc. ad nauseam", paths=[])
>>> my_collection_dict = dataclasses.asdict(my_collection)
>>> json.dumps(my_collection_dict)
'{"note": "free-form text etc. ad nauseam", "paths": []}'

This is key to encoding dataclasses using the Python standard library's JSON encoder. You have to¹ subclass JSONEncoder and special case instances of dataclasses to call this asdict method:

class CustomJSONEncoder(json.JSONEncoder):
    def default(self, obj):
	if dataclasses.is_dataclass(obj):
	    return dataclasses.asdict(obj)
	return super().default(obj)

Of course, the JSON encoder doesn't support many types out of the box. For example, UUIDs or datetime objects. To encode an instance of the compound type Collection that I made above requires several more special cases added to the custom encoder:

class CustomJSONEncoder(json.JSONEncoder):
    def default(self, obj):
	if isinstance(obj, datetime.datetime):
	    return str(obj)
	if isinstance(obj, uuid.UUID):
	    return str(obj)
	if dataclasses.is_dataclass(obj):
	    return dataclasses.asdict(obj)
	return super().default(obj)

document = Collection(note="<strong>very interesting</strong> series of data here",
                      paths=[
                          Path(identifier=uuid.uuid4(),
                               points=[
                                   Point(when=datetime.datetime.now(datetime.timezone.utc),
                                         latitude=40.884514,
                                         longitude=-73.419167),
                                   Point(when=datetime.datetime.now(datetime.timezone.utc),
                                         latitude=0.0,
                                         longitude=0.0)])])


>>> json.dumps(document, cls=CustomJSONEncoder)

{
    "note": "<strong>very interesting</strong> series of data here",
    "paths": [
        {
            "identifier": "857a2fea-880b-40fb-bf6b-1c9edf2a3dc9",
            "points": [
                {
                    "when": "2023-07-02 03:48:35.446977+00:00",
                    "latitude": 40.884514,
                    "longitude": -73.419167
                },
                {
                    "when": "2023-07-02 03:48:35.446982+00:00",
                    "latitude": 0.0,
                    "longitude": 0.0
                }
            ]
        }
    ]
}

So far, so easy. Patching up the JSON encoder for three separate data types right out of the gate got me thinking a bit though about how much work would be involved in targeting a different serialization format instead. Rather than picking something fashionable I tried to keep some of my motivations in mind and dug through the standard library to find XML.

No wait — don't go! I'm only serious.

In the same way I find myself annoyed when something like Pydantic breaks my code because of transitive dependencies, I don't like breaking interfaces if I don't have to. In pursuit of better asserting I haven't broken my own APIs I like sharing schemas rather than sharing datatypes/code in libraries. With JSON there's the option for JSON Schema, which has always felt a bit bolted-on to me. It seems to work but I haven't used an implementation left me hugely satisfied. What I unironically remember finding perfectly serviceable is XSD. I don't think it changes much and you can pull documents and schemas from 20 years ago and expect them to work. That's about the level of boring technology I strive for.

Starting from "I might like to use XSD" is a bit backwards though, first I need to find out how hard it will be to turn my data transfer objects into XML. The bar is pretty high with how simple it was to encode to JSON, despite the need to define a few custom type encoding options. The first hurdle is actually picking which XML library to use as there are few widely available ones for Python. I am specifically looking to avoid helper libraries that seem to litter StackOverflow answers. I don't have much faith that they'll be maintained in the long term. I'd much rather just write something myself so I'll at least know exactly how it works.

I ended up picking xml.dom.minidom for having one of the simpler APIs. I think libraries like lxml probably expose more knobs and better leverage the full XML specification but what I'm looking to do is very bare bones and the minidom library seems to fit the bill.

Here is a small XML builder function I've written to try exercising my potential APIs.

import dataclasses
from xml.dom.minidom import Document

def to_xml(obj) -> Document:
    def build(parent, obj):
        if dataclasses.is_dataclass(obj):
            for key in dataclasses.asdict(obj):
                tag = document.createElement(key)
                parent.appendChild(tag)
                build(tag, getattr(obj, key))
        elif type(obj) == list:
            for elem in obj:
                tag = document.createElement(type(elem).__name__.lower())
                build(tag, elem)
                parent.appendChild(tag)
        elif type(obj) == dict:
            for key in obj:
                tag = document.createElement(key)
                parent.appendChild(tag)
                build(tag, obj[key])
        else:
            data = str(obj)
            tag = document.createTextNode(data)
            parent.appendChild(tag)

    document = Document()
    build(document, obj)
    return document

Running it over my nested dataclasses example object demonstrates it handles the custom types without further fuss and I don't actually find it "noisy" like so much XML can justifiably be maligned.

>>> doc = to_xml(document)
>>> print(doc.toprettyxml(indent="  "))

<?xml version="1.0" ?>
<collection>
  <note>&lt;strong&gt;very interesting&lt;/strong&gt; series of data here</note>
  <paths>
    <path>
      <identifier>857a2fea-880b-40fb-bf6b-1c9edf2a3dc9</identifier>
      <points>
        <point>
          <when>2023-07-02 03:48:35.446977+00:00</when>
          <latitude>40.884514</latitude>
          <longitude>-73.419167</longitude>
        </point>
        <point>
          <when>2023-07-02 03:48:35.446982+00:00</when>
          <latitude>0.0</latitude>
          <longitude>0.0</longitude>
        </point>
      </points>
    </path>
  </paths>
</collection>

There are a few gotchas that might prove annoying or untenable for larger uses, it is too early for me to say whether things are quite so simple as they seem. Notable is perhaps the lack of attributes on any of the elements. I haven't yet thought of what I would use them for in the work I am doing so their absence hasn't really been missed yet. Second is the tag naming, I actually rather like how simple it is to map the dataclass attributes to tags and I've tried to maintain that with lists by introspecting the type of the list elements to name the list item tags. In particular this method doesn't fare so well with amorphous data types like plain dictionaries. I don't really care much because passing around bare dictionaries has been the bane of my existence in so much legacy code. Just specify your interfaces!

>>> d = {'what_is_this': 'oh it is a string', 'whatAboutThis': 42, 'some_data': [1,2,3,4], 'catchPhrases': ["wow", "oh boy"]}
>>> print(to_xml({'garbage': d}).toprettyxml(indent="  "))
<?xml version="1.0" ?>
<garbage>
  <what_is_this>oh it is a string</what_is_this>
  <whatAboutThis>42</whatAboutThis>
  <some_data>
    <int>1</int>
    <int>2</int>
    <int>3</int>
    <int>4</int>
  </some_data>
  <catchPhrases>
    <str>wow</str>
    <str>oh boy</str>
  </catchPhrases>
</garbage>

The bare types inside of lists is odd but the fix is pretty simple: write wrapper types that better describe the data. I haven't looked too deeply into what is potentially output if things like bare dictionaries are nested within each other; I don't imagine it is anything terribly useful. My intention is to create a transfer class with the appropriately named attributes. Below then are two different schemas that will validate my XML document above, first is a DTD:

<!ELEMENT collection (note, paths)>
<!ELEMENT note (#PCDATA)>
<!ELEMENT paths (path*)>
<!ELEMENT path (identifier, points)>
<!ELEMENT identifier (#PCDATA)>
<!ELEMENT points (point*)>
<!ELEMENT point (when, latitude, longitude)>
<!ELEMENT when (#PCDATA)>
<!ELEMENT latitude (#PCDATA)>
<!ELEMENT longitude (#PCDATA)>

The second, is XSD. XSD is much more expansive but it seems to be able to better express things like numeric types and can do regular expressions, sequences of bounded sizes etc. Reading about it I understand it is (or was) contentious for being so sprawling

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:element name="collection" type="Collection"/>

  <xsd:complexType name="Collection">
    <xsd:sequence>
      <xsd:element name="note"   type="xsd:string"/>
      <xsd:element name="paths"  type="Paths"/>
    </xsd:sequence>
  </xsd:complexType>

  <xsd:complexType name="Paths">
    <xsd:sequence>
      <xsd:element name="path" minOccurs="0" maxOccurs="unbounded">
        <xsd:complexType>
          <xsd:sequence>
	    <xsd:element name="identifier" type="xsd:string"/>
	    <xsd:element name="points"     type="Points"/>
          </xsd:sequence>
        </xsd:complexType>
      </xsd:element>
    </xsd:sequence>
  </xsd:complexType>

  <xsd:complexType name="Points">
    <xsd:sequence>
      <xsd:element name="point" minOccurs="0" maxOccurs="unbounded">
        <xsd:complexType>
          <xsd:sequence>
	    <xsd:element name="when"      type="xsd:dateTime"/>
	    <xsd:element name="latitude"  type="xsd:decimal"/>
	    <xsd:element name="longitude" type="xsd:decimal"/>
          </xsd:sequence>
        </xsd:complexType>
      </xsd:element>
    </xsd:sequence>
  </xsd:complexType>
</xsd:schema>

I admittedly pulled a fast one on you, if you read this far. I realized in tightening up my schema for the XSD definition that my datetime string was almost but not quite an ISO8601 formatted string. I was using str over the datetime object and ended up with something like this: 2023-07-02 03:48:35.446977+00:00 but the XSD validation complained it was not a valid dateTime format which is instead more like 2023-07-02T03:48:35.446977+00:00.

This is available from the isoformat method of the datetime class and while it is of course possible to extend the XML serialization function to special case things like the JSON encoder I thought a bit and figured it might make more sense to simply define a custom type to do the "right thing":

class ISO8601Datetime(datetime.datetime):
    def __str__(self):
        return self.isoformat()

The fact that I was able to catch the issue makes it feel a bit more like the right solution for the problem at hand, even if it is a bit more typing. Also encouraging is the fact that the fix works with the JSON encoder isinstance check to make those JSON encoded datetimes conform to ISO8601 as well.

Thoughts

I don't know that XML is really the right solution for any of my problems. I appreciate that XML has had its moment in the sun and there is not a ton of innovation happening in the space. This appeals to me because I don't think anyone is going to break all of my things when they decide to change an interpretation of the specifications or make a breaking change to the APIs that have been in use for more than 20 years.

Dataclasses are newer but their inclusion in the standard library makes me think they're pretty safe for a while a least. Fundamentally they don't do much that you couldn't do with plain old classes, it is all just convenience features. The approach of defining a schema separate from the logic of the system and building DTOs to match it still feels pretty good. Schema validation goes a long ways toward what would otherwise be a lot of integration tests.

Going Overboard?

Of course, once I start down a path I have a tendency to over-do things. Looking through the XSD documentation I noticed it is possible to define reasonably complex types. Take for example something like my example latitude, longitude values; you could create a new type with bounds checking like this:

  <xsd:simpleType name="latitude">
    <xsd:restriction base="xsd:decimal">
      <xsd:minInclusive value="-90.0"/>
      <xsd:maxInclusive value="90.0"/>
    </xsd:restriction>
  </xsd:simpleType>

That's neat and totally sensible for this type. Could it be useful to implement those same bounds in the Python code? Let's find out!

Browsing the dataclasses documentation I read there is the potential to annotate a class attribute with a custom descriptor. I've read before about descriptors but don't tend to write them too often, now seems like the perfect opportunity to try one out. There is a number type descriptor presented in the documentation which is a good place to start:

from abc import ABC, abstractmethod

class Validator(ABC):
    def __set_name__(self, owner, name):
        self.private_name = '_' + name

    def __get__(self, obj, objtype=None):
        return getattr(obj, self.private_name)

    def __set__(self, obj, value):
        self.validate(value)
        setattr(obj, self.private_name, value)

    @abstractmethod
    def validate(self, value):
        pass

class Number(Validator):
    def __init__(self, minvalue=None, maxvalue=None):
        self.minvalue = minvalue
        self.maxvalue = maxvalue

    def validate(self, value):
        if not isinstance(value, (int, float)):
            raise TypeError(f'Expected {value!r} to be an int or float')
        if self.minvalue is not None and value < self.minvalue:
            raise ValueError(
                f'Expected {value!r} to be at least {self.minvalue!r}'
            )
        if self.maxvalue is not None and value > self.maxvalue:
            raise ValueError(
                f'Expected {value!r} to be no more than {self.maxvalue!r}'
            )

Just some basic indirection to validate things before setting internal attributes. Where I previously wrote my Point class using two bare floats named latitude and longitude I instead want a single type with the appropriate bounds checking: (+90° -90°) and (+180° -180°) respectively. Here is one way to do just that:

class Latitude(Number):
    def __init__(self):
        super().__init__(-90.0, 90.0)

class Longitude(Number):
    def __init__(self):
        super().__init__(-180.0, 180.0)

@dataclasses.dataclass
class Location:
    latitude: Latitude = Latitude()
    longitude: Longitude = Longitude()

It works just like I had hoped:

>>> Location(latitude=40.884514, longitude=73.419167)
Location(latitude=40.884514, longitude=73.419167)

>>> Location(latitude=40.884514, longitude=273.419167)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 4, in __init__
  File "/tmp/example.py", line 50, in __set__
    self.validate(value)
  File "/tmp/example.py", line 71, in validate
    raise ValueError(
ValueError: Expected 273.419167 to be no more than 180.0

Dropping my new Location dataclass into my previous example serializes in an unsurprising way and I'm free to amend my schema using the type I outline above.

<?xml version="1.0" ?>
<collection>
  <note>&lt;strong&gt;very interesting&lt;/strong&gt; series of data here</note>
  <paths>
    <path>
      <identifier>fb3dc1cd-563f-41c1-b391-61ae4ffe7b50</identifier>
      <points>
        <point>
          <when>2023-07-02T04:01:52.101247+00:00</when>
          <location>
            <latitude>40.884514</latitude>
            <longitude>-73.419167</longitude>
          </location>
        </point>
        <point>
          <when>2023-07-02T04:01:52.101263+00:00</when>
          <location>
            <latitude>0.0</latitude>
            <longitude>0.0</longitude>
          </location>
        </point>
      </points>
    </path>
  </paths>
</collection>

This has been a pleasant experience and I am much more comfortable with dataclasses and custom serialization as a result. I think, however, there is small chance of implementing this level of strictness in most Python projects. It is sufficiently unlike how most systems I have worked in tend to be designed that it might well be a non-starter. People just really love slinging around bare dictionaries serialized to JSON and patching over bad data on receiving it. I should probably look more into JSON Schema, it seems to have a better chance of being used in new projects and despite yet more wacky syntax it at least appears to have support for the same level of detail that I've found use for in XSD. At least my dataclasses and custom validators won't have to change if that day ever comes.

I've since read about a nice idea for an alternative to creating a custom JSON encoder class which instead suggests the use of functools.singledispatch to register indepdendent JSON encoder functions registered against the object type. So a datetime encoder would look like this:
```
@functools.singledispatch
def to_serializable(value):
    return str(value)

@to_serializable.register
def ts_datetime(value: datetime.datetime):
    return value.isoformat()

json.dumps(some_object, default=to_serializable)
```
Doing the same for a dataclass then might look like this:
```
@to_serializable.register
def ts_collection(value: Collection):
    return dataclasses.asdict(value)
```
There is certainly something appealing about the ease of defining new encoding methods this way. One admittedly minor downside is that I have been unable to come up with a generic way to define an encoder for dataclasses that keeps this separation. Because dataclasses don't share a super type (they are instead built by adding attributes to a normal class definition) you can't do something like this:
```
@to_serializable.register
def ts_dataclass(value: dataclass):
    return dataclasses.asdict(value)
```
Sure, it is possible to change the default serialization method like this:
```
@functools.singledispatch
def to_serializable(value):
    if dataclasses.is_dataclass(value):
        return dataclasses.asdict(value)
    return str(value)
```
But that is getting back into the original method. I think instead it is probably best to keep redefining to_serializable for the dataclasses specifically if I'm going to be writing JSON like this.