Friday, March 5, 2010

JSOn vs XML vs YAML, and Python Parsing

I've begun working on translating a board game into a computer game. In order to do so, I need to be able to represent the state of the game and its many (many) tokens. In addition, I need to be able to store and load that state quickly. Finally, I need that state to be something that I can represent with a single string.

EDIT: I added YAML testing data to the results. I'm not posting my original files, as I don't have a license to be converting this game, and don't want to trip any legal issues. Suffice to say that the names in the file give it away.

EDIT: I added ElementTree testing data to the results.

EDIT: I added marshal data to the results.

EDIT: I added cPickle data to the results.

Storing the state presents a simple hierarchical structure: Multiple decks of cards are shuffled, and the shuffled state must be saved. This allows for a web front end to the game. The cards are just one aspect: I have to store information about what cards the player has, what state the cards are in, where the player's token is, and what feels like about 500 pieces of information (no joke: I scanned 494 images for this game).

So, how do I represent this data? How do I make it so that a web server can load and save the state quickly? XML is good for hierarchical data, as are JSON and YAML. But which one is right this time?

I needed to test it out. So, I wrote three equivalent files, one in XML, one in JSON, one in YAML. I then ran the timeit module against those files, using the following commands:

  • python -m timeit -n 100 -s 'from xml.dom.minidom import parse' 'd=parse("allboard.xml")'
  • python -m timeit -n 100 -s 'from xml.etree.ElementTree import parse' 'd=parse("allboard.xml")'
  • python -m timeit -n 100 -s 'import simplejson' 'd=simplejson.load(open("allboard.json"))'
  • python -m timeit -n 100 -s 'import yaml; from yaml import CLoader as Loader' ' d=yaml.load(open("allboard.yaml"), Loader=Loader)'
  • python -m timeit -n 1000 -s 'from lxml import etree' 'd=etree.parse(open("allboard.xml"))'
  • python -m timeit -n 100 -s 'import marshal' 'd=marshal.load(open("allboard.marshal"))'
  • python -m timeit -n 100 -s 'from xml.etree.cElementTree import parse' 'd=parse("allboard.xml")'
  • python -m timeit -n 1000 -s 'from cPickle import load' 'd=load(open("allboard.pickle"))'

The results were simply staggeringly different.

XML (minidom) took 15.3ms per run. XML (ElementTree) took 19.6ms per run. XML(lxml) took 925usec per run. XML (cElementTree) took 1.35ms per run.JSON took 250usec per run. YAML took 10.1ms per run when using the CLoader. Without the CLoader, it took 115ms per run. Marshal took 1.01ms per run. cPickle took 2.5ms per run.

JSON wins this one hands down.

Is it possible I chose poorly? Of course. There are many XML parsers for Python, just as there are many JSON parsers. I wanted pure Python here, so as to maximize portability. I chose these modules since they should be completely standard. If I'm wrong, tell me, and I'll revisit this.


Aigars Mahinovs said...

Just a though - YAML?

Michael Pedersen said...

Good question, actually. I had totally forgotten about YAML.

I used YAML's dump capability by loading the .json and dumping the resulting dict to YAML. I then ran the same sort of commands.

YAML was better than XML, but still loses big time to JSON.

I even modified the profiling commands so that the setup would only be run once (the "import xxx") as opposed to every time. Same results.

Nate Cumpson said...

xml.etree.ElementTree will yield better results than minidom. Here's a link with a benchmark test:

It would be interesting to see which C module would perform best for parsing each respectively. I would assume there results converge across different sample files.

I still prefer JSON for document storage and its other web perks.

Michael Pedersen said...

I'd never used ElementTree, and was unsure of the how to use it. Turns out it's quite easy. My testing, though, shows that I'll go with minidom from now on. 19.6ms per run is the worst performance out of them all.

Ian J said...

How about cElementTree and lxml? Both should be worlds faster than minidom and vanilla ElementTree.

Michael Pedersen said...

I've added the numbers from the two of them, and JSON still wins. lxml (which I'm not fond of due to issues with installing in a virtualenv. yes, it can be done, but it's not a straightforward "easy_install") is the best, and it's still almost four times as long to parse XML as it is to parse JSON.

XML has uses, undoubtedly. High speed parsing doesn't appear to be one of them.

jonjon said...

Also, for unsafe load/dump, there is marshal format (unsafe format and so on, since the format may change between python versions)...

You could add tests for it too :) (It is really fast)

Aigars Mahinovs said...

I did some test for my own needs and it was much, much faster to save/load data if I saved into a CSV text file and just parsed that like:

data = {}
for line in open('file.csv')
key, value = line.strip().split(',',1)
data[key] = value

It was several times faster than JSON for large data sets (tens of Mbs in CSV).

Michael Pedersen said...

marshal: For the data I've got, it turns out marshal was slower. It might be faster in other cases. My particular case has deeply nested dicts.

CSV: For something with nested data structures, CSV is a bad choice. For data which is tabular, JSON is a bad choice. In my case, I've got deeply nested dicts, and needed something much better than CSV. JSON wins.

Joseph Turian said...

I wrote this blog post, comparing the speed of different data deserialization libraries in Python.