I think that XML is cool. It works really well to define processes (like with the ANT build file), with certain data streams (like RSS or ATOM), and the idea of an open document format in XML is very appealing to me.
That said, it was refreshing to read that someone else is apprehensive about XML as a “big data” format.
The first time I used XML for a big datafeed, I was somewhat dismayed by the sheer number of extra characters that all the tags add to the data file. A simple “separated value” file is far more compact, needing only a “separator” between two datafields. Also, needing a special library to read the XML in an intelligent way is extraordinarily frustrating.
On the other hand, it’s really nice to have the data and its characteristics defined in one file. I like the ability to have multiple items of the same type for each data “row” (like an HTML list has multiple List Items, any XML data element can have multiple items). Tables aren’t the best way to represent all data, and XML allows us to skirt around the table metaphor if we choose (if only we could do the same thing with the file system filing cabinet metaphor! But I digress…)
Can’t we have a simple “delimited file” format that allows us to have multi-dimensional items, but has less of the overhead of the XML file? For example, have multiple delimiters so that data elements can have multiple items — here is an example of a 3-column dataset:
col1,subcol1;subcol2;subcol3,col3
This wouldn’t necessarily allow “infinite” dimensionality, but I bet there is a simple way to qualify delimiters to allow that. The other pain about this kind of format is the lack of formatting information within the file itself. The crappy part about loading a CSV file into Oracle is that you have to have a separate file that describes the format of each column and how many characters should be expected at max. (there are all sorts of caveats about storage size there, too, with extra bits for types and allocation block sizes)
I don’t know anything about the JSON format, and I don’t know anything about the SQLite format, which the author of the linked article previously implied was a better format for large datasets. I know I will start looking into them, though, as they might solve my problems. This all builds toward a larger problem that I have with data storage in general — some of my analysis gets complicated, and keeping the data “fresh” on a data stream becomes a difficult problem. More on this later.
XML definitely held sway for a while until people started bumping into the extra characters problem. Some people advocated using Binary XML where all the UTF-8 and ASCII was just converted over into a binary format in order to traverse the Interwebs easily. In that case the issue was packet size and speed of transmission. But for databases, Hmmm. Good Question. The DTD for a given document is supposed to hold all the schema stuff that the Oracle ‘external’ file is supposed to keep records for. In that way it gets schmematized once, then all your XML is just data with each field relating back to th DTD hosted centrally at the WC3 or your own webserver. That way everything is uniform and consistent.
Maybe what you need is an XSL Transform of what you get to map more easily into tables and rows. The real issue is Objects versus Table and Rows again. XML is more about objects, or flat files even whereas SQL is always tables and rows.
Comment by Eric Likness — 08-24-2009 @ 12:56 pm