1. XML and Data

    08-24-2009 by dan

    I think that XML is cool.  It works really well to define processes (like with the ANT build file), with certain data streams (like RSS or ATOM), and the idea of an open document format in XML is very appealing to me.

    That said, it was refreshing to read that someone else is apprehensive about XML as a “big data” format.

    The first time I used XML for a big datafeed, I was somewhat dismayed by the sheer number of extra characters that all the tags add to the data file.  A simple “separated value” file is far more compact, needing only a “separator” between two datafields.  Also, needing a special library to read the XML in an intelligent way is extraordinarily frustrating.

    On the other hand, it’s really nice to have the data and its characteristics defined in one file.  I like the ability to have multiple items of the same type for each data “row” (like an HTML list has multiple List Items, any XML data element can have multiple items).  Tables aren’t the best way to represent all data, and XML allows us to skirt around the table metaphor if we choose (if only we could do the same thing with the file system filing cabinet metaphor!  But I digress…)

    Can’t we have a simple “delimited file” format that allows us to have multi-dimensional items, but has less of the overhead of the XML file?  For example, have multiple delimiters so that data elements can have multiple items — here is an example of a 3-column dataset:

    col1,subcol1;subcol2;subcol3,col3

    This wouldn’t necessarily allow “infinite” dimensionality, but I bet there is a simple way to qualify delimiters to allow that.  The other pain about this kind of format is the lack of formatting information within the file itself.  The crappy part about loading a CSV file into Oracle is that you have to have a separate file that describes the format of each column and how many characters should be expected at max.  (there are all sorts of caveats about storage size there, too, with extra bits for types and allocation block sizes)

    I don’t know anything about the JSON format, and I don’t know anything about the SQLite format, which the author of the linked article previously implied was a better format for large datasets.  I know I will start looking into them, though, as they might solve my problems.  This all builds toward a larger problem that I have with data storage in general — some of my analysis gets complicated, and keeping the data “fresh” on a data stream becomes a difficult problem.  More on this later.

    • Share/Bookmark

  2. Web Services for Data

    07-30-2009 by dan

    The problem:

    I have a large data warehouse stored in an Oracle database.  The existing framework for “consuming” this information is very old and static.

    The question:

    Can I take this rich dataset and construct methods for extracting the data that will be extremely flexible, and run parallel to the existing static framework (so as to no break what is already there)?  What I am looking for is something like a web service, which accepts a basic set of parameters and returns some kind of data “object”.  It should be generalized so that I can ask for one data element, or a long list of data elements.

    Why do I want to do this?  To put it simply, it is too difficult to extract information from this database in the current form.   All custom queries must be constructed off-line, and require a large effort to get into the existing framework.  What I’d like is to supply a set of data “building blocks” that can be “mashed” together to create reports, summaries, unit tests, and new datafeeds that I hadn’t explicitly defined at the outset.  Even better would be the ability to also pull blocks from outside sources — like salesforce.com or google finance.

    So, I’m now researching the best way to provide the data “building blocks”.  I can see some of these individual blocks becoming quite large — someone might ask for a time series of daily transactions over the past 10 years — and so a “pull” architecture probably makes the most sense.  If I am pulling a big block of data out of the system, what format should I transmit it in?  Currently, most of the data is viewed in the tried-and-true table format of a database or spreadsheet.  Keeping it as CSV, then, would be logical, but I want this system to be flexible — it might provide a CSV by default, but it should also provide XML or JSON on request.  (Can I transfer in a compressed format, and decompress on the other end?)

    It should be obvious where I am going with all of this — if I can pull building blocks in whatever configuration I want, then I can insert them into whatever system I want — including the old static system.  Obviously, I’d like to completely supplant the old system, but I will have to build toward that.

    The other advantage to this approach will be more transparency about what is actually in the data catalog.  A side effect of a well designed pull framework will be a somewhat self-documenting catalog of what is actually available and how it was constructed.  This should also allow me to design natural unit tests for verifying the integrity of the data (and the framework).

    Is a web service the correct solution here?  It should be universally available to any device, permission-able, and as fast as possible.  The output from the framework should be completely separate from any service that consumes the data.  Do web services fit this mold?

    • Share/Bookmark

  3. WordPress Add-ons

    07-27-2009 by dan

    This post is to document the stuff that I have done to WordPress to make it work the way it does on my site.  More for my reference than anything, but you might find it useful.

    • Google Reader widget — this is the sidebar list I have under “Stuff I just read”
    • Twitter Tools — this is the sidebar list I have under “Stuff I’m doing”.  It also tweets everytime I post a new blog entry, and has the option to do a tweet digest post on a regular basis.
    • Delicious Linkroll — this is actually just a text widget I added to the sidebar (under “Stuff I’ve Bookmarked”), with the following code as the text:
      .delicious-posts li {
      list-style-type:circle;
      margin-left: 40px;
      }
      .delicious-posts ul {
      margin-left: 0; padding-left: 2.5em;
      }
      <script src="http://feeds.delicious.com/v2/js/dandube
      ?title=&amp;count=5&amp;sort=date" type="text/javascript">
      <!--mce:0--></script>
    • Google Analytics for WordPress — you have to have some kind of tracking software, and in my opinion, this is the best.  Make sure you disable tracking the admin, so you don’t end up just watching yourself look at your own website!
    • Add to Any — I hate when I see an article out on the web and can’t just add it to my Delicious account with the click of a button.  This solves the problem, but for ANY service.
    • Tweetmeme – This one is somewhat redundant with Add to Any, but I like seeing the tweet counts.
    • Gravatar Signup — if you have a gravatar and are posting to my site, you should be able to use it!
    • Lifestream — this creates a digest post on a regular basis of all the services you wish to post updates from.  I use it for Google Reader, Twitter and Delicious primarily, but it literally does EVERYTHING.  I like having an archive of all this stuff in one place (ie, my blog).
    • Lifestream CSS hack — The lifestream items were sometimes overflowing the primary content DIV, so I added this style to my CSS file: div.lifestream_label {
      width: 450px;
      }
    • Hack to Twittertools to use Facebook “selective status” — In the file called YOURBLOG/wp-content/plugins/twitter-tools.php, find the line that says this: $this->tweet_format = $this->tweet_prefix and add #fb to the end of the string, like this: $this->tweet_format = $this->tweet_prefix.': %s %s #fb'; Now, Facebook selective status will pick up your new blog post tweets as your status, even though it is usually ignoring tweets.
    • Share/Bookmark

  4. More on Missing Pipes

    07-22-2009 by dan

    The following is a comment I left on a post at Jon Udell’s blog about “rewiring the web“.  It outlines some of the ideas I have mentioned here, and one of the commenters mentioned that Microsoft, of all places, actually had done something similar to what I suggested.

    I love this wiring the web idea, but I’m getting concerned about where the wires themselves are stored.

    What if your dopplr or tripit or yahoo pipes or whatever you are using goes away unexpectedly?  This could break a lot of stuff you have built.

    It might take forever to restore things to a working state, particularly if you relied heavily on one service.

    Here’s what I think might solve this problem:
    You log in to yahoo pipes and design a filter that takes feed A and creates feed B.

    Rather than the filter being stored at yahoo, the pipes create a small object that you can add to your website (or wherever) that performs that functionality.  All that is needed now is for feed A to continue to exist.

    Now, yahoo pipes suddenly disappears, to be replaced by google hoses.  You still have that one piece of functionality you created with the pipes, which still works since you didn’t store it at yahoo.

    This would give you time to switch to google hoses for new filters, while old filters continued to work (and might even be able to be imported into google hoses for future editing)

    So the filters get stored with your data — after all, you spent time creating them, so they are sort of a type of meta-data, right?

    I think this is something that should be seriously considered.  Pipelines (or wiring, or street networks, etc.  Choose your metaphor) are laid with permanence in mind.  I don’t think the metaphor should break just because we are talking about digital connections rather than physical ones.

    • Share/Bookmark

  5. The Tangled Chain Someone Else Weaves (and then YOU have to undo)

    07-16-2009 by dan

    Wow, now that I have written this, it’s a lot more ranting and angry than I had intended…  I guess that says something about how I feel about it!  Prepare yourself!

    Eric just responded to my Crap Filing Cabinet post over at his blog.  As he was painstakingly dissecting my post, I got to thinking about one of the points he made about attaching files to emails:

    …the immediacy of ‘attaching’ just makes it too appealing. Someone is ‘dumping’ the task off on you with the minimum effort necessary…

    I actually think that his point here is broader than he lets on.  Email is a fantastic way to dump work off on other people.  But not in that “forward customer service request to my co-worker” way that immediately jumps to mind.

    Consider an email like this one:

    From: Person5
    To: Dan Dube
    Subject: FW: RE: FW: FW: Question

    Dan,

    Check this out and let me know what you think.

    P5

    From: Person10
    To: Person5
    Subject: RE: FW: FW: Question

    P5,

    Hey can you get Dan to look at this?  I’m sort of stumped.  I added a bunch of stuff the the script, though, so that should help.

    P4

    Office: (555) 666-6666
    P4@company.com
    www.mystuff.com — my blog!

    From: Person3
    To: Person4, Person10, Person11
    Subject: FW: FW: Question

    P4,

    I just looked through the database, it looks like an inner join isn’t working properly, so we are getting a bad match here.  Can you figure out where the source data was coming from, and why the join failed?

    Thanks!
    P3


    ANYTHING SENT TO THIS EMAIL ADDRESS IS CONFIDENTIAL!  IF YOU AREN’T THE INTENDED RECIPIENT OF THIS EMAIL, DELETE IT IMMEDIATELY AND GO WASH YOUR EYES OUT WITH SOAP!  OH MY GOD STOP READING HERE!  I WILL SUE YOU!  YOU KNOW I WILL

    From: Person2
    To: Person3
    Suject: FW: Question

    P3:
    Just got this in, can you take a look?

    Thanks!
    P2


    Without Love there is no War
    Person 2
    (555) 555-5555 (office)
    (555) 555-5556 (cell)
    www.p2.com (website)

    From: Person1
    To: Person2
    Subject: Question

    Hello, I was looking at your website and noticed that Item X shows a price of 5 dollars, but it seems like it should really only be 5 cents.

    This actually wasn’t as difficult to type up as you might expect, since I literally receive this email 20 times a day and have to go through this process 20 times a day.

    How is the work being offloaded here?  Let’s look at the completely misguided ways:

    • First, I have to look back through this entire chain to figure out what the hell it is about.  The email has conveniently been ordered so that the most relevant stuff is at the bottom, where I have to waste as much time as possible getting to it.
    • Second, Person2 did the classical “give it to someone else but don’t help at all” approach.  This guy is probably a manager.  Notice all the crap text in his footer that is now mucking up the email chain.
    • Person3 figured some stuff out, but any logic that he used is lost in the sands of email mess.  To make matters worse, he has a big privacy notice that adds garble to the mix.
    • Person3 is a bit too thorough, though, and forwarded to too many people.   Look at the subject of this email grow!  (to be fair, most systems don’t do this anymore, but I’m just trying to emphasize how lousy the subject is)
    • Person10 makes some more progress, but again, his logic is lost, and now when this finally gets back to me I will have to contend with the extra crap this guy added in.  Now the subject contains a helpful RE in addition to several FWs.
    • Now, my boss has a hold of this thing and he forwards it along again.  Instead of giving me a quick overview of what I should be looking for, I only get a “have a look” message.

    Wow.  There are all kinds of problems here.  Some smart people were doing work here, and their work is lost to the un-collaborative mess of this email.  The only way I’d be able to use what they did is to go find them and talk to them face to face, which is hard to do when you work from home.

    There isn’t any revision control of the email itself, so I have to go with my gut that the lowest thing on the chain was indeed the original email and hasn’t been cut by someone’s truncating email client or spastic copy-n-paste hand movements.

    Each of the people on the chain had to waste time navigating the chain merely because everyone else was too lazy to summarize what had happened up to that point.

    Finally, what if the first person had sent a file?  Imagine how many forked versions of it would have been passed around as people downloaded their own copy, made their own edits and then forwarded it along?  Especially when it starts going to multiple people, you can see how bad the problem can become.  And, to make matters worse, the email server is storing multiple copies of this crap, instead of one version controlled archive of the edits.

    This is awful.  Email and the filing cabinet metaphor must be discarded now.

    • Share/Bookmark

  6. R: Nice Number Formatting

    05-20-2009 by dan

    If you are like me, you end up creating all the data tables in statistical publications. That would make you unlucky, but also would give you the opportunity to eliminate all kinds of inefficiencies in the publication process. For example, I find that having R output the data tables pre-formatted with nice number formats helps the layout guys a lot.

    Here’s a simple solution that I came up with just recently for formatting “dollar numbers” and “percentage numbers”:

    
    nice_format = function(x,adec=2,sign='') {
    	sign1 = ''
    	sign2 = ''
    	if (sign == '$') {
    		sign1 = '$'
    		sign2 = ''
    	}
    	if (sign == '%') {
    		sign1 = ''
    		sign2 = '%'
    	}
    	ret = paste(sign1,
    		formatC(x,format='f',big.mark=',',digits=adec),
    		sign2,
    		sep='')
    	ret
    }
    

    There is nothing mysterious going on here, this is basically just a wrapper around the “paste” function for string concatenation. I convert all numbers into strings, and return the nicely formatted string. The sort of clever thing is that dollar signs typically come before the number, and percent signs after the number, so I have the function figure out where the sign should go based on the “sign” argument.  Also, the “x” argument can be a vector, and the code vectorizes properly.   (If only I had the patience to make it accept matrices or data frames, too!)

    Some notes about the formatC function:

    • Yes, that is formatC with a capital C.  Case matters!
    • The format argument specifies decimal, integer, string, etc.   There are several choices, so check out ?formatC for the best one for your circumstances.
    • big.mark tells R to put a comma at the thousands, millions, etc places.
    • Digits tells R how many significant figures to have after the decimal point.
    • Share/Bookmark