1. Moving Data is Like Moving Mountains

    06-04-2010 by dan

    Why is moving data the hardest thing to do in the world?

    All databases/database apps should have:

    • Easy connection for raw data.  Not a web service connection, but a permissioned, live, SQL interface that requires no special software other than maybe a driver.  The data should be in a readily accessible, universal format once it gets to the app, not converted to Excel or CSV or XML (but these should continue to be options).
    • Fast data transfer.  If my data is bigger than X, it should be compressed before transferring.  Both sides of the connection should handle this seamlessly.
    • Some method for persisting application views of data between apps/users/computers — for example, if my Ajax app pulls out a bunch of data for a table, my Protovis app should be able to read it with minimal (or zero) data movement.
    • For extra credit, my database should have an easy to use semantic layer that can dynamically generate queries (based on a data model I build), and my apps can interface with this to create queries, but use the “easy connection” mentioned above for extracting data.

    SAP and Oracle and Microsoft don’t understand this.  Or, if they do, they hide it because they want to sell you more software.  The problem is that the more complicated the chain of software to get the data, the more likely something fails or the data is bad.  Please — someone build an app-friendly database!

    (as an aside, I wonder how much programming would be required to get some of these things built as interfaces to MySQL or PostGres?  Do the pieces exist, but they just need to be interconnected properly?)

    • Share/Bookmark

  2. Why Not Make it Simple?

    03-17-2010 by dan

    Wow.  I hate most GUI software.  No, let me restate that.  I hate when a piece of software is designed to only be used through a GUI.  At the risk of sounding like a Windows 7 commercial, “just make it SIMPLE!”

    The problem with a GUI is that it adds a layer of complexity the problem you are trying to solve.  Especially when you have several different techologies talking to each other, you want to minimize the number of places something can break.

    For example, this is what the tool chain for Business Objects looks like:

    • Database is housed in MySQL on a Linux server
    • Business Objects is a server application running on a Windows server box
    • Business Objects connects to the MySQL server via ODBC
    • I run two pieces of software on my windows machine: a browser with a BO client, and a desktop app called “universe designer”
    • When I build a query and report, I use the data definition created in the “universe designer”, and the BO client to put it together.
    • I submit the job to the Windows server box in some proprietary format that I cannot revision control.
    • The BO server app has a scheduler, an SQL generator and a report formatter that take my job, submit it to MySQL server and then post-process it.
    • The final result is returned to my browser client.

    Something can go wrong ANYWHERE along that chain of events, and the error messages you get back are not robust enough to tell you where the error occurred.

    I submit that this would be better:

    • Database housed in MySQL on Linux server
    • Query scheduler also housed on same Linux server — only function is to schedule queries based on dependencies and return errors/halt queries at appropriate times
    • All queries should be passed to the scheduler as final SQL code
    • A server application that stores the data definition and a translator of data definition objects to SQL.  Data definitions are stored on the server and revision controlled. This server should probably be a separate server from the MySQL one, but doesn’t have to be.
    • A server application that stores “queries” which are objects created from data definition and translatable to SQL.  Again, revision controlled, and should probably be on the same server as the data definition server.
    • A server application that takes layouts defined for datasets and applies them to data.  The format for the layouts should be simple, probably XML based.  This should also be located on the same server as the data definition server.
    • Any number of clients that know how to do any subset of these things
      • Define a query from data definition (or load queries in the archive on the server)
      • Submit queries to be run (either hand written or data definition generated), or to be housed in the archive on the server
      • Layout data from a datasource (not necessarily one of the queries above)
      • Submit layouts to the archive on the server
      • Do advanced analysis on data (again, not necessarily from either of the above — obviously, I am thinking that R would be this tool)
      • Submit data back to the MySQL database in an intelligent fashion (permission based)

    Here, you could get very good error tracking by viewing logs.  Since all information is passed as text instead of some proprietary format, you can read the code itself to see if something is inserting bad code along the way.  There are probably some data transfer issues, but by using a driver like ODBC at the right place with the right amount of logging, even that could be easy to diagnose should a problem arise.  And, my personal pet peeve, this is very modular.  I don’t have to design a whole report if all I want to see is some raw data.  I don’t have to be tied to a report at all if I want to do some exploratory analysis.  And, I can define my own layouts and formatting without having to rely on “pre-defined” garbage from companies that don’t know how to actually deliver data in an informative way.

    • Share/Bookmark

  3. Testing for Correlation

    08-31-2009 by dan

    I am playing around with a dataset right now that has two specific streams:  the first is a daily price feed for a bunch of securities, and the second is an important event feed that is matched to the pricing data.

    The idea here will be to find some kind of correlation between the pricing data and the important events, to try to “predict” the important events.  Something like, “When the price spikes, we can expect an important event to happen within the next 2 weeks” or the like.

    I am going into this not even convinced that there is a correlation at all.  I don’t have a rough time window that I’d like to look in, and I don’t have even a good idea about whether a price increase or decrease would be more likely to correlate with an important event.

    I have read about cross correlation, but I don’t think it is applicable per se.  I have several hundred securities, each with one or more important events tied to them.  What is the best way to start analyzing this dataset for correlation?  Here is what I plan to try:

    • The first thing I want to do is to normalize time to a specific point.  I’ll put each important event at time 0, and then put the pricing series around time 0 — so the price at -5 is 5 days before the event.  It might be messy, but just plotting this time-normalized graph might give me some insight.
    • The second thing I want to do is to consider normalizing the prices themselves.  Maybe normalize to a 0-1 scale, where 1 is the highest price across all securities, and 0 is the lowest.

    I’m sure there are some more scientific techniques for taking a stab at this problem, but I don’t know what they are.  Is it safe to come up with some kind of distribution of price movements?  Does aggregating the data actually hide some information, or help me get at the underlying trend?  Is it worthwhile to build a linear model (regression) of the price movement and see if there is any kind of aggregate trend to the linear model?

    I want to take a statistics course that focuses less on the math proofs and more on how to approach a problem like this and say something informative with the statistics.  Do such courses exist, and can they ignore some of the proofs to trade off for the techniques?

    • Share/Bookmark

  4. Framework for Storing Ideas

    08-27-2009 by dan

    I need a framework for storing my ideas.  As I’ve discussed before, the filing cabinet metaphor for file storage is crap, but it’s still out there, and is still hindering my ability to manage ideas and thoughts in a cohesive manner.

    For example, I dabble on and off with Actionscript and MXML code in the Flex SDK.  I can download some crap one-off dataset and try to build a Flex app using it, but it always seems so ephemeral.  The app itself ends up feeling pretty temporary, too, and I have nowhere to put it where it can be re-used.  I could just make a pile of files on my webserver, but I don’t think that will accomplish my goal.

    What I really want is to have a solid framework set up that lets me build small apps that I can easily recombine later.  Something like Facebook or Yahoo Pipes, but for my own creations.  I can picture it being a modified CMS like Drupal, but with “containers” for any idea I happen to have — a blog entry, a program, a Flex app, a picture, etc.  Then, I should be able to build a new idea out of the building blocks of other ideas — for example a blog entry about a picture, or a program that reads as input one of my other program’s outputs.

    Of course, all the things I add to this CMS system should have nice version control and possibly even a persistant database instance.

    Then, if it were working properly for my personal set of ideas, it would be a natural fit to start using it at work.  Each of thse stand alone units could include a file I’m working on, a database that populates a report, an email (chain) request, etc.  Having Google index the objects in some way would make the search very simple.  Best of all, anything I built would be immediately reusable by hooking outputs to inputs — and it would all have a nice log of activity via revision control history.

    A major stumbling block to this being useful is that Drupal is built for a specific database.  It might make more sense to use Zend Framework, and build it out using the framework.  Xend has a database abstraction layer that would allow me to use any database backend I wanted.  Either way, it sounds like a lot of development time for something that I’d really like to just start using!

    • Share/Bookmark

  5. XML and Data

    08-24-2009 by dan

    I think that XML is cool.  It works really well to define processes (like with the ANT build file), with certain data streams (like RSS or ATOM), and the idea of an open document format in XML is very appealing to me.

    That said, it was refreshing to read that someone else is apprehensive about XML as a “big data” format.

    The first time I used XML for a big datafeed, I was somewhat dismayed by the sheer number of extra characters that all the tags add to the data file.  A simple “separated value” file is far more compact, needing only a “separator” between two datafields.  Also, needing a special library to read the XML in an intelligent way is extraordinarily frustrating.

    On the other hand, it’s really nice to have the data and its characteristics defined in one file.  I like the ability to have multiple items of the same type for each data “row” (like an HTML list has multiple List Items, any XML data element can have multiple items).  Tables aren’t the best way to represent all data, and XML allows us to skirt around the table metaphor if we choose (if only we could do the same thing with the file system filing cabinet metaphor!  But I digress…)

    Can’t we have a simple “delimited file” format that allows us to have multi-dimensional items, but has less of the overhead of the XML file?  For example, have multiple delimiters so that data elements can have multiple items — here is an example of a 3-column dataset:

    col1,subcol1;subcol2;subcol3,col3

    This wouldn’t necessarily allow “infinite” dimensionality, but I bet there is a simple way to qualify delimiters to allow that.  The other pain about this kind of format is the lack of formatting information within the file itself.  The crappy part about loading a CSV file into Oracle is that you have to have a separate file that describes the format of each column and how many characters should be expected at max.  (there are all sorts of caveats about storage size there, too, with extra bits for types and allocation block sizes)

    I don’t know anything about the JSON format, and I don’t know anything about the SQLite format, which the author of the linked article previously implied was a better format for large datasets.  I know I will start looking into them, though, as they might solve my problems.  This all builds toward a larger problem that I have with data storage in general — some of my analysis gets complicated, and keeping the data “fresh” on a data stream becomes a difficult problem.  More on this later.

    • Share/Bookmark

  6. Technology and Investing

    08-21-2009 by dan

    There’s got to be a way to leverage technology to improve the individual investor’s abilities to invest intelligently.

    The discount brokerages like E*Trade and ScottTrade have set us off in the right direction, with portfolio tools and lots of online data access.  But it’s not enough — I need to be able to access my data any way I want to access it, and I need to be able to build scripts that will monitor for events and let me know about them or take action on my behalf.

    Let’s start simple:  Suppose I want to monitor the price of a particular stock, and buy it if it drops to a specific level.  E*Trade provides some functionality in this regard, but it’s only for the price level itself — I can set a trigger if the price hits X.  I want to be able to define the trigger itself, ie, send me an alert when ANY stock drops FURTHER than other stocks in its industry, or possibly relative to the S&P500.   And then I want to be able to limit it from ANY stock to just a set of stocks I want to monitor.

    Next up, I want to set specific sell orders, again, not relative to the price.  Suppose I monitor a stock and I use some kind of value investing metric to determine whether the stock is still worth the price.  Maybe I look at Earnings to Cash Flow, and if the ratio gets too big I sell the stock.  I might not be monitoring this everyday, but a computer can.  I might build in a buffer that says something like “watch for 2 quarters in a row of high ratio and sell if you see it” or something like this.

    Next, I’d really like to determine how my rate of return is calculated.  I want to account for inflation, the original purchase price (not just the yearly return), the dividends reinvested, etc.  Just giving me one number for this year is stupid and tells me nothing.  Similarly for the whole portfolio — I should be able to tell at a glance if it is worth more today than it was at any particular date, adjusting for inflation if I so choose.

    Now, I want to create my own pricing metric.  Suppose I can come up with a way to measure what I think the price of a stock should be.  I should be able to track that in time series along with the actual market price of the stock, and use the difference between the two to make decisions.  This is what I was alluding to in yesterday’s post about Warren Buffett’s approach.

    Finally, I could do all this myself on my own system if I could just have a really good external API for accessing my information from within E*Trade.  At least let me download my purchase information for every transaction I’ve made on E*Trade, so I can enter it into my own system.

    • Share/Bookmark

  7. Web Sharing Questions

    07-31-2009 by dan

    Why develop anything for the web if it is locked into your proprietary framework?  I’m not talking about your intranet, I’m talking about your publicly available (or maybe even subscription based) internet presence.  In this day and age, anything you create that is not nimble enough for “mashups” will actually give YOU trouble as well as your customers.  For example?  What if you wanted to do summary statistics and unit tests of your database right now?  Would your developers have to start coding like mad?  I know the ones on my database would (mostly because that would be me).

    Why would a nimble architecture help here?  Because you could consume the data in some ready made statistical software (R anyone?) and do some quick analysis without new custom queries or development work.  Same thing with unit tests — pull out specific sets of data with your agile framework, and test them with whatever software you like.

    Here are my questions about developing this sort of framework:

    • How to get your database browse-ready in a RESTful way?  What is the software involved?  Do you need special Apache plugins?
    • How to get applications you do choose to develop to be able to consume your data properly — Do they consume a feed?  Browse the RESTful setup?  Both?
    • If you build an app, how do you make it ultra-portable?  Youtube videos, google gadgets, site searches, etc are all easily plugged into a website, and if you provide the proper XML attributes, you can even reskin them so your clients can easily keep their own look and feel with your technology.  What is the technology involved here?  Is it simply a matter of bundling attributes in HTML/XML and source javascript, with access to your web API?  Or is there more?
    • Scalability!  Do any of the above questions impose a ridiculous amount of overhead on your database or webserver?

    These are questions I don’t yet have answers to.  I’ll write here as I find answers.

    • Share/Bookmark

  8. Web Services for Data

    07-30-2009 by dan

    The problem:

    I have a large data warehouse stored in an Oracle database.  The existing framework for “consuming” this information is very old and static.

    The question:

    Can I take this rich dataset and construct methods for extracting the data that will be extremely flexible, and run parallel to the existing static framework (so as to no break what is already there)?  What I am looking for is something like a web service, which accepts a basic set of parameters and returns some kind of data “object”.  It should be generalized so that I can ask for one data element, or a long list of data elements.

    Why do I want to do this?  To put it simply, it is too difficult to extract information from this database in the current form.   All custom queries must be constructed off-line, and require a large effort to get into the existing framework.  What I’d like is to supply a set of data “building blocks” that can be “mashed” together to create reports, summaries, unit tests, and new datafeeds that I hadn’t explicitly defined at the outset.  Even better would be the ability to also pull blocks from outside sources — like salesforce.com or google finance.

    So, I’m now researching the best way to provide the data “building blocks”.  I can see some of these individual blocks becoming quite large — someone might ask for a time series of daily transactions over the past 10 years — and so a “pull” architecture probably makes the most sense.  If I am pulling a big block of data out of the system, what format should I transmit it in?  Currently, most of the data is viewed in the tried-and-true table format of a database or spreadsheet.  Keeping it as CSV, then, would be logical, but I want this system to be flexible — it might provide a CSV by default, but it should also provide XML or JSON on request.  (Can I transfer in a compressed format, and decompress on the other end?)

    It should be obvious where I am going with all of this — if I can pull building blocks in whatever configuration I want, then I can insert them into whatever system I want — including the old static system.  Obviously, I’d like to completely supplant the old system, but I will have to build toward that.

    The other advantage to this approach will be more transparency about what is actually in the data catalog.  A side effect of a well designed pull framework will be a somewhat self-documenting catalog of what is actually available and how it was constructed.  This should also allow me to design natural unit tests for verifying the integrity of the data (and the framework).

    Is a web service the correct solution here?  It should be universally available to any device, permission-able, and as fast as possible.  The output from the framework should be completely separate from any service that consumes the data.  Do web services fit this mold?

    • Share/Bookmark

  9. Agile Database Integration

    07-29-2009 by dan

    I attended the BTV PHP User Group meeting last Thursday where I listened to a talk about using a better data model for website development.  I couldn’t agree more with the content.

    I hate being locked into a corner by the database scheme, and even more by the code that is using the database scheme.  The slideshow addresses the latter problem, not so much the former (although it does touch on it briefly).  My realm of influence is definitely closer to the data than the code, but I am painfully aware of the interaction between them, and the restrictions each place on each other.  So, while I am not a PHP developer, this topic is very interesting to me, and I am considering pursuing some of the books that were mentioned about developing database-smart systems.

    What this presentation did not get into, however, is what to do if something in your schema DOES have to change.  What if you can’t “decorate” your classes, because some fundamental change has occurred in your data (or your understanding of your data)?  I suppose you can just grep for every occurrence of the old schema details, and replace with the new details.  I want an application that manages the structure of the service ALONG with the structure of the database itself, and is completely dynamic from that perspective.  Designing with this layered approach is definitely a step in this direction, but it doesn’t quite get me the whole way there.

    • Share/Bookmark

  10. More on Missing Pipes

    07-22-2009 by dan

    The following is a comment I left on a post at Jon Udell’s blog about “rewiring the web“.  It outlines some of the ideas I have mentioned here, and one of the commenters mentioned that Microsoft, of all places, actually had done something similar to what I suggested.

    I love this wiring the web idea, but I’m getting concerned about where the wires themselves are stored.

    What if your dopplr or tripit or yahoo pipes or whatever you are using goes away unexpectedly?  This could break a lot of stuff you have built.

    It might take forever to restore things to a working state, particularly if you relied heavily on one service.

    Here’s what I think might solve this problem:
    You log in to yahoo pipes and design a filter that takes feed A and creates feed B.

    Rather than the filter being stored at yahoo, the pipes create a small object that you can add to your website (or wherever) that performs that functionality.  All that is needed now is for feed A to continue to exist.

    Now, yahoo pipes suddenly disappears, to be replaced by google hoses.  You still have that one piece of functionality you created with the pipes, which still works since you didn’t store it at yahoo.

    This would give you time to switch to google hoses for new filters, while old filters continued to work (and might even be able to be imported into google hoses for future editing)

    So the filters get stored with your data — after all, you spent time creating them, so they are sort of a type of meta-data, right?

    I think this is something that should be seriously considered.  Pipelines (or wiring, or street networks, etc.  Choose your metaphor) are laid with permanence in mind.  I don’t think the metaphor should break just because we are talking about digital connections rather than physical ones.

    • Share/Bookmark