Go On, Live a Little. Denormalize Your Data

I have years of professional experience with and completed several academic courses about relational databases. This is a pretty familiar topic for me, and I’ve worked with a bunch of them: SQL Server, SQLLite, MySQL, PostgreSQL, Oracle, and even MS Access. I’ve always found them, paradoxically, to be both comfortable and awkward. They’re comfortable because of my familiarity with them and awkward because I’ve used them almost exclusively with OO languages and incurred the impedance mismatch.

While RDBMS are in my comfort zone, NoSQL alternatives are comparably alien but also intriguing. Ayende has made some pretty convincing arguments as to why one should think outside the RDBMS, including a .NET Rocks episode in which he talks about constraints, such as expensive disk space, that applied decades ago when RDBMS were designed and are no longer of concern. I’m receptive, my interest is piqued, and I’ve played a bit with MongoDB To understand how these things work. But I think it really clicked for me the other day when I was staring at a schema that had some definite complexity issues. I think I finally groked the use case for document databases.

I was working with a pre-existing schema recently in which (mildly obfuscated) you have a “customer” who can be part of one or more “programs.” The entities “customer” and “program” each have their own properties and, as any diligent normalizer will tell you, that information should not be repeated. So they get their own tables. Since a customer can participate in multiple programs and multiple customers can be in the same program, this is an “M to N” relationship — represented, predictably, by a linking table. What’s more, customers can participate repeatedly in programs so the linking table will sometimes have multiple participation records per customer-program pair, differing by a participation timestamp. I’m sure those of us who have done any significant database development have played out similar scenarios countless times.

As I wrangled Entity Framework to deal with this and other similar relationships that were making it relatively complicated to pull certain information, I started thinking that this seemed harder than it needed to be. There was no use case for handling “what if the program name or another attribute changes in the middle of its offering,” and there really are no good answers to that question. Right now, the implementation is “you can’t do that,” but this is hardly satisfying. To administrative users, it seems arbitrary and like a system shortcoming. But the alternative is unpleasant, too — additional complexity to allow “partial programs” in the schema or the code.

I started thinking about normalization and whether it really mattered here. Allowing duplication is a cardinal RDBMS sin, but it sure would make life easier for a lot of use cases. I mean, imagine a scheme where there was just a table of customers, and participation was recorded as you went, with the program information being duplicated in each participation record. The downside is the information duplication, but the upside is that changing programs midstream is trivial and there is less table overhead and complexity to maintain. No M to N relationships and worrying about whether a program is current or replaced by a new one.

And that’s when it hit me. RDBMS are great for maintaining, well, relational information. For instance, if I work at a company, I have a boss, and my boss has many reports. Modeling the manager-line employee relationship is a great fit for a relational database because it’s about modeling relationships, like “manager to reports.” Tree structures of composition tend to be a good fit as well, such as assemblies like cars and other pieces of machinery. Parts consist of sub-parts and so on. These are relationships in the sense that operations like “delete” and “replace” make sense.

But what about the one-and-done concept of a customer participating in a program one day? That isn’t a relationship any more than I form a relationship with a waiter that I tip and then never see again. That’s something that happened once — not an ongoing arrangement. And that’s where things really snapped into focus for me. RDBMS and the normalized model is great for managing relationships but is not necessarily the be-all-end-all for managing extremely transactional data in the sense of event recording.

And while I’m probably not in a position for this project to quickly introduce document databases and stir up the works too much, I am considering modeling what I might do storage-wise after a document database there in SQL Server. Perhaps it’s time to throw caution to the winds and start allowing some information repetition in there for historical modeling. If the experiment fails, I can always normalize it. But if it succeeds, adding document databases to the mix will be a lot easier since I’ll just be learning the syntax of the API as I go rather than the fundamental underlying concepts and use cases.

  • Alex Lysenka

    I recently evaluated a number of NoSQL solutions for one of our
    projects. I looked at all NoSQL families: key-value (Redis), column
    family (Cassandra), document (MongoDB), and finally graph (Neo4j).
    Keeping CAP theorem in mind, one can quickly notice that unlike
    RDBMS, each one of these sacrifices certain qualities in favor of
    others. On one side of the spectrum you got ke-value stores –
    highly scalable (think Amazon’s Dynamo), yet very rudimentary in
    design. On the other side – graphs which are extremely flexible
    and sophisticated, but less scalable and are mostly tailored toward
    idiomatic queries. Putting graph databases aside for a moment, the
    basic trend is the bigger your data, the more denormalized it is.
    Coming from a relational world, the developer has to change her
    mindset from “here’s my data and now I’d like to ask some
    questions” to “here are my questions and how do I store my data
    to answer them.” It’s a radical change for many and is hard to get
    used to. No wonder polyglot persistence is so common these days.
    Each one of these databases was designed with a specific use case in
    mind. I found Neo4j to be the most fascinating – it’s plain graph
    theory after all. It’s also a very natural approach to storing data.
    It actually satisfied about 90% of our requirements, but Neo suffers
    from a hub node issue and that was a deal breaker unfortunately.

  • http://www.daedtech.com/blog Erik Dietrich

    Which did you wind up choosing…?

    Anyway, I really like the way you describe the mindset change — I think that’s spot on. I just kind of stumbled into that mindset myself and started thinking of alternatives. I found myself thinking, “why is it so incredibly hard to get an answer to question X? If I just had the data stored in a different way, answering that question would be easy.” But doing some research or prototyping knowing what questions you want to ask of your data and designing storage around answering those questions is a great approach.

    I’ll have to look at graph databases more. I wasn’t familiar with an actual graph-oriented persistence model until you mentioned it and I looked it up. So far for me, that’d be a solution without a problem but I probably haven’t given it enough thought, and the math buff in me loves the concept.

  • Alex Lysenka

    On hold for now. Neo4j has a free book you can download from their website: http://www.neo4j.org/learn. I strongly recommend it as it will provide an insight into problems graph databases are equipped to solve. Some of the use cases include: social, recommendations, geo, master data management, network and data center management, authorization and access control, and communication. For a more concrete example, think of “People you may know” use case or Triadic closure concept in social network theory (http://en.wikipedia.org/wiki/Triadic_closure).