Go On, Live a Little. Denormalize Your Data

ByErik Dietrich September 6, 2013

I have years of professional experience with and completed several academic courses about relational databases. This is a pretty familiar topic for me, and I’ve worked with a bunch of them: SQL Server, SQLLite, MySQL, PostgreSQL, Oracle, and even MS Access. I’ve always found them, paradoxically, to be both comfortable and awkward. They’re comfortable because of my familiarity with them and awkward because I’ve used them almost exclusively with OO languages and incurred the impedance mismatch.

While RDBMS are in my comfort zone, NoSQL alternatives are comparably alien but also intriguing. Ayende has made some pretty convincing arguments as to why one should think outside the RDBMS, including a .NET Rocks episode in which he talks about constraints, such as expensive disk space, that applied decades ago when RDBMS were designed and are no longer of concern. I’m receptive, my interest is piqued, and I’ve played a bit with MongoDB To understand how these things work. But I think it really clicked for me the other day when I was staring at a schema that had some definite complexity issues. I think I finally groked the use case for document databases.

I was working with a pre-existing schema recently in which (mildly obfuscated) you have a “customer” who can be part of one or more “programs.” The entities “customer” and “program” each have their own properties and, as any diligent normalizer will tell you, that information should not be repeated. So they get their own tables. Since a customer can participate in multiple programs and multiple customers can be in the same program, this is an “M to N” relationship — represented, predictably, by a linking table. What’s more, customers can participate repeatedly in programs so the linking table will sometimes have multiple participation records per customer-program pair, differing by a participation timestamp. I’m sure those of us who have done any significant database development have played out similar scenarios countless times.

As I wrangled Entity Framework to deal with this and other similar relationships that were making it relatively complicated to pull certain information, I started thinking that this seemed harder than it needed to be. There was no use case for handling “what if the program name or another attribute changes in the middle of its offering,” and there really are no good answers to that question. Right now, the implementation is “you can’t do that,” but this is hardly satisfying. To administrative users, it seems arbitrary and like a system shortcoming. But the alternative is unpleasant, too — additional complexity to allow “partial programs” in the schema or the code.

I started thinking about normalization and whether it really mattered here. Allowing duplication is a cardinal RDBMS sin, but it sure would make life easier for a lot of use cases. I mean, imagine a scheme where there was just a table of customers, and participation was recorded as you went, with the program information being duplicated in each participation record. The downside is the information duplication, but the upside is that changing programs midstream is trivial and there is less table overhead and complexity to maintain. No M to N relationships and worrying about whether a program is current or replaced by a new one.

And that’s when it hit me. RDBMS are great for maintaining, well, relational information. For instance, if I work at a company, I have a boss, and my boss has many reports. Modeling the manager-line employee relationship is a great fit for a relational database because it’s about modeling relationships, like “manager to reports.” Tree structures of composition tend to be a good fit as well, such as assemblies like cars and other pieces of machinery. Parts consist of sub-parts and so on. These are relationships in the sense that operations like “delete” and “replace” make sense.

But what about the one-and-done concept of a customer participating in a program one day? That isn’t a relationship any more than I form a relationship with a waiter that I tip and then never see again. That’s something that happened once — not an ongoing arrangement. And that’s where things really snapped into focus for me. RDBMS and the normalized model is great for managing relationships but is not necessarily the be-all-end-all for managing extremely transactional data in the sense of event recording.

And while I’m probably not in a position for this project to quickly introduce document databases and stir up the works too much, I am considering modeling what I might do storage-wise after a document database there in SQL Server. Perhaps it’s time to throw caution to the winds and start allowing some information repetition in there for historical modeling. If the experiment fails, I can always normalize it. But if it succeeds, adding document databases to the mix will be a lot easier since I’ll just be learning the syntax of the API as I go rather than the fundamental underlying concepts and use cases.

Erik Dietrich

Language Agnostic

Addicted to Unit Testing
ByErik Dietrich February 17, 2011September 27, 2012

Something interesting occurred to me the other day when I posted sample code for a DXCore plugin that I created. In the code that I uploaded, I added a unit test project with a few unit tests as a matter of course. Apparently, the process of unit testing has become so ingrained in me that…

Read More Addicted to Unit Testing
Language Agnostic

Static Analysis — Spell Check for Code
ByErik Dietrich February 18, 2011September 27, 2012

A lot of people have caught onto certain programming trends: some agility in the process generally makes things better, unit testing a code base tends to make it more reliable, etc. One thing that, in my experience, seems to lag behind in popularity is the use of static checking tools. If these are used at…

Read More Static Analysis — Spell Check for Code
Language Agnostic

Inverting Control
ByErik Dietrich March 1, 2011October 19, 2014

I imagine that inversion of control is a relatively popular concept to talk or blog about, particularly in object-oriented circles, so rather than do a garden-variety explanation of the term followed by a pitch for using it, I thought I’d take a slightly different approach. I’m going to talk about the reason that there is…

Read More Inverting Control
Language Agnostic

Testable Code is Better Code
ByErik Dietrich August 3, 2011November 14, 2017

It seems pretty well accepted these days that unit testing is preferable to not unit testing. Logically, this implies that most people believe a tested code base is better than a non-tested code base. Further, by the nature of testing, a tested code base is likely to have fewer bugs than a non-tested code base….

Read More Testable Code is Better Code
Other Languages

Access To Access Without Access
ByErik Dietrich September 10, 2011December 20, 2012

Sorry about the title, but I just couldn’t resist… Anyway, this morning, I needed to open a MS Access 2010 file (.accdb extension), but I don’t have any version of Access installed or handy, except for Access 2000, which is obviously going to be a little behind the times. Not one to let that stop…

Read More Access To Access Without Access
Language Agnostic

Adventures in Pure Test-Driven Development
ByErik Dietrich October 4, 2011December 22, 2012

In a previous post some time back, I had expressed some skepticism about TDD as a design practice. I talked about test-driven development and its relationship with prototyping and the “make one to throw away” concept. Since I’m not one ever to believe that I’ve arrived at the optimal solution, I’m doing another round of…

Read More Adventures in Pure Test-Driven Development

3 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Alex Lysenka

12 years ago

I recently evaluated a number of NoSQL solutions for one of our projects. I looked at all NoSQL families: key-value (Redis), column family (Cassandra), document (MongoDB), and finally graph (Neo4j). Keeping CAP theorem in mind, one can quickly notice that unlike RDBMS, each one of these sacrifices certain qualities in favor of others. On one side of the spectrum you got ke-value stores – highly scalable (think Amazon’s Dynamo), yet very rudimentary in design. On the other side – graphs which are extremely flexible and sophisticated, but less scalable and are mostly tailored toward idiomatic queries. Putting graph databases aside… Read more »

Erik Dietrich

12 years ago

Reply to Alex Lysenka

Which did you wind up choosing…? Anyway, I really like the way you describe the mindset change — I think that’s spot on. I just kind of stumbled into that mindset myself and started thinking of alternatives. I found myself thinking, “why is it so incredibly hard to get an answer to question X? If I just had the data stored in a different way, answering that question would be easy.” But doing some research or prototyping knowing what questions you want to ask of your data and designing storage around answering those questions is a great approach. I’ll have… Read more »

Alex Lysenka

12 years ago

Reply to Erik Dietrich

On hold for now. Neo4j has a free book you can download from their website: http://www.neo4j.org/learn. I strongly recommend it as it will provide an insight into problems graph databases are equipped to solve. Some of the use cases include: social, recommendations, geo, master data management, network and data center management, authorization and access control, and communication. For a more concrete example, think of “People you may know” use case or Triadic closure concept in social network theory (http://en.wikipedia.org/wiki/Triadic_closure).

Similar Posts