A perfectly normalized database is a timebomb waiting to explode..

BOOM!
BOOM!

Once upon a time, God came down on earth in the form of E.F.Codd and gave us the Holy Tenets of Database Design.

EF Codd
EF Codd

Among the most important thing that every database designer or application developer had to learn was

“Thou shalt design your database so that it at least fulfills the conditions of 3NF (3rd normal form)”

Since then thousands of people have used that adage as the basis of good database design (myself included). And , it works, you know! Without following those basic principles of good database design, it just would not be possible to build databased applications in any sane manner.

So, what exactly is the problem then? The problem becomes apparent to anyone who ends up with a at least few thousand rows of data in multiple tables.

The more optimized a database is, the more effort and time it takes to retrieve data.

You see, database normalization is aimed at storing data in the most optimal manner – preserving referential integrity, and removing data redundancy, and making inserts, updates and deletes as painless as possible. But this is just looking at one part of the picture.

Due to the conceptual idea of a database being what it is, optimizing it for storage automatically makes it unoptimized for retrieval.

So lets say you have stored all the relevant details of Employees across 4 tables:

  • Employee master
  • Committees part of
  • Salary Transactions
  • Projects Handled

So , its all neatly stored in 4 tables with all the tables linked by primary and foreign keys and indexed and retrofitted with the relevant triggers. So now , what do we do when we want to see an employee’s details in one page. We form 4 JOIN operations and then retrieve the data. This is of course, assuming that the other fields in the tables themselves do not require accessing still more tables.

This above scenario is not so much of a performance issue, with a table of 40 employees records. But when you have 150,000 employee records then things start to crash. Imagine doing those join operations across 4 tables with that kind of data. No amount of indexing is going to save you from the massive load the database server will go through. Lets take this scenario a little further… Imagine 40 people trying to see the details of 40 different employees at the same time, when the database has over 500,000 records in each table.

Magic Lamp
Magic Lamp

Wouldn’t it have been ideal if , we could just get all the employee data in a single table for retrieval, but yet keep them normalized for storage?

There was a time when I didn’t even give a second thought to normalization. Like everyone else, I took it for granted that once normalized, I dont need to really worry about database design for the rest of the project.

I was forced to reconsider that assumption, when I was developing a web application for a major newspaper a few years back. The job was to put everything they had for the last 10 years into databases and provide a web application for users to see old issues and search and retrieve articles etc.

Very early into the project, we realized that even with just 10,000 articles in the database, all normalized, to search using keywords was taking far too long. It was just not working out.

To cut a long story short, we ended up having two databases – one for storage which was to get all the data into the database and the second for retrieval which was used by the website to simply retrieve data. I wrote some offline code in Java to process and migrate the data from the storage db to the retrieval db. This approach was possible because updates were done only once at the end of the day when the newspaper was ready to go for print.

Due to this approach, (and a couple of other innovations which are not really relevant to this article), the search database worked very fast for retrievals – we got a almost a 100-fold increase in performance. We had simply denormalized the articles retrieval database enough so that it would take no more than 2 tables to retrieve the main article data.

I personally feel, that the optimal solution, is to keep a different database for storage and another one for retrieval, but I also know how impractical it would be to do so, given our current approach to creating databased applications. Unless someone comes up another radical approach to storing data, or maybe some new magic middeware which automatically takes care of separating and optimizing storage and retrieval, we will still be following the normalization process for quite some to come.

Oh and by the way I still cant think of designing a database without normalization 🙂 . Its just too much trouble.

8 Comments

  1. Mario: I have heard of Lucene, but as you rightly guessed, it does not really replace the standard database normalization process for any generic RDBMS design.

  2. This is precisely the reason that OLAP databases exist and are completely separate from their OLTP brethren. Typically you use an ETL process to convert one to another.

    An entire industry was formed around business intelligence and data analysis. Nobody expects to report off the data in OLTP form – it’s simply not optimized for it.

  3. I am not at that level where you people are, just a kid in computer field. But I was thinking wouldn’t it be feasible to build “views” based on the FrontEnd use case. The normalized database will be efficient to save the records and maintain the atomicity and the view would result in faster retrieval of records?

  4. At the scale youre talking about (1/2 million rows) , I don’t see why there would be an issue with a 4 way join on reasonable hardware. I have a lot of apps that are 4-5 years old in 3nf with millions of rows, with lots of joins and they generally run fast and with little overhead.

    The biggest performance cost in an RDBMS is IO retrieval – thus if you put those 4 tables into 1 table, your IO cost will be nearly the same as if you do the joins! This is because most relational engines store rows in blocks, and it can’t ever retrieve part of a block, or part of a row.

    How big are your rows?

  5. I’d like to offer a counterpoint here, as this is an issue I have enjoyed debating on a number of occasions.

    I’d like to point out a flaw in your argument as to why normalized data is supposedly slower. By definition normalized data sets involve less redundancy. Therefore, the total amount of data that needs to be read will typically be significantly less. I think that’s why in my experience and readings, for most databases (at least until you get to the REALLY huge ones) normalized reads benchmark faster.

    I also think James’s comment was a correct one.

    Was the database you were using for the newspaper project you mentioned MySQL? Reads of normalized tables joined together really do tend to be slower in MySQL, from what I can tell. I have yet to observe this in one of the highly optimized databases (Oracle, SQL Server, etc.)

    Here is my blog post on the topic if you are interested in following up – http://ramthemdown.wordpress.com/2008/11/21/denormalization-is-bad-umkay/

2 Trackbacks / Pingbacks

  1. Intermedium « ram them down
  2. Buy Quality PHP Script

Leave a Reply

Your email address will not be published.


*