Once upon a time, God came down on earth in the form of E.F.Codd and gave us the Holy Tenets of Database Design.
Among the most important thing that every database designer or application developer had to learn was
“Thou shalt design your database so that it at least fulfills the conditions of 3NF (3rd normal form)”
Since then thousands of people have used that adage as the basis of good database design (myself included). And , it works, you know! Without following those basic principles of good database design, it just would not be possible to build databased applications in any sane manner.
So, what exactly is the problem then? The problem becomes apparent to anyone who ends up with a at least few thousand rows of data in multiple tables.
The more optimized a database is, the more effort and time it takes to retrieve data.
You see, database normalization is aimed at storing data in the most optimal manner – preserving referential integrity, and removing data redundancy, and making inserts, updates and deletes as painless as possible. But this is just looking at one part of the picture.
Due to the conceptual idea of a database being what it is, optimizing it for storage automatically makes it unoptimized for retrieval.
So lets say you have stored all the relevant details of Employees across 4 tables:
- Employee master
- Committees part of
- Salary Transactions
- Projects Handled
So , its all neatly stored in 4 tables with all the tables linked by primary and foreign keys and indexed and retrofitted with the relevant triggers. So now , what do we do when we want to see an employee’s details in one page. We form 4 JOIN operations and then retrieve the data. This is of course, assuming that the other fields in the tables themselves do not require accessing still more tables.
This above scenario is not so much of a performance issue, with a table of 40 employees records. But when you have 150,000 employee records then things start to crash. Imagine doing those join operations across 4 tables with that kind of data. No amount of indexing is going to save you from the massive load the database server will go through. Lets take this scenario a little further… Imagine 40 people trying to see the details of 40 different employees at the same time, when the database has over 500,000 records in each table.
Wouldn’t it have been ideal if , we could just get all the employee data in a single table for retrieval, but yet keep them normalized for storage?
There was a time when I didn’t even give a second thought to normalization. Like everyone else, I took it for granted that once normalized, I dont need to really worry about database design for the rest of the project.
I was forced to reconsider that assumption, when I was developing a web application for a major newspaper a few years back. The job was to put everything they had for the last 10 years into databases and provide a web application for users to see old issues and search and retrieve articles etc.
Very early into the project, we realized that even with just 10,000 articles in the database, all normalized, to search using keywords was taking far too long. It was just not working out.
To cut a long story short, we ended up having two databases – one for storage which was to get all the data into the database and the second for retrieval which was used by the website to simply retrieve data. I wrote some offline code in Java to process and migrate the data from the storage db to the retrieval db. This approach was possible because updates were done only once at the end of the day when the newspaper was ready to go for print.
Due to this approach, (and a couple of other innovations which are not really relevant to this article), the search database worked very fast for retrievals – we got a almost a 100-fold increase in performance. We had simply denormalized the articles retrieval database enough so that it would take no more than 2 tables to retrieve the main article data.
I personally feel, that the optimal solution, is to keep a different database for storage and another one for retrieval, but I also know how impractical it would be to do so, given our current approach to creating databased applications. Unless someone comes up another radical approach to storing data, or maybe some new magic middeware which automatically takes care of separating and optimizing storage and retrieval, we will still be following the normalization process for quite some to come.
Oh and by the way I still cant think of designing a database without normalization 🙂 . Its just too much trouble.