TCLUG Archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

SQL Followup



Just to follow up on the discussion of loading data into SQL engines,
the use of keys, and the differences between normalized and
unnormalized databases...

Firstly, normalized v.s. unnormalized.  Luke covered the topic quite
well.  Normalization of data is the act of breaking up data into
categorical groups and assigning identifiers to the individual records
of each group.  Take for example, an address book.  Any individual you
may wish to place in an address book may have more than one email
address, snail mail address, phone number, or URL.  For maximum
flexibility and maximum reuse of data, you break these elements into
their respective tables.  

First, the "header" table for the individual, which contains very
static information, such as their name, birthdate (if you want it),
nickname, is created.  A unique identifier is assigned to this
individual so that you can relate this record to the other tables in
the database.  You then proceed to make tables for each of the other
categories, address, phone, email, etc.

Now, when the individual gains a new address, you simply add a record
to the address table and associate it with the individual.  If you had
implemented this address book as one large table that contained all of
these fields, what is refered to as a "wide table", you would have to
edit the individual's record simply to change the address.

There are other advantages to having a normalized database over a
"wide" database.  Changes to a normalized database have less impact on
the performance of the database as a whole.  Indexing, maintenance, or
data locking can be segregated to a smaller portion of the database,
making such activities less intrusive.  To change an address in a wide
database, you may have to lock the table that others may be trying to
access at the same time... problem.  That is unless you have more
advanced locking mechanisms, such as "row-level" locking, or
concurrent versioning.

Yes, I know.  Lots of buzz words.  Just bear with me.  Check out the
documentation on http://www.postgres.org/ if you want to know what I'm
talking about.

Anyway, there are disadvantages to having a normalized database, such
as the more complex queries you have to write, the greater complexity
added to maintaining referential integrity (making sure you didn't
delete someone from the database while leaving their address or other
data just floating around).  Queries may actually take longer,
depending upon  how many tables you're trying to join together for
your query result set.  (I want to see all of the addresses, phone
numbers, and email addresses for Ben Kochie.  I will have to join four
tables to see all of this data at once. etc...)

There are techniques to making these types of "reports" quicker, such
as maintaining a "report table" or using views.  (Views are pretty
cool... Basically a virtual table.)  So the real challenge in writing
a database schema is deciding whether normalization is right for the
application functionality you're trying to provide.  For ultimate
flexibility, normalization is the key.  For ultimate speed, either
wide tables or a combination of normalization and wide "report" tables
may be the answer.

What does this all have to do with Tim's problem?  Not much.  He
doesn't get to design the database schema, and unless he wants to
transform the data he gets into a normalized database (if it isn't
already.)

Well, I wrote enough for now...  Didn't touch on keys and such, but
that's a discussion for another day.

-- 
  Chad "^chewie, gunnarr" Walstrom <chewie@wookimus.net>
              http://wookimus.net/chewie

PGP signature