When are columns better than rows as the way to organize your
database?
The standard business database is a relational system that
organizes data into tables, built with vertical columns and
horizontal rows. Oracle, DB2, SQL Server, Sybase, and MySQL all are
organized around the rows. But when it comes to data warehouses,
said Michael Stonebraker, the former Berkeley database guru, it's
better for performance reasons to focus on the columns.
He should know. He was the original architect of row-based Ingres
and pioneer of relational databases used for the past 25 years in
business. The row-based system is great for executing transactions,
but a column-oriented system is a natural fit for data warehouses,
he said. "You get a factor of 50 times better performance on all
classes of queries used to run most data warehouses," he says.
Stonebraker is now CTO of Vertica Systems, supplier of the
column-based Vertica database system. Vertica 2.0 was unveiled at
the Data Warehousing Institute's World Conference last month.
Columns cut across individual transactions, and store an element of
information that is standard to each transaction, such as the
customer name or address or amount of purchase.
That's because a data warehouse frequently stores information on a
company's transactions, with each transaction having many parts. A
row may hold the 20 to 200 different elements of the typical
transaction, such as customer name, address, and amount of
purchase. Columns, on the other hand, cut across transactions and
represent one piece of information from all transactions. Data
warehouse queries typically want to ask questions that that draw
information out of all transactions, or large sets of them, rather
than look at individual transactions. For example, a typical query
might ask: What was the average sale during the month of February?
Retrieving the Sales column gets all the data the query needs to
determine the average.
A standard relational database, however, would retrieve all the
rows reflecting individual sales for the month, load them into
system memory, then find all sales records and generate an average
from them. So being able to focus on the Sales column, the way
Stonebraker and Vertica 2.0 does, leads to data warehouse query
performance improvement.
Stonebraker said a second performance benefit flows from taking a
column approach. Because columns contain similar information from
each transaction, it's possible to derive a compression scheme for
the data type, then apply it uniformly throughout the column. Rows
can't be compressed as easily because the nature of the data varies
widely from record to record in the row, such as name, ZIP code,
and balance owed. Each would require a different compression
scheme.
Compressing data in columns makes them quicker to store and
retrieve and reduces the amount of disk required for storage. "In
every data warehouse I see, compression is a good thing," said
Stonebraker. And he claimed Vertica's compression approach is "2-3
times more effective. Columns compress better than row stores," he
said.
Arnette said he expects Sonian's e-mail archive to grow rapidly
from a few terabytes to 10 petabytes (10,000 TB) sometime in 2009.
"We needed a system that was optimized for high-volume reporting,"
said Arnette. In Sonian tests, "it was a lot faster than MySQL and
requires less CPU power," he claimed. He didn't test Vertica versus
Oracle because of Oracle's "license fees and cost of
administration. "Stonebraker said the performance advantage is most
evident with large data warehouse systems. "I expect the data
warehouse market to become completely column-store based, just
because it is inherently higher performance," he said.