When are columns better than rows as the way to organize your
database?
The standard business database is a relational system that
organizes data into tables, built with vertical columns and
horizontal rows.Oracle (NSDQ: ORCL), DB2, SQL Server, Sybase, and
MySQL all are organized around the rows. But when it comes to data
warehouses, said Michael Stonebraker, the former Berkeley database
guru, it's better for performance reasons to focus on the
columns.
He should know. He was the original architect of row-based
Ingres and pioneer of relational databases used for the past 25
years in business. The row-based system is great for executing
transactions, but a column-oriented system is a natural fit for
data warehouses, he said.
"You get a factor of 50 times better performance on all classes
of queries used to run most data warehouses," he says. Stonebraker
is now CTO of Vertica Systems, supplier of the column-based Vertica
database system. Vertica 2.0 was unveiled at the Data Warehousing
Institute's World Conference in Las Vegas Tuesday.
Columns cut across individual transactions, and store an element
of information that is standard to each transaction, such as the
customer name or address or amount of purchase.
That's because a data warehouse frequently stores information on
a company's transactions, with each transaction having many parts.
A row may hold the 20 to 200 different elements of the typical
transaction, such as customer name, address, and amount of
purchase. Columns, on the other hand, cut across transactions and
represent one piece of information from all transactions. Data
warehouse queries typically want to ask questions that that draw
information out of all transactions, or large sets of them, rather
than look at individual transactions. For example, a typical query
might ask: What was the average sale during the month of February?
Retrieving the Sales column gets all the data the query needs to
determine the average.
A standard relational database, however, would retrieve all the
rows reflecting individual sales for the month, load them into
system memory, then find all sales records and generate an average
from them. So being able to focus on the Sales column, the way
Stonebraker and Vertica 2.0 does, leads to data warehouse query
performance improvement.
Stonebraker said a second performance benefit flows from taking
a column approach. Because columns contain similar information from
each transaction, it's possible to derive a compression scheme for
the data type, then apply it uniformly throughout the column. Rows
can't be compressed as easily because the nature of the data varies
widely from record to record in the row, such as name, ZIP code,
and balance owed. Each would require a different compression
scheme.
Compressing data in columns makes them quicker to store and
retrieve and reduces the amount of disk required for storage. "In
every data warehouse I see, compression is a good thing," said
Stonebraker. And he claimed Vertica's compression approach is "2-3
times more effective. Columns compress better than row stores," he
said.
Vertica CEO Ralph Breslauer says 20 customers have been equipped
with Vertica since the first version emerged in September. One of
them is Sonian Networks, a Boston firm that archives e-mail for
enterprises. Greg Arnette, CTO of Sonia, said he considered
PostgreSQL, MySQL, and Oracle as the database for pulling business
intelligence out of its large e-mail archive, but opted for Vertica
instead, partly from "knowing Dr. Stonebraker's reputation."
Arnette said he expects Sonian's e-mail archive to grow rapidly
from a few terabytes to 10 petabytes (10,000 TB) sometime in 2009.
"We needed a system that was optimized for high-volume reporting,"
said Arnette. In Sonian tests, "it was a lot faster than MySQL and
requires less CPU power," he claimed. He didn't test Vertica versus
Oracle because of Oracle's "license fees and cost of
administration."
The 10-employee company uses Amazon (NSDQ: AMZN)'s "cloud"
Simple Storage Service and computing service for its e-mail archive
and Vertica systems, with both Sonian and Sonian customers using
Vertica to get reports out of the archive.
Stonebraker said the performance advantage is most evident with
large data warehouse systems. "I expect the data warehouse market
to become completely column-store based, just because it is
inherently higher performance," he said.
An academic experimental column-store database is a joint
project of MIT, Yale, Brandeis, Brown, and the University of
Massachusetts-Boston. An open source project based on a columnar
database approach is MonetDB.