Indexing Improvements for the Hive Database System

Facebook Computer Science, 2010-11

Liaison(s): Jonathan Hsu ’01, John Sichi, Yongqiang He
Advisor(s): Robert Keller
Students(s): Skye Berghel, Jeffrey Lym, Russell Melick (PM-S), Marquis Wang (PM-F)

Apache Hive is an open source distributed database/data warehouse system built on the Map/Reduce framework of Hadoop. Designed to handle massive datasets, it is currently heavily used by Facebook. Facebook desires improvements to the indexing framework within Hive, both in handling different types of data and in ease of index use. The team has developed a new index type based on bitmaps, which offers improvements for self-similar datasets. They have also augmented the existing framework to enable automatic creation and use indexes.