The Growing Need for SQL for Hadoop
TDWI research shows that data warehouse professionals are increasingly adopting Hadoop as part of a modern multiplatform data warehouse. A TDWI survey that ran in 2012 found that 10 percent of data warehouse programs canvassed had Hadoop in production in their warehouse environments. Note that this is a count of production systems, not prototypes or skunkworks. Subsequent surveys that repeated the same question saw production systems increase to 14 percent in 2014 and to 20 percent in 2016.
SQL is one of many challenges that data warehouse professionals face as they adopt Hadoop. Data warehouse and other data management professionals want to use their SQL skills and tools with Hadoop data. They have valuable experience being productive with SQL, and they have portfolios of tools that access data and operate on it via SQL queries and interfaces. The catch is that Hadoop is still weak with relational functions in general, and that includes SQL.
Why Is SQL Critical to Hadoop?
Now I know what you're thinking. Why should anyone need a legacy technology such as SQL on a shiny new platform like Hadoop?
Well, SQL just had its thirtieth birthday, and it's as relevant as ever, regardless of platform type.
SQL continues to be the language of data. Other query languages (namely object query language and XQuery) tried to displace SQL and failed. Although Hive on Hadoop is a popular tool among Hadoop users, most of those users are application developers. Data management professionals have no trouble learning HiveQL, but they prefer SQL, and their tools do too.
SQL is familiar and it works. Many people (both business and technical) have SQL skills and use them daily.
Numerous tools that could be used with a Hadoop-based data lake support ANSI SQL. These include tools for exploration, reporting, analytics, visualization, data integration, and data quality. Many users already have these in their software portfolios, and they wish to use them with a data lake.
Data exploration is a top priority for many Hadoop-based data lake users. Most of them want to explore the lake via ad hoc queries based on ANSI SQL. Without SQL, exploration is harder and slower.
SQL on and off Hadoop
One of the hottest debates concerning Hadoop and SQL is about the distinction between:
- SQL on Hadoop: SQL executes natively as a process inside Hadoop
- SQL off Hadoop: Tools outside Hadoop query Hadoop data
TDWI discusses the two approaches together using the phrase SQL for Hadoop. Our recent data lake survey asked Hadoop users to gauge their need for SQL for Hadoop.
Almost all Hadoop users need some kind of SQL support. A mere 1 percent of survey respondents say they have no need for SQL for Hadoop. This is not surprising -- most of Hadoop's use cases involve practices that tend to be wedded to SQL, namely decision making, exploration, and analytics.
The majority need both SQL on Hadoop and SQL off Hadoop. There's a slight preference for SQL on Hadoop (21 percent) over SQL off Hadoop (11 percent). However, most Hadoop users surveyed need both (67 percent).
This makes sense. A Hadoop implementation may need to satisfy the requirements of diverse user types, which leads to multiple, diverse approaches to data and tools. Furthermore, a range of data integration and quality tools rely on SQL, and a data lake needs these for data ingestion and data flows. Even traditional data warehouse and reporting environments include multiple SQL-based tools for the same reasons.
Some emerging practices for self-service with Hadoop data, including data exploration, self-service analytics, visualization, and data prep, assume SQL. Without SQL on Hadoop, self-service is limited, which in turn limits the range of user types who can get business value from Hadoop data.
To be fair and complete, let's recall that there are also many viable use cases for Hadoop that do not require SQL or any kind of relational function. For example, advanced analytics takes many forms, and diverse forms are enabled by technologies for mining, graph, statistics, associative engines, semantic technologies, and natural language processing.
Keep an eye out for my next TDWI Best Practices Report, Data Lakes: Purposes, Practices, Patterns, and Platforms. TDWI will publish it at www.tdwi.org/bpreports on or about April 1, 2017. In the meantime, to hear other views about SQL on Hadoop, replay the 2016 TDWI Webinar, "SQL for Hadoop: When to Use Which Approach," online at www.tdwi.org/webinars.
Philip Russom is director of TDWI Research for data management and oversees many of TDWI’s research-oriented publications, services, and events. He is a well-known figure in data warehousing and business intelligence, having published over 500 research reports, magazine articles, opinion columns, speeches, Webinars, and more. Before joining TDWI in 2005, Russom was an industry analyst covering BI at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and BI consultant and was a contributing editor with leading IT magazines. Before that, Russom worked in technical and marketing positions for various database vendors. You can reach him at email@example.com, @prussom on Twitter, and on LinkedIn at linkedin.com/in/philiprussom.