Wednesday, August 22, 2012

#Hadoop Myths Debunked

James Kobielus
This was originally published August 20th, 2012 on the "The Big Data Hub" (An IBM Blog) by James Kobielus, Big Data Evangelist
Hadoop has acquired a large body of prevailing myths in its short history as the hottest new big data technology. I'm surprised and dismayed when I see these myths propagated in leading business publications, such as in this recent Forbes article. Here now are some quick debunks of the myths in this particular piece that got my goat:

Hadoop Myth #1: Hadoop is primarily for batch processing.

Far from it. Hadoop is being used for the full spectrum of advanced analytics, both batch and real-time, against structured and unstructured data. It has a database (Hbase), an analytics environment (MapReduce, Hive, Mahout, etc.), and visualization tools (IBM InfoSphere BigInsights being one of many on the market). Taken together, the entire Hadoop stack, which is mostly open-source Apache with, optionally, various proprietary vendor tools/apps/libraries/accelerators, provide the foundation for complete applications.

Hadoop Myth #2: Hadoop has a computational model called "MapReduce."

MapReduce is not a computational (i.e, statistical) model. Rather, MapReduce is a modeling/abstraction framework and runtime environment for in-database execution of a wide range of data analytic functions in a massively parallel computing fabric, which may be entirely centralized on a single central computing hub/cluster/server (which may be highly efficient for some jobs) or spread out across a huge distributed cluster of machines (which may be best for others).

Hadoop Myth #3: Hadoop is synonymous with HDFS and vice versa.

At its heart, Apache Hadoop is an open-source community that has multiple subprojects (MapReduce, Hadoop Distributed File System (HDFS), Hbase, Hive, Pig, Mahout, etc.–the list continues to grow), has spawned an ecosystem and solution market, one of whose participants is IBM (with InfoSphere BigInsights). Many industry observers seem to assume (but rarely state outright) that Hadoop is HDFS and HDFS is Hadoop. But that's absurd. The core and defining subproject is MapReduce, not HDFS. There are plenty of Hadoop deployments that don't involve HDFS (or have built alternatives that use the HDFS API), but not a single one that does not involve MapReduce (without which it would not qualify as Hadoop).

Hadoop Myth #4: Hadoop is the first successful big data technology.

The first successful big data technology was enterprise data warehouses that implement massively parallel processing, scale to the petabytes, handle batch and real-time latencies with equal agility, and provide connectors to structured and unstructured sources. In other words, platforms like IBM Netezza and IBM Smart Analytics System have been doing big data (according to the 3 Vs and overlapping considerably with Hadoop in use cases/apps supported) for several years. Before the commercial Hadoop arena (which got going in earnest only a year ago) got off its feet.

Hadoop Myth #5: Hadoop is the commoditized, back-end, bare bottom
of the big data analytics stack.

Hadoop is not the bottom of the stack. Hadoop is in fact a full stack that is growing fuller all the time, both at the Apache community level and in vendor (e.g., IBM) productization. In fact, Hadoop is an evolving solution platform, just as an EDW is a solution platform. In fact, Hadoop is the very heart of the big data revolution and is the core of the next-generation EDW in the cloud.

Hadoop Myth #6: Hadoop is not a database.

See comment above re HBase. Also, Cassandra, a real-time distributed database with transactional features, is a Hadoop subproject. And HDFS, a distributed file system, supports the data persistence/storage features that make it a key database-like platform in Hadoop. A big part of Hadoop's flexibility is the ability to dispense with these databases and file systems, if you wish, and run MapReduce (the core of Hadoop) over non-Hadoop databases, such as Netezza, MySQL, etc.

Hadoop Myth #7: Hadoop is hard to set up, use and maintain.

Sure. The technology, market and solutions are maturing, and skills are in short supply. But it's maturing so fast, and we and others are doing great things on improving usability with every product release.

Hadoop Myth #8: Only savvy Silicon Valley engineers can derive value from Hadoop.

Wrong. There are a growing number of Hadoop case studies in other regions, in other industries.