Kajeepeta reports that Hadoop was inspired by MapReduce, a system designed by Google to support distributed computing on large data sets. Hadoop has many commercial flavors and is supported by a large ecosystem of tools and technologies that can help organizations tackle the broad problem of big data analytics. Several large technology companies (including Amazon.com, Facebook, IBM, Twitter and Yahoo!) and end user companies (such as eBay, Zurich, The New York Times and Fox Network ) are effectively using Hadoop to power various big data initiatives, such as enterprise search, social connections, sentiment analysis, log analysis, data mining and even supply-chain reporting.
When front-ended with an appropriate business case, applying the technology of Hadoop to a big data initiative can yield dramatic results for the business. Visa has said, for example, that the process time for 73 billion transactions, amounting to 36 Tbytes of data, was shrunk from one month with traditional methods to a mere 13 minutes with Hadoop.
However, Kajeepeta also highlights the fact that more than just Hadoop is needed to deal with the problems of big data. He suggests that a layered approach incorporating best practices is the best path to leverage big data. "As a set of layers needed to build any data analytics solution, the reference architecture of big data projects does look quite familiar," he said. "Where it differs from the norm are in the layers that account for large volumes of distributed, and potently heterogeneous, data; modeling tools that deal with the rather flat (and evolving) nature of the data relationships involved; specialized scale-out analytic databases and BI suites; and niche big data analytics packages for customer and sales domains."
Hadoop addresses that layered approach by incorporating processes that address the pragmatic needs of big data, including support for parallel and batch processing of large data sets (often many gigabytes to terabytes in size); a fault-tolerant clustered architecture; the ability to move compute power closer to data (rather than the other way around); and the ability to foster an ecosystem of open/portable layers of enterprise architecture from the compute/data layer all the way up to the analytics layer.
Nevertheless, companies will still need to be selective about which Hadoop projects/components are selected, and should proceed with caution. Kajeepeta’s research indicates that a good starter set of Hadoop projects might include HDFS and HBase for data management; MapReduce and Oozie as a processing framework; Pig and Hive as development frameworks for developer productivity; and the open source Pentaho for BI.
Kajeepta warns that the effective implementation and management of Hadoop requires a fair amount of expertise. If such expertise does not exist in-house, enterprises may want to partner with a service provider and/or implement one of the commercial versions of Hadoop. It is also important that companies consider the security of massive amounts of information stored in distributed clusters and potentially in public clouds. Before embarking on projects with live (and quite possibly sensitive) data, it is important to determine the security profile of the data and make necessary provisions to address it.
All things considered, Kajeepta makes a strong argument for using Hadoop to get a big data analytics project started, and he effectively points out many of the pitfalls and best practices that enterprises should consider before venturing into the realm of big data analytics.
Learn more about Fundamentals: 10 Steps to Effective Data Classification by subscribing to Network Computing Pro Reports (free, registration required).