!
!
Apache Hadoop: The Big Data Refinery 6
Hortonworks Data Platform delivers, in a single, tightly integrated package, popular Apache Hadoop
projects such as HDFS, MapReduce, Pig, Hive, HBase and Zookeeper. To this base, Hortonworks Data
Platform includes additional open source technologies that make the Hadoop platform more
manageable, open, and extensible. A complete set of open APIs is provided, making it easier for
enterprises and ISVs to integrate and extend Apache Hadoop.
Making Hadoop accessible begins with installation and configuration. Typically a laborious task,
installation and configuration of Hadoop is made all the more complex by the fact that the open source
projects that make up the Hadoop platform are independently developed and frequently updated
codebases, each with their own release schedules, versions and dependencies.
To ensure a consistent and stable platform for enterprise use, Hortonworks Data Platform includes only
stable component versions that have been fully integrated, tested and certified as part of Hortonworks’
extensive Q/A process, and are supported by the company’s multi-year support and maintenance policy.
Hortonworks Data Platform supplies installation and configuration tools that make it easy to install,
deploy and manage these certified components. The Hortonworks Management Center is based on
Apache Ambari, an open source installation, configuration and management system for Hadoop, and is
included in Hortonworks Data Platform. The Hortonworks Management Center provides a
comprehensive web dashboard that integrates monitoring, metrics and alerting information into a unified,
Hadoop-specific management console.
Important metadata management functionality is included in Hortonworks Data Platform via an open
source project called Apache HCatalog. HCatalog provides centralized metadata services, including table
and schema management, to all of the platform components. Additionally, it provides a method for
deeper integration with third-party data management and analysis tools, improving interoperability.
Beyond technology, as the industry-leading distribution of Apache Hadoop, Hortonworks Data Platform
is backed by a powerful ecosystem of partners, including leading software vendors, hardware vendors
and systems integrators. These partnerships help ensure that your investment in Hadoop extends and
complements existing IT investments and enterprise relationships.
Getting Started With Hadoop
In this white paper, we’ve introduced the notion of Apache Hadoop as a data refinery and illustrated the
analogy with comparisons to an oil refinery. We’ve used this analogy as the context for an introduction to
Hadoop and some of the major projects in the Hadoop ecosystem.
We’ve also introduced Hortonworks Data Platform; a pre-integrated distribution of Apache Hadoop
designed to help you be more successful, more quickly, in your efforts to harness big data.
Extending your knowledge of Hadoop couldn’t be easier. The Hortonworks web site is the place to start,
offering a wealth of practical educational resources including software downloads, video tutorials and
blog posts.
To go further, Hortonworks University is your expert source for Apache Hadoop training and certification.
Public and private courses are available for developers, administrators and other IT professionals
involved in implementing big data solutions. Training courses combine presentation material with hands-
on labs that fully prepare students for real-world Hadoop scenarios. Successfully completing a
Hortonworks training course entitles you to sit for the respective Hortonworks certification exam; earning
Hortonworks certification identifies you as an expert in the Apache Hadoop ecosystem.