1. Hortonworks Data Platform 1.2 focuses on continued innovation with Apache Ambari and enhanced security and performance for Hive and HCatalog.
2. Key features include root cause analysis, usage heat maps, and improved ecosystem integration in Ambari, as well as enhanced security models and concurrency improvements.
3. Hortonworks ensures tight alignment with open source Apache projects by certifying the latest stable components and contributing leadership and code back to projects.
Committed to building 100% open source Hadoop for the Enterprise
So how does this get brought together into our distribution? It is really pretty straightforward, but also very unique:We start with this group of open source projects that I described and that we are continually driving in the OSS community. [CLICK] We then package the appropriate versions of those open source projects, integrate and test them using a full suite, including all the IP for regression testing contributed by Yahoo, and [CLICK] contribute back all of the bug fixes to the open source tree. From there, we package and certify a distribution in the from of the Hortonworks Data Platform (HDP) that includes both Hadoop Core as well as the related projects required by the Enterprise user, and provide to our customers.Through this application of Enterprise Software development process to the open source projects, the result is a 100% open source distribution that has been packaged, tested and certified by Hortonworks. It is also 100% in sync with the open source trees.
100% Open Source: eliminating Lock-In
Quarterly Cadence: regular innovation every three monthsValidated & Tested by our ecosystem partnersEmbargo Date: January 15
HDP tracks closely to Apache project releasesCDH forks early and patches CDH distributions off to the side of the Apache community projects resulting in unnecessary drift and risk of lock-inThe “+923.423” and the “+541” parts of the version numbers represent how many patches these components have drifted away from corresponding Apache projects.While some drift can be expected, patches and changes that are in the order of hundreds results in lock-in and actually eliminates the virtuous cycle that upstream community should help drive.
I can’t really talk about Hortonworks without first taking a moment to talk about the history of Hadoop.What we now know of as Hadoop really started back in 2005, when Eric Baldeschwieler – known as “E14” – started to work on a project that to build a large scale data storage and processing technology that would allow them to store and process massive amounts of data to underpin Yahoo’s most critical application, Search. The initial focus was on building out the technology – the key components being HDFS and MapReduce – that would become the Core of what we think of as Hadoop today, and continuing to innovate it to meet the needs of this specific application.By 2008, Hadoop usage had greatly expanded inside of Yahoo, to the point that many applications were now using this data management platform, and as a result the team’s focus extended to include a focus on Operations: now that applications were beginning to propagate around the organization, sophisticated capabilities for operating it at scale were necessary. It was also at this time that usage began to expand well beyond Yahoo, with many notable organizations (including Facebook and others) adopting Hadoop as the basis of their large scale data processing and storage applications and necessitating a focus on operations to support what as by now a large variety of critical business applications.In 2011, recognizing that more mainstream adoption of Hadoop was beginning to take off and with an objective of facilitating it, the core team left – with the blessing of Yahoo – to form Hortonworks. The goal of the group was to facilitate broader adoption by addressing the Enterprise capabilities that would would enable a larger number of organizations to adopt and expand their usage of Hadoop.[note: if useful as a talk track, Cloudera was formed in 2008 well BEFORE the operational expertise of running Hadoop at scale was established inside of Yahoo]
In summary, by addressing these elements, we can provide an Enterprise Hadoop distribution which includes the:Core ServicesPlatform ServicesData ServicesOperational ServicesRequired by the Enterprise user.And all of this is done in 100% open source, and tested at scale by our team (together with our partner Yahoo) to bring Enterprise process to an open source approach. And finally this is the distribution that is endorsed by the ecosystem to ensure interoperability in your environment.
As the volume of data has exploded, we increasingly see organizations acknowledge that not all data belongs in a traditional database. The drivers are both cost (as volumes grow, database licensing costs can become prohibitive) and technology (databases are not optimized for very large datasets).Instead, we increasingly see Hadoop – and HDP in particular – being introduced as a complement to the traditional approaches. It is not replacing the database but rather is a complement: and as such, must integrate easily with existing tools and approaches. This means it must interoperate with:Existing applications – such as Tableau, SAS, Business Objects, etc,Existing databases and data warehouses for loading data to / from the data warehouseDevelopment tools used for building custom applicationsOperational tools for managing and monitoring
Eric and team created the Hadoop project as open source, and that is and always will be central to our approach. We believe strongly that the technology needs to be community driven and open source.In terms of open source mechanics, Apache Hadoop is governed by the Apache Software Foundation which provides structure to what inside a commercial software company would be a tightly governed process around the development, test and release process. When we think of Core Hadoop, the ASF has helped to manage this process for several years now.However as Hadoop has become more widely used, it has spawned a set of ancillary open source projects that introduce capabilities required for more mainstream use. These projects are generally classified as either being related to:“Data Services” – those that enable the Storage, Processing, and Accessing of data“Operational Services” – those that enable the management and operations of the infrastructureThe projects within these categories are run as independent projects with their own teams, and include some of the technologies you likely know of: Data Services include projects such as Hive, Pig, Hbase and Hcatalog, while Operational Services include Apache Ambari and more.Hortonworkers have always played a critical role in the development, test and release process for Core Apache Hadoop but also play leading roles in these ancillary projects that are required for enterprise usage. This includes every role from committer, release manager, and in many cases, the project leads. For example Arun Murthy is the project lead for Core Hadoop.Current Hortonworks PMC members by project:Hadoop: Arun Murthy, Deveraj Das, EnisSoztutar, GiridharanKesavan, JitendraNathPandy, MahadevKonar, Matt Foley, Owen O'Malley, Sanjay Radia, Suresh Srinivas, Nicholas Sze, Vinod Kumar VavilapalliPig: Daniel Dai, Alan Gates, GiridharanKesavan, AshutoshChauhan, Thejas NairHive: AshutoshChauhanHBase: NoneOozie: Deveraj Das, Alan GatesSqoop: NoneFlume: NoneBigtop: Alan Gates, Steve Loughran, Owen O'MalleyIncubator (not a Hadoop project but shows who's helping grow new projects in Apache): Arun Murthy, Deveraj Das, Alan Gates, MahadevKonar, Steve Loughran, Owen O'Malley, EnisSoztutar
We are believers in open source: for us, we believe it is the most efficient way to develop enterprise softwareBut more importantly, we believe that 100% open source is the best approach for our customers. And in particular in the data management market, our customers are acutely aware of the implication of growing their database usage with a proprietary vendor who then can exert pricing pressure (Oracle).Particularly when it comes to data storage, which we can all anticipate will continue to grow exponentially, you don’t want to be penalized for scale. By choosing an open source approach organizations can build their operational processes on open technologies, without concern that they will be locked in to a particular vendor. And they can be confident that as their usage grows, they can choose from flexible pricing alternatives – by node or by storage – that aligns best to their needs.It is ultimately about mitigating risk, and in this regard open source has been proven as the safest approach. I would also caution you to look beyond the open source label used by some vendors: are they harvesting open source work, forking the code and then working independently (“fork early / patch often”)? Or like Hortonworks, have they embraced and committed to the community open source approach which will allow them to stay in sync with the innovation of the community? In the Hadoop community, Hortonworks is unquestioned in taking the community-driven approach.