Mike Olson, Co-Founder and Chief Strategy Officer, Cloudera said, “Helping our customers win in the cloud is a key strategic objective for Cloudera. Today our enterprise platform is uniquely positioned to support any kind of big data workload in the cloud, whether transient or long-lived, handling batch jobs in support of building data ingest pipelines or supporting advanced SQL analytics and complex event processing.
“We deliver true elasticity, scaling to handle workloads on demand, and offering consumption-based pricing that users expect in the cloud. Delivering this customer success requires providing companies with choice in where they run their workloads. They need the ability to react quickly to changing business demands on a platform in a manner that’s secure and meets strict guidelines for data governance.”
Enhancements
Clouera claims that a significant number of enterprise companies – including Adecco, Airbnb, GoPro, Nielsen, Novantas and others – are running Cloudera Enterprise on public cloud infrastructure. Reasons for deploying in a hybrid, multi-cloud, or single cloud service often include the desire to do the following:
- Reduce the cost associated with purchasing, configuring, and maintaining on-premises hardware required to run big data applications
- Increase the ability for data engineers and data analysts to respond to business problems through self-service provisioning
- Meet strategic objectives to “move to the cloud” to reduce a company’s owned data center footprint
According to Cloudera, its Director makes it easier for customers to deploy and manage the lifecycle of Cloudera Enterprise clusters across cloud environments. Its clients can select from templates for AWS, Google Cloud Platform, and now Microsoft Azure for provisioning and cluster grow/shrink and terminate along with the ability to monitor and manage all clusters from a single unified interface. Additional features of Cloudera Director now include:
- Integrated usage meter with automated billing for a pay-as-you-go computing experience to go along with node-based pricing in the cloud
- Ability to deploy into multiple regions and availability zones from a single Cloudera Director instance
- Availability to deploy Cloudera Director via the Azure Marketplace coming soon
- Support for spot instance and preemptible instance provisioning
Cloudera also states that it is enabling production-ready big data analytics optimized to run across modern IT environments with the Enterprise 5.8, it helps customers to run Apache Impala (incubating) against popular cloud-native object stores including Amazon S3. This means customers can now run high-performance SQL analytics and BI workloads on data in Amazon S3 without having to transform or move that data to another location on Amazon Web Services (AWS). Now customers can also use processing and query engines Apache Hive, Apache Spark, and Hive-on-Spark, which it claims to be three times faster than Hive on MapReduce, directly against data in Amazon S3.
Open source collaboration for Apache Spot
The company has floated a proposal in collaboration with Intel to donate Spot to the Apache Software Foundation (ASF). Apache Spot (incubating), formerly called Open Network Insight (ONI), is a community-developed open source project started by Intel that aims to increase visibility into security threats by providing advanced threat detection using big data analytics and machine learning.
Clouera claims that leveraging Apache Hadoop for infinite log management and data storage scale, and Apache Spark for machine learning and near real-time anomaly detection, organizations and cybersecurity application developers are unlocking new analytics functionality unmatched by previous applications. Spot allows organizations to more effectively harness the power of this technology and data science skills in the Apache big data ecosystem for detecting unknown cyber threats.
New Spark 2.0
The company has also announced a release built on the Apache Spark 2.0 (Beta), with enhancements to the API experience, performance improvements, and enhanced machine learning capabilities. In addition, Cloudera is working with the community to continue developing Apache Kudu 1.0, recently released by the Apache Software Foundation.
The company claims that it recognizes the growing need for streaming and analyzing real time data in high demand workload, including machine-learning models deployed in production. Hence, the latest contributions are for these open source projects alongside deeper integration for its platform.
The Spark 2.0 features include better performance and enhanced usability with the new Dataset API; structured Steaming for better performance and easier ingest of traditional structured data, for time series, tabular and Internet of Things (IoT) data. It can also compile-time type safety for user defined functions, for improved reliability in mission-critical applications and machine learning model, pipeline persistence and newly supported machine learning libraries to take on new data sets and analytic applications.
Kudu
In September last year, Cloudera announced the public beta release of Apache Kudu, its high performance columnar store for Hadoop that enabled the powerful combination of fast analytics on fast data. Two months later, Cloudera donated Kudu to the Apache Software Foundation (ASF) to open it to the broader developer community to expand the type and variety of fast analytic use cases. While Spark 2.0 will give businesses better access to streaming data, Kudu 1.0 will enable enterprises to adopt real-time use cases at a greater pace.
According to the company, Kudu offers fast scans across data for analytics, and instant read/write capabilities for frequent updates and searches. Kudu also enables enterprises to adopt real-time use cases at a greater rate. Along with its integration with Spark, Kudu 1.0 is also tightly integrated with MapReduce and Impala to enable best-in-class processing.
Kudu 1.0 features include a simplified architecture that enables very fast batch and stream processing, fault tolerance and scalability into the hundreds of nodes and a columnar structure that enables analytic analysis on the latest data, for real-time use cases such as time series data, machine data analytics and online reporting.