Why Reinvent the Wheel? The Challenges of DIY Open Source Analytics Platforms

by Andreas Skouloudis, and David Coursey

Posted in Business | July 24, 2023 7 min read

In their effort to reduce their technology spend, some organizations that leverage open source projects for advanced analytics often consider either building and maintaining their own runtime with the required data processing engines or retaining older, now obsolete, versions of legacy Cloudera runtimes (CDH or HDP). However, both of these options are associated with substantial cost and risk, as organizations underestimate the complexity and the necessary expertise required to not only build but also operate a platform for advanced analytics.

The following sections explain in detail the five major activities involved in managing and operating a custom open source distribution:

Development of custom platform

1. Integration of open source projects and ongoing upgrades

Probably, the most pronounced misconception among organizations that evaluate developing their own platform, is the initial development effort. That first step requires integrating the latest versions of all required open source projects, including not just data processing engines (e.g., Apache Impala, Apache Spark) but also all foundational services needed for storage (e.g., Apache Ozone), scheduling / orchestration (e.g., Apache Zookeeper), and security / governance (Apache Ranger and Apache Atlas). That process is a complicated development workflow that requires substantial engineering effort. While an available version of each open source project is fully functional on its own, it was not built with the intention to integrate with any version of other open source packages. As a result, the platform development team needs to test many different combinations to ultimately identify the right major / minor version of each project that properly integrates with the rest of the custom distribution. All those tests until a working combination is found, would require several testing cycles to ensure the platform meets functional and non-functional requirements.

The platform development workflow doesn’t end there, as the engineering team needs to continuously upgrade the platform, once a new version of a relevant open source project has been made available in the open source community. Then, the team needs to not only ensure that the new version is compatible with the rest of the platform (making any necessary upgrades to other open source projects on an as needed basis), but also re-apply all the custom patches / scripts that have been built so far and re-certify all end-user applications (e.g., data engineering pipelines, machine learning models). That process will need to be repeated often during a year, given the release frequency of open source projects included in Cloudera Data Platform (CDP), as illustrated below:

In CDP, Cloudera manages dependencies across 25+ projects in the open source ecosystem, dealing with an inflow of hundreds of open source commits annually. To ensure that the platform can meet all functional and non-functional requirements of our customer base, we conduct four different types of tests (preCommit CI Tests, Smoke Tests, Non Functional and Readiness Tests) across a variety of scenarios, in terms of scope, environment footprint, and workload.

To ensure that our customers continuously receive the latest stability, reliability and performance improvements that become available in the open source community, Cloudera provides the latest, pre-integrated and pre-tested runtimes in Long Term Support (LTS) releases that include bug fixes, consolidated hotfixes, CVE security fixes and minor platform certifications. LTS releases drastically simplify the cluster upgrade process by providing the latest improvements as parcels that can be easily distributed to an existing cluster. In addition to LTS releases, Cloudera provides regular maintenance releases called Service Packs that also include security updates, hotfixes, performance and minor updates that guarantee the security posture and reliability of the platform.

2. Integration of custom monitoring and management tooling

An additional layer of complexity to creating and managing a custom runtime is identifying and configuring all the relevant tools required for common platform management tasks that can be performed, out-of-the-box, by proprietary Cloudera capabilities (such as Cloudera Manager, and Cloudera Observability) available in the CDP runtime. Given the number of administration tasks involved in managing a custom open source platform there are many different categories of tools required such as workload optimization tools (or Application Performance Management tools) to optimize the performance of individual workloads, environment monitoring tools for environment-level and host-level metrics and dashboards, log search tools for filtering and searching through filters and alerting tools for sending alerts based on user-defined triggers.

Some of these tools are open source, whereas others are not (e.g., for Workload Management, Log Search), increasing, as a result, the total cost of ownership for the custom platform. On the other hand, Cloudera subscription for all tiers includes all management tools required for those tasks at no extra cost.

Ongoing platform management effort

While the tools presented above offer similar functionality to the Cloudera management capabilities, they result in greater management effort throughout the platform lifecycle:

3. Environment Configuration and Monitoring

An analytical stack comprising open source projects has a lot of configuration complexity; In a typical Cloudera deployment of ~100 nodes, there are more than 400 services running, each with its own environment variables (some global and others local), multiple config files, unique command line options etc. Since there is no 3rd party solution dedicated to open source projects, most of those configurations need to be made manually, whereas Cloudera Manager offers a simple interface to manage that complexity. A great example of the capabilities of Cloudera Manager not available by any other open-source or commercial-off-the-shelf software is Kerberos Authentication. To streamline the user authentication lifecycle Cloudera Manager offers automated Kerberos configuration, direct-to-AD Kerberos integration and tuning / monitoring capabilities for Kerberos services.

In addition to its configuration capabilities, Cloudera Manager is able to visualize metrics for all open source projects and management services used by platform tenants and deliver critical insights to platform administrators that help them with decision making. Those metrics include not just specific variables and metrics collected by each service (e.g., throughout, utilization, network I/O, data written) but also composite metrics and alerts that help with issue resolution and environment management. None of the open source or proprietary monitoring tools that could be used to manage / monitor a custom runtime offer that granularity in environment performance and health, which makes platform management more complex and platform downtime more likely.

4. Issue Resolution

In a custom runtime with many analytical services that possess a high degree of configuration and integration complexity, issue resolution becomes a challenging topic. Organizations that maintain their own custom platforms have a limited amount of time and technical expertise to reactively handle problems that arise with mission critical services. On the other hand, Cloudera has decades of deep expertise in the open source projects included in the Cloudera runtime and the necessary resources to help clients pinpoint and resolve platform issues regardless of complexity right down to the actual code level. Cloudera Support has also developed its own troubleshooting blueprint known as CDM (Cloudera Diagnostic Methodology) which offers a plan of attack for achieving thorough and complete problem resolution.

In addition to the Cloudera Support organization that has over 500 support resources distributed across the entire globe and capable of achieving 24/7 coverage on critical issues, Cloudera has over 150 committers to the various Apache open source that are included in the CDP runtime. Cloudera’s Software Engineers/Apache committers can be directly involved in resolving support cases issues when their level of expertise is required.

To further accelerate the issue resolution process, we have introduced Cloudera Observability which is available to all customers as the Essential tier. Among others, Cloudera Observability enables users to quickly diagnose platform or workload related issues with advanced service health and performance metrics, conduct root cause analysis and proactively prevent issues with Validations.

More specifically, “Validations” is one of Cloudera’s most powerful proactive and predictive support differentiators that allows customers to obtain self-service recommendations via the MyCloudera Customer Portal on over 320 known problem signatures leveraging our customer-only Knowledge Base inside of MyCloudera as a constantly curated repository for problem summaries and solution paths. Validation alerts are powered by the Cloudera Diagnostic Bundle built inside of Cloudera Manager and its comprehensive collection of both environmental and product-level diagnostics. Customers can rest easy knowing that the bundle contains no personally identifiable information or other sensitive data. Customers enjoy a 30% decrease in the time to resolution by leveraging the Cloudera diagnostic bundle and have also avoided thousands of known problems by remedying these known issues before they can cause adverse cluster effects and downtime.

5. Security and CVE Remediation

While security is one of the core disciplines in our software engineering process, we cannot ignore the likelihood of security vulnerabilities in the open source projects to which Cloudera contributes, as well as the other dependencies that make up our product – AKA: Supply Chain Security.

Cloudera performs continuous analysis using a full suite of tools and data feeds. This allows us to identify security issues or vulnerabilities and to perform remediation with minimal delay. Every release candidate goes through extensive analysis including Static Application Security Testing (SAST), Dynamic Application Security Testing (DAST), Software Composition Analysis (SCA), and manual Penetration Testing.

The triage and validation process begins as soon as a vulnerability is identified internally or reported through an external channel. During validation the Product Security team performs a thorough analysis of the code in question in order to determine exploitability, impact, and to find all uses of the vulnerability within the codebase. If the vulnerability is determined to be valid, Cloudera will develop a hotfix within our Service Level Agreement (SLA) and distribute to customers using our support portal. Often this is very resource intensive, especially if the development team is not familiar with the OSS code containing the vulnerability. Thankfully, Cloudera developers operate on these code repositories daily.

On the other hand, an organization managing their own open source runtime will have to develop and implement a hotfix for their custom platform after a security vulnerability becomes public. That however, requires deep expertise to build and implement, in order to ensure compatibility with other components of the platform. This divergence can also lead to duplicate or wasted work if the upstream project implements a fix that conflicts with the code. As a result, the lack of dedicated security SMEs to timely identify a vulnerability in addition to the substantial effort applying a hotfix to custom runtimes extends the duration a self-supporter’s custom platform is exposed to cybersecurity risks.

Conclusion

In the sections above we provided an overview of the effort and challenges associated with building and managing a custom distribution. That effort translates to additional resources required to build, operate and secure the platform:

The costs associated with hiring and retaining those resources are very high and will ultimately offset the costs incurred for the Cloudera subscription, let alone the ongoing risks associated with losing talent that has that expertise. In addition, the custom platform will negatively impact tenant experience due to longer upgrade cycles, delays to provision new environments and increased time to discover and resolve issues that translates to greater likelihood of platform downtime and performance degradation. Finally, the delay to resolve internally or externally identified security vulnerabilities undermines the entire security posture of the technology organization.

If you would like to learn more about the advanced capabilities of CDP, check out a quick overview of the platform, or contact Cloudera for a discussion tailored to your situations.

Andreas Skouloudis

More by this author

David Coursey

Senior Manager, Engineering - Product Security

More by this author

Editor's Choice

Business

Generative AI for the Enterprise

Technical

Building Trust in Public Sector AI Starts with Trusting Your Data