We are in the early stage of the data revolution. Today’s conventional wisdom states that hybrid latency prevents you from running analytic and machine learning workloads in the cloud with the data on-prem. As a result, most companies either do not burst their workloads into the cloud or have to copy their data into a cloud environment and maintain that duplicate data—and even this can take days or weeks.
Counter to conventional wisdom, you can create a high performance hybrid/multi-cloud data platform for analytics and machine learning by using data orchestration technology. Container orchestration technology (e.g. Kubernetes) enables containers to run in any environment agnostic to the hardware that is running the application, similarly data orchestration technology enables active working sets and data to be orchestrated on demand. This enables enterprises to create a compute agnostic, storage agnostic and cloud agnostic stack to serve all their data driven workloads, including analytics (e.g. Apache Spark, Presto, Apache Hive) and AI (e.g. Tensorflow, PyTorch).
An enterprise hybrid cloud analytics strategy should burst compute that spans on-prem and cloud data stores. By bringing the hot data to the analytics and machine learning applications, the performance is the same as having the data co-located in the cloud. Also, the on-prem data stores will have offloaded the computation and minimized the additional I/O overhead.
Specifically, a data orchestration system should:
- Abstract data access across storage systems. Today, data is siloed in various storage systems and deployments. For example, lots of on-premise data is stored in systems such as Apache HDFS, Ceph, EMC Isilon and IBM Cleversafe, and data in the cloud is stored in systems such as AWS S3, GCP Store, Alibaba OSS, etc. Historically, there have been many attempts to unify all data into a single storage deployment with limited success. As a result, we’ve seen more of a trend around embracing these data silos by abstracting the data in various storage systems/deployments. To do this, the data orchestration system needs to abstract and virtualize across various storage deployments/deployments by server side API translation, which enables users to run existing applications without modifying any code. The system also needs to provide a single global namespace to enable apps and users to access all the data in their environments without the need to understand which storage the data is stored in.
- Bring data locality in a separated compute and storage environment. While cloud vendors and leading companies in data-heavy industries have adopted a decoupled compute and storage architecture, it can still be very challenging to implement this type of architecture, especially when it comes to bursting on-premises data into the public cloud. A data orchestration system resolves this challenge by providing caching functionality that enables the system to automatically orchestrate the hot data to be closer to compute while at the same time keeping cold data in a cheap storage. Because each workload may have a different usage pattern, the system must have a policy that enables customized policies best suited for those specific workloads. Moving forward, it is also very critical to have policies across various clouds to enable a multi-cloud environment.
- Provide security for data in motion and at rest with secure access methods. A data orchestration system plays a critical role in the success of an organization’s data strategy. While open source software is important and can help in avoiding potential vendor lock-in, it is also critical to provide necessary security features in this environment given the system touches a lot of data in the organization.
Besides empowering data driven application developers, a data orchestration system also brings value to data and infrastructure engineers. It de-risks vendor lock-in by providing organizations with flexibility at the infrastructure level. Transitioning to different storage systems (including cloud storage), adopting another application framework, or even having a hybrid or multi-cloud environment are all possible without the need for many different technologies or incurring large development costs.