LinkedIn open sources lakehouse tool OpenHouse

The tool is already in use at LinkedIn with more than 3,500 managed OpenHouse tables in production, serving more than 550 daily active users.

p1200433
Magdalena Petrova

LinkedIn has decided to open source its data management tool, OpenHouse, which it says can help data engineers and related data infrastructure teams in an enterprise to reduce their product engineering effort and decrease the time required to deploy products or applications.

OpenHouse is compatible with open source data lakehouses and is a control plane that comprises a “declarative” catalog and a suite of data services.

A data lakehouse is a data architecture that offers both storage and analytics capabilities, in contrast to the concepts for data lakes, which store data in native format, and data warehouses, which store structured data (often in SQL format).

“Users can seamlessly define Tables, their schemas, and associated metadata declaratively within the catalog. OpenHouse reconciles the observed state of Tables with the desired state by orchestrating various data services,” LinkedIn wrote while describing the offering on GitHub.

Fundamental idea behind the product

But why did LinkedIn choose to develop the big data management tool for lakehouses?

According to company engineer Sumedh Sakdeo, it all started with the company opting for open source data lakehouses for internal requirements over cloud data warehouses as the former “allows more scalability and flexibility.”

However, Sakdeo said that despite adopting an open source lakehouse, LinkedIn faced challenges around offering a managed experience for its end-users.

In contrast to the typical understanding of managed offerings across databases or data platforms, in this case, the end-users were LinkedIn’s internal data teams and the management would have to be done by its product engineering team.  

“Not having a managed experience often means our end-users have to deal with low-level infrastructure concerns like managing the optimal layout of files on storage, expiring data based on TTL to avoid running out of quota, replicating data across geographies, and managing permissions at a file level,” Sakdeo said.

Moreover, LinkedIn’s data infrastructure teams would be left with little control over the system they had to operate, making it harder for them to regulate proper governance and optimization, Sakdeo explained.

Enter OpenHouse — a tool that solves these challenges by eliminating the need to perform additional data management activities in an open source lakehouse.

According to LinkedIn, the company has implemented more than 3,500 managed OpenHouse tables in production, serving more than 550 daily active users and catering to a broad spectrum of use cases.

“Notably, OpenHouse has streamlined the time-to-market for LinkedIn’s dbt implementation on managed tables, slashing it by over 6 months,” Sakdeo said, adding that onboarding LinkedIn’s go-to-market systems to OpenHouse has helped it achieve a 50% reduction in the end-user toil associated with data sharing.

Inside OpenHouse

But how does it work? At its heart, OpenHouse, which is a control pane for managing tables, is a catalog that comes with a RESTful table service designed to offer secure and scalable table provisioning and declarative metadata management, Sakdeo said.

Additionally, the control plane encompasses data Services, which can be customized to seamlessly orchestrate table maintenance jobs, the senior software engineer said.

The catalog service, according to LinkedIn, facilitates the creation, retrieval, updating, and deletion of an OpenHouse table.

“It is seamlessly integrated with Apache Spark so that end-users can utilize standard engine syntax, SQL queries, and the DataFrame API to execute these operations,” LinkedIn said in a statement.

Standard supported syntax includes, but is not limited to: SHOW DATABASE, SHOW TABLES, CREATE TABLE, ALTER TABLE, SELECT FROM, INSERT INTO, and DROP TABLE. 

Additionally, the catalog service will allow users to establish retention policies on time-partitioned OpenHouse tables.

“Through these configured policies, data services automatically identify and delete partitions older than the specified threshold. End-users can also employ extended SQL syntax tailored for OpenHouse,” Sakdeo said, adding that the service also allows users to share OpenHouse tables.

OpenHouse supports Apache Iceberg, Hudi, and Delta table formats.

To help enterprise users replicate tables, the company has extended the data induction framework, Apache Gobblin, by contributing cross-geography replication functionality tailored for Iceberg tables.

IcebergDistcp, a component within this framework, ensures high availability for Iceberg tables, allowing users to execute critical workflows from any geographic location, the company said.

“OpenHouse classifies tables as either primary or replica table types, allowing replica tables to be read-only for end-users. Update and write permissions are exclusively granted to the distcp job and the OpenHouse system user,” it added.

On the storage front, it supports a Hadoop Filesystem interface, compatible with HDFS and blob stores that support it. Storage interfaces can be augmented to plug in with native blob store APIs, the company said.

As for database support, OpenHouse utilizes a MySQL database to store metadata pointers for Iceberg table metadata on storage.

“The choice of database is pluggable. OpenHouse uses the Spring Data JPA framework to offer flexibility for integration with various database systems,” Sakdeo said.

Other functionalities of OpenHouse include observability and governance.

Copyright © 2024 IDG Communications, Inc.