Creating the Ultimate Data Orchestrator

By Corey Reed, Head of Data Science & Satya Gandham, Machine Learning Engineering Manager

Published in

StubHub Product & Tech Blog

6 min readFeb 21, 2019

When you serve millions of customers every year — averaging more than 1.3 tickets sold per second — the opportunity to leverage all the data we have is ripe. That extensive opportunity is further enhanced by our ongoing migration to the cloud which has created several new and exciting avenues to explore. We are using Google Cloud Platform (GCP) and its ML tools — like TensorFlow and ML Engine — to enhance our customers’ experience at StubHub.

From a data science perspective, what’s most exciting is the chance to construct our own data orchestrator (DO). By doing so, we will unlock the power of machine learning by allowing models to access the various data sources we have at our disposal and to seamlessly integrate with our application. The DO component manages all the data services that we provide to our customers by acting as a data broker for our ML models.

For the ML models that are either being trained or given the data they need to make a prediction, DO serves as a middle layer between the data sources and the model. For models that have already made a prediction, DO is a middle layer between the model and the StubHub application that will act on the prediction.

With this DO component, we’ll be increasing the ease of data access and speed of model development & deployment, ultimately allowing us to provide more value to our fans than ever before.

Utilizing GCP Tools for ML

As we set out to modernize our data warehouse, we wanted to create a new platform to build StubHub’s ML applications. We know of other companies have built their own ML platforms in-house and from the ground up, but this can be a time- and labor-intensive process.

We turned to GCP because of the “out-of-the-box” infrastructure and tools it provides, particularly ML Engine, upon which we can build the DO to enable data scientists to deploy and upgrade models faster, and TensorFlow, an ML library regularly used by Google engineers on their own platforms.

With these tools in place, we can scale a job as needed, so clusters that are sitting idle won’t add cost to the bottom line. We can exploit customized chips for processing TensorFlow code to perform complex computations as quickly as possible. We can test one version of a model against another. Lastly, we can create an internal REST API, allowing engineers to call the API instead of building a custom API for all new models our data science team creates.

Data Orchestrator

DO acts as an overseer that knows where everything inside our different data sources lives and can collate and share that data with the API that the ML Engine creates.

We decided we needed a DO on top of GCP for a variety of reasons. The DO optimizes data collection and can make near-real time data available to models. It allows us to be compatible with a variety of programming languages that data scientists depend on, such as Python and R . In addition, with DO, data scientists can easily specify which data a model needs and rely on the orchestrator to determine how to locate that data.

DO adds a layer of security, preventing outside access to certain units of data. It sets up a framework that makes it easy for applications to go live and inherently meet our Information Security (IS) standards.

The Pricing Guidance Engine Example

Here’s a scenario we’ve been testing: a seller wants to make a new listing for a ticket. She puts in the details of the event, such as the city, date, and venue. The seller knows what the face value of the ticket but she doesn’t know its current market value. So StubHub provides the seller with a pricing estimate.

*A high-level view of our data architecture.*

If we provided only the API from the ML engine, its inputs will be the specific features of the ML model. But those come from different places. Some of the inputs come from the front-end — in this case, the page where the seller just added the details of the listing. Other inputs come from various sources. A potential source could be a table in BigQuery that stores the transactions that have recently happened, which the ML model will need to know. There’s other data — like details about seat quality and recent transactions — that are currently available only on StubHub’s on-premise computer clusters.

Without having a DO in place, our engineers and data scientists would have to create custom procedures specific to this pricing guidance use-case in order to collect all the data needed by the ML model.

With a DO, the data scientist who builds the model can design the way it finds the data that’s needed. Instead of calling the ML engine’s API, the StubHub application can send a message to the DO that it wants to make a specific request and provide the pertinent information it has on hand (i.e. the details of the listing). The DO will recognize what kind of request this is and then reference the contract that the data scientist created, which specifies that for this kind of request, data will need to be accessed from specific sources (i.e. from BigTable, Pub/Sub, internal computers, etc.).

DO has been built in such a way that it knows exactly where to go to find that data.

Data Stress Test

We are continuing to build out and refine our DO, which includes performing stress tests on the varying amounts of data DO can handle. For instance, recently we fabricated a test where we sent more data input to the API. The API receives input as JSON, which is basically a key value pair list. We looked to see what would happen if the JSON input was 150% bigger than our typical request. What would happen if this input was 700 % bigger or even 70,000% bigger?

We wanted to see how fast we’d be able to get a response from an extremely simple model and at what point the network latency of sending that data started to become noticeable. This simple experiment showed that slow-down due to the volume of data being sent in to a model is not a concern except in the most extreme cases. This opens the door to some rather interesting possibilities. Will we need a fast response from a model whose input is a picture or a lengthy news article? It might be feasible!

*Results from a latency response test on DO for various JSON file sizes.*

Looking Ahead

We still have components of DO that we want to improve. The advantage we have is the GCP platform itself, which allows us to iterate faster and gives our data science team the freedom to develop more sophisticated ML programs. This gives our team comparable power and ability to build and define our own machine learning products.

We’re now empowered to innovate new ideas to help our customers. Accurate and timely data that can be accessed via our ML models will help give our customers the best prices for their tickets. And for our team of data scientists and engineers, DO is just the first step in creating a new system that will operate smarter and faster for years to come.

Learn more about the authors of this post, Corey Reed and Satya Gandham. Do you love discovering trends in data and tackling unique problems with creative solutions? Join us.