Streamlining Data Virtualization with AI and ML
Reading Time: 4 minutes

These days, artificial intelligence (AI) and machine learning (ML) seem to be everywhere around us, answering questions, designing components, and even driving cars. Can they also help with data virtualization? For sure. But first, let’s make sure that when we talk about AI and ML, we are talking about the same things.

For a start, forget about Terminator and the Matrix. These are examples—and very scary ones, at that—of artificial general intelligence. That doesn’t exist… yet. Today’s AI and ML are very limited in their ability and scope; hence the relatively narrow moniker “artificial intelligence.” Perhaps “artificial not much intelligence” would be more apt.

One other consideration on naming is that AI, ML, deep learning, cognitive computing, and half a dozen other terms are used in this field. For experts, they may mean very specific and different things, but they are often used interchangeably in general discussion, and I’m doing the same here. I will, however, make a different distinction: automation vs. augmentation.

Although current AI tools support specific applications in almost every conceivable field, their functions can be categorized into two broad areas: automation, enabling faster and cheaper decisions and actions, often by eliminating human activity in the process, and augmentation, helping to improve or expand human decision making and action taking with insights and suggestions. While these may overlap somewhat in some applications, the question of whether the human role is diminished or supported always indicates the key focus of AI in each specific application.

AI in Decision-Making Support

In the field of user-facing business intelligence (BI) and analytics tools, AI has been leveraged with increasing frequency over the past few years. The aim has been to augment businesspeople’s understanding of data and results. AI techniques—such as regression, clustering, and pattern and anomaly detection—are used to offer insights and advice to users via natural language generation (an ML technique) to deliver “stories” about the results. The aim here is to improve the lives and effectiveness of businesspeople.

Data preparation is another big focus area, driven particularly by the challenges of understanding externally sourced big data and the observation that crystal-clear data lakes often degrade quickly into murky data swamps. “Data wrangling” often comes into play here. Data wrangling is usually performed by data scientists to discover the structures and meanings in externally sourced data coming into these data lakes. Such detective work is often said to take up to 80% of a data scientist’s time.

This figure has also been often quoted in reference to “data archaeology,” a popular name for an ancient yet unpopular pastime in data warehousing. It involved (and still involves) specialized IT resources to dig around for ages in old and often obscure operational systems to find and interpret data in technically complex stores. It is one of the most costly, time-consuming, and challenging aspects of building an extract, transform and load (ETL) system for a data warehouse.

AI, trained using existing data, can assist with both wrangling and archaeology, mostly for automation and to a lesser extent for augmentation. Supervised learning requires training data that has been previously tagged by humans, whereas unsupervised learning discovers patterns in untagged data.

Both learning approaches are used to determine the structure and content of new data sources. For example, a large comma-separated variable (CSV) file contains dozens of unlabeled fields. AI can recognize that some fields look like zip codes, street addresses, state and country names, and so on, based simply on the content of the fields, enabling an address entity to be constructed, and its content cleansed, based on validation rules. Similarly, customer and product entities can be identified and prepared for further use. Human work—whether that of a data scientist in wrangling or an IT specialist in archaeology—is reduced to the final checking and approval of the AI analysis. Such automation reduces costs and speeds delivery.

Applying Machine Learning to Data Virtualization

Although lacking a catchy name, the same challenge of understanding data sources is faced by the designers and builders of data virtualization systems: What is the context and meaning of the data to be accessed and processed via query or API by businesspeople through the data virtualization layer? The challenge for data virtualization is greater than for traditional ETL systems. In production, data virtualization delivers data immediately, whereas ETL systems deliver it in the background, in scheduled batches, and often offline. The bar is higher for data virtualization: any contextual errors in data understanding must be caught and fixed early and effectively, lest they impact the business user in real time.

AI techniques used in data wrangling and archaeology can also be applied in the design phase of a data virtualization implementation, to both automate the context discovery process and to augment the knowledge of the data virtualization environment designers. This can be particularly useful in identifying and alerting developers to common challenges, such as the use of the same variable name for different data in diverse systems. For example, different sources may contain partially overlapping sets of Customers, as well as incompatible Customer ID schemes. Machine learning can provide solutions based on pattern matching techniques and inferences from the data sources. Note that this is work that typically can be and is done by data virtualization developers. AI reduces the workload and helps new developers to understand what to look out for.

Into the Future

Applying AI to data preparation in data wrangling, data archaeology, and especially data virtualization, is still in its early stages. It is more difficult than applying AI in other, more generalized application areas, because training data sets are smaller and more restricted. However, this situation will change as new AI techniques, such as deep reinforcement learning and generative adversarial networks, reduce the amount of training data required. As a result, we may expect to see a more extensive application of machine learning to data virtualization emerge in the near future.

Barry Devlin