Data Virtualization and Data Science
Reading Time: 2 minutes

If we look at a typical data science lifecycle, many of its stages have more to do with data than science. Before data scientists can begin their work regarding data science, they often must begin by:

  1. Finding the right data
  2. Gaining access to that data, which might involve establishing credentials
  3. Transforming that data into a usable format
  4. Combining that data with data from other sources
  5. Cleansing that data, if necessary, to eliminate incomplete data points

Enter Data Virtualization

Data virtualization provides data scientists with integrated real-time views of the data, across all of its existing locations. The best part is that it provides such views without having to move any data from its original locations to a new, centralized repository, such as a data lake or data warehouse.

Data virtualization can do this because it forms a unified data-access layer above the different data sources. This layer contains no source data; just the critical metadata necessary to access the different data sources.

Not only does data virtualization facilitate finding and gaining access to data, but because it is implemented as a separate data-access layer above the disparate sources, it can also perform transformation, combination, and cleansing, all on-the-fly.

Here, I’ll provide a little more information about how data virtualization can support data scientists throughout the typical data science workflow:  

  • Identifying Useful Data: Data virtualization provides data scientists with seamless access to all types of data sources, from data lakes, to Presto or Spark systems, to social media, or even flat and/or JSON files. The Denodo Platform offers a built-in data catalog, which enables data scientists to find the data they need using simple search functionality like that of a search engine.
  • Modifying Data into a Useful Format: The Denodo Platform also provides administrative tools that enable data scientists to add notes alongside their data sets. In the Denodo Platform, data scientists can use their own notebooks, such as Jupyter, for this, or they can leverage the included notebooks, which offer automatically generated recommendations using artificial intelligence/machine learning (AI/ML), based on past usage and behavior.
  • Analyzing Data: With data virtualization, data scientists can engage in analysis by immediately executing queries on the data when they first discover it, after it has been modified into different formats, or at any time to suit the data scientist.
  • Preparing and Executing Data Science Algorithms : The Denodo Platform provides a query optimizer that streamlines query performance by a series of techniques, including pushing down processes to the sources. For this technique, users can push down only a select part of the operation, depending on what delivers the best results.
  • Sharing Results with Business Users: Data virtualization provides data scientists with a platform for sharing queries and results with other team members, for a more collaborative, iterative workflow, especially using a data catalog like the one provided by the Denodo Platform. With data virtualization, data scientists can also publish data directly to a specific application like MicroStrategy, Power BI, or Tableau, and they can immediately see the results using the tool of their choice.

Data Virtualization: The Foundation for Data Science

Data virtualization can be deployed at any phase of the data science lifecycle, to streamline data science initiatives. Data virtualization offers data scientists real-time access to disparate data sources, it helps to facilitate data preparation and analysis, and it enables effortless collaboration.

 

Ravi Shankar