AI datasets need to get smaller—and better

Ever-larger datasets for AI training pose big challenges for data engineers and big risks for the models themselves.

cairns rocks rockpile
Marcus Buchwald (CC BY-SA 2.0)

From early-2000s chatbots to the latest GPT-4 model, generative AI continues to permeate the lives of workers both in and out of the tech industry. With giants like Microsoft, Google, and Amazon investing millions in R&D for their AI solutions, it’s hardly surprising that global adoption of AI technologies more than doubled between the years 2017 and 2022.

So, what exactly has changed in the last five years of AI development? From an engineering perspective, AI advancements have generally been in three categories:

  1. Models: The most obvious change we’ve seen is in the development of transformer models and, subsequently, the evolution of large-scale models like GPT-3 and GPT-4. Scalability limitations in training natural language processing (NLP) models are overcome using parallelization and the attention mechanism of transformer models, which accounts for context and prioritizes different parts of an input sequence.
  2. Management tooling: The data engineering field has evolved to account for rapidly scaling datasets and advanced reinforcement learning algorithms. Namely, more sophisticated data pipelines are being leveraged to gather, clean, and utilize data. We also see the emergence of automated machine learning (autoML) tools that automate several aspects of model development, including feature selection, hyperparameter tuning, and the concept of machine learning operations (MLOps). MLOps introduces solutions for better model monitoring, management, and versioning to facilitate the continuous improvement of deployed models.
  3. Computation and storage: As you might expect, more advanced models and tooling require enhanced hardware to accelerate data processing, including GPUs and TPUs. The data, of course, needs somewhere to live, so enhanced data storage solutions are emerging to handle and analyze vast amounts of data.

With more available training data than ever before, AI and machine learning should be more effective than ever. So why are data engineers and decision-makers still struggling with data quality and model performance?

From data scarcity to abundance

Initially, the primary challenge in AI development was the scarcity of data. Adequate, relevant, and diverse data was difficult to come by, and AI development was often bottlenecked by these limitations.

Over the last five years, open data initiatives and automated data collection have skyrocketed. These, among other things, created an influx of available data for AI and thus transformed former limitations into a paradox of plenty. Open-source information and AI-augmented datasets leveraged to address data gaps have presented engineers with unique, unexpected challenges. While the availability of extensive data is crucial for advancing generative AI, it has simultaneously introduced a set of unforeseen problems and complexities.

More data, more problems?

Vast amounts of available data are no longer purely beneficial and, in fact, may no longer be the best way to improve AI. Large datasets inherently come with substantial volumes of data, often ranging from terabytes to petabytes or more. Managing, storing, and processing such large volumes of data require sophisticated engineering solutions, such as distributed computing systems, scalable storage solutions, and efficient data processing frameworks.

Aside from volume, engineers also struggle with the high speed at which datasets are often generated, processed, and analyzed. This increased velocity and the intricacy of large datasets (including nested structures, high dimensionality, and intricate relationships) demand sophisticated data modeling, transformation, and analysis techniques.

The challenges of large datasets

This near-impossible balancing act unsurprisingly presents a myriad of problems for engineers. Tech executives widely report the following challenges that arise as their datasets grow: 

  1. Information overload: The sheer volume of data can be overwhelming. With large datasets, it quickly becomes challenging to identify relevant or valuable information. This issue trickles all the way down the pipeline, where irrelevant or ambiguous data causes difficulty in extracting meaningful insights.
  2. Increased complexity: More data often means dealing with complex, high-dimensional datasets that require sophisticated (and computationally intensive) development and optimization.
  3. Decreased quality: When large datasets introduce ambiguity or complexity, models tend to compensate by overfitting. Overfitting occurs when a model learns the training data too well, including its noise and outliers, to the extent that it no longer produces accurate results for unseen data. Essentially, the model begins memorizing rather than learning, thus making it extremely difficult to ensure data quality and accuracy.
  4. New resource limitations: Despite the computational advancements made in the AI sector, companies continue to face resource limitations when training models. Longer training times demand adequate processing power and storage, which poses logistical and financial challenges to developers and researchers. Perhaps less obviously, advancements in AI also present human-centric challenges, including a growing skill gap for professionals who can manage big data and AI systems. 

The volume, velocity, variety, and complexity of large datasets necessitate advanced data engineering solutions. When fighting for quality against resource constraints, data management is the only way to ensure an effective, efficient, and secure data model.

Rethinking datasets for AI training

Now more than ever, large training datasets necessitate advanced data engineering solutions. Proper data management can combat many data quality issues, from inconsistency to model performance.

But what if the best way to manage large datasets is to make them smaller? There’s currently a move afoot to utilize smaller datasets when developing large language models (LLMs) to promote better feature representation and enhance model generalization. Curated smaller datasets can represent relevant features more distinctly, reduce the noise, and thus improve model accuracy. When representative features are emphasized this way, models also tend to generalize better.

Smaller datasets also play a crucial role in regularization, a technique used to prevent overfitting in machine learning models, allowing the models to generalize better to unseen data. That being said, smaller datasets come with a higher risk of overfitting, especially with complex models. Hence, regularization becomes crucial to ensure that the model does not fit the training data too closely and can generalize well to new data.

As you might expect, data accuracy is even more critical with smaller datasets. In addition to normalizing and balancing the data, engineers must ensure adequate model validation and often choose to revisit the model itself. Techniques like pruning decision trees, using dropout in neural networks, and cross-validating can all be employed to generalize data better. But at the end of the day, the quality of training data will still make or break your results.

Shifting the focus to curation and management

Engineering managers and leadership should shift focus now to curating and managing datasets to maximize data variety and relevance and minimize noise. Not only does a well-managed dataset contribute to better model training, it also fosters innovation by allowing researchers and developers to explore new models and techniques. Companies that can manage data effectively and ensure its quality can gain a competitive edge by developing superior AI models. These models not only boost customer satisfaction, but also support better decision-making processes at the executive level.

The paradox of plenty presents the inherent risks and challenges posed by so much available information. Generative AI is shifting its focus to managing and processing. For this reason, we turn to comprehensive observability and analytics solutions. With the right tools, data engineers and decision-makers can develop more meaningful models, regardless of the size of the datasets they work with.

Ashwin Rajeeva is co-founder and CTO of Acceldata.

Generative AI Insights provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.

Copyright © 2024 IDG Communications, Inc.