Estimated reading time: 5 minutes

SMT Perspectives & Prospects: Artificial Intelligence Part 6: Data Module 1
Data is one of the six pillars of AI infrastructure. It is critical to the performance of artificial intelligence (AI) models. AI data, essential to both the training and inference of Generative AI models, connotes the datasets used to train, validate, and test AI models. Training data provides models with a frame of reference by establishing a baseline against which models can compare new data using pre-trained models for predictions or generating new content.
There are three primary types of data: structured, unstructured, and vector. Structured data is a highly organized database, making it easier for algorithms to learn patterns. Traditional machine learning (ML) tasks, such as regression or classification for predicting sales numbers, fall into this category. Unstructured data lacks a predefined format and is widely used in deep learning applications, such as natural language processing, computer vision, and speech recognition. Examples include text, audio, video, and images. Vector data and embeddings are high-dimensional numerical representations of such data, and are commonly used in tasks like similarity search, semantic search, clustering, and recommendation systems. At its core, model output is directly shaped by the input data. The size and quality of data are the top characteristics of data.
Quality and Size of Data
The overall volume of data continues to rise. By 2030, experts predict worldwide data will grow to over 660 zettabytes¹. The size of the data is one thing; the data’s quality is another. What data is to be collected and used? Data needs to represent the entity described completely and accurately in the absence of missing data from a given dataset, to be consistent without contradictions, and to be valid with integrity and uniqueness without duplicate or overlapping data. Additionally, specificity, adaptability, and diversity are other required characteristics of data. Cleaning and sorting data are prerequisites for data preparation, and tools such as AWS SageMaker Data Wrangler can facilitate this.
A model trained on massive but nonspecific datasets may not be useful. The effort is to create smart data that possesses the qualities characterized by accuracy, completeness, consistency, integrity, and uniqueness. From a usefulness perspective, more isn’t always better, and a high volume of data may diminish returns and increase compute costs.
Haphazardly loading data into a large language model (LLM) can exacerbate the problem, which leads to overwhelming levels of complexity and a lack of confidence in shared data. To mitigate data inconsistencies, access-based data collaboration, rather than copy-based integration, eliminates data duplication or overlapping data. For example, high-quality data is imperative in the automotive industry when developing autonomous vehicle (AV) algorithms. Datasets for AV algorithms typically feature data captured from autonomous vehicles’ LiDAR and camera systems to improve object detection and motion prediction. It calls for a stringent six-nines (99.9999%) of reliability.
Ensuring privacy and security is another requirement, which may have business consequences.
Data Infrastructure
Building an effective AI system requires not only the raw data but also the infrastructure to collect, store, prepare, transport, process, and analyze data. This includes data collection systems that can gather data from various sources (sensors, user interactions, public datasets). Also included are storage systems that manage, store, and transport robust databases and data lakes capable of handling large volumes of structured and unstructured data, plus the Time-to-Live parameter that specifies how long temporary or transient data is retained in a system before being deleted (e.g., training logs; a robot completing a task within a set time).
Data management and governance are additional crucial components of data infrastructure that ensure data quality, compliance, and traceability.
Boundaries of Training Data
In considering the boundaries of training data, the first three fundamental questions are:
- What data to include: Topics, languages, time periods, sources.
- What to exclude: Irrelevant, harmful, or low-quality content.
- How much data: Volume and diversity to achieve the desired performance.
The overall goal of defining boundaries for LLM training data is to establish the purpose, scope, domains, and limits of data volume, aligning the datasets with the model’s goals, resources, and capabilities.
Data Preparation
When working on an AI model, data preparation takes a significant portion of the total time. Preparing data to achieve clean and relevant datasets is a critical process for the performance and accuracy of AI models. To prepare data effectively, follow these key steps:
- Define the AI model’s objective.
- Determine the data attributes and format required to achieve the objective.
- Collect raw data with relevance and diversity.
- Study the collected data to understand its content and quality, and spot anomalies, missing values, inconsistencies, and errors.
- Clean and standardize data by removing inconsistencies, errors, and duplicates.
- Convert the data into a suitable format for analysis.
- Divide datasets into training, validation, and test. A rule of thumb: 70-15-15 split.
- Augment data, if needed, to increase size and diversity, and to enhance data that mimics real-world data.
- Test and iterate by running a small model on a subset of data to spot issues.
- Continue monitoring data quality by regularly assessing datasets to reflect changing conditions or new information.
- Always keep records of data sources.
Challenges and Concerns With AI Data
Data quality and data volume are demanding areas, as LLMs (e.g., ChatGPT-4 and ChatGPT-5) require massive datasets. While the overall volume of data is growing, there is a shortage of diverse and high-quality data; noisy or biased data leads to poor model performance. Other challenges include data aging and data poisoning. Data aging refers to data losing its relevance, accuracy, or value over time. To mitigate this, implement data retention policies, adopt tiered storage practices, and regularly purge and delete outdated data.
Data poisoning is a type of cyberattack. Malicious actors may spread misinformation or insert incorrect or misleading information into the data or tamper with the data used to train AI models. Hackers could attempt to poison an AI model's training data and introduce security vulnerabilities, and then they could exploit them later using various attack vectors. For example, hackers may inject false information into websites, generating corrupted data. The difficulty lies in detecting and removing such poisoned data, as even a small amount can compromise an AI system’s output.
Module 2 on “Data” will highlight the alternative approaches in data requirements, synthetic data, data governance, the role of Retrieval-Augmented Generation, and Agentic RAG in data.
References
1. UBS Research (www.UBS.com).
Appearances
Dr. Jennie Hwang will deliver a Professional Development Lecture, “Reliability of Electronics—Solder Joint Reliability,” at SMTA International, Oct. 20, in Chicago. She will instruct three Global Electronics Association Professional Development webinar courses: “PoP Packaging and Assembly—Materials, Processes, Reliability,” Nov. 11 and 13; “BTC Packaging and Assembly—Materials, Processes, Reliability,” Nov. 18 and 20; and “Reliability of Electronics—Solder Joint Voids, All You Should Know,” Dec. 2 and 4.
This column originally appeared in the October 2025 issue of SMT007 Magazine.
More Columns from SMT Perspectives and Prospects
SMT Perspectives and Prospects: Warren Buffett’s Perpetual Wisdom, Part 2SMT Perspectives and Prospects: Warren Buffett’s Perpetual Wisdom, Part 1
SMT Perspectives and Prospects: Artificial Intelligence, Part 5: Brain, Mind, Intelligence
SMT Perspectives and Prospects: Artificial Intelligence, Part 4—Prompt Engineering
SMT Perspectives and Prospects: The AI Era, Part 3: LLMs, SLMs, and Foundation Models
SMT Perspectives and Prospects: A Dose of Wisdom
SMT Prospects and Perspectives: AI Opportunities, Challenges, and Possibilities, Part 1
SMT Perspectives and Prospects: Critical Materials—A Compelling Case, Part 3