Estimated reading time: 5 minutes
SMT Perspectives & Prospects: Artificial Intelligence, Part 7—Data Module 2
When I last wrote about data, I focused on data quality and size, data infrastructure, the boundaries of data training, data preparation, and the general challenges and concerns of data. This month, I will focus on alternative approaches, data governance, retrieval-augmented generation (RAG), and agentic RAG.1,2
Alternative Approaches
Large language models (LLMs) demand enormous resources that may exceed the capacity of many academic and industrial entities and operations. Recent developments point to alternative approaches: resorting to smaller models, namely the reasoning approach, or taking time to “think.” Letting the chatbot think for just 20 seconds could get the same boost in performance as scaling up the model by 100,000x and training for 100,000x longer, while delivering comparable performance and consuming substantially fewer resources.
Taking time to think allows LLMs to tackle complex problems they haven’t been directly trained on. For example, OpenAI’s o1 offers several responses to each question and analyzes them to find the best one. Alibaba’s QwQ-32B is a reasoning model designed to solve complex problems through a reasoning approach with only 32 billion parameters.
The DeepSeek R1 model burst onto the AI scene in early 2025, catching the industry off guard. What made it such a jaw-dropping moment? The model was trained at a fraction of the cost, is open source, and does not require high-end chips. The model is significantly smaller, using substantially fewer parameters than ChatGPT-4 or -5. DeepSeek has 671 billion parameters, compared to ChatGPT-4’s estimated more than a trillion parameters, or ChatGPT-5’s tens of trillions of parameters.
DeepSeek takes an iterative refinement technique, while ChatGPT draws from a vast, diverse corpus backed by massive computational resources. Based on the technique known as “distillation,” DeepSeek uses much less computing power by training smaller, more efficient models from larger “teacher” models. This means they could extract strong performance without needing to train from scratch at a massive scale every time. This approach begins with a small, high-quality, curated dataset as seed data, which is used to train a classifier model. The model, in turn, retrieves similar documents from larger raw datasets and weeds out duplicates and low-quality data through data filtering and data preparation.
By training models on high-quality, curated datasets rather than on massive amounts of raw data (e.g., from the internet), DeepSeek improves training data efficiency by focusing on smart data use. It employs heavy use of synthetic data and leverages reinforcement learning, in which the model essentially taught itself to reason better through trial and error, reducing dependence on expensive human-labeled data and resulting in significant cost savings.
Overall, the approach of artificially synthesizing data to supplement real-world data has recently seen significant growth. The goal is to do more with less.
Data Governance
This framework of policies, processes, roles, accountability, and standards ensures that data is accurate, consistent, secure, and used responsibly across an organization. It defines who owns data, how it’s managed, and who can access it, enabling compliance, trust, and better decision-making.
AI models can only be as valuable as the quality, trustworthiness, and accuracy of the data that was used to train and fine-tune them. The industry-specific datasets and up-to-date data are crucial. To ensure data quality, data governance, including policies and procedures, should be established by considering the following areas:
- Identifying both the internal and external datasets.
- Determining performance-specific acceptance criteria before deployment. For example, the probability of an AI component failing is one in 10,000 computations, which may be acceptable for a customer chatbot but not in a self-driving vehicle.
- Building the technical infrastructure and gathering, cleaning, moving, storing, and delivering the data to the AI systems at the right time and at the optimal speed. To this end, leading techniques, such as RAG, have often been leveraged.
- For enterprise agentic AI with embedded agents, ensuring a shared understanding of intent or limits among multiple agents acting across systems.
Retrieval-augmented Generation (RAG)
This consists of the retriever component and the generator component, which uses NLP techniques to combine an LLM with external knowledge retrieval. The retriever component searches a large corpus of information to respond to the input query, and the generator component generates the final output by integrating the input query with retrieved passages. By fetching “facts” from external sources, RAG enables the link between the private data and the LLM to perform data analysis, summarization, and other tasks. This technique overcomes the limitation of AI models to what they have learned from their initial training on public data (up to a certain point in time). Overall, it enhances the accuracy, relevancy, and reliability of AI models.
RAG also facilitates synthetic data generation by combining the real data and RAG documents to generate answers (outputs), which can be stored as synthetic training data for smaller models. Using RAG to fetch authoritative passages, then prompting the LLM to create synthetic corpora (summaries, explanations, reasoning chains) aligned with domain knowledge, is a plausible pathway.
Agentic RAG
Agentic RAG further extends RAG with agent-like behavior; it can plan what information it needs before retrieving, and choose which tools or sources to query. It iterates by retrieving, evaluating the results, deciding whether more retrieval is needed, and then retrieving again. It can also decompose complex questions into sub-questions and retrieve for each. As an intelligent layer that reasons, retrieves, and verifies, an added agent decides what to retrieve and when to continue ongoing conversations. The agentic nature provides step-by-step reasoning chains to lead to richer synthetic data than LLM outputs by adding context and rationale.
My next column on AI data module 3 will focus on the development of synthetic data and IoT data management.
References
- “Artificial Intelligence, Part 6: Data Module 1,” by Jennie S. Hwang, SMT007 Magazine, October 2025.
- “Artificial Intelligence, Part 4: Prompt Engineering,” by Jennie S. Hwang, SMT007 Magazine, January 2025.
Appearances
Dr. Jennie Hwang will present “The AI Bubble Myth: Intelligence Incorporated and the Global Race,” at The Wilson Science and Technology Forum on April 10, 2026.
This column originally appeared in the April 2026 issue of SMT007 Magazine.
More Columns from SMT Perspectives and Prospects
SMT Perspectives & Prospects: 12 Predictions for Using AI in 2026SMT Perspectives & Prospects: Artificial Intelligence Part 6: Data Module 1
SMT Perspectives and Prospects: Warren Buffett’s Perpetual Wisdom, Part 2
SMT Perspectives and Prospects: Warren Buffett’s Perpetual Wisdom, Part 1
SMT Perspectives and Prospects: Artificial Intelligence, Part 5: Brain, Mind, Intelligence
SMT Perspectives and Prospects: Artificial Intelligence, Part 4—Prompt Engineering
SMT Perspectives and Prospects: The AI Era, Part 3: LLMs, SLMs, and Foundation Models
SMT Perspectives and Prospects: A Dose of Wisdom