Google Launches Simula: A Thinking-First Framework for Generating Controllable, Scalable Artificial Datasets in All Special AI Domains

Training powerful AI models relies on one resource that is running out of steam: specialized data. While the Internet has provided a seemingly endless supply of text and images to train today’s standard models, the next wave of AI breakthroughs – in cybersecurity, forensics, healthcare, and other niche domains – requires data that is not in sufficient volume, or inaccessible due to privacy concerns.
A team of researchers from Google and EPFL present Let’s starta logic-driven framework for synthetic data generation and evaluation that prioritizes transparency, good control, and scalability. Unlike conventional methods, Simula does not rely on seed data from target distributions, manual data, or evolutionary algorithms — it builds each data set from first principles, treating data generation as a method design problem.
Why Synthetic Data Generation Is Harder Than It Looks
If you’ve worked with optimization pipelines or training for a specific domain model, you may run into a ‘not enough data’ wall. Manually collecting and annotating specialized data sets is expensive, time-consuming, and error-prone. But the obvious task – just tell a large-scale linguistic model (LLM) to generate training data – runs into its own problems.
Most existing synthetic data methods only prepare a subset of what researchers describe as the three axes ‘good’ data: quality, diversity, again complexity. Quality refers to whether a data point meets certain semantic and syntactic requirements. Diversity includes both global coverage (do you have examples from across the concept space?) and local diversity (do you have many different takes on each concept?). Complexity captures how confusing, unusual, or detailed an example is. Controlling all three simultaneously, at scale, in a definable manner, is an unsolved challenge that Simula directly addresses.
How Simula works: Taxonomies, Meta-Prompts, and dual critics
Simula divides the manufacturing process into four separate, manageable stepseach targeting a specific data asset.
I the first step addresses global diversity using hierarchical taxonomies. Given a definition of a data set – say, ‘cybersecurity threat intelligence query dataset’ – a multi-method model (called M3) is instructed to identify the main variables of that domain (eg, type of attack, threat actor, risk category). Each feature is then expanded breadth-first into a hierarchical taxonomy tree. To reduce the risk of missing important subcategories, the system uses a Best-of-N recommendation strategy combined with a critic refinement step, where the model suggests N candidate child nodes and then criticizes them for completeness, soundness, and specificity. The resulting taxonomies act as systematic sampling scaffolds – ensuring that when you draw 512,000 training examples, they truly cover the long tail of the domain rather than clustering around common paths.
I step two it manages the diversity of the area. A sample combination of taxonomy nodes — called ‘mixes’ — is passed to M3 to generate ‘meta information.’ For example, a mixture of {house cat, poem, travel enthusiast} becomes ‘Create an interesting haiku about a house cat that walks around on its own.’ To prevent mode collapse when multiple meta data are generated from the same set of nodes, Simula generates multiple meta data at the same time and subsamples the required fraction, ensuring different occurrences than the same iteration.
I step three it is to be complicated. User configurable component, cof meta information is passed to the complexization step, which prompts M3 to increase the complexity of the generated meta instructions and outputs while maintaining all other requirements. This separates complex control from coverage control – you can raise the difficulty without sacrificing range.
I step four improves quality by using a ‘double-criticism’ approach. Rather than asking the model once whether the generated answer is correct, Simula independently asks the model whether the answer is correct and incorrect. This double-validation design reduces sycophancy bias – the tendency of LLMs to agree with plausible results – and is especially important for tasks with a defined notion of fairness, such as multiple-choice questions or math problems.
What Tests Show
The research team tested the Simula using Gemini 2.5 Flash (non-reflective) as the teacher model and Gemma 3 4B as the student model, using 10 iterations of the LoRA fine-tuning with different seeds in each configuration and reporting mean accuracy with 95% confidence intervals. They produced datasets of up to 512K data points across five domains: CTI-MCQ, a multiple-choice questionnaire dataset to assess understanding of CTI levels, threats, and mitigations; CTI-RCM, an open generation task that requires a model to generate a Common Weakness Enumeration (CWE) category from a Common Weakness and Exposure (CVE) definition; LEXam, which includes Swiss, EU, and international law exams in English and German; GSM8k (school grade math); and Global MMLU (Mathematics, Computer Science, and Physics in English, Korean, and Nepali).
For all data sets and data sizes, the full Simula system — including global diversity, local diversity, integration, and criticality — is consistently more efficient than the simple baseline configuration. Notably, incorporating both Global and Local diversity was important; or in isolation produced smaller results depending on the data set and scale.
The complex results were particularly instructive. For GSM8k, the High Complexity classification yielded a 10% accuracy gain over the Low Complexity classification for 64K data items. But on LEXam, where the teacher model achieved only 57% accuracy, higher complexity data actually hurt performance — showing that complex data is only beneficial if the teacher model is robust enough to produce reliable labels. The LEXam critic rejection rate reached 61%, compared to 2% for the CTI-MCQ, 9% for the CTI-RCM, and 9% for the GSM8k, which directly reflects the weakness of the teacher model in that domain.
A different and important finding is what the research team calls the Student-Teacher Gap Effect on grading rules. In CTI-RCM, the performance of the learner model peaked at about 128K data points, after closing about 83% of the gap between the initial accuracy of the learner (40%) and the performance of the teacher model (70%). GSM8k, on the contrary, did not show such saturation because the highest performance of the student model (75%) is always far enough from the teacher (88%).
Internal Auditing Gets a Rethink
Beyond the generation, the research team presents two new test methods. Taxonomic Coverage measures what fraction of taxonomy nodes at each level are represented in the dataset – a systematic alternative to embedded cosine distance metrics that fail to provide probabilistic information. Weighted Complexity Rating provides Elo ratings for individual data points using cluster-wise comparisons, a method the research team calls ‘weighted attribute rating,’ which has proven to be a good match for human-defined complexity labels in the MATH dataset.
Another finding stands out: on a taxonomic basis, real-world reference datasets almost always cover less of the target domain than Simula-generated diversity, even if embedded diversity metrics tell the opposite story. This emphasizes the limitation of relying on the cosine range alone as a proxy for dataset quality.
Key Takeaways
- Simula’s original, seedless conceptual framework manages quality, diversity, and complexity as independent axes — enabling the design of refined artificial datasets without relying on manual annotation, evolutionary algorithms, or seed data from target distributions.
- Combining Global and Local variability is important: either component in the classification produces inferior results, but together it improves the performance of the downstream model for all tested data sets and data sizes.
- Data complexity helps model performance in many domains, but it can hurt if the teacher’s model is weak — in LEXam, where Gemini 2.5 Flash (non-thinking) achieved only 57% accuracy, the Low Complexity classification outperformed the High Complexity classification.
- Real-world reference datasets almost always cover less of the target domain than Simula-generated taxonomic-based variants, even if embedding-based cosine distance metrics suggest otherwise.
- Data scaling rules are driven by data properties, not size alone – Simula’s full system achieved superior downstream performance with fewer samples compared to baseline methods, making it cost-effective over the full data lifecycle despite requiring up to 5x more calls per data point.
Check it out Paper again Technical details. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.
Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? contact us


