Exploring Synthetic Data Generation for AI using Gen AI
Synthetic data use cases - generative AI for synthetic data using GAN, GPT, and VAE
Audience
This post is for people who work on data analytics and would like to know about synthetic data generation with AI.
For definitions of AI terms used in this article, refer to the AI Glossary 1.
1. What is Synthetic Data?
Synthetic data is fake data produced using algorithms and models, rather than being collected from real-world sources. It aims to replicate the patterns, distributions, and relationships found in real data. It also helps in simulation of rare events, edge cases, or hypothetical situations.
People needing data have generated synthetic data for many years without the use of AI or generative AI (‘genAI’). However, new generative AI models are now able to generate new data samples based on learned patterns from existing data.
2. Advantages and Disadvantages
Synthetic data generation is becoming increasingly important due to the need for data that mimics real-time information without the privacy and accessibility issues of actual datasets.
We need lots of training and validation data for all sorts of use cases for different industries. For some use cases, we do not have much relevant, high quality data that is ethically permissible to use for building AI applications. Synthetic data can help.
Advantages of Synthetic Data
Real data are the best to make data-driven decisions, but they are expensive, biased, or sometimes unavailable due to privacy regulations for some use cases. Synthetic data gives some aid in filling those data gaps.
In a study conducted by MIT-IBM Watson Lab (reference [7]), researchers found that video processing algorithms were more effectively trained on object recognition using synthetic data based on three publicly available data sets than with actual footage.
When it comes to video training data, there are privacy concerns (ex: individuals’ sensitive information like faces, license plates, or location indicators), so it is often better to use simulated data.
Also, proprietary issues are real causes for concern when it comes to real data. To avoid being sued, high-quality simulated data may be preferred.
Bias in training data can be mitigated by using synthetic data to produce unbiased data.
Disadvantages of Synthetic Data
Synthetic data may not capture all the complexities of real data, potentially leading to inaccurate models and misleading results.
It is also difficult to validate synthetic data against real-world scenarios, which can lead to doubts about the reliability and applicability of the insights derived from synthetic data
The long-term reliability of the models has been questioned, as in this post in reference [3].
Careful consideration of the implementation complexity, data quality, and potential ethical implications is essential to ensure meaningful and responsible use of synthetic data. If not generated carefully, synthetic data can introduce biases or misrepresentations, leading to skewed analyses and outcomes.
Real training data is also continuously monitored for:
data drift (when the data is no longer relevant; for example, in the housing price data set, if the housing price increases, then the model is no longer good for house price prediction) and
concept drift (for example, more 2 bedroom apartments became available after the training data is created, which had less data with 2 bedroom apartments), then the model is retrained with the new training data to keep the model relevant.
Even with real past data, AI model output based on mathematical equations is still not an accurate result. It is the result of learning from the data and approximating the output to fit to the trained model.
With synthetic data, approximations will still deviate from real world data, and may be even further off. Configuring synthetic data to vary with the real world situations, like inflation affecting house prices etc., can be a challenge. Data drift and concept drift with synthetic data also need to be monitored with synthetic data.
3. Synthetic Data Generation Market
As per reference [2] , the global synthetic data generation market, valued at USD 316.11 million in 2023, is expected to grow at a Compound Annual Growth Rate (CAGR) of 34.8%, reaching USD 6,262.27 million by 2033. This growth is driven by the need for high-quality training data in AI, ML, and IoT (Internet of Things), which synthetic data provides by mimicking real data without privacy concerns.
The market faces challenges due to the lack of standardization, but opportunities abound across various industries like healthcare, finance, and automotive. North America leads with a 39% market share and Asia-Pacific is anticipated to grow at a CAGR of approximately 36.5% from 2024 to 2033, outpacing some other regions due to rapid industrialization and technological adoption.
For example, Nvidia’s “Omniverse Replicator” is a framework for developing custom synthetic data generation pipelines and services. Developers can generate physically accurate 3D synthetic data that serves as a valuable way to enhance the training and performance of AI perception networks used in autonomous vehicles, robotics and intelligent video analytics applications.
4. Applications of Synthetic Data
Healthcare
Examples: Mimicked real patient data while ensuring privacy and compliance with regulations such as Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR), synthetic genomic data, synthetic population health data, etc.
Finance
Examples: Simulated realistic transaction records for various banking operations, including deposits, withdrawals, transfers, and payments, ensuring compliance with privacy standards, mimicked fraudulent transaction patterns, etc.
Manufacturing
Examples: Data representing quality inspection results, defect rates, and process deviations, synthetic maintenance data, etc.
Energy
Examples: Simulated data for demand response events, including consumer participation, load reduction metrics, and response timings.
Environment
Examples: Synthetic climate data like temperature, precipitation, humidity, wind speed, and atmospheric pressure etc,
Robotics
Examples: Synthetic data for robotic manipulation tasks, including object positions, grasp points, and manipulation sequences.
Autonomous driving
Examples: Artificially generated images from virtual cameras mounted on vehicles, depicting roads, pedestrians, traffic signs, and other vehicles in diverse lighting and weather conditions.
5. Major Techniques of Producing Synthetic Data
Process-driven methods
Process-driven methods for synthetic data generation are highly effective in scenarios where the underlying processes are well understood and can be accurately modeled using mathematical principles. They offer precision, control, and efficiency across a range of applications but require careful consideration of model assumptions, complexity, and computational requirements to ensure meaningful and applicable results.
Physics, engineering, risk assessment, modeling, simulation, environmental science, manufacturing and robotics are being considered to produce synthetic data under process driven methods.
Business rules can be used to increase accuracy of synthetic data. If industry-specific processes can be defined as rules, then it will help to create meaningful synthetic data. For example, in predictive maintenance of gas turbine engines, the business rule could be “Engines should operate with a MTBF [mean time between failure] of 10000 hours, and hence the failure rate of a particular engine sub component must be considered while generating synthetic data used for predictive maintenance purposes”.
In telecommunications (and all sectors which use personal data), synthetic data must comply with regulations like the General Data Protection Regulation (GDPR) 2, ensuring that no real user data is exposed.
In the healthcare sector, HIPAA regulations must be complied with to generate patient records and billing information.
Digital twin technology can be used to create simulated data based on the digital twin which can supplement the real world data for training. Also, by creating synthetic data, genAI tools can supplement data training sets used by digital twins (reference [6]).
Data-driven methods
Data-driven methods for synthetic data generation are powerful tools for creating realistic and representative datasets, particularly when the original data exhibits complex patterns or when privacy is a concern.
By leveraging advanced machine learning techniques and statistical models, these methods enable organizations to enhance their data analysis capabilities while protecting sensitive information.
Generative AI can be used to subset the source data needed to train the model.
6. How Gen AI Produces Synthetic Data
Gen AI is capable of generating text, images, audio, and more. These capabilities are being used to generate synthetic data, which can be used to train models. Mostly GPT, GAN, and VAE are used to generate synthetic data. (See this Glossary and the sub-sections below to learn more about GPT, GAN, and VAE.)
Llama 3.1 405B was trained on a massive 15 trillion tokens using 16,000 of Nvidia’s ultra-expensive H100 GPU. So far, it is the best model to create synthetic data, as quoted in this Nvidia blog post reference [4]: “The Llama 3.1 405B model is ideal for synthetic data generation due to its enhanced ability to recognize complex patterns, generate high-quality data, generalize well, scale efficiently, reduce bias, and preserve privacy.”
OpenAI’s DALL-E 3 image generation model is largely built on synthetic data as per reference [8].
Generative Pre-trained Transformers (GPTs)
GPT models are a type of transformer-based model trained to predict the next word in a sequence given the preceding words. They are pre-trained on large corpora of text and fine-tuned for specific tasks or domains. This allows them to generate human-like text that can be used as synthetic data.
When it comes to using GPT for synthetic data generation, we have to be very careful.
Example 1: I have tried to generate text as synthetic data sample to generate the medical sample complaint text using hugging face transformer GPT model.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Initialize the model and tokenizer
model_name = 'gpt2' # You can use 'gpt2-medium', 'gpt2-large', or 'gpt2-xl' for more complex generation
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
# Function to generate synthetic text
def generate_synthetic_text(prompt, max_length=100, temperature=0.7):
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(
input_ids,
max_length=max_length,
num_return_sequences=1,
temperature=temperature,
top_k=50,
top_p=0.95,
no_repeat_ngram_size=2
)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
return generated_text
# Example prompts
prompts = [
"In the context of healthcare, a patient might experience cold and shivering",
]
# Generate and print synthetic data for each prompt
for prompt in prompts:
synthetic_text = generate_synthetic_text(prompt)
print(f"Prompt: {prompt}")
print(f"Synthetic Data: {synthetic_text}\n")
For the above code sample, the generated text is as below.
If these are example of complaints to be used as synthetic data, I don’t see it is a good practice, since the synthetic health complaints are not realistic and will lead to wrong modeling and results.
Example 2: I have used the GPT to create a restaurant review:
The above code snippet produced the following outputs.
These reviews can be labelled as “Positive” , “Negative” and “Neutral” by humans. Then this labeled data can be used for model training, and the trained model can be used for a “restaurant review sentiment recognition” task. I see no harm here.
Conclusions from Example 1 and 2: I won’t use this as synthetic data for any AI solution development in case of example 1. It is no harm to use the GPT to generate synthetic data in case of example 2. This “common sense” needs to be applied while using synthetic data produced by the GPTs.
The main reason the GPTs are used is lots of data. We are not sure what it is capable of generating or hallucinating. Hence, generated text should be manually verified before using it as training data.
In my view, the AI agent for which you are collecting data need not come from pre-trained GPT. A GPT trained with your industry specific data, and then using it for the same industry, seems less harmful.
Pre-trained GPTs could be fed with industry-specific data through RAG (Retrieval-Augmented Generation) approach, and the output used with the model to generate synthetic data.
Generative Adversarial Networks (GANs)
GANs consist of two neural networks (NN):
Generator: Generates fake data trying to mimic real data.
Discriminator: Distinguishes between real data and fake data produced by the generator.
GANs are particularly effective in generating high-fidelity image and text data. The networks are trained in a minimax game:
The generator tries to fool the discriminator by generating realistic fake samples.
The discriminator tries to correctly classify real vs. fake samples.
The discriminator is essential for providing feedback to the generator. It ensures that the generator improves over time by learning to create more realistic images that the discriminator cannot easily distinguish from real images. The adversarial relationship between the two networks is what drives the GAN to produce high-quality synthetic data.
If interested in creating synthetic data example code, please refer to the following code. This produces handwritten character synthetic images. https://github.com/lakshmiveeramani/synthetic_data/blob/main/synthetic_data-gan.ipynb
Variational Auto-Encoders (VAEs)
Variational Autoencoders (VAEs) are a type of neural network used for generating synthetic data by learning latent representations. They are effective for generating structured data and have applications in fields like genomics and image synthesis.
Unlike GANs, which rely on adversarial training between two neural networks, VAEs use a probabilistic approach to generate synthetic images.
For generating synthetic images using VAE, please refer to the code example in https://github.com/lakshmiveeramani/synthetic_data/blob/main/synthetic_data_vae.ipynb
7. Synthetic Data Selection Criteria
Data Quality
Ensure that the synthetic data accurately replicates the statistical properties of the real data, including distributions, correlations, trends, and outliers. This includes checking metrics like mean, variance, skewness, and kurtosis to verify the data's consistency with the original dataset.
Data Privacy
Ensure that the personal data is anonymised to protect sensitive information and the industry specific regulatory compliance and privacy regulations are not compromised.
Relevance
Ensure the synthetic data is well suited for the specific use case, particular requirement and particular industry.
Technical Support
Ensure the ease, scaling and integration of the synthetic data is well supported by documentation and customer support.
Pricing
Ensure the cost benefit analysis is performed to justify the cost of synthetic data. Also need to ensure the synthetic data is being scaled while retraining the model for the data drift or any license restriction regarding that.
Conclusion
Synthetic data generation is set to make a big impact in many fields, like healthcare, finance, and other tasks which are not having enough real data for machine learning. It helps solve privacy issues, and also saves cost and time.
Synthetic data can help create balanced datasets, especially for rare events, and supports innovation by allowing rapid testing of new ideas by providing more data for training machine learning models. However, the ethical responsibility still lies with human wisdom to ensure these synthetic data are based on the criticality of the use cases.
With future advancements in technology and better ways to combine synthetic and real data, synthetic data generation will continue to grow. It promises to transform how we use data, making processes more efficient, private, and innovative.
I am exploring the synthetic data and use cases in this article. Please comment if you have any advice or any use cases of synthetic data you may be aware of!
References:
https://news.mit.edu/2022/synthetic-data-ai-improvements-1103
https://aibusiness.com/ml/google-mit-s-synclr-model-training-using-only-synthetic-data