By Bobby Carlton
One of the main reasons why the use of synthetic data is growing is due to the increasing number and complexity of data sources that software engineers are exposed to.
There are ethical and practical issues when it comes to using live data to train AI and test software platforms. This makes the case for utilizing augmented and synthetic data to address certain development needs. Some of the key stakeholder organizations that have discussed the advantages of utilizing such data include IBM, Datavant, and Gartner.
Developers can speed up the development of AI and platforms by using data collected by algorithms or augmented by real data. This can help them address privacy concerns when it comes to releasing personal information to third parties for testing and developing software for healthcare and financial transactions. The use of synthetic data is expected to continue growing.
Some people might be concerned that synthetic data could be used to mislead or misinform others. For instance, in a fraud case, a defendant was accused of paying a scientist to create false information for customers.
However, synthetic data is typically used in a legitimate manner to develop and test software and AI models. Examples of companies that use this include IBM.

Jonah Leshin, Datavant’s head of privacy research, talked about synthetic data saying it is generated by an engine that takes real data and produces outputs that are representative of the original data. Developers need to ensure that the properties of the data they create allow them to act as if they were a copy of the original. For instance, he says that a synthetic data model can serve as a representation of the patterns that are in the original data.
He also says that developers can use synthetic data to perform analysis and create entries in databases. For instance, if they’re planning on testing a particular workflow, he says that this type of data can be useful as it can provide a starting point. However, there are various privacy and regulatory issues that prevent the use of actual data.
In certain cases, he says that an increased amount of data resources can be used to augment the original dataset. He explained that this method involves creating multiple copies of the data. He said that this method works by viewing the outputs of the synthetic data as if they were samples from a larger population.
In a conversation with Venture Beat, Yashar Behzadi, CEO and founder of synthetic data platform Synthesis AI, said synthetic data can be used to enhance and improve the performance of existing models. He noted that some of the companies that he’s working with include tier 1 software firms and auto manufacturers.
Behzadi noted that many companies are not able to use the correct training data for their computer vision models due to various factors. One of these is the lack of consent from their customers. With synthetic data, they can get the exact training data they need.
According to Jim Scheibmeir, a Gartner analyst, software engineers can use synthetic data to develop new features even though there’s no production data available. For instance, if a company is testing an algorithm for driverless cars, it needs to know more about the weather and other factors on the road.
Data scientists who are working on new algorithms might not be able to access production data due to various factors, such as access restrictions and compliance. This makes synthetic data an attractive option as it can provide them with a starting point.
Artificial intelligence (AI) could also play a role in the generation and use of synthetic data. According to Scheibmehr, the ChatGPT initiative by Gartner will likely reinvigorate the notion of how generative can be used for the benefit of society. However, it’s important to note that due to the increasing number of regulations and laws regarding the protection of personal information, it’s becoming harder to acquire data.
“There’s other states in the US that are picking up legislation, whether it’s Utah, Colorado, Virginia, or Connecticut,” Scheibmeir says.
One of the main reasons why the use of synthetic data is growing is due to the increasing number and complexity of data sources that software engineers are exposed to. He said that this can lead to a cognitive overload that can affect the development of new software. Another reason why it’s important that software engineers are equipped with the necessary tools and resources to manage and collect data is to reduce the burden on them.
The company, IBM has been using synthetic data to train and test their own AI models. This method can also be used to protect sensitive data by replacing or enhancing it. For instance, it can be used to analyze stock prediction models to see how they respond to fake quotes posted on social media.
Inkit Padhi, an engineer at IBM Research, said that one of the most important factors that companies consider when it comes to using synthetic data is mitigating the risk of unauthorized access. For instance, a financial institution might not be able to share the details of credit card transactions if a third party were to develop a tool that would allow them to monitor these transactions.
An AI model can learn to paint in the style of Impressionist art if it is shown enough examples. But, designing windshield wiper blades in this manner is almost impossible.
A lot of moving machines require a mechanism that transfers force or motion from one part of the assembly to another. For instance, if you want to remove snow and rain from your windshield, you would need a motor that rotates an arm connected to a series of links.
According to Faez Ahmed, a professor of mechanical engineering at MIT, developing images using AI can be simple, but designing mechanical systems can be very challenging. This is because a small change in the design can lead to a complete failure.
Today, most linkage systems are built manually due to the high level of precision required. With the help of a computer-aided design program, engineers can easily move around the bars and joints of a mechanism to find the ideal movement.
A team led by Ahmed and Akash Srivastava at MIT-IBM Watson AI Lab wants to change this process by giving AI a goal and allowing it to design a movement system that fits the needs of the user.
The researchers managed to create a dataset that contains over a hundred million mechanisms, which is almost a thousand times bigger than the next-biggest collection of mechanisms. The complexity of the structures in the dataset is also more than human could dream up.
The increasing complexity of linkage systems leads to less and less likely outcomes. This concept also applies to the design and creation of AI-based mechanisms. To create the largest dataset of its kind, the researchers ran several billion simulations. They were able accomplish this by figuring out how to accelerate the process by around 800 times.
The researchers plan to expand the scope of their dataset by adding more complex mechanisms such as gears, sliders, and cams. For Srivastava, who is an IBM Research engineer, the use of AI could help improve the efficiency and creativity of the design process.

“Designing machines using probabilistic generative modeling rather than traditional optimization techniques has the potential to bring more creativity and efficiency into the design process,” said IBM’s Srivastava. “I’m excited to see what this AI can help us achieve.”
You can design your own mechanism with this demo from Ahmed’s lab.
In addition to being useful for validating the accuracy of AI models, synthetic data can also help reveal potential biases or security flaws. This can be done through the deployment of fake data as adversarial examples. According to Padhi, artificial intelligence models can benefit from the testing of synthetic data. “They can help to make AI models more fair, accurate, and trustworthy.”
Despite the various advantages of using synthetic data, it’s still important that organizations have the necessary controls and monitoring to prevent unauthorized access and the spread of private information.
If a company uses synthetic data to mimic the exact details of real data, then its bias will be replicated and its results will be biased, according to Padhi. “If the data is biased, the model that you train, the machine learning that you train will propagate these biases as well.”