By Bobby Carlton
In 2012, over 150,000 students from all around the world took advantage of the free online classes offered by edX, a US startup. This led to an explosion in the number of people taking advantage of such programs. To help get a bigger and better picture of what this meant, researchers turned to synthetic data.
But what is synthetic data? To help lay the foundation for this article, here is a great article on the Nvidia site on this topic written by Gerard Andrews, Senior Product Marketing Manager, Robotics at NVIDIA who writes, "It may be artificial, but synthetic data reflects real-world data, mathematically or statistically. Research demonstrates it can be as good or even better for training an AI model than data based on actual objects, events or people"
Although edX was not the first startup to offer online classes, the large number of participants was surprising. It created a lot of data about how people interact with such programs. The data collected by the platform provided researchers with valuable insights into the factors that influence students' decisions to enroll in online classes.
Data scientist Kalyan Veeramachaneni of MIT's LIDS noted that the university had amassed a huge amount of information. It was the first time in the history of the institution that it had big data on its own. Even though the university had previously dealt with large datasets, this was the first time it had collected such data locally.
To take advantage of the data, Veeramachaneni asked 20 MIT students to analyze the edX data. Unfortunately, the private information was not allowed to be shared, and this prevented the researchers from accessing it. The data was stored on a single computer, which was not connected to the Internet to prevent unauthorized access.
“I just couldn’t get the work done because the barrier to the data was very high," said Veeramachaneni in a Nature article.
Veeramachaneni and his team eventually came up with a solution: They created synthetic students, which were computer-generated individuals who had characteristics similar to real students who used the platform. They then used machine learning to analyze the data and identify factors that could affect a person's chances of failing a course.
Through this approach, they were able to identify certain factors that could affect a student's chances of failing a course, such as the time it took to submit assignments. They then used the findings to develop interventions that could help individuals complete their courses.
Through this experience, the researchers created the Synthetic Data Vault, which is an open-source software that lets users create their own synthetic data sets. They then used the models they created to generate different versions of the data. In 2020, Veeramachaneni co-founded DataCebo, a company that helps companies collect and analyze data.
One of the main factors that drives the development of synthetic-data research is the desire to protect the privacy of the collected information. Due to the rapid emergence and evolution of AI and machine learning, as well as the increasing number of applications in various fields such as financial analysis and health care, concerns about the data that these systems are using are growing.
To develop systems that can perform effective and efficient functions, such as analyzing and predicting the future behavior of individuals, they first need to collect vast amounts of data. This data could be used to perform various actions, such as analyzing and predicting the likelihood of people being able to buy or rent a home.
According to some researchers, the most effective way to address the concerns about the privacy of collected data is by creating synthetic data. This technology could allow computers to make data that is close to real-world objects without the need for recycling old information. Mihaela Van der Schaar, the director of the UK CAM in Medicine's machine-learning division, said that she wants to see data become more useful.
Besides privacy concerns, data sets also come with various issues that can affect their development and operation. For instance, if a system is trying to diagnose a rare condition, it might not be able to perform the task properly due to the lack of real-world data.
Another issue with data sets is the potential bias, which could cause systems to favor certain groups over others. Supporters of synthetic data claim that they can solve this issue by adding irrelevant information to the data sets, which is faster and cheaper than gathering it in real life.
“To me, it’s about making data this living, controllable object that you can change towards your application and your goals,” says Phillip Isola, a computer scientist at MIT who specializes in machine vision. “It’s a fundamental new way of working with data.”
Although there are many ways to create a synthetic data set, the concept of synthesizing them all is the same. A computer can perform a statistical analysis of a real data set and learn about its relationships. It then compiles the different data points into a new set.
ChatGPT is a text generation engine that was built using a large language model known as the Generative Pre-training Transformer. This program analyzed billions of words and built a representation of how they meshed. When given a task, ChatGPT takes into account the various factors that influence the words and produces a string composed of those words.
In order to create a text generation engine that can produce images, audio, and even rows and columns of data, a computer needs to be trained correctly. According to Thomas Strohmer, a professor at UC Davis, one of the biggest challenges in developing synthetic data is ensuring that the output is accurate.
Jason Adams, Thomas Strohmer and Rachael Callcut (left to right) are part of the synthetic data research team at UC Davis Health.
One of the most important factors that a computer needs to consider when it comes to developing synthetic data is accuracy. Having the correct statistical relationships is very important to ensure that the results are relevant to the task at hand. Despite the challenges that artificial intelligence faces, it has already made many impressive achievements.
According to Strohmer, if we could understand medical data properly, we would no longer need a machine to find its relationships between patients and health conditions.
The clearest way to determine if a synthetic data set has taken the original attributes is to analyze how similar the predictions made by an AI system to the original are. The more capable a machine is, the harder it will be for humans to distinguish between the fake and the real.
One tool available to create synthetic data is Nvidia's Omniverse Replicator.
With the ability to create synthetic data sets that are specific to their networks' requirements, developers can use Omniverse Replicator to easily create applications that can improve the performance of their neural networks. The platform is built on open standards such as MDL, USD, and PhysX, and it can be expanded with custom randomizers and writers.
This allows for the creation of fast data sets with a CUDA-based annotator, allowing developers to easily preview their output. The ability to connect SwiftStack and Omniverse Farm with the output of Replicator provides a huge amount of scalability.
The Omniverse Replicator SDK is composed of six primary components for custom synthetic data workflows:
Due to the rapid development of AI technology, the images and text generated by machine learning have become realistic to most people. However, it is still important to keep in mind that these are not real data sets.
In April, Strohmer and his colleagues at UC Davis were awarded a four-year grant from the NIH to develop methods that could help improve the quality of synthetic data. This project will allow scientists to create a more accurate and timely data set. One of the methods that Strohmer is working on is to analyze how accurate the data sets that are created by machine learning are.
He is also working on developing a mathematical algorithm that can guarantee the privacy of the data collected by machine learning. This is because the various laws that are related to the privacy of medical data are very strict. The difficulty is balancing the utility of data with the privacy of the individual.
A method that researchers can use to increase the privacy of their data is by adding statistical noise to the data set. For instance, if one of the data points that they collect is a person's age, then they can add some random numbers to make the data set more identifiable. However, if the age of the individual is one of the factors that is being studied, this method might lead to inaccurate results.
One of the biggest issues researchers face when it comes to protecting the privacy of their collected data is the lack of clarity regarding how and how much information they can reveal. According to Florimond Houssieau, a computer scientist from London's Alan Turing Institute, one of the most common ways that secrets can be spilled is by the data that is too similar to its original form.
The complexity of the data set can make it hard to understand the relationships between various pieces of information. This is because the system that is creating the synthetic version of the data is more likely to copy and reproduce the data it sees.
Researchers can assign a numerical value to the privacy level of a data set, but “we don’t exactly know which values should be considered safe or not. And so it’s difficult to do that in a way that everyone would agree on”.
The complexity of the medical data set can also make it hard to create synthetic versions of them. For instance, if a medical professional has years of experience and knowledge, they can easily put together a diagnosis based on the various factors that are collected by the data set.
Unfortunately, machine learning is currently not able to extract information from various types of data. This is a major issue that scientists are facing when it comes to developing synthetic versions of their data.
According to Isola, there are various theoretical limitations that prevent machine learning from improving the quality of data. One of these is the information theory principle known as data-processing inequality. This suggests that processing data can reduce its usefulness and prevent it from adding to it. All of the problems that are related to real data, such as bias, expense, and privacy, still remain even after the pipeline has been created.
With machine learning, an individual is still learning from the data collected by the system. Isola explained that instead of having a free download, one is simply formatting the data into a format that they can control better. Creating a synthetic version of the data allows scientists to get a better and more complete picture of their collected information.
Although the use of synthetic data in medicine is still in its early stages, it is already being used in various sectors, such as finance. According to Strohmer, many companies are currently working on developing new data sets that are designed to protect the privacy of individuals.
In finance, if a mistake is made, it may still hurt, but it won't kill you, which helps speed up the process in comparison to medicine.
In 2021, the US Census Bureau revealed that it was planning on creating a synthetic version of the data collected by the American Community Survey to improve the privacy of the individuals who responded to it. However, some researchers criticized the move, saying that it could potentially undermine the data's usefulness.
In February, a partnership that facilitates the sharing of public sector data announced that it would be awarding a grant to study the worth of synthetic versions of the data sets made by the UK Data Service and the ONS.
Andrew Elliot, a statistician at Glasgow University talked about this saying that some people are also experimenting with using fake data to test the software they hope to use on real data. Although the data they create looks exactly like the real thing, they can be useless if they are only used for testing the code involved.
Scientists can easily create a synthetic version of a data set that they are only authorized to examine if they want to improve the code involved. This eliminates the need to waste time getting hold of the real data.
Currently, synthetic data is mainly regarded as a niche technology. More people should start talking about this technology and its potential impact since it could affect everyone.
Strohmer noted that the issues related to synthetic data raise fascinating scientific questions and also important societal issues. Data privacy, he said, is essential in the age of Surveillance Capitalism. Creating good synthetic versions of data that are both transparent and maintain diversity can help improve the performance of artificial intelligence and expand its scope of applications.
“A lot of data is owned by a few big companies, and that creates an imbalance. Synthetic data could help to re-establish this balance a little bit,” Strohmer says. “I think that’s an important, bigger goal behind synthetic data.”