Case Study
Request a quote
Back to Blog
22 June 2023

Synthetic Population Data Can Deliver Valuable Insights Into the Development and Maintenance of Complex Urban Communities

By Bobby Carlton
Photo by NatGeo

By Bobby Carlton

The use of synthetic population data can transform communities by providing quality data without impacting poor or vulnerable individuals.

Millions of people are becoming part of the quantified self movement by using wearable tech to track their activities and health. This movement is powered by the data collected by these devices, which can be analyzed and utilized to improve the quality of life for those with chronic conditions.

The benefits of having a good quality data collection are numerous, but collecting that data can be invasive and could even impact poor and vulnerable communities.

In order to balance the interests of the individual with the gathering of valuable information, can we make use of synthetic populations to help urban communities?  This post explores the use of this approach in New York City, which is known for its use of big data to support urban management.

Through a collaboration with the Sloan Foundation, World Data Lab was able to create a synthetic population that can be used to study the poverty rates in Brooklyn. The group utilized a combination of statistical data and microdata to create the synthetic data for the synthetic population.

Personal information collected at the individual level is referred to as microdata. In the US, this type of data is generally available in the form of public use microdata areas (PUMA), which are geographic regions that contain no more than 100,000 individuals each. Due to privacy concerns, this type of data is not available in more granular census tracts.

Microdata can also include various details about an individual such as their educational attainment, income, and household size. Census tracts are statistical subdivisions that are roughly the size of a building block. These types of statistics are based on the populations of a given area, which makes them easier to collect than individuals.

Summary statistics are also available for households and individuals. These types of statistics provide a comprehensive view of the population and its various characteristics. For instance, they can identify the number of households that are within a certain income range.

Unfortunately, the distribution of census tracts is not as visible in microdata areas due to their limited availability larger than PUMA level. For instance, policymakers might not be able to see the differences in income disparities within a certain neighborhood due to the lack of microdata. Through a synthetic population approach, we can combine the two data sets to create a more accurate representation of the population.

A synthetic population is composed of various factors such as the number of households, demographic information, and income brackets. These are then combined with microdata to create a complete representation of the population. To ensure that the constraints of summary statistics are met, we use these data sets in a way that allows them to sample the microdata.

By controlling for various factors, we can create a micro dataset that represents the census tracts, but without collecting personal information. This allows us to explore the variations across census tracts in a PUMA and study more detailed questions about income disparities based on sex and age.

Although we can only control for the variables that are included in the two datasets, the synthetic population still has information about the other variables from the original microdata on PUMA level.

Synthetic population

The poverty threshold in New York City-specific areas such as Flatbush and Midwood in Brooklyn’s Public Housing Authority (PHA) is represented by the number of people living below it. On the other hand, the synthetic population’s poverty rate is different from that of the PUMA.

The high variance in mean income was the reason why the PUMA Midwood and Flatbush in Kings County, New York, was selected. It has 44 census tracts and is composed of around 57,000 households.

Using the microdata of PUMA level, the image shows that about 26.4 percent of the population in the area is below the poverty threshold in New York. However, through the synthetic population approach, we were able to see that some census tracts had lower poverty rates than the average.

Big data has become an integral part of the social programs of New York City. For instance, the CIDI, a non-profit organization focused on innovation through data intelligence, launched the NYC Wellbeing Index, which measures the health of the city’s neighborhoods. Through this tool, leaders can get a deeper understanding of how their communities compare.

Although the number of residents of Neighborhood Tabulation Areas (NTAs) is less than that of census tracts, it can still be used to identify areas with high poverty rates. This can help improve the delivery of social services by identifying areas with the most households that are below the poverty line.

This method can also be used to identify areas where the most people are living below the poverty line. In developing countries, this method could help them target areas with high poverty rates as the average poverty rate has started to fall. Several countries, such as Colombia, Thailand, and the Philippines, have been experimenting with this method.

In terms of its effectiveness, the synthetic population approach can provide us with valuable insights into the various factors that affect the development and maintenance of complex urban communities. It can also help us design effective interventions and improve the privacy of our data.