MLDS Center Investigates the Creation of Synthetic Data

COLLEGE PARK, MD (December, 2015) – The State of Maryland has been awarded a four-year, $6.9 million grant under the U.S. Department of Education’s Statewide Data Systems program. Crucially, a portion of these funds will support the Maryland Longitudinal Data System (MLDS) Center’s project to create synthetic data sets, taken up jointly with the University of Maryland’s School of Social Work, the Joint Program in Survey Methodology, and the College of Education’s program in Measurement, Statistics, and Evaluation.

The project, entitled “Feasibility of Synthetic Data for Population-Averaged and Cluster-Specific Analyses by Researchers Utilizing Integrated State Longitudinal Data Systems” and led by Dr. Laura Stapleton of the Department of Human Development and Quantitative Methodology, aims to increase access to MLDS Center data, allowing policy analysts and researchers to pose questions directly. To do so, the project will create and test synthetic copies of three data warehouses that contain needed data as identified by potential users.

Though the MLDS Center is obligated to make data accessible to researchers, policymakers, and education stakeholders, a constellation of confidentiality laws and policies limit the provision of data, even when stripped of individually identifiable information. A strategy commonly used to disseminate data to researchers while protecting confidentiality – providing aggregate data – faces significant limitations; researchers cannot analyze aggregate data to answer very detailed questions. Alternative forms of data disclosure prevention, such as data swapping and perturbation, are not feasible for use with the MLDS since they still involve the release of actual records.

To surmount these difficulties, the MLDS Center is taking a cue from the U.S. Census Bureau by seeking to develop synthetic data – artificial data sets that are similar to, but distinct from, the raw, confidential data from which they are derived. In this way, researchers will have access to microdata closely mimicking the properties of the raw data. They can then analyze the synthetic data to answer a variety of important research questions which simply cannot be addressed with merely summarized data.

To create synthetic data, conditional probability distributions for the variables of interest to researchers are constructed based on the raw values and, from these distributions, values are randomly drawn to create a new data set. Thus, if the statistical model is adequately specified, the synthetic data will represent the original data but will actually be comprised of different values with no correspondence to real people. Because these data can be freely accessed without confidentiality concerns, more research questions can be answered – and in a more timely manner.

With the pursuit of synthetic data, the MLDS Center will increase the usefulness of the data it collects while preserving the privacy of Maryland students. The project promises to be a model for other longitudinal data systems across the nation.

Dr. Laura Stapleton is an associate professor in the Measurement, Statistics, and Evaluation program in the Department of Human Development and Quantitative Methodology. Her work looks at analysis of administrative and survey data obtained under complex sampling designs and multilevel latent variable models, including tests of mediation within a multilevel framework. She is the associate director of research for the Maryland State Longitudinal Data System Center.

Click here to learn more about the Maryland Longitudinal Data System Center.

-end-

For more information on the College of Education, visit: www.education.umd.edu

or contact

Joshua Lavender, Communications Coordinator, at: lavender@umd.edu