Evaluation of synthetic data generators on complex tabular data

Loading...
Thumbnail Image
Author (Corporation)
Publication date
2024
Typ of student thesis
Course of study
Type
04B - Conference paper
Editor (Corporation)
Supervisor
Parent work
Privacy in statistical databases. International Conference, PSD 2024, Antibes Juan-les-Pins, France, September 25–27, 2024, Proceedings
Special issue
DOI of the original publication
Link
Series
Lecture Notes in Computer Science
Series number
14915
Volume
Issue / Number
Pages / Duration
194-209
Patent number
Publisher / Publishing institution
Springer
Place of publication / Event location
Cham
Edition
Version
Programming language
Assignee
Practice partner / Client
Abstract
Synthetic data generators are widely utilized to produce synthetic data, serving as a complement or replacement for real data. However, the utility of data is often limited by its complexity. The aim of this paper is to show their performance using a complex data set that includes cluster structures and complex relationships. We compare different synthesizers such as synthpop, Synthetic Data Vault, simPop, Mostly AI, Gretel, Realtabformer, and arf, taking into account their different methodologies with (mostly) default settings, on two properties: syntactical accuracy and statistical accuracy. As a complex and popular data set, we used the European Statistics on Income and Living Conditions data set. Almost all synthesizers resulted in low data utility and low syntactical accuracy. The results indicated that for such complex data, simPop, a computational and methodological framework for simulating complex data based on conditional modeling, emerged as the most effective approach for static tabular data and is superior compared to other conditional or joint modelling approaches.
Keywords
Project
Event
International Conference, PSD 2024
Exhibition start date
Exhibition end date
Conference start date
25.09.2024
Conference end date
27.09.2024
Date of the last check
ISBN
978-3-031-69650-3
978-3-031-69651-0
ISSN
Language
English
Created during FHNW affiliation
Yes
Strategic action fields FHNW
Publication status
Published
Review
Peer review of the complete publication
Open access category
Closed
License
Citation
Thees, O., Novak, J., & Templ, M. (2024). Evaluation of synthetic data generators on complex tabular data. In J. Domingo-Ferrer & M. Önen (Eds.), Privacy in statistical databases. International Conference, PSD 2024, Antibes Juan-les-Pins, France, September 25–27, 2024, Proceedings (pp. 194–209). Springer. https://doi.org/10.1007/978-3-031-69651-0_13