Evaluation of synthetic data generators on complex tabular data

Thees, OscarNovak, JiriTempl, MatthiasDomingo-Ferrer, JosepÖnen, Melek2024-12-122024978-3-031-69650-3978-3-031-69651-010.1007/978-3-031-69651-0_13https://irf.fhnw.ch/handle/11654/48405Synthetic data generators are widely utilized to produce synthetic data, serving as a complement or replacement for real data. However, the utility of data is often limited by its complexity. The aim of this paper is to show their performance using a complex data set that includes cluster structures and complex relationships. We compare different synthesizers such as synthpop, Synthetic Data Vault, simPop, Mostly AI, Gretel, Realtabformer, and arf, taking into account their different methodologies with (mostly) default settings, on two properties: syntactical accuracy and statistical accuracy. As a complex and popular data set, we used the European Statistics on Income and Living Conditions data set. Almost all synthesizers resulted in low data utility and low syntactical accuracy. The results indicated that for such complex data, simPop, a computational and methodological framework for simulating complex data based on conditional modeling, emerged as the most effective approach for static tabular data and is superior compared to other conditional or joint modelling approaches.en330 - Wirtschaft004 - Computer Wissenschaften, Internet510 - MathematikEvaluation of synthetic data generators on complex tabular data04B - Beitrag Konferenzschrift194-209