Templ, Matthias
Lade...
E-Mail-Adresse
Geburtsdatum
Projekt
Organisationseinheiten
Berufsbeschreibung
Nachname
Templ
Vorname
Matthias
Name
Matthias Templ
35 Ergebnisse
Suchergebnisse
Gerade angezeigt 1 - 10 von 35
Publikation Evaluation of synthetic data generators on complex tabular data(Springer, 2024) Thees, Oscar; Novak, Jiri; Templ, Matthias; Domingo-Ferrer, Josep; Önen, MelekSynthetic data generators are widely utilized to produce synthetic data, serving as a complement or replacement for real data. However, the utility of data is often limited by its complexity. The aim of this paper is to show their performance using a complex data set that includes cluster structures and complex relationships. We compare different synthesizers such as synthpop, Synthetic Data Vault, simPop, Mostly AI, Gretel, Realtabformer, and arf, taking into account their different methodologies with (mostly) default settings, on two properties: syntactical accuracy and statistical accuracy. As a complex and popular data set, we used the European Statistics on Income and Living Conditions data set. Almost all synthesizers resulted in low data utility and low syntactical accuracy. The results indicated that for such complex data, simPop, a computational and methodological framework for simulating complex data based on conditional modeling, emerged as the most effective approach for static tabular data and is superior compared to other conditional or joint modelling approaches.04B - Beitrag KonferenzschriftPublikation Simulation of calibrated complex synthetic population data with XGBoost(MDPI, 2024) Gussenbauer, Johannes; Templ, Matthias; Fritzmann, Siro; Kowarik, AlexanderSynthetic data generation methods are used to transform the original data into privacy-compliant synthetic copies (twin data). With our proposed approach, synthetic data can be simulated in the same size as the input data or in any size, and in the case of finite populations, even the entire population can be simulated. The proposed XGBoost-based method is compared with known model-based approaches to generate synthetic data using a complex survey data set. The XGBoost method shows strong performance, especially with synthetic categorical variables, and outperforms other tested methods. Furthermore, the structure and relationship between variables are well preserved. The tuning of the parameters is performed automatically by a modified k-fold cross-validation. If exact population margins are known, e.g., cross-tabulated population counts on age class, gender and region, the synthetic data must be calibrated to those known population margins. For this purpose, we have implemented a simulated annealing algorithm that is able to use multiple population margins simultaneously to post-calibrate a synthetic population. The algorithm is, thus, able to calibrate simulated population data containing cluster and individual information, e.g., about persons in households, at both person and household level. Furthermore, the algorithm is efficiently implemented so that the adjustment of populations with many millions or more persons is possible.01A - Beitrag in wissenschaftlicher ZeitschriftPublikation Prof. Rudolf Dutter (1946-2023): Ein Nachruf(Austrian Statistical Society, 07/2023) Filzmoser, Peter; Templ, MatthiasDer ehemalige TU Wien Professor Rudolf Dutter verstarb am 5. Mai 2023 an den Folgen seiner langjährigen Diabetes-Erkrankung. Prof. Dutter war von 1997 bis 2003 Redakteur der Österreichischen Zeitschrift für Statistik (Austrian Journal of Statistics), und diese Tätigkeit hat er mit viel Engagement im Sinne der Österreichischen Statistischen Gesellschaft geleistet. Eine seiner Aktivitäten war die Einrichtung und der Betrieb einer Website für die Zeitschrift, die einen "Open Access" Zugriff auf die Artikel ermöglichte. Ein kurzer Nachruf in dieser Zeitschrift, auch als Information für die Mitglieder der Gesellschaft, scheint daher mehr als passend zu sein.01A - Beitrag in wissenschaftlicher ZeitschriftPublikation Visualization and imputation of missing values. With applications in R(Springer, 2023) Templ, MatthiasThis book explores visualization and imputation techniques for missing values and presents practical applications using the statistical software R. It explains the concepts of common imputation methods with a focus on visualization, description of data problems and practical solutions using R, including modern methods of robust imputation, imputation based on deep learning and imputation for complex data. By describing the advantages, disadvantages and pitfalls of each method, the book presents a clear picture of which imputation methods are applicable given a specific data set at hand. The material covered includes the pre-analysis of data, visualization of missing values in incomplete data, single and multiple imputation, deductive imputation and outlier replacement, model-based methods including methods based on robust estimates, non-linear methods such as tree-based and deep learning methods, imputation of compositional data, imputation quality evaluation from visual diagnostics to precision measures, coverage rates and prediction performance and a description of different model- and design-based simulation designs for the evaluation. The book also features a topic-focused introduction to R and R code is provided in each chapter to explain the practical application of the described methodology. Addressed to researchers, practitioners and students who work with incomplete data, the book offers an introduction to the subject as well as a discussion of recent developments in the field. It is suitable for beginners to the topic and advanced readers alike.02 - MonographiePublikation Enhancing precision in large-scale data analysis: an innovative robust imputation algorithm for managing outliers and missing values(MDPI, 2023) Templ, MatthiasNavigating the intricate world of data analytics, one method has emerged as a key tool in confronting missing data: multiple imputation. Its strength is further fortified by its powerful variant, robust imputation, which enhances the precision and reliability of its results. In the challenging landscape of data analysis, non-robust methods can be swayed by a few extreme outliers, leading to skewed imputations and biased estimates. This can apply to both representative outliers – those true yet unusual values of your population – and non-representative outliers, which are mere measurement errors. Detecting these outliers in large or high-dimensional data sets often becomes as complex as unraveling a Gordian knot. The solution? Turn to robust imputation methods. Robust (imputation) methods effectively manage outliers and exhibit remarkable resistance to their influence, providing a more reliable approach to dealing with missing data. Moreover, these robust methods offer flexibility, accommodating even if the imputation model used is not a perfect fit. They are akin to a well-designed buffer system, absorbing slight deviations without compromising overall stability. In the latest advancement of statistical methodology, a new robust imputation algorithm has been introduced. This innovative solution addresses three significant challenges with robustness. It utilizes robust bootstrapping to manage model uncertainty during the imputation of a random sample; it incorporates robust fitting to reinforce accuracy; and it takes into account imputation uncertainty in a resilient manner. Furthermore, any complex regression or classification model for any variable with missing data can be run through the algorithm. With this new algorithm, we move one step closer to optimizing the accuracy and reliability of handling missing data. Using a realistic data set and a simulation study including a sensitivity analysis, the new alogorithm imputeRobust shows excellent performance compared with other common methods. Effectiveness was demonstrated by measures of precision for the prediction error, the coverage rates, and the mean square errors of the estimators, as well as by visual comparisons.01A - Beitrag in wissenschaftlicher ZeitschriftPublikation A new version of the Langelier-Ludwig square diagram under a compositional perspective(Elsevier, 2022) Templ, Matthias; Gozzi, Caterina; Buccianti, Antonella01A - Beitrag in wissenschaftlicher ZeitschriftPublikation A systematic overview on methods to protect sensitive data provided for various analyses(Springer, 2022) Templ, Matthias; Sariyar, Murat01A - Beitrag in wissenschaftlicher ZeitschriftPublikation Statistical analysis of chemical element compositions in food science: problems and possibilities(MDPI, 2022) Templ, Matthias; Templ, BarbaraIn recent years, many analyses have been carried out to investigate the chemical components of food data. However, studies rarely consider the compositional pitfalls of such analyses. This is problematic as it may lead to arbitrary results when non-compositional statistical analysis is applied to compositional datasets. In this study, compositional data analysis (CoDa), which is widely used in other research fields, is compared with classical statistical analysis to demonstrate how the results vary depending on the approach and to show the best possible statistical analysis. For example, honey and saffron are highly susceptible to adulteration and imitation, so the determination of their chemical elements requires the best possible statistical analysis. Our study demonstrated how principle component analysis (PCA) and classification results are influenced by the pre-processing steps conducted on the raw data, and the replacement strategies for missing values and non-detects. Furthermore, it demonstrated the differences in results when compositional and non-compositional methods were applied. Our results suggested that the outcome of the log-ratio analysis provided better separation between the pure and adulterated data and allowed for easier interpretability of the results and a higher accuracy of classification. Similarly, it showed that classification with artificial neural networks (ANNs) works poorly if the CoDa pre-processing steps are left out. From these results, we advise the application of CoDa methods for analyses of the chemical elements of food and for the characterization and authentication of food products.01A - Beitrag in wissenschaftlicher ZeitschriftPublikation Privacy of study participants in open-access health and demographic surveillance system data. Requirements analysis for data anonymization(JMIR Publications, 2022) Templ, Matthias; Kanjala, Chifundo; Siems, InkenBackground Data anonymization and sharing have become popular topics for individuals, organizations, and countries worldwide. Open-access sharing of anonymized data containing sensitive information about individuals makes the most sense whenever the utility of the data can be preserved and the risk of disclosure can be kept below acceptable levels. In this case, researchers can use the data without access restrictions and limitations. Objective This study aimed to highlight the requirements and possible solutions for sharing health surveillance event history data. The challenges lie in the anonymization of multiple event dates and time-varying variables. Methods A sequential approach that adds noise to event dates is proposed. This approach maintains the event order and preserves the average time between events. In addition, a nosy neighbor distance-based matching approach to estimate the risk is proposed. Regarding the key variables that change over time, such as educational level or occupation, we make 2 proposals: one based on limiting the intermediate statuses of the individual and the other to achieve k-anonymity in subsets of the data. The proposed approaches were applied to the Karonga health and demographic surveillance system (HDSS) core residency data set, which contains longitudinal data from 1995 to the end of 2016 and includes 280,381 events with time-varying socioeconomic variables and demographic information. Results An anonymized version of the event history data, including longitudinal information on individuals over time, with high data utility, was created. Conclusions The proposed anonymization of event history data comprising static and time-varying variables applied to HDSS data led to acceptable disclosure risk, preserved utility, and being sharable as public use data. It was found that high utility was achieved, even with the highest level of noise added to the core event dates. The details are important to ensure consistency or credibility. Importantly, the sequential noise addition approach presented in this study does not only maintain the event order recorded in the original data but also maintains the time between events. We proposed an approach that preserves the data utility well but limits the number of response categories for the time-varying variables. Furthermore, using distance-based neighborhood matching, we simulated an attack under a nosy neighbor situation and by using a worst-case scenario where attackers have full information on the original data. We showed that the disclosure risk is very low, even when assuming that the attacker’s database and information are optimal. The HDSS and medical science research communities in low- and middle-income country settings will be the primary beneficiaries of the results and methods presented in this paper; however, the results will be useful for anyone working on anonymizing longitudinal event history data with time-varying variables for the purposes of sharing.01A - Beitrag in wissenschaftlicher ZeitschriftPublikation Artificial neural networks to impute rounded zeros in compositional data(Springer, 2021) Templ, Matthias; Filzmoser, Peter; Hron, Karel; Martín-Fernández, Josep Antoni; Palarea-Albaladejo, Javier04A - Beitrag Sammelband