Official microdata related Project
Microdata Security Project
Project Summary and Anticipated Outcomes and Goals
This is a project for the establishment of data suppression techniques to prevent the disclosure of confidential information of survey participants and enable the safe use of public survey microdata in research. In particular, we are developing cell suppression algorithms for tabular data including summary tables, which are representative descriptive statistics used in official statistics.
This will establish data suppression techniques for the secure utilization of microdata and provide tools for implementing the proposed method. Furthermore, it will improve the efficiency of safety verification for academic research use.
In recent years, Japan has been promoting the secondary use of questionnaire information, and an on-site use system has been fully operational since 2019. This system allows researchers to conduct exploratory analysis of official microdata from terminals at on-site facilities as long as it is for academic research purposes. However, microdata contains confidential information about survey participants, and researchers must take necessary steps to prevent the disclosure of confidential information when publishing analysis results as an academic paper. Therefore, the development of data suppression techniques for creating safe analysis results is an extremely important issue.
Project Research and Development Content
In this study, we devised a suppression algorithm that performs cell suppression on tabular data comprising frequency and summary tables, which are representative descriptive statistics used in official statistics, and developed a suppression tool for researchers who take data out of the system and check the output for on-site use. When the frequency of a cell in tabular data is low, there is a risk that an attacker with external knowledge could identify the survey subject contained in the cell and thus infer sensitive information about said subject. Therefore, it is necessary for data with such a high level of confidentiality to be concealed. In the case of table data, a secondary confidentiality process is required to additionally conceal the nonconfidential cell values as it is easy to recover the confidential cell values based on the row total and column total relationship formulas. However, enabling the appropriate secondary selection of concealed cells while maintaining the usability of table data is complicated. In this study, we formulated this problem in the framework of integer programming, where the objective function is to minimize the number of secret cells. To do this, we implemented an efficient algorithm using the Benders decomposition method, written in the R language. We also added an explanation feature in the suppression tool we developed that verifies the safety of the confidential table data, contributing to the efficiency of output checking for on-site use.