Imputing Missing Confidential Data without the use of Multiple Imputation

Research output: Contribution to conferencePosterpeer-review

37 Downloads (Pure)

Abstract

Missing data is a particular challenge in data research. Missing data can be missing at random (MAR), missing completely at random (MCAR) or not missing at random (MNAR) (1). There are numerous reasons why data might appear missing, but in the case of MNAR data, the missingness can be a result of deliberate omission. An example of this is data that is omitted to protect a patient’s identity as part of the robust safeguards in place for storing and managing patient data. Statisticians and data analysts can usually overcome missing data issues using multiple imputation techniques as suggested by Rubin (2). Unfortunately, these methods are more appropriate for MAR and MCAR data and may be biased for MNAR data scenarios.

Supplementary data was requested from outputs published in the routes to a Colorectal Cancer (CRC) diagnosis in Northern Ireland report (3), to inform the build of a larger model of CRC natural history. Rather than simple age or stage or gender-based outputs, we sought a stratified data output, with short intervals of mortality to inform future work on comparative COVID-19 outcome assessments. Unfortunately, the incidence and mortality variables had missing values and thus, any intial data analysis was difficult to complete. If less than 10 individuals were present within a certain route, age group, stage of disease and sex category, the data was omitted by the data custodian. Also, if less than three individuals died at the intervals requested (three/six/twelve months) after they were first diagnosed, again, the data were omitted. However, in comparison with the main report (3), a record of the number of individuals within the route, sex, age group and stage categories permitted comparison to determine the missing values using linear algebra methods instead.

This analysis aimed to impute the unknown values within the dataset, by routes to a CRC diagnosis dataset without the use of multiple imputation techniques. It was possible to apply singular value decomposition to a system of linear equations relating to the incidence of CRC variable and apply random sampling with specific constraints to the other three variables relating to the mortality at each time period. Within some routes, it was possible to get accurate values that correspond to the reported paper outcomes (3), however, some values had to be manually adjusted to prevent oversized estimates and match the reported outcomes. Given the paucity of work in this area, we acknowledge this does not apply to all MNAR data cases, but it may be used as a guide to explore possible new avenues for missing data manipulation when MNAR data is concerned.

Original languageEnglish
Publication statusPublished - 16 May 2022
EventConference for Applied Statistics in Ireland, 2022 - Garryvoe Hotel, East Cork, Ireland
Duration: 16 May 202218 May 2022

Conference

ConferenceConference for Applied Statistics in Ireland, 2022
Abbreviated titleCASI 2022
Country/TerritoryIreland
Period16/05/202218/05/2022

Keywords

  • Missing data
  • Survival estimates
  • Cancer data
  • Cancer Screening
  • Modelling and simulation
  • modelling designs

Fingerprint

Dive into the research topics of 'Imputing Missing Confidential Data without the use of Multiple Imputation'. Together they form a unique fingerprint.

Cite this