Data Quality Case Study Prepared by ORC Macro
26 Slides175.00 KB
Data Quality Case Study Prepared by ORC Macro
Data Correction Background – Data Correction Tracking system SAS AF query application Guidelines – Profile Analysis SSNs Names 2
Profile Analysis—SSNs P e rs o n s t n 3 4 6 ,3 8 1 PPR Fs M is s in g n 9 6 ,0 9 7 (2 7 % ) V a lid - lo o k in g S S N s n 2 3 4 ,3 1 1 (6 8 % ) I n v a lid n 1 5 ,9 7 3 (5 % ) S h a re d S S N s n 7 ,1 0 0 R e p e a te d S S N s n 3 ,4 0 6 3
Profile Analysis—SSNs Shared SSNs (n 7,100) Different Names 27% Candidates for Collapse Candidates for Correction Same or Similar Names 73% 4
Profile Analysis—Names P e rs o n s t n 3 4 6 ,3 8 1 U n iq u e N a m e s n 2 3 2 ,1 7 2 Possible Duplicates 23% n 79,300 R e p e a te d N a m e s n 1 1 4 ,2 0 9 Unique Persons 77% n 267,081 N a m e G ro u p s n 3 0 ,4 4 7 I n d iv id u a l P e r s o n s n 3 4 ,9 0 9 P o s s ib le D u p lic a te s n 7 9 ,3 0 0 5
Profile Analysis—Names N a m e G ro u p s n 3 0 ,4 4 7 C o n tra c ts n 1 8 ,6 5 0 (9 1 % ) S e q u e n t ia l/M u ltip le P r o f ile s n 2 0 ,3 7 5 1 1 4 , 2 0 9 P r o f ile s O th e r n 1 ,7 2 5 (9 % ) I n v a lid /M is s in g S S N s n 8 3 ,5 2 1 S h a re d S S N s n 2 ,0 9 2 A p p a r e n t V a lid S S N s n 3 0 ,6 6 8 T y p o /D a ta E n try n 3 ,6 2 2 U n iq u e S S N s n 2 4 ,9 5 4 6
OLTP—Commons Cases Definition Statistics Status 7
Data Correction Identifying the extent of the problem Investigating based on type of error Validating the investigation Implementing the change Tracking the identification, investigation, validation, and implementation 8
Data Correction—An Example PERSON ID 3070908—PPRF record Identification of problem – Two different middle initials found Investigation of problem – TA module – Scripts run Validation of information – Name, SSN, degree(s), grant(s) – Sources 9
Data Correction—An Example PERSON ID 3070908—PPRF record Implementation of correction – Grants report submitted to NIH OD Tracking of correction – Internal tracking system Post-correction – Loss of control of data 10
Developing a Data Quality Business Plan
Focus of Our Activities Examination of the Database, Procedures, and Interface Development of Modified Use Cases Unified Modeling Language Identification and Extraction of Business Rules Identification of Business Model 12
Data Quality Issues Type-over of information Generation of duplicate persons Collapsing Changes in degree and address data Generation of orphans 13
Type-Over Practices Intentions: – Assign a new principal investigator (PI) to a grant – Change the name of a PI on a grant – Correct a misspelled name Consequences: – Inclusion of incorrect information in a person profile – Absence of linkages between PIs and grant applications – Creation of false linkages between PIs and grant applications 14
Factors Affecting Quality Relatively easy access to person-related data elements Lack of self-validation routines Interface issues 15
Solutions Restricted access Quality control validation Interface simplification Self-validation algorithm 16
Data Quality Validation Who does it? – ICs – A Quality Assurance group – Other How is it done? – Staging areas – Manual and intelligent filtering – Architecture 17
GM Module Screen GM1040 18
GM Module Screen COM1100 19
Self Validation Name-matching algorithm Consistency checking 20
Higher-Level Analysis The following are being examined relative to their effect on quality: Commons interface with IMPAC II Database redundancy Business rules in the database Master person file Front-end design Human factors Ownership 21
Development of a Data Quality Model
Major Goals Quality improvements plan for personal identifiers Evaluate the different identification algorithms currently in use for IMPAC II Develop identification algorithm(s) and procedures Serve as consultant and guarantor of efficacy of algorithm implementation 23
Moving Forward Understanding the technical infrastructure Identification of specific areas of concern Development/proposal of data quality expectations Development/proposal of appropriate, acceptable solutions 24
Data Quality White Paper Knowledge assets are very real and carry tremendous value. Outline Definition Rules Risks and Costs NIH Expectations Process Measurements/Metrics Testing Continuous Improvements Conclusions 25
Conclusion Examination of the Database, Procedures, and Interface Development of Modified Use Cases Unified Modeling Language Identification and Extraction of Business Rules Identification of Business Model Development/Proposal of Appropriate, Acceptable Solutions Development/Proposal of Data Quality Expectations Identification of Specific Areas of Concern Understanding the Technical Infrastructure 26