Data Data Cleaning by Gio 1
12 Slides692.65 KB
Data Data Cleaning by Gio 1
Basic Principle: Garbage in Garbage Out Garbage in Garbage out (GIGO) is the prevailing principle that flawed components of a data set can invalidates the practical use that data set in data science or machine learning Data Cleaning is the act of removing all flawed or irrelevant parts of data so that what remains is more suited to a particular goal; typically, data science or machine learning 2
NYC Taxi Data Set Data dictionary CSV and Data Frames Resources & Links 3
Preliminary Inspection Statistical breakdown NaN counts Visual inspection Notice: Non-Linear & Geographic columns 4
Non-Linear Column Value Engineering Categorical columns Binary and One-Hot Encoding Timestamping dates 5
Geographic Value Engineering Geo-Encoding API and GeoPandas Shape files European Petroleum Survey Group(EPSG) Frequency measurements 6
Middle Data Inspection Statistical breakdown NaN counts Random visual inspection Notice: NaN & total number of columns 7
Replace Not-a-Number(NaN) Values Approaches: drop rows, statistical replacement, etc. NaN distribution and random value generation 8
Feature Scaling, PCA & Correlations Feature Scaling is the process of normalizing a range of a variable to add context to the values within the data Principal Component Analysis is the process of utilizing the principal components of a data set to reduce the dimensionality(# of columns) of that data set Correlations are the statistical relationships between variables that can imply dependencies between those variables 9
Feature Scaling, PCA & Correlations Rescaling(Min-Max Normalization) Standardization Correlation Matrices, Tables and Lists 10
Final Data Inspection Compare with original data set Statistical breakdown Random visual inspection 11
Thank you! 12