Chapter 01 Introduction Dr. Steffen Herbold
24 Slides4.51 MB
Chapter 01 Introduction Dr. Steffen Herbold [email protected] Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Outline Introduction to Big Data Data Science and Business Intelligence The Skillset of Data Scientists Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
What is „Big Data“?!? Introduction to Data Science https://sherbold.github.io/intro-to-data-science Is this really about size?
Naive Definition Naive definition: Big data only depends on the data size 1 Gigabyte? 1 Terabyte? 1 Petabyte? Naive interpretation misses important aspects Time: Analyzing 1 Gigabyte of data per day is different from analyzing 1 Gigabyte of data per second Diversity: Analyzing spread sheets with numeric data is different from analyzing Web pages that contain a mixture of text and images Distribution: Analyzing data from a single source is different from analyzing data from multiple sources Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Definition of Big Data Following Gartner‘s IT Glossary: Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation. The three Vs Volume Velocity Variety Some people actually use 10 Vs to define big data! Variability Veracity Validity Vulnerability Volatility Visualization Value Introduction to Data Science https://sherbold.github.io/intro-to-data-science
The 3 Vs: Volume Scale of the data must be „big“ No clear definition „that demand [ ] innovative forms of information processing“ (Gartner) Data center storage worldwide Introduction to Data Science https://sherbold.github.io/intro-to-data-science Statista 2018
The 3 Vs: Velocity Speed at which new data is created Speed at which data must be processed and analyzed Often close to real-time Introduction to Data Science https://sherbold.github.io/intro-to-data-science
The 3 Vs: Variety Diversity in data types and data sources Str uct ur ed SemiStructured Quasi-Structured Data with defined types and structure Example: comma separated values Textual data with parseable pattern Example: XML files with schema Textual data with erratic formats that can be formated with effort Example: Clickstream data Unstructured Data that has no inherent structure, often with multiple formats Example: Web site, videos Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Examples for data types Structured Quasi-Structured Unstructured Semi-Structured Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Outline Introduction to Big Data Data Science and Business Intelligence The Skillset of Data Scientists Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Defining Data Science Unfortunately, there is no clear definition (yet?) Goal is the extraction of knowledge from data Combination of techniques from different disciplines Scientific principles guide the data analysis Introduction to Data Science https://sherbold.github.io/intro-to-data-science
What is „Data Science“?!? Introduction to Data Science https://sherbold.github.io/intro-to-data-science Tools? Big Data? Machine Learning?
Mathematical Aspects Computational Geometry Optimization Scientific Computing Stochastics Machine Learning Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Computer Science Aspects Data Structures and Algorithms Software Engineering Databases Artificial Intelligence Introduction to Data Science https://sherbold.github.io/intro-to-data-science Distributed Computing Machine Learning
Statistical Aspects Linear Models Statistical Tests Time Series Analysis Inference Machine Learning Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Applications Intelligent Systems Robotics Marketing Medicine Autonomous Driving Social Networks Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Data Science vs. Business Intelligence Business Intelligence (Gartner IT Glossary) [ ] best practices that enable access to and analysis of information to improve and optimize decisions and performance. Business Intelligence High Data Science Depth of Insights Techniques Dashboards, alerts, queries Optimization, predictive modelling, forecasting Data Types Structured, data warehouses Any kind, often unstructured Common questions What happened ? How much did ? When did ? What if ? What will ? How can we ? Business Intelligence Low Past Present Time Data Science Future Introduction to Data Science https://sherbold.github.io/intro-to-data-science
More Data More Opportunities TERABYTES PETABYTES EXABYTES VOLUME OF INFORMATION LARGE SMALL 1990’s Relational Databases & Data Warehouses 2000’s Content Management Introduction to Data Science https://sherbold.github.io/intro-to-data-science 2010’s Key-Value Storages & Unstructured Data
Outline Introduction to Big Data Data Science and Business Intelligence The Skillset of Data Scientists Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
What are Data Scientists? Not computer scientists But should know about databases, data structures, algorithms, etc. Not mathematicians But should know about optimization, stochastics, etc. Not statisticians But should know about regression, statistical tests, etc. Not domain experts But must work together with them Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Skills of Data Scientists Quantitative Maths Algorithms Statistics A bit of everything Collaborative Teamwork Communication skills Data Scientists Technical Programming Infrastructures but actually as much as possible of everything Skeptical Create hypotheses, but be skeptical about them Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Different types of Data Scientists According to Microsoft Research: Polymath Data Analyzer „Do it all“ Analyzing data Data Evangelist Data analysis, disseminating and acting on insights Platform Builder Collect data and create infrastructures Moonlighters (50%/20%) Data Preparer „Spare time“ data scientists Querying existing data, preparing data for analysis Insight Actors Use the outcome and act on insights. Data Shapers Analyzing and preparing data Miyung Kim, Thomas Zimmermann, Robert DeLine, Andrew Begel: Data Scientists in Software Teams: State of the Art and Challenges, IEEE Transactions on Software Engineering (Online First) Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Outline Introduction to Big Data Data Science and Business Intelligence The Skillset of Data Scientists Summary Introduction to Data Science https://sherbold.github.io/intro-to-data-science
Summary Big data has a high volume, velocity, and variety Different data structures Structured, semi-structured, quasi-structured, unstructured Data science is a very diverse discipline Maths, computer science, statistics, applications Data scientists require a diverse skillset Introduction to Data Science https://sherbold.github.io/intro-to-data-science