CMPE 226 Database Systems May 2 Class Meeting Department of
58 Slides2.48 MB
CMPE 226 Database Systems May 2 Class Meeting Department of Computer Engineering San Jose State University Spring 2017 Instructor: Ron Mak www.cs.sjsu.edu/ mak
Remaining Weeks! April 25 May 2 (today) Team project presentations May 17 Data visualization and the Cisco Information Server Data mining May 9 and May 16 NoSQL databases Team projects due May 23 Final exam Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 2
Final Project Put your emphasis on data management. Data models: How you manage the data are you are using. Operational tables Analytical tables Data operations: Queries and updates of the operational tables. How the analytical tables are loaded. Queries of the analytical tables for data analysis. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 3
Final Project, cont’d A user application that invokes the data operations. Web-based or desktop-based. PHP, Java, etc. Fancy GUI or data visualization not necessary. How well did you use the technologies you learned during the semester? RDBM, DW, XML, NoSQL, data virtualization (CIS) Not all technologies have to be used. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 4
Final Project, cont’d Written report What is the application? What data did you use, and where did you get it? Overview of your data models (in words). ER diagram Relational schemas Star schemas Example user actions and screen shots of results. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 5
Team Presentations PowerPoint talk. What is the application? What data did you use? Where did you get it? Overview of your data models (in words). Briefly show and explain the highlights of: ER diagram Relational schemas Star schemas Maximum 15 minutes per team. Demo Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 6
Presentation Schedule May 9 1. 2. 3. 4. 5. 6. ACID Fancy Coders Spartans Team Ultimate Star Schema Infinite DBtarians 7. May 16 1. 2. 3. 4. 5. 6. Data Miners Forkhead SQL Ninjas Team Longshot Team 666 No Brainers Deadlock 7. Teams can trade days! Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 7
Data Virtualization An approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted or where it is physically located. https://en.wikipedia.org/wiki/Data virtualization An agile data integration approach that organizations use to gain more insight from their data, respond faster to ever changing analytics and BI needs and save 50-75% over data replication and consolidation. http://www.compositesw.com/data-virtualization/ Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 8
Data Virtualization, cont’d Differs from traditional ETL. The original data remains in place. A real-time process accesses the data sources. Reduces the risk of data errors. Reduces the workload of extracting, transforming, and loading data that may never be used by an application. Provides applications a “virtual view” of the data. Applications can treat the disparate and heterogeneous data as a single relational database. 9 Computer Engineering Dept. CMPE 226: Database Systems Spring 2017: May 2 R. Mak
Data Virtualization Functionality Abstraction Virtualized data access Abstract the technical aspects of stored data, such as location, storage structure, API, access language, and storage technology. Connect to different data sources and make them accessible from a common logical data access point. Transformation Transform, improve quality, reformat, etc. source data for consumer use. https://en.wikipedia.org/wiki/Data virtualization Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 10
Data Virtualization Functionality, cont’d Data federation Combine result sets from across multiple source systems. Data delivery Publish result sets as views and/or data services executed by client applications or users when requested. https://en.wikipedia.org/wiki/Data virtualization Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 11
Cisco Information Server (CIS) The Cisco Information Server (CIS) forms the core of the Cisco Data Virtualization Platform. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 12 http://www.compositesw.com/data-virtualization/
Install the CIS Studio Studio is the client-side application that connects to the CIS server. As delivered, it runs only on Windows. A student has enabled it to run on Mac and Linux. Download and install Studio. See http://www.cs.sjsu.edu/ mak/CMPE226/CIS7.0.4/ “Cisco Information Server and Studio” contains both the CIS server and the client Studio app. Install only Studio. See the Installation Guide. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 13
CIS Team Accounts Each team has a CIS account. Same team username and password as your MySQL account. On the server, CIS is installed in /opt/Cisco Systems/CIS 7.0 Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 14
Logging into CIS To log in, start the client-side Studio application. Domain: composite Server: 130.65.159.87 Port: 9400 Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 15
Browse Database Sources Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 16
Browse XML Data Sources Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 17
XML Data Database Table First define a transformation that maps an XML file to a relational database table. Then CIS can dynamically load the XML data into the database table. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 18
XML Data Database Table, cont’d Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 19
XML Data Database Table, cont’d Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 20
Create Views by Joining Tables Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 21
Publish the Results “Publish” the transformations and views. Applications can then access them as if they were actual relational database tables. Cisco provides a JDBC driver for Java programs. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 22
CIS and Your Projects Try the demo described in the Getting Started Guide. Can you find ways to use CIS in your project? Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 23
Assignment #10: CIS Upload (part of) your database to 130.65.159.87 and run it on the MySQL database server. Follow the guidelines of the CIS Getting Started Guide. Create a sources folder in Studio under My Home Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 24
Assignment #10: CIS, cont’d In your sources folder, create and introspect a MySQL 5.5 data source and connect it to your database: Datasource name: (you choose) Host: localhost Port: 3306 (the standard MySQL port) Database name: (based on your group name) Login and password Make a screen shot of the introspection window. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 25
Assignment #10: CIS, cont’d Make a screen shot of Studio displaying the column details and the contents of one of your database tables. Such as the ds orders example. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 26
Assignment #10: CIS, cont’d Upload an XML file to the server and make it a data source. Such as the ds XML example. Make Studio screen shots of the XML contents and associated schema. Create an XML data transformation and make a screen shot. Such as the productCatalog xform example Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 27
Assignment #10: CIS, cont’d Create a view by joining two or more of your database tables (virtual or real), and make a screen shot of the table join graphic. Such as the ViewOrder example. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 28
Break Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 29
Data Mining The analysis step of the overall process of “Knowledge Discovery in Databases” (KDD) We are “data rich but information poor”. Make predictions based on current data. Discover (“mine”) knowledge from large data sets. Discover patterns hidden in the data. Two types of data mining: descriptive predictive Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak Key word: DISCOVERY Find something new! 30
Knowledge Discovery To be defined: knowledge Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber, and Jian Pei Elsevier, 2011 Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 31
Data Mining Techniques Integrates techniques from multiple disciplines: database technology statistics machine learning (AI) high-performance computing pattern recognition fuzzy set theory neural networks data visualization information retrieval knowledge representation image and signal processing spatial data analysis One of the most important frontiers in database systems. One of the most promising interdisciplinary developments in the information industry. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber, and Jian Pei Elsevier, 2011 32
Uses of Data Mining Market segmentation Customer churn Identify the common characteristics of customers who buy the same products from your company. Predict which customers are likely to leave your company and go to a competitor. Fraud detection Identify which transactions are most likely to be fraudulent. http://www.laits.utexas.edu/ anorman/BUS.FOR/course.mat/Alex/ Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 33
Uses of Data Mining, cont’d Direct marketing Identify which prospects should be included in a mailing list to obtain the highest response rate. Interactive marketing Predict what each individual accessing a website is most likely interested in seeing. Market basket analysis Understand what products or services are commonly purchased together; e.g., beer and diapers. http://www.laits.utexas.edu/ anorman/BUS.FOR/course.mat/Alex/ Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 34
New Business Opportunities from Data Mining Trend analysis Reveal the difference between a typical customer this month and last. Automated prediction of trends and behaviors Automate the process of finding predictive information in a large database. Questions that traditionally required extensive hands-on analysis can now be directly answered from the data. http://www.laits.utexas.edu/ anorman/BUS.FOR/course.mat/Alex/ Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 35
New Business Opportunities, cont’d Automated prediction of trends and behaviors, cont’d Example: Targeted marketing. Use data on past promotional mailings to identify the targets most likely to maximize return on investment in future mailings. Example: Forecast bankruptcy and other forms of default. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 36
New Business Opportunities, cont’d Automated discovery of previously unknown patterns. Sweep through databases and identify previously hidden patterns. Example: Analyze retail sales data to identify seemingly unrelated products that are often purchased together. Example: Detect fraudulent credit card transactions. Example: Identify anomalous data that could represent data entry keying errors. http://www.laits.utexas.edu/ anorman/BUS.FOR/course.mat/Alex/ Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 37
History of Data Mining Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber, and Jian Pei Elsevier, 2011 Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 38
Types of Database Queries “Standard” query Extract data that’s in the database. Example: What was the revenue from a certain product at a particular store on a given date? OLAP Slice, dice, drill up, drill down, pivot Generate views using aggregation and “group by”. Example: What revenue did we get during Q3 from our northeastern stores? Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 39
Types of Database Queries, cont’d Predictive analytics (data mining) Use statistical methods to forecast based on existing data. Example: What are the predicted revenue figures for next quarter? Finding patterns (data mining) What interesting patterns exist in the data? You may not even know ahead of time what questions to ask! Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 40
Data Mining Examples Transactional data “Market basket data analysis” What product items sell well together? Spatial data What characteristics do houses have that are located near a park? What climates are found in mountainous areas at various altitudes? How does the poverty rate change with distance Computer Engineering Dept. CMPE 226: Database Systems from major highways? Spring 2017: May 2 R. Mak 41
Data Mining Examples, cont’d Temporal and time-series data How should we schedule bank tellers based on customer volume? What patterns are there in the price of a certain stock? Web data What patterns are there in path traversals? What patterns are there in social networks? Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 42
Mining Patterns in Data Characterization Summarize the general characteristics or features of a target class of data. Example: What is the profile of a customer who spends over 1000/year at our store? Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 43
Mining Patterns in Data, cont’d Discrimination Compare the general characteristics or features of a target class of data with those of another class of data. Example: How do the features of software products whose sales increased by 10% last year compare with those products whose sales decreased by 10%? Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 44
Mining Patterns in Data, cont’d Associations Discover association rules among attributes and values for frequently occurring conditions. Example: Customers aged 20-29 with incomes 3050K/year will purchase Priuses with 60% probability. Rules of the form X Y age(X, 20.29) && income(X, 30K.50K) buys(X, Prius) confidence 60% Example: Amazon recommends books to you based CMPE 226: Database Systems R. Mak on books you’ve previously looked at or purchased. 45 Computer Engineering Dept. Spring 2017: May 2
Mining Patterns in Data, cont’d Classification and prediction Discover a set of models that describe the classes that objects belong to. Use the models to predict what classes objects belong to. Implement with IF-THEN rules, decision trees, or neural networks. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 46
Mining Patterns in Data, cont’d Classification and prediction, cont’d Example: You want to label products in your store as poor sellers, good sellers, or best sellers. Discover a model that uses a product’s characteristics, such as price, color, size, etc. to predict which label to attach to each product. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 47
Classification Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber, and Jian Pei Elsevier, 2011 A classification model can be represented in various forms: (a) IF-THEN rules, (b) a decision tree, or (c) a neural network. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 48
Mining Patterns in Data, cont’d Clusters Cluster or group objects based on the principle of: maximize intraclass similarity (similarity within a class) minimize interclass similarity (separation between classes) In other words, you want separate and distinct clusters. Cluster analysis is often used to generate labels for subsequent classification and prediction. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 49
Clusters Data Mining: Concepts and Techniques by Jiawei Han, Micheline Kamber, and Jian Pei Elsevier, 2011 A 2-D plot of customer data with respect to customer locations in a city, showing three data clusters. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 50
Mining Patterns in Data, cont’d Outliers Discover data objects that do not comply with the general behavior or model of the data. Use statistical tests and distance measures. Do not always discard as noise! Example: Uncover fraudulent credit card usage. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 51
Mining Patterns in Data, cont’d Evolution Discover regularities or trends for data objects whose behavior changes over time. Analysis can include time-series data analysis, sequence, or periodicity pattern matching, etc. Example: Look for patterns in the behavior of stocks. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 52
Interesting Patterns Generally, only a small percentage of patterns discovered by data mining are “interesting”. A pattern is interesting for a given application if it: is easily understood by humans is valid on new data with some degree of certainty is potentially useful is novel or unexpected validates a hypothesis the user wishes to confirm Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 53
Interesting Patterns, cont’d An interesting pattern represents knowledge. Do not expect a data mining system to generate all patterns generate only interesting patterns Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 54
Measuring Interest Objective measures An application sets thresholds for support and confidence. Let X Y be an association rule where X and Y are item-sets. Let T be a set of transactions of a database. The support value of X with respect to T is the proportion of transactions in the database which contains the item-set X. The confidence value of the rule X Y with respect to T is the proportion of the transactions that contain Computer Engineering Dept. 226: Database 55 X and which also CMPE contain Y.Systems Spring 2017: May 2 R. Mak
Measuring Interest, cont’d Subjective measures are based on user beliefs in the data. unexpected patterns patterns that offer strategic information (“actionable patterns”) A good data mining system discovers knowledge (interesting patterns) at all levels of data abstraction. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 56
Major Data Mining Issues Methodology Diverse data types User interactions data mining languages ad hoc data mining presentation and visualization of results handling noisy or incomplete data evaluating how interesting are the discovered patterns Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 57
Major Data Mining Issues Performance Ethics, privacy, social impacts Still a new field! Big Brother is watching you. Big Brother knows everything about you. Computer Engineering Dept. Spring 2017: May 2 CMPE 226: Database Systems R. Mak 58