Querying WebSources within a Data Federation Lynn Wu1, Aykut
26 Slides535.00 KB
Querying WebSources within a Data Federation Lynn Wu1, Aykut Firat2, Tarik Alatovic3, Stuart Madnick1 MIT Sloan School of Management 2 Northeastern University 3 INSEAD 1 International Conference on Information Systems (ICIS) December 11, 2006 1
Motivating Scenario You want: The current stock quotes of all companies listed on the Stock Exchange that are in the biotechnology industry. And you want to see each of the stock quotes in all the major currencies. 2
Good News All of the necessary information is available (and for free) on the Web isting of companies in an industry Stock price for any company Conversion between any two currenci So what’s the problem? 3
Process – Part 1 Web sites are not like Relational (SQL) databases. Must go step-by-step: first find all the biotech companies. 4
Process – Part 2 237 Biotech firms Biotechnology Ticker Acadia Pharmaceuticals Inc. ACAD Accentia Biopharmaceuticals, I ABPI Achillion Pharmaceuticals, Inc ACHN Acorda Therapeutics, Inc. ACOR Adherex Technologies Inc. ADH Advanced Cell Technology Inc. ACTC.OB Advanced Life Sciences Holding ADLS Advaxis Inc. ADXS.OB Adventrx Pharmaceuticals Inc. ANX Alfacell Corp. ACEL Alnylam Pharmaceuticals Inc. ALNY Then must find the stock price of each, one-by-one. ADH ACOR ACHN ABPI ADLS ANX ACAD 5
Process – Part 3 Then must convert stock price of each, one-by-one. Ticker Price( ) ACAD 8.96 ABPI 3.55 ACHN 17.52 ACOR 17.13 ADH 0.31 ACAD 8.96 6.8096 792.2289 ACTC.OB 0.76 ABPI 3.55 2.698 313.8853 ADLS 2.43 ACHN 17.52 13.3152 1549.09 ADXS.OB 0.14 ACOR 17.13 13.0188 1514.607 ANX 2.42 ACEL 1.62 ADH 0.31 0.2356 27.4097 ALNY 23.81 ACTC.OB 0.76 0.5776 67.19798 ADLS 2.43 1.8468 214.8567 ADXS.OB 0.14 0.1064 12.37858 ANX 2.42 1.8392 213.9725 ACEL 1.62 1.2312 143.2378 ALNY 23.81 18.0956 2105.242 Ticker Price( ) EURO JPY 6
General Scenario Users often have to browse through many websites and collect and process a lot of information manually. Wouldn’t it be great if you could get all the stock quotes in the biotech select ticker, price from yahooF industry using one query? where ticker IN (select companyticker from companytable where industry 'Biotechnology') 7
Why is this so difficult? Websites have various capability restrictions. Web sites do not accept general queries (e,g., SQL). Assuming they somehow accepted general queries, there are still problems. For example: select price from yahooF This is not answerable as Yahoo! Finance requires at least one ticker at a time to get the stock quote. select exchanged, expressed, rate, date from olsen where expressed 'USD' and date '12/10/06' Must specify both currencies. 8
Existing Solutions Commercial databases can incorporate heterogeneous data sources through the use of wrappers: However, there is no general-purpose wrapper that can query the entire Web. Need to construct one wrapper per website. This is our focus – how can these be improved ? Other options: Using highly expressive context-free grammars to express the capability restrictions Has not been used widely in commercial systems due to their complexity. 9
How does a Federated database system handle the problem? Example: IBM DB2 Query: Select .from s1,s2,s3 Federation Engine IBM DB2 Wrapper: Request-Reply Protocol Wrapper For web sites (S1, S2, S3), each wrapper must be custom crafted. Wrapper for S1 Wrapper for S2 Wrapper for S3 Capability Handler Capability Handler Capability Handler Data Extraction Data Extraction Data Extraction Web Sources S1-website S2-website S3-website 10
Research Contribution Offer a complete, practical, and scalable solution to easily incorporate websites into a data federation. Abstract wrapper components into separate reasoning engines. Capability reasoning engine for query planning and execution Data extraction engine 11
Our Solution Two-Layered Architecture—current IBM solution Query: Select .from s1,s2,s3 Three-Layered Architecture— with capability declaration Query: Select .from s1,s2,s3 Federation Engine Federation Engine IBM DB2 IBM DB2 Wrapper: Request-Reply Protocol Wrapper Wrapper for S1 Capability Handler Data Extraction Wrapper: Request-Reply Protocol Wrapper, Capability Engine Wrapper for S2 Capability Handler Data Extraction Wrapper for S3 Query planning with capability declaration CR for S1 CR for S2 CR for S3 Capability Handler Data Extraction Data Extraction Engine Data Extraction Engine Capability Record Declaration DE for S1 Data Extraction Spec Files DE for S2 DE for S3 Web Sources Web Sources S1-website S2-website S3-website S1-website S2-website S3-website 12
Adding a web source is simple. Define the data extraction rules. Define the capability record. No procedural coding involved at all ! 13
Data Extraction: Cameleon Engine Extract data from web pages using declarative specifications that extract specific fields within a website. Can answer rudimentary queries involving only a single Input param Regular expression identifying the region and extracts the price Example data extraction rules for Yahoo! Fin 14
Cameleon Studio tool enables quick creation and testing of the data extraction rules 15
Capability Record For Yahoo Finance!, we have two attributes of interest. Cameleon extracts data and form a table format Ticker Price Must provide one (and only one) Ticker at a (some sites allow up to 50 Tickers at a time) Capability Record relation(‘YahooF’, [[‘Ticker’, string, bound(1)], [‘Price’, number, free]], [' ']) relation(‘companytable’, [[‘Industry’, string, bound(1)], [‘CompanyTicker’, string, free]], [' ']) Price is value returned. Can only use equality ( ) operator. relation(olsen, [['Exchanged',string, bound(1)], ['Expressed',string, bound(1)], ['Rate',number, free], ['Date',string, bound(1)]], [' ']). 16
IBM DB2 Uses wrapper to access non-relational data sources. DB2 first decomposes the original query into query fragments and then sends them to wrappers. Wrapper sends the result back to DB2 which then assembles the final results. DB2 XML Wrapper (Adapted from IBM). 17
Request-Reply-Compensate Protocol Query Fragment Request select price * 1.3 from YahooF where ticker in (‘GE’, ‘IBM’, ‘MSFT’); HXP: Price Table: YahooF Predicates: ticker in (‘GE’, ‘IBM’, ‘MSFT’) HXP: Price Table: YahooF Predicate: ticker ‘GE’ Wrapper plan 1 HXP: Price Table: YahooF Predicate: ticker ‘IBM’ Wrapper plan 2 HXP: Price Table: YahooF Predicate: ticker ‘MSFT’ Wrapper plan 3 Request-Reply-Compensate protocol example 18
Query Planning Now we have a capability record defined. Add a secondary mini query planner that is designed specifically to work with capability records. Can answer queries involving multiple web sources. Specify a query execution order of query fragments. Independent query fragments are executed first. Followed by dependent query fragments that can uses the prior results. 19
Our Solution Example 1 Find all the stock quotes of biotech companies. SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER IN (SELECT COMPANYTICKER FROM COMPANYTABLE WHERE INDUSTRY 'BIOTECHNOLOGY' AND COMPANYTICKER 'AD')) Independent query fragment SELECT COMPANYTICKER, INDUSTRY FROM COMPANYTABLE WHERE INDUSTRY BIOTECHNOLOGY’ AND COMPANYTICKER 'AD') Depends on the previous query fragment SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER [ unbound kind ] 20
Example Query Independe SELECT COMPANYTICKER, nt query INDUSTRY FROM fragment COMPANYTABLE WHERE INDUSTRY BIOTECHNOLOGY AND COMPANYTICKER AD Depends on the previous query fragment SELECT ACAD SELECT ACAM SELECT ACOR SELECT ACEL COMPANYTICKER INDUSTRY --------------------------------------------ACAD Biotechnology ACAM Biotechnology ACOR Biotechnology ACEL Biotechnology SELECT TICKER, PRICE FROM YAHOOF WHERE TICKER [ unbound kind ] PRICE, TICKER FROM YAHOOF WHERE TICKER PRICE, TICKER FROM YAHOOF WHERE TICKER PRICE, TICKER FROM YAHOOF WHERE TICKER PRICE, TICKER FROM YAHOOF WHERE TICKERTICKER PRICE ------------------------------------------ACAD 14.90 ACAM 6.51 ACOR 5.10 21 ACEL 3.18
Example 2 Now you want the stock price in all major currencies. select yahooF.ticker, yahooF.price * exchange.rate, exchange.curency from (select ticker, price from yahooF where ticker IN (select companyticker from companytable where industry ‘biotechnology’) (select currency, olsen.rate from (select currency from currency map where currency ‘USD') currency map, (select exchanged, 'USD', rate, ‘12/10/06' from olsen where expressed 'USD' and date ‘12/10/06') olsen where currency map.currency olsen.exchanged and currency map.currency 'USD ') as exchange 22
Example 2 Get all the exchange rates against the USD on Dec 10 2006 Query fragment 1 Query fragment 2 select olsen.rate, from (select currency, from currency map where currency ‘USD') currency map, (select exchanged, ‘USD', rate, ‘12/10/06' from olsen where expressed ‘USD' and date ‘12/10/06') olsen, where currency map.currency olsen.exchanged and currency map.currency ‘USD' 23
Query fragment 2 (select exchanged, ‘USD', rate, ’12/10/06' from olsen where expressed ‘USD' and date ’12/10/06’) olsen Capability record relation(olsen, [['Exchanged',string, bound(1)], ['Expressed',string, bound(1)], ['Rate',number, free], ['Date',string, bound(1)]], [' ']). select olsen.rate from (select currency from currency map where currency 'USD') currency map, (select exchanged, 'USD', rate, '12/10/06' from olsen where expressed 'USD' and date '12/10/06') olsen, where and currency map.currency olsen.exchanged currency map.currency 'USD' (select exchanged, 'USD', rate, ’12/10/2006' from olsen where expressed 'USD' and date '12/10/06' and exchanged in (select currency from currency map where currency ’USD’)) Modified Query fragment 2 24
Currency rate ---------------------------------------AUD 1.46 CAD 1.32 HKD 7.72 YPY 113.00 TICKER PRICE ------------------------------------------ACAD 14.90 ACAM 6.51 ACOR 5.10 select ticker, price * exchange.rate, exchanged.currency TICKER PRICE( ) PRICExRATE CURRENCY -----------------------------------------------------------------------------------------------------ACAD 14.90 21.754 AUD ACAD 14.90 19.668 CAD ACAD 14.90 115.028 HKD ACAD 14.90 1683.7 YPY ACAM 6.51 9.505 AUD ACAM 6.51 8.593 CAD ACAM 6.51 50.257 HKD ACAM 6.51 735.63 YPY ACOR 5.10 7.446 AUD ACOR 5.10 6.732 CAD ACOR 5.10 39.372 HKD ACOR 5.10 576.3 YPY 25
Conclusion Three-layered architecture for querying web sources. Instead of burying capability handling in each wrapper, we created a generic capability handler. Using this capability handler, adding a web source to a federated database is as simple as declaring the extraction rules and capability record for the source. This was implemented and successfully tested. This makes millions of semi-structured web sites into useful “databases.” 26