A walk in cloud (and look for databases) Jian Xu DMM DB-talk, Feb 2010
18 Slides295.39 KB
A walk in cloud (and look for databases) Jian Xu DMM DB-talk, Feb 2010
outline Cloud products Architecture Database issues Data integration issues Other topics
Cloud products Google App Engine – http://code.google.com/appengine/ Amazon Elastic Compute Cloud (EC2) – http://aws.amazon.com/ec2/ Yahoo! Cloud Testbed / M45 – http://labs.yahoo.com/cloud computing WestGrid – http://www.westgrid.ca/ Microsoft Azure – http://www.microsoft.com/windowsazure/
What’s in common ? A cluster of centrally managed computing resources Provide application hosting & storage Provide API / SDK Encourage deployment of applications that runs in parallel. The goal : fast, scalable & cheap computing
Comparison to cluster/grid Share similar hardware deployment Cloud hides resource specification: [IBM] – Only API is exposed [google, Amazon] – User transparent to underlying resources Cluster exposes resource configuration – “Utility” computing – Less or no service support
Source : [Foster 09]
Cloud Architecture Cluster / Grid system as hardware platform – IBM cluster line Virtualization [Barham03] – Amazon Machine Image Unified storage and data access model – Google’s bigtable – Amazon’s Simple DB
IBM Architectural Model for Cloud Computing [Web2.0 expo] Subscribers Cloud Infrastructure & Application Provider Web 2.0 Solution Tools Service Management End User Requests User Request Management/Self Service Service Lifecycle Management Performanc Image Provisionin e Availability/ Lifecycle g Manageme Backup/ Restore Management nt License Security: Identity, Access, Integrity, Usage Manageme Accounting Isolation, Audit & Compliance nt Image Library (Store) Virtualized Applications Applications & Services Mashup Interface Content & Data Web 2.0 Platform (image deployment, integrated security, workload mgmt., highavailability) Virtualized Infrastructure Virtual Resources & Aggregations Service Catalog Server Virt. Storage Virt. Network Virt. Standards Based Interfaces System Resources SMP Servers Design & Build Deployment Operational Lifecycle of Images Virtualized Infrastructure Blades Storage Servers Storage Network Hardware 8
Where is database ? Database as a service Above / below / crossing virtualization layer File system v.s. Database G’s bigtable -- (?) A’s S3 -- SimpleDB, RDS(mysql based) M’s Azure Table – SQL Azure
Handling big big data Is database too weak? Map-reduce v.s. Distri. Databases [stonebraker 2010] Database accepting map-reduce [chen 2009] Why databases are missing in clouds General purpose v.s. specific use Programming model Work-flow administration
Cloud is like database Cloud User does not know how resources are allocated balances resource among user applications [LINQ] Database User does not care how DBMS executes an SQL query Handles large number of concurrent queries SQL standard
New Database Architectures? DBMS for High-density clusters, Virtualization platform. [Ibrahim 09] SSD & SSD-Matrix as storage / high speed cache / virtualized storage GPU aided computing or DBMS on GPU [He 08]
How about Data Integration? Domain heterogeneous data sources Need to scale More data sources co-operate in query processing Larger scope of schema Data integration as a cloud service
Our preliminary model
Opportunities with Cloud Cloud gathers data sources into highly available clusters. Cloud reduces the overhead of replicating data sources. Cloud simplifies the underlay communication. Peer-to-Peer model maintains autonomy of individual data sources and scales
Other topics Security & privacy New business model Inter-cloud data exchange
References [stonebraker 2010] Michael Stonebraker. MapReduce and Parallel DBMSs , Friends or Foes?, CACM 2010 [Ibrahim 09] Shadi Ibrahim et, al . CloudLet: Towards MapReduce Implementation on Virtual Machines. HPDC09 [chen 09] Qiming Chen et, al. Efficiently Support MapReduce-like Computation Models Inside Parallel DBMS. IDEAS 2009 [Loebman] Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help? [He08] Mars: A MapReduce Framework on Graphics Processors
Reference (2) [LINQ] Michael Isard, Yuan Yu. Distributed Data-Parallel Computing Using a High-Level Programming Language. Sigmod 09 [Pavlo 09] Andrew Pavlo et, al. A Comparison of Approaches to Large-Scale Data Analysis. Sigmod 09 [Foster 09 ] Ian Foster et,al. Cloud Computing and Grid Computing 360-Degree Compared. [Web2.0 expo] Scott Gerard. Maximize your web2.0 experience with cloud computing. (presentation on web2.0 expo 2009) [Barham 03] Xen and Art of Virtualization , SOSP 2003