A Management Architecture for Client-defined Cloud Storage
50 Slides3.00 MB
A Management Architecture for Client-defined Cloud Storage Services Jae Yoon Chung Supervisor: Prof. James Won-Ki Hong Distributed Processing and Network Management Lab. Dept. of Computer Science and Engineering Pohang University of Science and Technology [email protected] 2014. 12. 15 Thesis Defense 1/36
Outline Introduction Related Work CLIent-defined storage Management Architecture (CLIMA) Client-defined privacY-protected Reliable cloUd Storages (CYRUS) Design CYRUS Implementation Evaluation Conclusion Thesis Defense 2/36
Introduction Top 10 Strategic Technology Trends 2012 2013 2014 2015 Big data Strategic big data Smart machines Smart machines Extreme low-energy servers Integrated ecosystems Web-scale IT Web-scale IT Next generation analytics Actionable analytics 3D printing 3D printing App stores and marketplaces Enterprise app stores Software-defined anything Software-defined applications/infrastructure IoT IoT IoT IoT In-memory computing In-memory computing Cloud/client architecture Cloud/client computing Mobile-centric applications/ interfaces Mobile applications/HTML5 Mobile apps and applications Risk-based security/selfprotection Cloud computing Hybrid IT/cloud computing Hybrid cloud & IT as a service broker Advanced pervasive/invisible analytics Media tablets and beyond Mobile device battles Mobile device diversity/ management Computing everywhere Contextual/social user experience Personal cloud Era of the personal cloud Content-rich systems Gartner, Top 10 Strategic Technology Trends for 2015, Analysts Examine Top Industry Trends at Gartner Symposium/ITxpo, Orlando, Oct. 5-9, 2014. Thesis Defense 3/36
Reliability and Security Concerns Dropbox service down (Jan. 12, 2014) – Loss master-replica pairs while updating OS e m ti Outage for Google services (Jan. 24, 2014)ice up v r e ) s r a – 55 minutes service down % e 9 y . 9 n i 9 e s e m e ti t n n a w r o a d u g e c i A v Outage cloud services (Nov. 18, 2014) r SL Microsoft’s e s s r u o – One hour service down affecting on h 6 7 . (8 MS Band, Xbox Live, Apps for Windows Phone OneDrive, Private photos leaked from iCloud (Aug. 31, 2014) – – – – Almost 500 photos of celebrities Jennifer Lawrence Kate Upton Kaley Cuoco Thesis Defense 4/36
Problem Statement How to manage multiple cloud services? How to improve privacy-protection of public cloud services? How to integrate/coordinate multiple cloud storages? How to realize client-defined storage management architecture? Thesis Defense 5/36
Research Motivation and Goal Propose a management architecture for client-defined cloud storage services Realize the proposed architecture as a client application that improves reliability, privacy-protection, and performance Thesis Defense 6/36
Related Work (1/2) Distributed storage systems – Byzantine fault-tolerant system – HAIL, RACS, Scalia, Ceph Adopting techniques of storage systems to client-defined architecture – Metadata management strategy – Data deduplication ChunkStash, Content address caching with CZIP – Data encoding for distributed storage (t, n) secret sharing, Erasure code Thesis Defense 7/36
Related Work (2/2) Client-based distributed storage (t, n) property Concurrency Data deduplication Versioning Optimal cloud selection Elastic Reliability Client-based Architecture PiCsMu InterCloud RAIDer DepSky No No Yes No Yes Yes Proposed Approach Yes Yes No Yes No Yes No Yes Yes Yes No No No Yes No No No Yes Yes Yes Yes Yes Thesis Defense 8/36
CLIent-defined storage Management Architecture Overview Application Domain Cloud Domain Network Domain Thesis Defense 9/36
CLIent-defined storage Management Architecture Define domains – Different manageability Define components – Coordination and cloud control Define data format – Metadata and encoding data Define data sync protocol – Client-based approach Develop optimization algorithm – Optimal cloud selection Thesis Defense 10 /36
CLIMA – Application Domain Application Domain – Fully-manageable – Similar with legacy Internet applications Client – Implementation of application functions – Coordinator to schedule requests to multiple clouds – Controller to access different cloud APIs Server (optional) – Implementation of supplementary functions – Do not use as proxy Thesis Defense 11/36
CLIMA Application Thesis Defense 12 /36
CLIMA – Cloud Domain Cloud Domain – – – – Offer resources Partially-manageable IaaS: fully manageable PaaS and SaaS: limited by available APIs Unify different clouds – Understanding different implementations Thesis Defense 13 /36
CLIMA – Network Domain Network Domain – Link between client and clouds – Not manageable – No available APIs Active probing – Send packets through network Passive monitoring – Measure statistics Limited information – Based on inference Thesis Defense 14 /36
Client-defined privacY-protected Reliable cloUd Storages Design Unifying Heterogeneous CSPs Ensure Privacy and Reliability Metadata and Data Sync Protocol CSP Selection Algorithm Thesis Defense 15 /36
Design Considerations Unifying heterogeneous CSPs (Cloud Service Providers) – Without changing cloud APIs Improve storage efficiency for version management – Do not store duplicated data File chunking for data deduplication Chunk encoding for ensuring privacy and reliability Thesis Defense 16 /36
Data Deduplication – File into Chunks Size-based chunking vs. content-based chunking – Size-based chunking: unchanged data region could be detected as new chunks – Content-based chunking: changed data region is detected as new chunks only Size-based chunking 0101 0010 1010 1111 1010 1001 0110 1100 1011 1010 0101 0010 1010 0010 1111 1010 1001 0110 1100 1011 1010 Content-based chunking 0101 0010 1010 1111 1010 1001 0110 1100 1011 1010 0101 0010 1010 1111 0010 1010 1001 0110 1100 1011 1010 Thesis Defense 17 /36
Ensure Privacy and Reliability (t, n) threshold property – Encode a chunk into n shares – Need any t shares to decode data Ensuring privacy and reliability – At least t accounts are needed – privacy protection – Allow failures at most n-t CSPs – reliability Thesis Defense 18 /36
Encode chunk into shares Reed-Solomon Code – – – – Generate n equations with t variables Hide data to the t unknown variables Distribute n shares (constant term) to clouds Need any t equations to find solution (original data) Store n shares to clouds Original file size b Storage overhead Transmission overhead Thesis Defense Shamir RS Code One CSP b b/t Total n*b n*(b/t) Upload n*b n*(b/t) Download t*b t*(b/t) b 19 /36
Metadata and Data Sync Protocol Metadata stored at CSPs – A metadata represents a file update – Listing files in metadata folder detects new updates Metadata format – FileMap, ChunkMap, ShareMap Thesis Defense 20 /36
CSP Selection Algorithm Schedule Optimal Round-Robin CSP selection and performance optimization – Minimize maximum download time of CSPs c r br dr,c βc : CSP index : chunk index : size of chunk r : indicator : bandwidth of CSP c – Indicator dr,c selects t*R shares from n*R shares 1 if chunk r is downloaded from CSP c 0 otherwise Download more shares from faster CSPs Thesis Defense 21 /36
Client-defined privacY-protected Reliable cloUd Storages – Implementation Addressing Shares Zfec (RS code) Modification DEMO Thesis Defense 22 /36
Addressing Shares When uploading n shares – Select n CSPs from C CSPs – Avoid uploading duplicated shares Content-based naming scheme – Store the same shares to the same locations Abstracting CSP-side operations – Don’t UPDATE or DELETE files in CSP lock-free read/write – Require GET and PUT APIs only SHA-1 Hash Consistent Hash 1234abcd Filename: xyz Overwrite but the content is the same 1234abcd Thesis Defense 23 /36
Zfec (RS code imp.) Modification 20 40 30 70 50 2 4 3 7 5 22 42 32 72 52 23 43 33 73 53 24 44 34 74 54 25 45 35 75 55 26 46 36 76 56 Generating dispersal matrix – Generate a vector with T elements from key string – Each user uses different dispersal matrix – The only effort for user is to remember key string Thesis Defense 24 /36
Evaluation Testbed Experiments Real-World Benchmarking Comparison with DepSky Deployment Trial Results Thesis Defense 25 /36
Testbed Experiments (1/3) Testbed setup – Traffic shaping using NetEM – FTP servers as storages – Upload/download 172 files (638 MB) in Documents directory 2 MB/s 15 MB/s Thesis Defense 26 /36
Storage overhead [%] Storage reliability [%] Testbed Experiments (2/3) Storage reliability Storage overhead Storage overhead – Increasing overhead when increasing n – Decreasing overhead when t is close to n Reliability (assuming reliability of a CSP is 99.9%) – When t n, reliability is lower than single CSP – When t n, reliability is close to 100% 99.999999% with (2, 4) configuration Thesis Defense n: number of distributed shares t : number of required shares 27 /36
Testbed Experiments (3/3) Download completion time while changing cloud selection algorithm – CYRUS with optimal selection shows the best performance Performance bottleneck is upload/download time to/from slow CSPs – More shares are downloaded from faster CSPs to reduce completion time Thesis Defense 28 /36
Real-World Benchmarking Completely parallel transmission Completion time to upload/download a 40 MB file with (2,3) configuration – Google, Dropbox, Microsoft, Box.com – Compare CYRUS with baseline approaches and DepSky Thesis Defense 29 /36
Avoiding CSP Dependency Active probing from client using TRACEROUTE – Measuring logical topology from client to CSPs – Understand location and dependency of CSPs – Detect five CSPs are deployed on Amazon Detection time is less important – Deployment locations of cloud services are not frequently changed Thesis Defense 30 /36
Comparison with DepSky Locking overhead of DepSky DepSky always selects fastest clouds Thesis Defense 31 /36
Deployment Trial Results CSP1 CSP2 CSP3 CSP4 CYRUS (2,3) CYRUS (2,4) CSP1 CSP2 CSP3 CSP4 CYRUS (2,3) CYRUS (2,4) Recruit 20 academic users from US and Korea – Collect event logs when CYRUS starts Performance in United States - CSPs are fast – Throughput is reached to client’s link capacity Performance in Korea – CSPs are slow – Boost up throughput by parallelizing transmissions Thesis Defense 32 /36
Conclusion - Contributions Proposed CLIent-defined storage Management Architecture – Aimed at developing entire features of storage research and commercial storage services – Defined domains, protocols, data format, algorithm Realized CLIMA as client app. called CYRUS – Specified requirements and functions of storage services – Adopted storage area research to CLIMA – Implemented and deployed prototype in US and Korea Evaluated performance – Performed experiments with testbed and commercial CSPs – 99.999999% reliability with (2, 4) configuration – Compared completion time with baseline approach and DepSky Thesis Defense 33 /36
Conclusion – Future Work P2P-based metadata exchange – Design direct metadata exchange protocol – Develop file sharing function for multiple users – Develop access control scheme Extending Network Domain – Integrate CLIMA with Software-Defined Networking research – Develop network-side interfaces to interact with client applications – Develop protocols for client applications to obtain network information and to reserve network resource Thesis Defense 34 /36
Thesis Defense 35 /36
Appendix Thesis Defense 36 /36
Cloud Storage Market Cloud storage market – 244 billion dollars in 2017 - Gartner 2013 Trend – Security and privacy protection – Hybrid cloud – Integrating cloud accounts Thesis Defense 37 /36
Content-based Chunking File chunking using Rabin’s fingerprinting – Calculate fingerprint while moving window If fingerprint is the same with pre-defined value, set chunk boundary Thesis Defense 38 /36
Example - Content-based Chunking 0101 0010 1010 1111 3 4 1010 1001 5 6 0110 1100 1011 1010 4 Hash 3 2 1 0 Hash(x): x%4 1 2 7 8 9 10 Data deduplication Do not store duplicated chunks High benefit for version control (modification history) Thesis Defense 39 /36
Design Considerations Unifying heterogeneous CSPs Ensuring privacy and reliability Concurrent file access Client-based architecture CSPs selection and performance optimization Thesis Defense 40 /36
CYRUS Implementation - Client Thesis Defense 41 /36
Decoupling metadata and file control Thesis Defense 42 /36
Decoupling Metadata Metadata stored at CSPs – A metadata represents a file update – Listing files in metadata folder detects new updates Metadata format – FileMap, ChunkMap, ShareMap Thesis Defense 43 /36
Unifying Heterogeneous CSPs Thesis Defense 44 /36
Metadata Format Written in JSON FileMap: file’s metadata ChunkMap: Sequence of required chunks ShareMap: Mapping information between shares and clouds Thesis Defense 45 /36
Upload Thesis Defense 46 /36
Upload/Download Procedure Thesis Defense 47 /36
Distributed Conflict Detection Thesis Defense 48 /36
Performance of Modified Zfec Original Zfec - solid; Modified Zfec - dashed Different configurations – (2, 3) – (2, 4), (3, 4) – (2, 5), (3, 5), (4, 5) Thesis Defense 49 /36
Consistent Hash Thesis Defense 50 /36