HDInsight makes Hadoop Easy Tools Agenda Collect and Load Big Data
22 Slides6.65 MB
HDInsight makes Hadoop Easy Tools Agenda Collect and Load Big Data HDInsight Clusters Cluster Customizations Visual Studio Tooling 1
Collect and load big data Prerequisites Blob storage concepts Data types and sources Performance and scalability Administration Reliability Security Data processing Pre-processing data Serialization and compression Choosing tools and technologies Tools 2
Collect and load big data Interactive Visual Studio PowerShell Hadoop command line Cloudberry 3rd party AzCopy application Interactiv e Relational Data Azure Data Factory Apache Sqoop SQL Server Integration Services PolyBase in APS Streaming data Relational Data Streamin g data 10 01 Azur e blob Server log files Apache Flume SQL Server Integration Services Custom solution using the Azure SDK Server log files HDInsig ht Apache Storm on HDInsight Azure Stream Analytics Reactive extensions (RX) Custom or 3rd party application HDF S Automate d Automated Azure Data Factory PowerShell with task scheduler SQL Server Integration Services Custom solution using the Azure SDK 3
Blob storage concepts Store large amounts of unstructured text or binary data with the fastest read performance Access a highly scalable, durable, and available file system Expose blobs publically over HTTP Securely lock down permissions http:// account .blob.core.windows.net/ container / blobname Account Container Blob Page/blocks PIC01.JPG Images Contoso Block/Page PIC02.JPG Video Block/Page VID1.AVI 4
zzzz Data types and sources devices and sensors Extract data in a form that can be easily consumed Stage data before submitting it to a big data cluster Submit data accurately to cluster storage, including data conversions geo-location data web clickstreams social media server logs Choose the right tool for your data If using HBase, use an appropriate technique to upload it Azure blob storage 5
Collect and load big data Prerequisites Blob storage concepts Data types and sources Performance and scalability Administration Reliability Security Data processing Pre-processing data Serialization and compression Choosing tools and technologies Tools 6
Reliability Monitor Upload tool should handle transient connectivity and transmission failures Monitor upload to detect failures early Record each stage in a process that raised an error Scale out with multiple upload instances Validate the data before you upload 7
Security Protect data at rest and in motion with secure authentication Leverage local security policies and features Employ a robust auditing and monitoring process Remove non-essential sensitive data Encrypt essential sensitive data HDInsight clusters 8
Collect and load big data Prerequisites Blob storage concepts Data types and sources Performance and scalability Administration Reliability Security Data processing Pre-processing data Serialization and compression Choosing tools and technologies Tools 9
Serialization and compression A zu SDK is available from NuGet re Tools for Avro serialization and compression Tools provided by the codec supplier, eg. GZip and BZip2 for compression Use the classes in the NET Framework to perform GZip and DEFLATE compression on your source files SDK Create a query job that is configured to write output in compressed form using one of the builtin codecs HDInsight compression libraries Format DEFLATE GZip BZip2 Codec org.apache.hadoop.io.compress.DefaultCodec org.apache.hadoop.io.compress.GzipCodec org.apache.hadoop.io.compress.BZip2Codec (this codec is not enabled by default in configuration) Extension Splittable .deflate No .gz No .bz2 Yes 10
Interactive data ingestion UI-based tools Cloudberry Explorer, Storage Explorer The hadoop dfs CopyFromLocal [source] [destination] command Codec supplier tools Command line tool GZip and BZip2. AzCopy to upload large files PowerShell commands Take advantage of the Azure PowerShell cmdlets Tools 11
Handling streaming data Apache Storm on HDInsight Open-source framework that runs on a Hadoop cluster to capture streaming data Stream Processing Service on Azure offering ease of use to consume and process event streams. Custom event or stream capture solution Microsoft StreamInsight Feeds data into the cluster data store in real time or batches Tools Azure Stream Analytics Complex event processing (CEP) engine with a framework API for building apps that consume and process event streams 12
Loading relational data Sqoop Extract required data from a table, view, or query in the source database and save the results as a file in your cluster storage. Tools Interfaces that support connectivity to big data clusters Microsoft Analytics Platform System (APS) contains PolyBase, to expose a SQL-based interface for accessing data stored in Hadoop and HDInsight 13
HDInsight Clusters Tools 14
Azure VNet HTTP traffic HDInsight cluster architecture ODBC/ JDBC WebHCata log Oozie Secure gateway AuthN HTTP Proxy Highly available Head nodes Worker nodes x Azure Storage Ambari
HDInsight service entry points HDInsight cluster Remote desktop Oozie REST Command line SDK ODBC Query console PowerShell Excel Visual Studio plugin Hive Pig M/R 16
Cluster Customization s Tools 17
Cluster customization options Via Azure portal Hive/Oozie Metastore Storage accounts ScriptAction HDInsight cluster provisioning states Ready for deployment Accepted Cluster storage provisioned AzureVM configuratio n Customize cluster? Configuring HDInsight Via scripting / SDK Config values JAR file placement in cluster Ad hoc RDP to cluster, update config files (nondurable) Running Timed Out Cluster operational N o Error Ye s Cluster customization (custom script running
Visual Studio Tooling Tools 19
Visual Studio tooling Ships with the Azure SDK Supported for VS 2012, 2013 and 2015 Enables Hive query authoring, submission and debugging Navigate Linked Resources Create Hive Tables Run Hive Queries View Hive Jobs Hive Script Local Validation IntelliSense Support for Hive (Preview) Table creation and schema management are also supported
Get started today! For more information visit: http://azure.microsoft.com/en-us/services/hdinsight/
2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.