Operational Excellence in IT Service Management Mehmet Özgür
27 Slides7.18 MB
Operational Excellence in IT Service Management Mehmet Özgür Depren Technical Sales Manager - IBM Middleware
The Next IT Operations Focus: Big Data “Focus on operational objectives has seen significant uptick since 2013”
IBM Continues to Invest Heavily in Analytics More than 17B in Acquisitions Since 2005; more than any other company Most comprehensive portfolio, from business to IT Analytics, while most other vendors offer only point solutions C&SI’s suite of analytics products leverage best of breed capabilities from across all of IBM’s portfolio 2015 Social Analytics/Consumer Insight Workload Optimized Systems Advanced Case Management Content Analytics Decision Management Stream Computing Pervasive Content pureScale pureXML Deep Compression Developer Productivity Autonomic Operations 2005
IT Operations Analytics Solves New Challenges Reducing & Preventing Outages and Slowdowns for the 24/7 Application World The Network End users Web Servers Devices Databases App Servers IT Operations Analytics can help 1 Never set performance threshold manually again 2 Identify potential issues before customers are impacted 3 Isolate the problem through analysis of all your IT data
Understanding IBM Operations Analytics Business Outcome Proactive Outage Avoidance Faster Problem Resolution Optimized Performance Predict Search Optimize Predict problems before they occur Search quickly across massive amounts of data Optimize across your IT app infrastructure Capabilities Operations Analytics IBM Big Data Platform Streams IBM or 3rd Party Solutions Operational Environment Application Performance SPSS Cloud Insights InfoSphere BigInsights Rave Watson Documentation System & Log Monitoring Transactions Assets & Workorders Alerts, Alarms & Events Applications Systems Workloads Wireless Network Voice Security Mainframe Storage Assets
IBM Solution for IT Operations Analytics Our Capabilities Why IBM? Predict Predict problems before they become service impacting Search Diagnose application & infrastructure issues using all your operational data Optimize Ensure your IT infrastructure is operating as efficiently as possible environments 60% Faster creation of custom high impact mobile ready operations dashboards 50% Faster application diagnostics Analytics Avoid Outages While Reducing Threshold Management Costs Consolidated Communications detects 100 percent of their major incidents, including silent failures, and eliminated the human intensive task of managing manual thresholds, saving 300,000 annually Resolve Problems Faster Barclay’s Bank was able to search and diagnose problems 60% faster to quickly resolve application and infrastructure issues. In addition, they identified customer patterns from log data and applied this to channel intelligence 30% Improve Operational Efficiency Advanced events analytics has allowed Claranet to reduce the number of trouble tickets and focus more time and resources on what truly matters to their customers. Reduction in operator event load 20% Reduction in storage requirements over competitive offerings #1 Leadership position in Operations Management solutions
IBM Operations Analytics – Predictive Insights Challenge: Reacting to performance thresholds is not enough. IT Staffs must become proactive to ensure mission Predict critical apps never go down. Automated Threshold Maintenance No complex manual intervention to setup & maintain with 5 times faster processing Anomaly Detection Alerting before potential issues become service impacting, enabling IT to shift from reactive to proactive On-Prem and SaaS Predictive Insights now available as a Service, providing additional value to our Performance Management solutions Supports Heterogeneous Environments Out-of-the-box integrations to IBM APM/ITM or 3 rd-party monitoring solutions
Why aren’t operations teams proactive today? Too much data to analyze manually Existing analytic techniques, such as standard thresholds, are not up to the task They cannot detect problems while they are emerging (before business impact) Set performance threshold too high, insufficient warning before total failure. Set performance threshold too low, too much noise, everything is ignored If no there is no ‘early detection’ before the outage, operations teams can only react while outage is already in effect and already losing money.
Learn relationships between metrics without static thresholds Predicative Insights learns the normal historical range It will alarm if it falls outside this range Watson DNA inside 9
European Telco – Flatline Stopped (crashed) Application - Regular load absent. Targeting Situation Detections Customer Relationship Management System for large Telco. 100 applications monitored by Compuware System. (40 million metrics) In this Example the regular load on one of the servers has changed indicating application problem.
European Gambling Website – Adaptive Threshold High disk latency Automated Dynamic Thresholds and Early Detection A gambling Website application monitored by HP . Coming up to busy sporting event traffic increased causing stress on the system and negative customer experience. Using PI early detection of latency issue could have been tackled to avoid this.
Large US Bank– Adaptive Threshold Connection Leak Automated Dynamic Thresholds and Early Detection These are Websphere metrics taken from CAWily performance management system. The number of actual connections to the WebSphere application server has increased dramatically. The poolsize and bytesInUse are also affected indicating either increased demand, or a problem with connections not being freed up. Insight Poolsize and Bytesinuse on the same node are also behaving anomalous at the same time and are related to each other.
European Bank – Significant trend. Disk Thrashing Targeting Situation Detections File server under stress as file control operations and bytes per second increase. This sudden change can be tracked back to a patch applied.
A Sample of technologies Predictive Insights integrates with IBM ITM/TDD & IBM APM IBM OMEGAMON HP BAC, Topaz IBM TNPM Aircom Optima
Predictive Insights as a Service Performance Management Predictive Insights Integrated threshold automation and maintenance Anomaly detection Get ahead of potential application and resource outages Learn, Explore, and Try Continuous Delivery
IBMPredict Operations Analytics – Log Analysis Challenge: To diagnose service problems in applications and the infrastructure supporting them involves quickly analyzing incredible amounts of both structured and unstructured data Breadth of Searchable Data Search across all of your IT operational data to quickly resolve issues Expert Advice Any competitor can isolate problems. IBM helps clients quickly resolve them. Mainframe Support Search System z (zLinux & zOS) logs in addition to all your other data Embedded Analytics Out-of-the-box integrations to IBM APM/ITM or 3 rd-party monitoring solutions Search
Search IBM Operations Analytics – Log Analysis Collects large volumes of structured and semi-structured data and transforms it through analytics into actionable intelligence. Search and Visualize Insight Packs IT Operations App Support Service Desk Normalize Consolidate Documentation Logs Metrics Events Collect
Application owner : I got a trouble ticket on my application. I want to quickly find the root cause and fix it and restore app/service ASAP Current Challenge : large volume of data to collect and analyze , manual correlation taking days/hours to find the root cause of the problem. Cannot find logs for problem window situations. Highly dependent on SME skills. Its an art Core files Logs, Traces,. Events Metrics Transactions Config 01000110001110000111 00110001111100001100 01 11111100011001110001 1 [10/9/12 5:51:38:295 GMT 05:30] 0000006a servlet E com.ibm.ws.webcontainer.se rvlet.ServletWrapper service SRVE0068E:
Application owner : I got a trouble ticket on my app. I want to quickly find the root cause, fix it and restore service ASAP Solution: IBM Operations Analytics – Log Analysis can provide insights from all data in clicks. App owner can search through the data, leverage Dashboards to find the root cause in minutes IBM IBMOperations Operations Analytics AnalyticsLog LogAnalysis Analysis metrics metrics Expert Expert knowledge knowledge Events Events Tickets Tickets [10/9/12 5:51:38:295 GMT 05:30] [10/9/12 5:51:38:295 GMT 05:30] 0000006a servlet E 0000006a servlet E com.ibm.ws.webcontainer.servlet.Servlet com.ibm.ws.webcontainer.servlet.Servlet Wrapper Wrapperservice serviceSRVE0068E: SRVE0068E:Uncaught Uncaught exception created in one of the service exception created in one of the service methods of the servlet TradeAppServlet methods of the servlet TradeAppServlet ininapplication applicationDayTrader2-EE5. DayTrader2-EE5. Exception created :: Exception created logs logs javax.servlet.ServletException: Tx# date status 108978 23-Jul-2013 started 108978 23-Jul-2013 To IN Transaction Transactiondetails details from App DB from App DB
Out of the Box Insight Packs Out of the Box Insight Packs (IBM Provided) IBM Websphere Application Server IBM DB2 Web Access Logs Windows Events SysLog Java Core IBM MQ Series IBM Integration Bus (Message Broker) Delimiter Separated Value (DSV) log files Partner Provided – Microsoft Sharepoint, Microsoft Exchange, Microsoft SQL Server, Microsoft Active Directory Tivoli Storage Manager IBM Systems Disk Storage 8000 IBM AIX Errpt IBM HTTP Server HP LiveSite , HP TeamSite Oracle Database VM Ware ESXi Oracle Siebel https://developer.ibm.com/itoa/
IBM Netcool Operations Insight Modern Dashboards, Fully Mobile Visualize the performance and health of your entire operations environment. Out of the box Integration 98% Reduction in Critical events: 22 critical & 100 major events per week Improved focus and utilization of first- and second-line staff Analytics to increase event value v1.1 30% reduction in Events to Operations v1.2 Almost 50% reduction in repeating events v1.3 90% reduction for known event classes Optimize
Event Analytics – Seasonal Event Identification Improve efficiency by identifying and resolving recurring problems Large Bank 7% of Priority 1 Tickets were raised by events that were highly seasonal 30% of lower severity tickets Report on event history identifies seasonal events sorted by confidence level and frequency Drill down shows time distributions of events investigate peaks. Can better align thresholds to seasonal peaks reducing events
Seasonality Analysis of events 1 MS SCOM Health Service Heartbeat failures happen often on Sunday 06.00am, probably due to regular maintenance 2 A specific Oracle database is not accessible every day at 21.00pm, probably due to a daily restart or backup 3 A node is giving file system alerts every day around 01.00am, probably due to a daily batch job
Related Events Grouping Relationships I know about Known Event Analysis Grouping and Correlation providing powerful situation management of active events Out of the box domain expertise for known event relationships Vendor and technology dependent Significant reduction of incidents presented to the operator Extendable by Business Partners and clients with no coding required
Event Analytics –Related Event Analytics Relationships I don’t know about Improve efficiency - Reduce actionable events by grouping events that always occur together Automatic detection of event clusters Leverages machine learning to analyze historical event archive and identify groups of events that always occur together Presents identified relationship to the Administrator Presents proposed automated actions Watch, Deploy, Archive or Do nothing Groups events in the Event Viewer “It is very beneficial to have a tool that can turn historical event data into an event group with a single root event. It helps us turn the data into logic” Increase operator efficiency by up to 90% with out-of-the-box alert reduction and advanced alert analytics
Future of Service Management Visibility Control Automation Real-time Analytics and Visualization Problem Isolation Data Correlation Outage avoidance Integration Optimization Insight & Care Predictive Analytics
Thank You