Mapping, Monitoring, and Metrics Dr. Natheer Khasawneh. Sara Ismail.
17 Slides127.73 KB
Mapping, Monitoring, and Metrics Dr. Natheer Khasawneh. Sara Ismail.
This chapter will cover . The chapter specifies what Data Center-related information should be documented and maintained and how such data is helpful for managing rooms, troubleshooting during emergencies, and planning future Data Center expansions. The chapter also suggests inexpensive tools that can be used to monitor a server environment and recognize problems before they affect the systems contained within.
Documenting the Data Center To help simplify your management of these rooms, document as much information about them as possible. The more details you collect and maintain about a Data Center, the fewer mysteries that can arise and trigger unanticipated problems or delays. Cabinet locations, electrical and data infrastructure, server names, and installed applications are all key details worthy of keeping track of. There are several choices for how to archive this data. One option is a maintained Data Center handbook, filled with reference materials pertaining to the room. Even more effective is the information posted on a company intranet site. Whenever alterations are made to your server environment, have those changes reflected in the documentation for the room. Data Center map must be kept current at all times, For Data Center details that change frequently, update information on a regular basis, such as monthly or quarterly.
Documenting the Data Center(cont ) Floor Plan: One of the more powerful documents to have for a Data Center is a map of the room. At a minimum, an accurate map shows physical clearances, cabinet locations, the placement of major infrastructure, and the Data Center's numbering scheme. This information is helpful when allocating space for incoming servers and crucial when the time comes to expand the server environment.
Documenting the Data Center(cont ) As-Built: As part of the design package issued for the construction of your Data Center, require the respective cabling and electrical contractors to provide as-built blueprints of the room. An as-built is just what it sounds like—a document showing specific Data Center infrastructure as it was built. A cabling as-built shows the physical paths of all structured cabling and provides termination details. An electrical as-built shows the equivalent information for electrical infrastructure—conduit paths, how many and what types of receptacles, and which circuits specifically terminate where. Many changes, big and small, often happen during the construction of a server environment. As-built documents incorporate all of these and show how a room truly is.
Documenting the Data Center(cont ) Server Inventory: Once a Data Center is operational, inventory its servers, networking devices, and other equipment on a regular basis. Include the name, make and model of machine, and corresponding cabinet location in the room. Follow the same Data Center numbering scheme that you use for cable runs and electrical schedules. inventorying Data Center equipment keeps you in touch with what items are flowing in and out of the room over time. This can help you identify equipment trends, alerting you to changes that need to occur to your existing infrastructure. Consider recording additional physical details about your Data Center equipment as well. Inventorying servers might even save your company money. Store inventory information in an online database.
Documenting the Data Center(cont ) Applications: Other valuable data to inventory are the applications running on each server within the Data Center. This information is useful for two reasons: if you are going to perform work on a machine that hosts a particular application, in your change request you can accurately define all servers that are going to be affected by the scheduled downtime. if an application fails unexpectedly you can quickly determine the scope of the problem and what specific servers are affected. They typically span multiple machines, it is frequently impossible to isolate applications to a particular section of the room. Be aware that application information can be more difficult to obtain and keep current than a physical inventory of servers. That's because applications are added to, upgraded on, or removed from machines more frequently than devices are physically relocated.
Documenting the Data Center(cont ) Processes: Useful processes to document include: Access and change management policies— Instructions for how to gain access to the Data Center . Service level agreements (SLAs)— Involving Data Center-related clients, support organizations, and vendors. An SLA is a contract between someone who is hired to perform a task or service and a customer, specifying the measurable functions and services they are to provide. Server installation guidelines— Spell out for Data Center users how they can most effectively install their incoming equipment. Equipment move procedures— If your business is prone to relocating servers from one Data Center to another, perhaps due to acquiring another company, it is helpful to have some basic instructions on hand. Features and Philosophies: Last, consider documenting and publishing details about your Data Center's infrastructure as well as the design philosophies behind it.
Monitoring from Afar For that real-time information, you need tools that actively monitor the room. The greater the ability you have to "see" in to your Data Center without having to physically be there, the easier it is to manage. Web cameras: A great way to tell what's happening in your Data Center is to deploy web cameras that leverage the room's network. For the small expense of one or two web cameras per Data Center, you can instantly see the condition of the room and know the status of its most vital infrastructure systems, all from any computer connected to your company's internal network. Amperage Meters: An additional method of keeping an eye on your Data Center is having your server cabinet power strips equipped with amperage meters. These devices display the amount of electrical load that is put upon them. This tells a Data Center user how close they are to reaching the maximum electrical capacity of a power strip.
Monitoring from Afar(cont ) It also helps with efforts to balance power within a server cabinet. If someone is installing a server with a single power feed, they can check which of a server cabinet's two power strips is carrying the lesser electrical load and plug in to that one. Temperature Sensors: useful thing to know about your Data Center is how hot or cold it is. Monitoring the temperature of the room can alert you to a malfunctioning air handler, air flow problems, or hot spots that are forming due to increased server density at a particular cabinet location. Many servers and networking devices also enable you to check their internal temperature by entering a certain command. Humidity Sensors: Humidity is generally monitored and controlled by Data Center air handlers. If a server environment is having problems with humidity— condensation or corrosion from too much moisture in the air or static from not enough—humidity sensors can help diagnose the problem.
Gathering Metrics Other information useful to have about a Data Center is metrics— measurements taken regularly to determine how the room functions over time. There are a lot of data points that can be collected about a server environment: Maintaining an Incident Log: To get some perspective on the performance of your server environment and the incidents that happen in and around it, keep a log of Data Center-related events. Record the time, date, and major details of notable occurrences. Also note incidents in which things go right and downtime or a catastrophic event is avoided: when utility power fails but the Data Center runs interrupted thanks to its standby generator. An incident log that thoroughly tracks Data Center events can be extremely valuable for upper management. Such a log provides them with real-world information about the threats posed to company servers and what infrastructure and processes are (or aren't) in place to protect that equipment.
Gathering Metrics(cont ) Here are several useful categories to separate Data Center-related incidents into: Commercial Power (CP)— An interruption in the power that is normally provided to the Data Center by a utility source. Connectivity (CO)— A disruption in data connections, either in the external structured cabling that feeds the company site or those within the Data Center. Mechanical—HVAC (AC)— An incident related to the Data Center's cooling system. Mechanical—Power (MP)— An incident related to the Data Center's primary or standby electrical infrastructure. Miscellaneous (MI)— Events that are worth noting but don't fall in to any other categories. Perhaps a false alarm in the fire suppression system or a problem with the room's physical access controls, for example. Water Leak (WL)— An incident in which unwanted moisture enters the server environment.
Gathering Metrics(cont ) Even more important than knowing what happened in a server environment is understanding the cause of the incident. Here are some typical causes of Data Center-related incidents: External (EX)— External causes are those that originate away from your company site. Such as: Utility power failures, damage to the structured cabling, or an earthquake. Human Error (HU)— Human error applies to incidents that occur because a person made a mistake rather than the failure of a physical component. Such as: Powering down the wrong electrical circuits, or inappropriately pressing an emergency power off button. Mechanical (ME)— A mechanical cause is the malfunction of infrastructure at the company site. A belt breaking within an air handler, or a standby generator not engaging when it is supposed to. Structural (ST)— The rarest of causes are those related to a building's structural integrity. Examples of this are a roof leak or the buckling of a Data Center floor.
Gathering Metrics(cont ) Availability Metrics: Availability: the degree to which a Data Center is online. Measuring your Data Center's availability therefore goes a long way toward evaluating its contribution to the success of your business. Availability metrics can also justify the expense of additional Data Center infrastructure, either when designing a new room or when upgrading an existing one. Example: your server environment was designed and built with the goal of achieving 99.99 percent availability, track the number of outages that occur over a significant time period, perhaps annually, to determine what its availability has turned out to be. You can calculate your Data Center's availability by using the following formula: (TIME—OUTAGES) TIME Percentage of Availability
Gathering Metrics(cont ) TIME is the total number of minutes in a defined time period and OUTAGES is the cumulative number of minutes that a Data Center was offline during that period. For instance, say a Data Center was offline for 20 minutes over the course of a 30-day month. There are 43,200 minutes in that month (30 days x 24 hours in a day x 60 minutes in an hour 43,200 minutes). Being online for all but 20 minutes translates to: (43,200—20 min.) 43,200 min. 99.95 percent availability. By keeping track of the lengths of outages throughout the year, you can calculate availability for any time period—monthly, quarterly, or annually. Example: Say that your company has four Data Centers, two that are 5000 square feet in size, one that is 10,000 square feet, and one that is 30,000 square feet, for a total of 50,000 square feet. Say that the Data Center with a 20-minute outage and 99.91 percent availability is one of the small rooms—5000 square feet. If the other three rooms all stayed on line for the entire month, what's the cumulative availability for all 50,000 square feet of Data Center space?
Gathering Metrics(cont ) The formula then becomes: ((SIZE1 * (TIME-OUTAGES1)) (SIZE2 x (TIME-OUTAGES2)) (SIZE3 * (TIME-OUTAGES3)) (SIZE4 x (TIME-OUTAGES4)) (TOTAL SIZE * TIME) Plugging in the monthly statistics for the four Data Centers, with the smallest having 20 minutes of downtime, you get the following: ((5000 sq. ft. * (43,200—20 min.)) (5000 sq. ft. * 43,200 min.) (10,000 * 43,200 min.) (30,000 sq. ft. * 43,200 min.)) (50,000 sq. ft. * (43,200 min.) 99.995 percent availability.
Gathering Metrics(cont ) Other Useful Data: Cabinet occupancy— How quickly are Data Center cabinet locations filling up? Consumable usage— How many server cabinets, cabinet shelves, and patch cables are used each quarter? This information is helpful for maintaining proper inventory amounts and future budgeting. Supplies and vendors— Document the items you stock in your server environment and the vendors who provide them. Include both everyday consumables (i.e., patch cords and server cabinets) and those items needed to complete a Data Center when it is first built (i.e., storage bins, signage materials, and floor tile pullers). Major infrastructure changes— Then and now" comparisons can be very illustrative. The information can also be useful when future retrofit projects are planned. Data Center trivia— What's the biggest piece of equipment in the Data Center? The smallest? How long does it take to install a typical server? Such trivia might not help you manage the room, but it can be powerful when explaining Data Center challenges.