Configuring Operations Center (and other products) within a High Availability (HA) configuration tends to confuse people. I guess it starts with the basic requirement of needing an application/service configured in a manner that it has some level of Fault Tolerance (FT) and/or HA which in turn reduces the possibility of outages (system not being available for the end users). FT helps with HA, but FT is not HA. (more on that later)
Fault Tolerance (FT) is more about configuring the hardware of a system in a manner that in the event of a failure (IE: hard drive), without intervention, the system automatically recovers and continues to operate. The most common is around hard drives and leveraging "RAID" (Redudant Array of Independent Disks). Other solutions provide dual power supplies, NIC's, etc.
- RAID 0 is setting up a system with multiple harddrives but having the data striped over all of the drives. The basic concept is, if you have three hard drives and were writing a text file with the word "THE" in it, the "T" would go to the first drive, the "H" to the second and the "E" to the third... all at the exact same time. This is more a performance option than a FT option. If one drive fails, the entire RAID 0 volume is no longer available. This is typically not a configuration leveraged when a HA (or FT) solution is required.
- RAID 1 is mirroring, with the example above, the word "THE" is written to both disks. Raid 1 can come in both a software or hardware solution with hardware typically being the best performing. If one drive fails, the other takes over. This is the most basic FT solution selected by administrators. This configuration requires at least two drives. This can become complicated if several drives are required for storage space and then setting up redundant drives for mirroring.
- RAID 2 is similiar to RAID 0, except it adds additional data (parity) on a dedicated drive. This data is used in the event one of the other drives fails. When a system has a failed drive and data is being accessed via the parity information, the system stays up, but runs slower.
- RAID 3 is similiar to RAID 2 except a different parity coding is used.
- RAID 4 is the same as RAID 5 (discussed below), but still utilizes a single parity drive like RAID 3 and RAID 2.
- RAID 5, one of the more popular choices for an FT solution, it distributes the data as well as the parity across all drives (IE: there is not a dedicated parity drive). In this configuration, any single drive can fail and the volume will continue to operate. Three disk minimum and only one can fail. Most of the RAID 5 solutions allow the hot replacement of a failed drive and rebuild without taking the system offline.
- RAID 6 being similiar to RAID 5 allows up to two drives to fail and continues to operate.
There are several RAID solutions (and options) on the market both hardware and software based, each has pro's and con's. I can tell you that I have set up RAID 5 for Application Servers and Database Servers using many different vendors and have been very happy with the results.
Configuring a system with Dual power supplies, NIC's, RAID 5, etc, while reducing the possibility of an outage due to a hardware failure, is typically not the only thing done for a system that needs %99 uptime, HA needs to be take into account (IE: a windows Blue Screen, motherboard failure, etc will still cause an outage)
There are many HA solutions, again, there are some hardware options and some software options as well as combinations. If you are trying to get the most uptime, then you most likely will not be looking at HA solutions that are configured in a Hot/Cold or Hot/Warm set up. This means that one system is up and running (HOT) and the other is a close standby (Warm, partially started) or ready to be used (COLD, configured by turned off). There are several solutions by large computer manufactures that are HOT/COLD HA solutions, if the HOT system fails (IE: motherboard dies, entire RAID fails, etc), you will have an outage while the COLD or WARM is taking over. This outage may be 5 minutes or 20 minutes, it really depends upon the startup time. If your uptime requirements are significant, you will have to invest in additional hardware for a Hot/Hot solution. Before selecting a clusturing solution, determine if it is Hot/Hot, Hot/Warm or Hot/Cold.
For users of Operations Center who require a high uptime (IE: %99.nnn uptime), I only recommend solutions that leverage HOT/HOT configurations. This configuration means that more than one Operations Center system is configured, up and running, adapters connected, etc. This solution also requires some type of solution as the front door such as a Load Balancer or Content Switch to direct users. If configured properly, a single complete system can fail, users are automatically moved over to the secondard HOT system (via the Load Balancer) and the users have no idea there was any type of failure. I had one customer that would "fail" users over from one system to another to do mainteance during core hours (seems odd, but it is an option).
Operations Center is supported in a clustered (Hot/Hot) environment. As I indicated above, a Load Balancer (more popular with adminstrators from what I have seen) or a Content Switch (less popular) is required. Operations Center is not written in a manner that a specific Load Balancer is required, I have seen customers use several vendors solutions.
The last thing to consider is around how many nodes (Hot systems) in the cluster. Operations Center can be configured (may custom scripts) to have Nodes in the cluster handle different roles. For instance, you can have one of the Nodes responsible for running the BSCM jobs to build and maintain views while the others are for users to access the data.
This blog is not being written to tell you how to configure a FT and/or HA solution, just advise you of the technical area and make you aware that a full list of requirements needs to be understood and then the different solutions evaluated to determine the best fit. I recommend getting the Consulting team involved to assist you with determining the options available based on the requirements you have.
- FT is not HA, FT is useful within an entire HA solution
- Solutions that are Hot/Cold or Hot/Warm are nice, but not full HA
- If administrators need to get involved to get a system failed over to the backup, it is not HA.
- The more 9's you require in uptime (IE: %99.0 versus %99.9), plan on spending more money on hardware/software.
- Operations Center integrates with many third party products (IE: adapters to HP, IBM, etc) as well as storage of data in a backend database. What are the uptime requirements for those systems. (IE: If Netcool fails, the adapter will no longer surface alarms, is that consider an outage based on your requirements? If the backend database fails, Operation Center will not allow end users to view historical data, is that considered an outage based on your requirements?, etc, etc, etc)
- Understand the uptime requirements, determine if the budget for the HA solution aligns, review the clustering (HA) solutions and how they work, test the configuration, test failing over (automated and manual).
Apr 25 2012, 03:08 PM
Filed under: tips, NetIQ, BSM, Business Service Management, Operations Center, Novell, HP, Content Switch, Netcool, Load Balancer, IBM, RAID, Clustering, Redundant Array of Independent Disks, High Availability, Fault Tolerence, NOC, Tobin Isenberg, Data Center Solutions