End-to-End Server High Availability Whitepaper

Journal The1Genius's Journal: End-to-End Server High Availability Whitepaper

Journal by The1Genius on Friday September 13, 2002 @01:09AM

1.0 Server Internals

Most servers are designed with a certain level of redundancy in mind. The following topics look at the high availability aspects within a single server.

1.1 Memory

1.1.1 Memory Banks

Memory appears in a system in one of two ways - directly on the motherboard, or on it's own planar that connects to the motherboard. In either case the memory slots on the system will be split between 2 or 4 'Banks' numbered 0 thru 3. Memory modules being installed by technicians or independent installers should be spread evenly across all available banks in the server. Thus the failure of any specific bank won't cripple the entire server. In general, the Operating System and the CPU can recover from a single memory bank failure without halting the server completely.

1.1.2 Memory SIMM/DIMM modules

Within a bank of memory are the individual SIMM or DIMM modules that typically breaks the memory down into 128MB, 256MB, or 512MB chunks per SIMM/DIMM. There are various kinds of SIMM/DIMM module specifications besides the size and speed of the memory.

First is whether the memory is Parity or Non-Parity. There are two type of Parity. First is standard Parity: As data moves through your computer (e.g. from the CPU to the main Memory), the possibility of errors can occur - particularly in older servers. Parity error detection was developed to notify the user of any data errors. By adding a single bit to each byte of data, this bit is responsible for checking the integrity of the other 8 bits while the byte is moved or stored. Once a single-bit error is detected, the user receives an error notification; however, parity checking only notifies, and does not correct a failed data bit. If your SIMM/DIMM module has 3, 6, 9, 12, 18, or 36 chips then it is more than likely Parity.

The second type of Parity is called Logic Parity - also known as Parity Generators, or Fake Parity - some manufacturers, as a less expensive alternative to True Parity, produced these modules. Fake parity modules "fool" your system into thinking parity checking is being done. This is accomplished by sending the parity signal the machine looks for, rather than using an actual parity bit. In a module using Fake Parity, you will NOT be notified of a Memory error, because it is really not being checked. The result of these undetected errors can be corrupted files, wrong calculations, and even corruption of your hard disk. If you need Parity modules be cautious of suppliers with bargain prices; they may be substituting useless Fake Parity.

Non-Parity modules are just like Parity modules without the extra chips. There are no Parity chips in Apple® Computers, later 486, and most Pentium® class systems. The reason for this is simply because Memory errors are rare, and a single bit error will most likely be harmless. If your SIMM/DIMM module has 2, 4, 8, 16, or 32 chips, then it is more than likely Non-Parity. Always match the new Memory with what is already in your system. To determine if your system requires parity, count the number of small, black, IC chips on one of your existing modules.

With the elimination for the need of Parity, ECC or Error Correction Code modules appeared as an advanced form of Parity detection often used in servers and critical data applications. ECC modules use multiple Parity bits per byte (usually 3) to detect double-bit errors. They also will correct single-bit errors without creating an error message. Some systems, which support ECC, can use a regular Parity module by using the Parity bits to make up the ECC code. However, a Parity system cannot use a true ECC module.

Most memory modules also have the ability to block the usage of known 'bad' portions of the memory module if regular corruptions are occurring in a particular region. This will trigger a server fault message within the machine to replace the affected module, but will not cause any change in the operation of the server.

1.2 CPU

Most server manufacturers offer multiple CPUs to be installed. In a single CPU server, a CPU failure will take down the entire server. However going to a Dual CPU configuration will eliminate the chance of a CPU failure bringing down the entire system. Only 20% of server-based applications today actually support multiple processors. Symmetrical Multi-Processing (SMP) aware versions of applications usually cost significantly more for licensing but will utilize all of the CPU resources in the server as needed.

SMP-aware Operating Systems will manage and load balance the services of non-SMP aware applications within the server. Single CPU failures will usually have minimal impact on the server operations. Moving to 4, 6 or 8 CPUs can increase redundancy, but the cost and effort will not be worth while unless you are running SMP aware applications.

1.3 PCI Bus

Servers with many PCI bus slots for peripheral cards are designed to split that bus between 2, 3 or 4 independent buses within the server. Each bus having a capacity for 2-6 peripheral cards depending on the size of the server. This design is meant to provide internal redundancy between peripheral cards on multiple buses. When designing a server with dual network interface cards or dual SCSI interface cards, they should be split between the different PCI bus planes to ensure continued server operation if one of the PCI bus planes should fail.

1.4 SCSI Interface

There are two options for SCSI interface cards on a server. Putting in two single SCSI bus cards and splitting them between to PCI bus planes, or installing a single SCSI card that had 2 SCSI buses on it. Either method can be used and provides differing ratios of management vs redundancy within the system.

Independent SCSI cards can provide redundancy in the face of single PCI bus failure, but can only be used for mirroring disk arrays between the two SCSI buses.

For a Dual/Quad SCSI bus on a single card, the system can be designed to split various levels of RAID across the busses for guarding against a single SCSI bus failure. Also creating large disk arrays across multiple SCSI buses usually requires the busses to be on a single card. While the same operation can be simulated with software on a multi-SCSI card installation, it can place significant load on the server.

1.5 Hard Disk

Hard disks can be either internal to the server, on an external drive rack that is dedicated to the server or within a SAN/NAS solution that provides disk to the server.

For internal disks to be redundant, they should be split between two or more SCSI buses and mirrored between the buses for redundancy. A disk or bus failure will result in the system switching to the mirror without user intervention.

Servers that use external disk cages should be design to use two or more disk array cages to ensure that a SCSI bus failure doesn't cut off the entire disk array within the cage from the server. Some drive housings can be split between multiple SCSI buses to provide redundancy within the frame. The disk should be split between the SCSI buses and drive cages evenly with RAID and hot spare drives arranged such that a SCSI bus or drive cage failure will not impact the system.

Moving to a SAN/NAS solution should provide the optimum redundancy and recoverability of server storage by eliminating the need for the server to manage any of its disk requirements. Dual connection to the Network or SAN backbone should be installed to ensure that connectivity to the SAN/NAS is not lost due to an interface failure. Keep in mind that even with SAN/NAS based solutions, there is usually still a requirement for local disk to run the Operating System. These disks should be designed with redundancy internally as described above to ensure recovery from a failure.

1.6 AC Power

1.6.1 Power Supplies

Within any given server should be multiple power supplies with multiple AC cord connections - preferably one AC cord for every Power supply. Best case scenario is to have 3 power supplies each with their own AC cord. Be weary of single power supplies with multiple AC cords, or multiple power supplies with a single AC cord. These all mask single point of failure situations.

1.6.2 UPS

All of the AC cords from the server should be plugged into an Uninterruptable Power Source or some sort of battery backed-up power system. Servers with multiple power supplies and multiple AC cords, should have those AC cords split between independent UPS systems to ensure that the maximum power redundancy is achieved.

All servers connected to a UPS should also have a serial or USB connection to the UPS and some sort of power management software that can trigger an automated shut down of the server in the case of a lengthy power outage. Sudden loss of power to a server can result in file or operating system loss/corruption.

1.7 Network/SAN Interface Cards

Any connections to networks or external devices should be Dual interfaced, and where possible, those interfaces should be split among multiple PCI buses. Keep in mind that it is more cost effective to have dual interfaces on a single card, but you lose the PCI bus failure redundancy of independent interfaces.

1.8 Single Points of Failure (SPOF)

When designed properly, the only real single point of failure for the server is the motherboard itself. The motherboard connects all of the buses and components together, and any fracture in this piece of hardware will cause a critical system outage.

Other peripheral outages such as floppy drive, CD-ROM, video card, sound card etc. are usually not considered to be critical in server environments and thus do not require redundancy.

2.0 Dual (or Multiple) Servers

To get around some of the Single Points of Failure identified in section 1.8, the practical method of providing redundancy is to put in multiple (2 or more) servers to support the application.

2.1 Load Balancing

The Load Balancing process is usually achieved with a LAN frame solution like Cisco's 'Load Director' running on a Catalyst 6000/7000/8000/9000 frame or through using the Bind features of a DNS server. This splits inbound traffic among a number of servers providing the same service.

The Bind feature of DNS can be set up to perform a 'round robin' style assignment of an IP address to a single DNS server name. This method does not allow for network or application level balancing of the load among the servers.

Cisco's Load Director can distribute user among available servers based on the network loading of the servers, but not the application or service loading on the individual servers. While Load Director can monitor and perform loading based on specific application level protocols, it does not know the amount of CPU drag that is caused by those application requests. Thus it is only useful for prioritizing application traffic that is going to a group of servers that are hosting a number of application responders. (I.e. HTTP and FTP).

In load-balancing situations, the individual servers that are servicing the requests are unaware of any of the other servers. When a workstation first requests the load-balanced servers, that workstation is assigned a specific server in the group and will use that server until the workstation is rebooted or shutdown.

This style of multi-system redundancy is best for front-end presentation layer type applications that are stateless such as Internet protocols like FTP, HTTP, SMTP, etc.

2.2 Cold Fail-Over

Cold fail over is a secondary server that sits powered off in proximity to the online server. This machine sits powered off until there is a failure in the primary system. Technicians must then power up the server and restore the data of the application for the stand-by server to take over the new system.

This type of recovery may take anywhere from 8 hours to 2 days depending on the volume of data that needs to be recovered and the speed of tape recover y systems. Users will have to wait for the server to be restored before logging in.

2.3 Warm Fail-Over

In a Warm Fail-Over solution, the secondary server is running in a fail-over cluster mode to the operating system level. When the primary server goes down, the secondary server automatically takes over. Support personnel only need to connect to data drives and startup the application on the backup box to restore service.

This recovery can usually be achieved in under an hour and is the best solution for non-cluster aware applications. Users will have to wait for the new server to become available and then re-login to the service.

2.4 Hot Fail-Over Cluster

For hot fail-over solutions, the applications must be fail-over cluster aware. The secondary server is connected to the same data drives and operates with all of the same services running as the primary server. When the primary server goes down, the secondary server instantly assumes the role of the primary server including its name and IP address. Depending on the Application's support for this type of fail-over, users may or may not have to login in to the application again.

To achieve this, the servers must be connected via a high-speed connection dedicated to the cluster

2.5 Clustered Multi-Processing (CMP)

In a Clustered Multi-Processing environment, all of the servers in the cluster are operating simultaneously, and are aware of all of the other servers in the cluster. Application software must be designed to support this type of environment. Any server within the cluster that fails is immediately covered by the other server(s) in the cluster.

In some cases, A hybrid of CMP and Hot Fail-Over is used to provide extremely high performance database services. For example, in a CMP-based Microsoft SQL 2000 implementation, the each server in the cluster can be an owner of a set of tables within the database, and each of those servers would have a hot fail-over server to back it up.

All servers within this kind of cluster must have a high-speed backbone to mange the cluster traffic.

2.6 Single Points of Failure (SPOF)

These multi-server environments are designed to provide a solution to the SPOF of a single server. The multi-server option is usually dictated by the software's support for these technologies.

Single points of failure within the solution can be attributed to the cluster hardware itself. Solutions 2.4 and 2.5 have a single layer of failure in the high-speed cluster backbone. This problem can be eliminated by adding a redundant cluster backbone to the servers.

This only leaves the actual Networking equipment of the LAN/WAN and SAN/NAS as SPOFs for these clusters. Implementation of networking equipment can create single points of failure that the server clusters will be unaware of.

3.0 Network
Any large high availability network should provide a number of layers of redundancy described below:
3.1 Local Area Network

Within the Local Area Network, there are many areas of failure that need to be covered:
3.1.1 Hubs/Switches

When installing a server with Dual Network Interface cards, it defeats the purpose of redundancy unless they are being plugged into separate switches or hubs within the same segment or on two separate segments. For situations where large LAN frames are being used, like a Cisco Catalyst 7000, the two network connections should be connected to two separate cards within the frame.

When possible, the connections should be on two separate segments to reduce the effects of DDOS attacks that may target a single segment. Connecting to two separate segments can also reduce the risk of router failures making a server unavailable.

3.1.2 Routers

The routers that connect various sites and segments of the network should be doubled up for redundancy. Or at least 2 ports within a router should be proving the connection between networks to eliminate the chance of a port failure or router failure from disconnecting service.

3.1.3 DNS

There should be at least 2 DNS servers providing Name services on a network. WINS or other Name servers can also be implemented for further redundancy between users and servers. More than 2 DNS servers can be put on a network and DHCP services can help split the users among the various DNS servers evenly.

3.1.4 DHCP

While DHCP is usually not a concern for servers with fixed addresses, the users that have to connect to the servers are typically assigned an IP address which is a prerequisite to logging on to any server or application service. Overlaps of DHCP regions should be used to ensure duplicity of DHCP services for users.

3.1.5 Firewalls

Dual connections to external networks should be implemented in case of a failure of one connection or firewall into that network.

3.2 Wide Area Network

3.2.1 Internet connectivity

In the case of connections to the Internet, each of the connections should come from a different provider to ensure that an ISP outage doesn't cut off both connections to the Internet. Some sort of load-balancing should be enabled between these firewalls to keep traffic even among the connections.

3.2.2 Private Data Network Connectivity

For connections to internal or private networks, Dual connections and firewalls should be in place for redundancy, but there is typically only 1 provider of private data services for any given enterprise.

4.0 Multi-site

Multi-site solutions are usually used to address Sever Disaster Recovery Protection scenarios
4.1 Cold-Site recovery

A cold-site recovery is usually a dedicated hosting space that contains the hardware required to perform a failed site recovery. All of the systems may or may not be racked and plugged in, to go online if needed, but are powered off. In the case of a disaster, the systems can be brought up, data and applications restored, and the solution can be put into production. This is usually a cheaper DRP solution as the space provider (typically a 3rd party) provides the space and hardware for a multitude of customers. In the case of localized disasters, only one, two or a few of the customers of the site may require the service at any given time. Widespread disasters may present a problem for the provider.

This solution can usually take 1-2 weeks to perform a site recovery.

4.2 Warm-Site recovery

Warm-site recovery involves a secondary site for applications that are not cluster aware, but has all of the servers at the secondary site online. In case of a disaster, the warm-site servers can be directed to the data or a data recovery will be performed in order to get the warm site operational.

This recovery may take 1-7 days depending on the amount of data to be recovered.

4.3 Hot-Site recovery

A Hot-Site backup utilizes the technologies of Hot Fail-Over or a CMP style solution for applications that support it. Instead of applying these technologies within a LAN environment, servers operate over the wide area network. In the case of a Hot Fail-Over, the fail-over server exists on a remote site and traffic is redirected to that site in a fail-over condition.

In the case of CMP, the servers clustered to provide an application can exist in two or more locations and work simultaneously to service clients. Mechanisms can also be employed to have users talk to the closest server to their location based on determining the server with the least number of network hops to that user.

This discussion has been archived. No new comments can be posted.