Introduction: Purpose and Causes
The College of Engineering Computer Network stores university data that is important for business continuity and research efforts. These facilities and data have no life-critical or clinical functions/applications.
Purpose and Goals
This contingency and disaster recovery plan describes the College of Engineering Computer Services (ECS) computer environment that has been designed to prevent incidents and anticipate easy recovery from a disaster destroying all or part of the facilities.
This plan is applicable to all College of Engineering Computer Services staff responsible for managing critical facilities, including server hardware, software, and data. The ECS director serves as the administrator of the core server infrastructure for the College of Engineering. In the case of an incident or disaster, the ECS director serves as the Recovery Manager.
The goals of this disaster plan are to:
- Provide for the safety and well-being of people on the premises at the time of a disaster;
- Continue critical business operations;
- Minimize the duration of a serious disruption to business operations and resources;
- Minimize immediate damage and losses;
- Ensure organizational stability;
- Ensure orderly recovery.
Situations that can interrupt or destroy computer, network, or telecommunication services occur under the following major categories:
- Environmental failures include interruptions by fire, steam, flooding, and weather, of the air conditioning and electrical systems.
- Hardware and software failures and malfunctions.
- Application failures caused by sabotage, system malfunction, or an interruption of the computing infrastructure.
Recovery depends on the severity of the failure. This plan covers strategies for both partial and full recovery of critical and non-critical applications and data.
Who Is Affected
ECS staff, guided by the ECS director, handles the work of recovering from a disaster. Faculty, staff, and students in the College of Engineering may be affected by a disaster, as well as clients with equipment or data that is managed by ECS.
Disaster prevention means reducing the impact of problems by minimizing recovery time and effort to keep an incident from escalating into a disaster. Preventive measures strive to decrease recovery time, as well as reduce the probability of a catastrophic event and reduce its impact.
In preparation for a disaster, ECS
- Has configured a virtual server environment that can suffer the loss of up to two host servers without loss of service. Servers and data storage devices are configured to recover automatically from errors and disruptions.
- Can perform a restoration from the ground up.
- Can restore a fully patched and compliant computing environment.
- Provides central controls (e.g., virus blocking, port blocking) when applicable.
- Identified other work units that would be affected if the incident were to spread.
ECS has documented the following:
- Identified person(s) with the authority to declare a disaster (see “Recovery Manager”).
- Identified multiple staff capable of restoring IT services (see “Recovery Manager”).
- Process for retrieving backed up data.
- Procedure for the annual review and testing of the plan, including educating appropriate staff to ensure they are aware of and understand the plan.
Security for data involves protection from damage or attack, being stable, reliable, and free of failure. Securing information is guaranteeing its confidentiality (levels of privacy), integrity (being complete and true), and availability (being accessible).
ECS has set monitors to detect to following conditions in the server room, network entrance facility, and disaster recovery site:
- Fire or smoke
- Overheating or lack of sufficient cooling
- Power interruption
- Server malfunction
- Change in network operations
- External (network) attack, network intruder(s)
- Fire detection and suppression
When a monitor detects any of the above fault conditions, the system calls and sends email to the ECS director and the sysadmin staff.
The ECS server room, entrance facility, and disaster recovery site have:
- Backup power with uninterruptible power supply backed by a diesel generator*
- Heating, ventilation, and air conditioning controls
- Secure doors and windows, and facility space
- Video surveillance, intruder alarm systems, and motion sensors
- Access control logging
- Two-factor authentication required for entry: physical key and code
- Virtual servers configured to automatically take over the functions of a failing server.
- Sysadmin staff receives on-going cross training.
- Key employees have cell phones and Internet access from remote location.
- VPN to allow key employees to have secure access from remote location.
- Multiple server and data backups available. (See “Backups” below)
- A contingency communication plan that assumes normal electronic communications, including telephones, will not work.
All systems are backed up periodically as described in the ECS Backup and Recovery Policy. Physical access to backups is restricted by access control and password security. Physical access to off-site storage is restricted by access control.
Decision-Making Process - During an Incident or Disaster
In the case of an attack or emergency, the Recovery Manager or designated staff will:
- Report the attack to the University IT Security Officer.
- Block or prevent escalation of the problem, if possible.
- Preserve evidence, where appropriate.
- Change affected account passwords as necessary.
- Change the status of accounts as necessary.
- Stop the service, if necessary.
The ECS director shall be the Recovery Manager, who is the primary person expected to carry out the following basic roles after a disaster situation has been declared by the designated authority. The secondary person shall be the first available first responder from the sysadmin group.
The Recovery Manager will direct coordination, restoration, and communication activities. This manager makes command decisions as related to the disaster within the scope of the area, and is essentially in charge of the disaster recovery. S/he works with other staff and/or outside vendors to restore computers, or other technical systems, to a functionality needed for the area to operate, at a minimum, its critical services. S/he also handles communication with departmental staff and outside entities.
Equipment and Data Protection
Servers and storage devices are configured to automatically switch to another server or storage devices in the case of failure. If the disaster is caused by water, the designated staff will verify that the operations of the affected server or storage device have automatically moved to other equipment.
Data are protected by the security protections and back up policy described in the “Prevention” section above.
The Recovery Manager or designate should evaluate damage to the computing equipment, structure, electrical system, air conditioning, and building network. One part of a damage assessment is specifying what equipment must be replaced and in what order replacements will be done. Estimates of repair time should include ordering, shipping, installation, and testing time. After the assessment, the Recovery Manager should estimate and communicate when computing functions are likely to return to normal.
The Recovery Manager is responsible for evaluating the necessity of purchasing replacement equipment. The Purchasing Department at the UI will determine the best source for the quick acquisition of hardware and other equipment.
The servers and data storage devices are configured to automatically recover from errors. Should manual intervention be required, the process is known by several from the sysadmin staff and documentation is kept in the Server Room Operations Manual.
These documents include information on servers, storage devices, software, customers and services, and operating information relevant to the Engineering Computer Network.
Server Room Operations Manual
Backup and Recovery Policy (to be reviewed)
* Disaster recovery site in ERF does not have a diesel generator.