Implementing Distributed Systems: Concepts
Reliability
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults and even human error), “continuing to work correctly, even when things go wrong.”
-
Hardware Faults: Our first response is to add redundancy to the individual hardware components to reduce the system's failure rate. Disks may be set up in a RAID configuration; servers may have dual power supplies and hot-swappable CPUs, and data centers may have batteries and diesel generators for backup power. When one component dies, the redundant component can be replaced while the broken member is replaced.
-
Software Errors: Carefully consider assumptions and interactions in the system; thorough testing; process isolation; allowing processes to crash and restart; measuring, monitoring, and analyzing system behavior in production.
-
Human Errors: Minimize opportunities for errors.
Reliability is achieved by removing every failure point and introducing components and data redundancy.
Scalability
Scalability is the property of a system to handle a growing amount of work by adding resources to the system.
-
As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
-
The capability of a system to grow and manage increased demand.
-
A system that can continuously evolve to support growing work is scalable.
Horizontal scaling can be achieved by adding more servers to the pool of resources.
Vertical scaling can be achieved by adding more resources (CPU, RAM, storage, etc.) to an existing server. This approach comes with downtime and an upper limit.
Availability
Availability is when a system remains operational to perform its required function in a specific period.
-
Measured by the percentage of time that a system remains operational under normal conditions.
-
A reliable system is available.
-
An available system is not necessarily reliable.
-
A system with a security hole is available when there is no security attack.
Efficiency
Efficiency indicates how the system uses the inputs. It measures productivity, i.e., the output against the information.
-
Latency: response time, the delay in obtaining the first piece of data.
-
Bandwidth: throughput, amount of data delivered in a given time. Check this link to see how to calculate bandwidth requirements.
Maintainability
-
Over time, many people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.
-
Easiness to operate and maintain the system.
-
Simplicity and spending with which a system can be repaired or maintained.
References: