Markov Modeling for Reliability 
Part 3: Considerations for More Complex Systems 

The simple method described in Section 2 works quite well for systems with just dual redundancy, and with component repair rates that are much greater than the component failure rates (which is often the case in practice). However, for systems with triplex (or greater) redundancy, and/or for inspection/repair rates that are relatively low (i.e., long latency times), there are additional factors that must be considered when formulating and solving Markov models. Also, if we are interested in determining not just the longterm average failure rate, but the instantaneous failure rate as a function of time, we must solve the timedependent system equations. These issues are discussed in the following sections. 


3.1 Modeling Infrequent Periodic Repairs for FirstOrder States 

In the simple example discussed in Section 2.2, a Thour periodic inspection/repair of a particular unit was modeled as a transition with constant rate 2/T, because the failure distribution was roughly uniform during each Thour interval, and hence the average time from failure to repair was only T/2 hours. As noted above, we could simply be conservative and use a repair rate of 1/T, but in order to minimize the conservatism in our analysis we used m = 2/T. However, there are two important restrictions on the use of 2/T for modeling periodic repairs. 

First, the asymptotic rate of 2/T for frequent inspection/repairs strictly applies only for failure states that are a single failure away from the fullyrepaired state. In practice this is often all that is needed, because many systems make use of just dual redundancy, and the failure of both redundant units represents complete system failure, so the only candidates for periodic inspection and repairs are states that are just a single fault away from the fullup state. (See Section 3.3 for higher order states.) 

Second, as already mentioned, even for first order systems, if the inspection interval is large compared with the MTBF of the unit, then the average time between failure and repair approaches T, so we should use m = 1/T in such cases. For intermediate cases, i.e., cases when the inspection/repair interval is roughly the same order of magnitude as the unit MTBF, we need to determine the appropriate rate, which will be somewhere between 1/T and 2/T. For firstorder states, a Thour periodic inspection/repair for any firstorder state can be modeled as a transition with a constant rate of 

_{} 

where Sl denotes the sum of all the failure rates exiting the fullup state. Notice that this formula always gives a repair rate in the range from 2/T to 1/T. This formula is valid for any inspection interval, large or small, but only for first order states. The following section discusses how to handle periodic repairs of higherorder states. 


3.2 Simplifying HigherOrder Models By State Aggregation 

If each distinguishable condition of a system is assigned a unique state in the model, the number of states can be very large for even a moderatelysized system. To illustrate, consider a system of N identical components. If each possible combination of failures is treated as a separate state, the model will contain 2^{N} states, as shown in the diagrams below for N = 2, 3, and 4. 



Each failure transition has the same value of l associated with it. Since all the components are identical, we can gather together all the singlefailure conditions into one state, all the doublefailure conditions into another state, and so on. This leads to the simpler state models with just N+1 states shown below. 



In general, if two or more configurations of a system are completely symmetrical, they can be consolidated into a single state. (Section 2.6 included a simple example of state aggregation for the case N = 2.) 

For another example of state aggregation, consider a system in which each channel contains operational components and protective components (e.g., tranzorbs for lightning protection) that are latent in service but that are repaired whenever the unit is serviced for some detected failure of the operational components. It might seem as if each channel should have four possible states, but in such a case only three states need to be considered, corresponding to fully healthy, protection failed, and operationally failed. This is because, for an operationally failed unit, it is irrelevant whether the protection is failed or not, because the unit will remain inoperative until it is repaired, at which time the protection will also be restored if it is failed. In effect, the operational failure and the operationalplusprotection failure states are aggregated together. 

Still another example of state aggregation is when a system has a huge number of possible highorder combinations of failures, and yet only the single and double failure combinations have significant probabilities. All the third, fourth, and higher order combinations are negligible. In such a case, it is permissible to aggregate all the higherorder states into the total system failure state, as shown in the figure below for N = 4 components. 



This reduces the number of states from 16 to just 6. Since many of the higherorder combinations may actually be operational, it is conservative to lump them together with the total system failure state, but since their probabilities are believed to be negligibly small, there should be no objection to aggregating them in this way, thereby greatly reducing the number of states in the model. 


3.3 Iterative Solution of SteadyState Models 

For the simple system described in Section 2.2 it was possible to recursively determine each state probability as fractions of the fullup state probability. This is because the equation for State 1 enabled us to determine the ratio P_{1}/P_{0}, and the equation for State 2 gave the ratio P_{2}/P_{1}, which could be multiplied by P_{1}/P_{0} to give P_{2}/P_{0}. It is often the case with relatively simple Markov models that the probabilities can be computed recursively in this way. When this is possible, it eliminates the need to apply a fully general solution algorithm (such as matrix inversion) to solve the model. This is why it is often possible to easily evaluate simple Markov models by hand calculations. 

However, for more complicated systems, especially those involving partial repair transitions, the state probabilities are bidirectionally interdependent, and a direct recursive solution is not possible. This isn’t an issue if a generalpurpose solution algorithm is being used to invert the coefficient matrix as in 2.3, but it does present a problem for hand calculations and other “ad hoc” solution methods. 

Fortunately the nonrecursive effects in Markov models for reliability analysis are generally higher order (smaller) effects, so it’s possible to apply the basic recursive solution method, solving for each state probability ratio using its respective state equation as if all the other state probability ratios were known, and arrive at a firstorder approximate solution. Then this process can be repeated, using the newly computed probability ratios, to arrive at an even better approximation, and so on. (This solution method has advantages over both the GaussSeidel method and the Jacobi method, because it exploits the homogeneous character of the system equations of a Markov model.) 

To illustrate, consider a model with the simple steadystate equations 

_{} 

Because of the state interdependencies (e.g., P_{1} depends on P_{2} as well as P_{0}), this system of equations can’t be solved by direct recursion. Note, however, that we can proceed as if the probability ratios were already known. We can divide through each equation by P_{0}, discard the first equation (as always, because it is redundant), and solve each of the other equations for the ratio on the diagonal. Letting Q_{j} denote the ratio P_{j}/P_{0}, this gives 

_{} 

Beginning with Q_{1} = Q_{2} = 0, iterating these equations quickly leads to the steadystate ratios. Of course, once the ratios are known, we can compute P_{j} = Q_{j }/ (1+Q_{1}+Q_{2}). In general this approach converts the N homogeneous state equations into a nonhomogeneous system of N1 equations, which can be iterated to arrive at the solution. (They could also be solved directly, but if the purpose is to avoid the need to solve a system of equations, iteration can be used.) 


3.4 Modeling Periodic Repairs for HigherOrder States 

For periodic inspection/repair of states that are more than one failure away from the fullyrepaired state, the equation for m given in the preceding section is not strictly applicable. Nevertheless, for some common types of higherorder models it is still possible to accurately estimate the applicable repair transition rate for representing a Thour inspection/repair interval. 

As noted previously, we always have the option of using m = 1/T, since the mean effective repair rate can’t be any lower than this. However, in many cases we can justify the use of a significantly higher rate. In fact, for higherorder states, we can often justify using a rate even greater than 2/T. This is because the rate of entry into higherorder states is not constant, but increases over time, as the probabilities of the feeding states increase. As a result, entry into a higherorder state is not uniformly distributed over the interval, it is weighted toward the end of any given inspection interval, and hence the mean time to repair may be even smaller than T/2. 

For certain common types of higherorder models, the optimum rates corresponding to periodic inspection/repairs are easy to determine. For example, consider a system consisting of N identical units, each of which has a failure rate of l. The system remains operational unless all N of the units has failed. If the system fails, it is repaired immediately to the fullup state, but aside from this the only maintenance is that the system is inspected every T hours, and any units found failed are repaired at that time. A Markov model for this system, with N = 4, is shown below. 



Here we have modeled the Thour inspection/repair as constantrate transitions for each of the three partiallyfailed stated, but the rates differ because the mean times to enter the respective states are all different. In general, for a system of this type, with N identical components, it can be shown that periodic inspection/repair of a state that is k failures away from the fullyrepaired state can be represented by a transition with constant rate m = (k+1)/T if the inspection interval is small compared with the unit MTBF, whereas it can be represented by a transition with constant rate m = 1/T if the inspection interval is large compared with the unit MTBF. To cover both of these regimes, and the intermediate region between them, it’s convenient to use the simple formula 

_{} 

However, it should be emphasized that this formula is tailored specifically for systems consisting of N identical units, with no repairs other than the Thour periodic/inspection of all the units (along with the immediate repair of a totally failed system). In practice we may encounter more complicated systems, with dissimilar units and a variety of maintenance strategies for each unit. It is always possible to determine a set of constant values to represent the repair transition rates such that the overall system failure rate is correct, but for highly complex systems it essentially becomes necessary to carry out an exact analysis of the system in order to determine suitable repair rates for the approximate solution. Of course, once the exact analysis has been performed, there may be no need to perform an approximate analysis. Hence, for highly complex models involving periodic repair intervals of intermediate length (i.e., intervals that are on the same order of magnitude as the MTBF of the units), the approximate method of solution based on constant repair rates may not be suitable. Such models are relatively rare in practice, but when they arise, it is advisable to use one of the “exact” solution methods described in the next section. 


3.5 Exact Solution of Complex Models with Infrequent Periodic Repair 

The simple system discussed in Section 2.2, including the periodic repair, can be very satisfactorily modeled using the simple method described in that section. However, we will use that system to illustrate the exact solution. The first step is to consider the Markov model with all the transitions that can be accurately modeled with constant rates. We exclude the periodic inspection/repairs, on the assumption that we are not able to accurately represent those as transitions with constant rates. Thus we have the system shown below. 



We have omitted the fullyfailed state (State 3), because as explained in Section 2.2 that state contains no probability (because it is repaired immediately). The timedependent state equations for this system are 

_{} 

It is convenient to express this set of equations in matrix notation as 

_{} 

where 
_{} 

Given any initial conditions P(t) at time t, the state probabilities at some later time t+T are given by 

_{} 

Now, at the initial time t = 0 we have P(0) = Transpose[1 0 0], and at the end of the first Thour periodic inspection interval this formula gives the new state probabilities P(T). At this point we wish to inspect and repair any systems found in State 2 back to State 0. This can be accomplished by multiplying P(T) by redistribution matrix A defined as 

_{} 

After the first interval T and carrying out the first inspection/repair, the state probabilities are P(T) = (Ae^{M}^{T})P(0), and after the second inspection/repair the state probabilities are P(2T) = (Ae^{M}^{T})^{2}P(0), and so on. Therefore, after the nth inspection/repair, the state probabilities are 

_{} 

The average probabilities for the interval beginning at time nT is given by 

_{} 

where I is the identity matrix. Notice that the numerator on the right hand side is divisible by MT, so it isn’t necessary to invert the M matrix. The system failure rate is l_{2}P_{1} + l_{1}P_{2}, so if we define the row vector L = [ 0 l_{2} l_{1} ], the average system failure rate during the interval beginning at t = nT can be expressed as 

_{} 

Inserting the values of the parameters for the example from Section 2.2, this equation shows that the system has already reached equilibrium by the end of the first 1000hour inspection interval, at which point it is has the value l_{sys} = 5.84E07. Recall that the simple algebraic method in Section 2.2 gave the value l_{sys} = 5.79E07. This confirms that, for this simple system, the algebraic method is quite adequate. 

Note: Some analysts prefer to numerically integrate equation 3.51 rather than making use of the explicit closedform solution 3.52. The results are the same. 


3.6 Exact Solution With Multiple Distinct Periodic Repair Intervals 

The example discussed in the preceding section involved only one periodic inspection/repair interval. The solution can be generalized to any number of distinct repair intervals. For example, suppose we have a system for which some of the components and inspected and (if failed) repaired every t hours, whereas the remainder of the components are inspected/repaired only once every T = nt hours. In other words, the longer interval is n times the smaller interval. For this system we have two “redistribution matrices”, which we may call A and B. The first represents the repair of the thour components, and the second represents the repair of all components (because everything is checked and repaired at the longer interval T). It can be shown that the average system failure rate is 

_{} 

We note that the B redistribution matrix doesn’t appear in this expression, because B represents complete restoration of the system to the fullup state, so, beginning from the fullup state, we need only evaluate the system reliability over n of the t periods using the A matrix. At this point the B redistribution would restore everything to the fullup state, so all subsequent iterations would be identical to the first. 

To illustrate, consider again the sample system of Section 2.2, but suppose Unit 1 did not contain continuous health monitoring (so m_{1} = 0), and instead it was maintained by a periodic inspection/repair every t = 200 hours. As before, Unit 2 is inspected and repaired periodically every T = 1000 hours. Thus we have n = 5, and the coefficient and redistribution matrices are 

_{} 

Since the B matrix represents complete restoration of the system, we can use the preceding equation to evaluate the average system failure rate of one complete Thour interval (which consists of five thour intervals). Inserting these matrices into the preceding equation gives l_{sys} = 5.85E07, which is nearly identical to the previouslycomputed results. This was to be expected, because the inspection of 200 hours for Unit 1 is essentially equivalent to the previous system that required repair of Unit 1 at 100 hours after failure. Recalling that a frequent periodic inspection and repair every T hours gives a mean repair time of T/2, these two maintenance strategies give the same result. 


3.7 Model Truncation and Completeness 

One of the main difficulties for any kind of comprehensive reliability analysis is that the number of distinct system states may grow quite large as the number of components increases. A system consisting of N basic components has 2^{N} possible states (even without taking the order of occurrence into account), so any analysis method must somehow accommodate this exponential growth in complexity as the number of basic components increases. This applies regardless of the method of analysis (e.g., fault tree, Markov model, etc.). However, in practice this problem is often mitigated by the fact that the systems to be analyzed are not fully general, partly because they have been specifically designed to isolate redundant functional paths and avoid “common mode faults”. This not only is sensible design practice, it also tends to make systems easier to analyze, because groups of related components can be lumped together into a much smaller number of independent highlevel partitions. 

In addition, the failure probabilities of practical components tend to be many orders of magnitude smaller than 1, so combinations of more than two or three failures are often found to be negligible. It is often possible to greatly simplify an analysis by exploiting these special aspects of practical systems. However, care must be taken not to assume, in advance, that a particular system can be simplified in these ways. Meaningful substantiation of the extremely high levels of reliability (e.g., one per billion hours) mandated for safetyrelated systems requires a commensurate level of reliability in the analysis method, so it is not permissible to employ a method that gives the right answer for “almost all” systems. Every possible combination must be taken into account (though not necessarily explicitly), either in the design phase or in the analysis phase. This highlights the importance of applying good design practices, including strict isolation of redundant functional paths, so that systems are amenable to rigorous analysis. 

Of course, the requirement to account for all combinations doesn’t means that each combination must be explicitly represented in a model. For example, if all combinations of three or more failures are believed to be negligible, it is reasonable to include only the single and double failure combinations explicitly in the model, provided all further failure transitions from the doublefailure states are routed to the complete system failure state. This will be conservative, but the conservatism will be negligible – assuming all the higherorder combinations are indeed negligibly small. If the model turns out to be overly conservative, then it will be necessary to model the higherorder combinations to justify a less conservative analysis. 

It’s also worth noting that modern computers have greatly increased the size of the models that can be easily analyzed. Nevertheless, the potential for exponential growth in the number of states ensures that there will always be incentive to limit the number of states that are explicitly represented in a model. 


3.8 Models with Variable Transition Rates 

In most common Markov models the transition rates are constants, so the transition times are exponentially distributed (as discussed in the Introduction). It is, however, possible to construct models with nonconstant failure rates. The governing state equations are the same, except that each rate l is allowed to be a function l(t) of the time. The need for variable failure rates arises in practice to account for things like “wearour” or “infant mortality”. Of course, in order for the failure rate to correlate with such variations, the time parameter t must represent the age of the component. This is quite different from common Markov models, in which the time variable t is arbitrary, i.e., we can set t = 0 at any instant, and the system equations are valid. This is why common Markov models are said to be “homogeneous in time”, whereas if the state equations are valid only if t represents the age of the unit, the model is said to be “nonhomogeneous in time”. 

One of the main difficulties in dealing with variable failure rates is that a “state” is no longer fully determined by the health status of a finite number of components. The state also depends on how much time has passed since the last time each of the components was replaced or restored to it’s “as new” condition. Even if we agree to begin the model with all new components in the fullup state, a system with repairs will involve units that have failed and then been returned to the fullup state at various times, so very quickly each state will consist of a mixture of systems, all in the same operational condition, but with components of different ages (and therefore different instantaneous failure rates). Consequently, the life of each individual component must essentially be analyzed separately, with no interactions or repairs. 

For example, it would not be possible to use variable failure rates in the simple twostate closedloop model discussed in Section 2.5, because of the mixed ages of repaired and neverfailed units in the fullup state. With ordinary Markov modeling techniques the most we could do is evaluate the openloop model discussed at the beginning of that section. However, after much labor, we would find that the state probabilities are just the products of the individual component probabilities, since the two units are simply following their own individual failure rates. The application of conventional “Markov modeling” to such system’s is of little benefit. 

There
do exist, however, techniques for applying nontrivial Markov modeling to
system with variable failure rates. One “brute force” approach is to use a Monte Carlo simulation,
generating a large number of (simulated) systems and tracking them through
their life cycles. Each unit is failed at a rate depending on the age of that
particular unit (since it’s last overhaul). A more analytical approach is to
actually model the time dependence of the failure rates by means of a
sequence of states. A description of these methods is outside the scope of
this document. 

3.9 Phased Mission Analysis 

Some missions are split into distinct phases with different configurations and/or failure modes or effects. For such missions a Markov model can be developed for each phase, and the final conditions for one phase may be used as the initial conditions for the next. Phased missions are somewhat uncommon in industry. No fundamentally new ideas are required to analyze such missions. 
