Next: Factors That Limit Speedup
Up: Geometric Parallelism

Asynchronous / Relaxed Parallelism

In relaxed algotithms there is no explicit dependency between processes, as occurs in synchronized algorithms. Instead relaxed algorithms never wait for input. If they are ready they use the most recently available data

To illustrate this consider the following.

Say we have two processors A and B. A produces a sequence of numbers a1, a2 .. B inputs ai and performs some calculation F which uses ai.
Say that B runs much faster than A.

Synchronous Operation

A produces a1 passes it to B which calculates F1;
A produces a2 passes it to B which calculates F2;
i.e. B waits for A to finish ( since B is faster than A ) etc..

Asynchronous Operation

A produces a1 passes it to B which calculates F1 but now A is still in the process of computing a2 so instead of waiting B carries on and calculates F2 ( based on old data i.e. a1 and therefore may not be the same as F2 above )and continues to calculate F using the old data until a new input arrives
e.g. Fnew = Fold + ai

The idea in using asynchronous algorithms is that all processors are kept busy and never remain idle (unlike synchronous algorithms ) so speedup is maximized.

A drawback is that they are difficult to analyse ( because we do not know what data is being used ) and also an algorithm that is known to work ( e.g. converge) in synchronous mode may not work (e.g diverge) in asynchronous mode.

[ Examples of calculations that do run successfully as asynchronous algorithms are : Gauss Seidel and Jacobi methods for solving linear systems of equations. ]

Consider the Newton Raphson iteration for solving F (x) = 0 where F is some non-linear function
i.e. Xn+1 = Xn - F(Xn)/F'(Xn)......( 1 )
generates a sequence of approximations to the root, starting from a value X0.

Say we have 3 processors

P1 : given x, P1 calculates F (x ) in time t1, units and sends it to P3
P2 :given y, P2 calculates F'(y) in time t2 units and sends it to P3
P3 : given a, b, c, P3 calculates d = a - b/c in time t3 units; if | d-a | > Epsilon then d is sent to P1 and P2 otherwise d is output.

Serial Mode

P1 computes F(Xn) then P2 computes F'(Xn) then P3 computes Xn+1 using (1)
So time per iteration is t1 + t2 + t3
If k iterations are necessary for convergence then total time is k (t1 + t2 + t3)

Synchronous Parallel Mode.

P1 and P2 compute F(Xn) and F'(Xn) simultaneously and when BOTH have finished the values F(Xn) and F'(Xn) are used by P3 to compute Xn+1
Time per iteration is max( t1, t2) + t3

Again k iterations will be necessary so total time is k [ max( t1, t2) + t3]
X1 = X0 - F(X0)/F'(X0) ...etc

Asynchronous Parallel Mode

P1 and P2 begin computing as soon as a new input value is made available by P3 and they are ready to receive it,

P3 computes a new value using (1) as soon as EITHER P1 OR P2 provide a new input i.e. (1) is now of the form

Xn+1 = Xn - F(SXi)/F'(Xj)

Time per iteration is at most min( t1, t2) + t3 [but may be as low as t3 - e.g. if P1 and P2 complete at consecutive time steps]

we cannot predict how many iterations will be necessary
Say t1 = 2, t2 = 3 and t3 = 1.


		2 units		3 units		1 unit
TIME		P1		P2		P3

2		F(X0)		C0		-
3		-		F'(X0)		-		
4		-		-		X1 = X0 - F(X0)/F'(X0)
5		C1		C1		-
6		F(X1)		C1		-
7		-		F'(X1)		X2 = X1 - F(X1)/F'(X0)
8		C2		C2		X3 = X2 - F(X1)/F'(X1)
9		F(X2)		C2		-
10		C3		F'(X2)		X4 = X3 - F(X2)/F'(X1)
11		F(X3)		C3 or C4	X5 = X4 - F(X2)/F'(X2)
12		C4 or C5			X6 = X5 - F(X3)/F'(X2)
13		  .		   .		  .
14		  .  		   .		  .
etc
- indicates processor is idle
Ci indicates processor is using Xi in its calculation

NOTE
At time 11 P2 has the choice of using X3 to calculate F'(X3) or X4 to calculate F'(X4), i.e. omit X3. Which choice is made should be determined experimentally to see which gives the best results.

we could relax the parameter i.e. use X4 or we could synchronise with the parameter i.e. using X3