Consider adding m numbers using N processors.
Processors 1 -> N-1 compute the partial sum of m / N-1 numbers. These partial sums are passed on to processor PN to calculate the final sum i.e. add N-1 partial sums.
i.e. There are N-1 communications ( Assume PN can read input simultaneously )
Say each floating point addition takes 1 unit of time and it takes 100 units of time to pass the partial sums from P1 .... PN-1 to PN i.e.TOTAL communication time = 100 units.
Parallel time is
Tp = (m / N-1) + ( 100 ) + ( N -1 )
where (m / N-1) is time to add (m / N-1) numbers P1....PN-1
and ( 100 ) is the total communication time
and ( N -1 ) is the time to add N-1 numbers in PN
i.e. TP = M + N * N + 98N - 99 -------------------------- N - 1Time for sequential algorithm Ts is m units (since no communication needed)
therefore
Consider now the grain size i.e. amount of work done in each processor between synchronizations, in this case, the size of m.
if m= 1000 and use 10 processors i.e. N = 10
SPEEDUP = 10 000 - 1 000 = 4.54
----------------------
1 000 + 100 + 980 - 99
and
EFFICIENCY = 4.54 = 0.454 = 45%
----
10
If m= 10 000 and use 10 processors ( N = 10)
SPEEDUP = 100 000 - 10 000 = 8.20 ------------------------ 10 000 + 100 + 980 - 99 EFFICIENCY = 8.20 = 0.82 = 82% ------ 10So we can see that increasing the grain size, in this case by adding more numbers in each processor, increases the speedup and improves the efficiency.
Tp = ( (m / N-1) + N-1 )Tcalc + Tcomm Ts = m Tcalc Therefore % efficiency is given by 100 m ( N-1 ) Tcalc = E ....... ( 1 ) --------------------------------------- N (m + (N-1)*(N-1))Tcalc + N(N-1)TcommSay we require to attain at least 80% efficiency using N = 10 then rearranging ( 1 ) we get :
m > EN( N-1 ) [ ( N-1)Tcalc + Tcomm ] ----------------------------------- Tcalc [ 100(N-1) - EN ]If Tcomm / Tcalc = 1 on machine A then m > 720