Seu SlideShare está sendo baixado.
×

- 1. Universitat Polit`cnica de Catalunya e Facultat d’Inform`tica de Barcelona a AMPP Final Project Smith-Waterman Algorithm Parallelization Authors: Supervisors: M´rio Almeida a Josep Ramon Herrero Zaragoza ˇ Zygimantas Bruzgys Daniel Jimenez Gonzalez Umit Cavus Buyuksahin Barcelona 2012
- 2. Contents 1 Introduction 3 2 Main Issues and Solutions 3 2.1 Parallelization Techniques . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Blocking and Interleaving Technique . . . . . . . . . . 6 2.2 Performance Model on Linear Network Topology . . . . . . . . 7 2.2.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Blocking Technique: Optimum B . . . . . . . . . . . . 10 2.2.3 Blocking and Interleaving Technique . . . . . . . . . . 11 2.2.4 Blocking and Interleaving Technique: Optimum B . . . 14 2.2.5 Blocking and Interleaving Technique: Optimum I . . . 14 2.3 Performance Model on 2D Torus Network Topology . . . . . . 15 2.3.1 Blocking Technique . . . . . . . . . . . . . . . . . . . . 15 2.3.2 Blocking and Interleaving Technique . . . . . . . . . . 15 2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Performance Results 21 3.1 Finding Optimal P and B . . . . . . . . . . . . . . . . . . . . 21 3.2 Finding Optimal I . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Conclusions 22 A How to Compile 24 B How to Execute on ALTIX 24 C Code 24
- 3. 1 Introduction In this project the parallel implementation of the Smith-Waterman Algo- rithm using Message Passing Interface (MPI). This algorithm is a well-known algorithm for performing local sequence alignment, which is, for determining similar regions between two amino-acid sequences. In order to ﬁnd the best alignment between two amino-acid sequences a matrix H is computed of size N × N , where N is a size of each sequences. Every element of this matrix is based on Score Matrix (cost of matching two symbols) and a gap penalty for mismatching symbols of sequences. When matrix H is computed, the optimum alignment of sequences can be obtained by tracking back the matrix starting with the highest value in the matrix. In our parallel implementation only H matrix calculation was parallelized as it is our only interest. The tracking part was removed from the code and from the sequential code as well, in order to gather the most accurate computation times for comparison. For parallelization a pipelining method was used. Following this model, each process communicates with another after calculating B columns of N rows. This is called blocking. We introduced P a parameter that easily allowed to change this value. Later interleaving parameter I was added. During this project several performance models were created. One model is for linear interconnection network and another one for 2D torus network. In models calculations B and I parameters were included. Later, the opti- mum B and I were found and performance tests were executed to empirically ﬁnd out those two parameters. 2 Main Issues and Solutions In this section parallelization solutions are described. Solution with blocking at column level is explained and performance model is described. Then the solution with both blocking at column level and interleaving at row level is explained and performance model is described as well. Also, in this section the calculations are provided for optimum blocking factor B and interleaving factor I. The second part of the section is for description of the perfor- mance models for both solutions on the diﬀerent network topology and the calculations for ﬁnding optimum blocking factor B and interleaving factor I. Finally, our implementation of these techniques in C++ is provided and explained. 3
- 4. 2.1 Parallelization Techniques 2.1.1 Blocking Technique Figure 1: Parallelization approach by introducing blocking at column level The P processes share the matrix M in terms of consecutive rows. For calculating the matrix M of size N × N , each process Pi works with N/P consecutive rows of the matrix. When using a blocking technique for paral- lelization, columns are divided by a deﬁned block size B. So, each process has to calculate N/B blocks. These parameters are visualized at Figure 1. At the top part of the ﬁgure it can be seen how elements of the matrix are divided between processes. And at the bottom part of the ﬁgure the par- allelization of calculations between processes is visualized. There is shown that when ﬁrst process computes the ﬁrst block of the matrix, which is size of N/P × B, it communicates with the next process. Then the next process start calculating the other block of the matrix while the ﬁrst process con- tinues calculations on the next block and so on. This type of parallelization is called pipelining. In this type of parallelization, the problem is divided into a series of tasks that have to be completed one after the other. Be- fore explaining the parallelization in detail, we should analyze data and task dependencies between processes to calculate the matrix. In Figure 2 the data dependency for a particular matrix element is shown. In order to calculate a matrix element M [i][j], the process Pi+1 needs the calculated data form the previous column M [i][j − 1] and elements M [i − 1][j − 1] and M [i − 1][j] from the previous row as seen in the picture. If 4
- 5. Figure 2: Data dependency for calculating one matrix element the previous row is calculated by the process Pi then that row is sent after process Pi calculates the block of size N/P × B. This introduces a data and task dependencies. The process Pi+1 can not start calculations till the process Pi sends the last row of the block, which is needed for calculating the block of process Pi+1 . To calculate the ﬁrst row and column of the matrix it is considered that the predecessor row and column is ﬁlled with zeros. Figure 3: Data dependecies between blocks of matrix The Figure 3 shows the parallelism of the matrix in the wide window. The squares represent the blocks matrix and three arrows show data decen- cies between the blocks. As mentioned before, an element needs its upper, left, and upper-left values to be calculated. It is called data dependency. Therefore, blocks on the same minor diagonal are independent from each other. So these blocks can be and are calculated in parallel. The steps of calculations are as follows: 5
- 6. 1. The process waits till the previous process ﬁnish calculation of a block (if applicable); 2. The process receives the last row of a block that was calculated by the previous process; 3. After receiving the last row of a block calculated by previous process, the process has all necessary information to calculate its block. So, the process performs a calculation of its block; 4. When process ﬁnish the calculation, it sends the last row of its block to the next process (if applicable); 5. The process repeats these steps until it ﬁnishes the calculation of all blocks, that is, calculates all rows that are assigned to the process. 2.1.2 Blocking and Interleaving Technique Figure 4: Matrix calculation with interleaving factor, when I = 2 This parallelization method adds an interleave factor to a blocking tech- nique that was described above. With this method the matrix is divided into I parts, so that each part has N × N/I elements. Every part is then calculated as explained in the previous section, that is, using blocking tech- nique. As soon as the process ﬁnish processing rows assigned to it from the ﬁrst interleaving part it continues with the blocks from another interleave part. For example, in Figure 4, where interleaving factor I = 2, the matrix is divided into two smaller parts. Each process Pi calculates N/(P · I) rows of one part before moving to the second part. 6
- 7. The steps of calculations are very similar to those where blocking tech- nique is used and are as follows: 1. The process waits till the previous process ﬁnish calculation of a block (if applicable); 2. The process receives the last row of a block that was calculated by the previous process; 3. After receiving the last row of a block calculated by previous process, the process has all necessary information to calculate its block. So, the process performs a calculation of its block; 4. When process ﬁnish the calculation, it sends the last row of its block to the next process. If the process is the last one and there is another interleave part to calculate, then it sends the row to the ﬁrst process. Otherwise it does not send anything; 5. The process repeats these steps until it ﬁnishes the calculation of all blocks within the current interleave part, that is, calculates all rows that are assigned to the process within the interleave part. If there is another interleave part to calculate it moves to next interleave part and repeats theses steps until all blocks from all interleave parts are calculated. 2.2 Performance Model on Linear Network Topology 2.2.1 Blocking Technique In this section we will be describing the performance model of our imple- mentation with blocking technique for a linear network topology. In later sections we will compare it with non linear topology, taking into account the diﬀerences in the performance models. In order to focus on the main objectives of this performance analysis, we will only take into account the parallel algorithms used for matrix calculation. This means that some parts of the code that were done sequentially on a single process such as opening and reading the input ﬁles were ignored in this model. Some assumptions were made in terms of the models for diﬀerent network topologies, such as the assumption that the creation of new processes is location aware in terms of their place in the network to make it more eﬃcient. For all the performance models described in this section we will use the following annotation to represent them: 7
- 8. • ts : Startup time. (prepare message + routing algorithm + interface between local node and the router). • tc : Time of computation for each value in matrix. • tw : Time of traversing per word. • Tcomm : Total communication time. • Tcomp : Total computation time. Figure 5: Communication and computation times of matrix parallel calcula- tions by process using the blocking technique. The diagram in Figure 5 represents the steps of the matrix calculation performed by our algorithm as well as initial declarations and needed com- munications. These diﬀerent steps are represented with diﬀerent colors. The blue color represents the scattering of one protein sequence to all the pro- cesses. The green colored areas represent the computation time needed to do the matrix calculations in each block and yellow color represents the time taken to send the last row of a block to the next process. In order to simplify the diagram, the time the last process needs to receive the last row of the block of the previous process is already taken in account in the upper yellow area. This explains why the last process doesn’t have yellow areas in its time-line but still has to wait to receive the blocks needed to perform the matrix calculations. All of this will be considered in this performance model. As we can observe from the diagram, the communication time of this model is composed by the scattering of the protein sequence vector (blue area) and several communications to send the last row of each block to the next process (yellow). The scatter method [2] will receive a vector of size N 8
- 9. and deliver a vector with size N/P to each process. The scattering time is given by: N Tscatter = ts · log(p) + · (P − 1) · tw (1) P The sending of the last row of each block to the next process is composed by the communication startup time (ts ) and the traversing time of the B elements in this blocks row. This is given by: TrowComm = ts + B · tw (2) In the total communication time, this startup and traversing are done N/B times for the ﬁrst process and an extra P − 1 times for the remaining pipeline stages of the remaining processes. In order to take into consideration the fact that the last process doesn’t need to send its last row to another process we will consider that it takes P −2. So the total communication time is given by: N Tcomm = Tscatter + ( + P − 1 − 1) · TrowComm B N N Tcomm = ts · log(p) + · (P − 1) · tw + ( + P − 2) · (ts + B · tw ) (3) P B The next step is to calculate the total computation time. Having in mind that a block is composed by N/P rows and B columns, the total number of block elements is B · N/P . This means that the computation time of a single block is given by: N Tcomp block = tc · B · (4) P As we did for the total communication time, this computation time is multiplied N/B + P − 1 to calculate the computing of the blocks for all the processes: N Tcomp = ( + P − 1) · Tcomp block B N N Tcomp = ( + P − 1) · (tc · B · ) (5) B P To conclude this performance model, the total parallelization time is given by the sum of the total communication and computation times. So the total parallelization time is given by: 9
- 10. Tparallel = Tcomp + Tcomm N N Tparallel = + P − 1 · tc · B · + B P N + ts · log(P ) + · (P − 1) · tw + P N + + P − 2 · (ts + B · tw ) (6) B 2.2.2 Blocking Technique: Optimum B In order to ﬁnd an optimum B for ﬁxed values of N and P , and assuming N is much bigger than P , we need to ﬁnd the value of B for each the total parallel time of computation and communication is smaller. This value can be found be deriving the total parallelization time equation and ﬁnding the value of B for which the derivate is equal to zero. dTparallel =0⇔ dB tc BN N tc N tc N ⇔ −N + ts + Btw B −2 + +P −2 + tw + =0⇔ P B P P N · ts · P ⇔B= ⇔ P · tc · N + P2 · tw − tc · N − 2 · tw · P N · ts · P ⇔B= ⇔ tc · N · (P − 1) + P · tw · (P − 2) ts ⇔B= tw ·(P −2) tc ·(P −1) N + P For N P: ts B≈ (7) tc 10
- 11. 2.2.3 Blocking and Interleaving Technique In this section we will be describing the performance model of our implemen- tation with blocking and interleaving techniques for a linear network topol- ogy. In later sections we will compare it with non linear topology, taking into account the diﬀerences in the performance models. As in the previous model, we will use the mentioned annotation and we will only take into account the parallel algorithms used for matrix calculation. Figure 6: Communication and computation times of matrix parallel calcula- tions by process using the blocking and interleaving techniques. The diagram in Figure 6 represents the steps of the matrix calculation performed by our algorithm as well as initial declarations and needed com- munications. These diﬀerent steps are represented with diﬀerent colors. The blue color represents the scattering of one protein sequence to all the pro- cesses. The green colored areas represent the computation time needed to do the matrix calculations in each block and yellow color represents the time taken to send the last row of a block to the next process. In order to simplify the diagram, the time the last process in the last interleave needs to receive the last row of the block of the previous process is already taken in account in the upper yellow area. This explains why this last process doesn’t have yellow areas in its time-line but still has to wait to receive the blocks needed to perform the matrix calculations. All of this will be considered in this performance model. As we can observe from the diagram, the communication time of this model is composed by the scattering of a part of the protein sequence vector (blue area) for each interleave and several communications to send the last 11
- 12. row of each block to the next process (yellow). The scatter method will receive a vector of size N and deliver a vector with size N/(P · I) to each process per interleave. The scattering time is given by: N Tscatter = ts · log(p) + · (P − 1) · tw (8) P ·I This scattering is done for each interleave. This means that we have to multiply this Tscatter by I: N TT scatter = I · (ts · log(p) + · (P − 1) · tw ) P ·I The sending of the last row of each block to the next process is composed by the communication startup time (ts ) and the traversing time of the B elements in this blocks row. This is given by: TrowComm = ts + B · tw (9) In order to clearly describe the calculation of the total communication time we will be splitting it into communication time in the ﬁrst I − 1 inter- leaves and the special case of the last interleave. For the ﬁrst I −1 interleaves, one might notice that each interleave introduces N/B extra yellow areas. This means that the communication time for all the startups and traversing for the ﬁrst I − 1 interleaves is given by: N TcommInter = (I − 1) · ( ) · TrowComm B N TcommInter = (I − 1) · ( ) · (ts + B · tw ) (10) B The case of the last interleave is slightly diﬀerent, we must have into ac- count the typical pipelining extra P − 1 communications due to the diﬀerent pipeline stages. Since in our implementation, the last process doesn’t need to send its last row to another process, there will be only P − 2 extra com- munications. So the communication time for all the startups and traversing is given by: N TcommLastInter = ( + P − 2) · TrowComm B N TcommLastInterleave = ( + P − 2) · (ts + B · tw ) (11) B 12
- 13. With these formulas we can ﬁnally describe the total communication time as being the sum of scattering times and startups and traversing times of all the interleaves. So the total communication time is given by: Tcomm = TT scatter + TcommInter + TcommLastInterleave N N Tcomm = I · (ts · log(p) + · (P − 1) · tw ) + (I − 1) · ( ) · (ts + B · tw ) + P ·I B N +( + P − 2) · (ts + B · tw ) B N N Tcomm = I · (ts · log(p) + · (P − 1) · tw ) + ((I − 1) · ( ) + P ·I B N +( + P − 2)) · (ts + B · tw ) (12) B The next step is to calculate the total computation time. Having in mind that a block is composed by N/(P · I) rows and B columns, the total number of block elements is B · N/(P · I). This means that the computation time of a single block is given by: N TcompBlock = tc · B · (13) P ·I As we did for the total communication time, we have to take into account how the interleaving aﬀects the computation. For the ﬁrst I − 1 interleaves the computation time is given by: N N TcompInter = (I − 1) · ( ) · tc · B · (14) B P ·I Diﬀerently from the communication time, the last interleave has exactly N/B + P − 1 extra computations of blocks. This means that the total com- putation time is given by: N N N Tcomp = ((I − 1) · ( ) + ( + P − 1)) · tc · B · (15) B B P ·I To conclude this performance model, the total parallelization time is given by the sum of the total communication and computation times. So the total parallelization time is given by: Tparallel = Tcomp + Tcomm 13
- 14. N Tparallel = (I · (ts · log(p) + · (P − 1) · tw )) + P ·I N N + ((I − 1) · ( ) + ( + P − 2))× B B N × (tc · B · + ts + B · tw ) + P ·I N + tc · B · (16) P ·I 2.2.4 Blocking and Interleaving Technique: Optimum B In order to ﬁnd an optimum B in order to N, P and I values, and assuming N is much bigger than P, we need to ﬁnd the value of B for each the total parallel time of computation and communication is smaller. This value can be found be deriving the total parallelization time equation and ﬁnding the value of B for which the derivate is equal to zero. dTparallel =0⇔ dB (I − 1)N N tc BN ⇔ − 2 − 2 · + ts + Btw + B B IP (I − 1)N N tc N tc N + + +P −2 · + tw + =0⇔ B B IP IP N ts P I 2 ⇔B= (17) P tc N + P 2 tw I − tc N − 2tw IP For N P: IN ts B≈ (18) tw 2.2.5 Blocking and Interleaving Technique: Optimum I In order to ﬁnd an optimum I in order to N, P and B values, and assuming N is much bigger than P, we need to ﬁnd the value of I for each the total parallel time of computation and communication is smaller. This value can be found be deriving the total parallelization time equation and ﬁnding the value of I for which the derivate is equal to zero. 14
- 15. dTparallel =0⇔ dI N tc (P − 1)B 2 ⇔I= (19) P (Bts log(P ) + N ts + N Btw ) N tc B 2 I≈ (20) Bts log(P ) + N ts + N Btw 2.3 Performance Model on 2D Torus Network Topol- ogy 2.3.1 Blocking Technique Assuming that the spawning of processes is location aware in terms of the network topology, the only diﬀerence between the linear topology mentioned in the previous sections and the 2D Torus network topology is in the scat- tering of data [1]. So the new performance model for this topology is given by: N N Tparallel = + P − 1 · tc · B · + B P √ N + 2 · ts · log( P ) + · (P − 1) · tw + P N + + P − 2 · (ts + B · tw ) (21) B Although the scattering of data is done faster, as it is not aﬀected by the vari- able B, it will not aﬀect the calculation of the optimum B. So the optimum B remains the following: ts B≈ (22) tc 2.3.2 Blocking and Interleaving Technique Lets also assume that the spawning of processes is location aware in terms of the network topology. This means the only diﬀerence between the lin- ear topology mentioned in the previous sections and the 2D Torus network 15
- 16. topology is in the scattering of data. So the new performance model for this topology is given by: √ N Tparallel = (I · 2 · (ts · log( P ) + · (P − 1) · tw )) + P ·I N N + ((I − 1) · ( ) + ( + P − 2))× B B N × (tc · B · + ts + B · tw ) + P ·I N + tc · B · (23) P ·I Just as in the blocking technique, the scattering is not aﬀected by B but it is aﬀected by I. This means that the scattering is dependent on the level of interleaving. So the new equation for the optimum I is given by: N t B2 I≈ √ c (24) 2Bts log( P ) + N ts + N Btw The corresponding optimum B is given by: IN ts B≈ (25) tw Taking into account the logarithmic properties, we deduce that the opti- mum I is the same for both network topologies. The only diﬀerence between the two is the time needed to perform the scattering. 2.4 Implementation In this section, the implementation of the our solution is provided and ex- plained. Our solution compared to provided sequential one requires extra parameters B and I. Where B is a blocking factor and I is an interleaving factor. Note that in order not to use interleaving, the I parameter should be set to 1. In our solution, all required data is ﬁrstly read by the root process and later broad-casted or scattered to other processes. Vector A is scattered to all of the process. How much of information is scattered to every process depends on I parameter and number of processes and every process receives N/(I · P ) rows before computing each of the interleave parts. Usually, N elements can not be divided by I · P parameter, so the padding is introduced. The amount of elements that each process will receive during scatter procedure is calculated and stored as follows: 16
- 17. sizeA = N % (total_processes * I) != 0 ? N + (total_processes * I) - (N % (total_processes * I)) : N; chunk_size = sizeA / (total_processes * I); Then the root process reads the data and shares the data as follows: // Broadcast the Similarity Matrix MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD); // Broadcast the portion of vector A that will be received during broadcast MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD); // Broadcast N, B, I and DELTA parameters MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD); Later, each process allocates space for a portion of H matrix, portion of A vector and for a whole B vector. Note that in our solution every process does not allocate the full-sized H matrix, but just enough portion of this matrix where every process writes their results. So the sum of sizes of each H matrix portions distributed throughout the processes will be N ×N +N +N ·(P ·I). It is the whole matrix, initial column ﬁlled with zeros and extra lines where the processes receives information from other processes. The portions is stored in a three dimensional array where the ﬁrst dimension refers to an interleaving ID and the rest refers to column and row. The memory is allocated mapped and the B vector is broad-casted as follows: CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) * (chunk_size + 1)))); CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) * (chunk_size + 1)))); CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I))); for(int i = 0; i < (chunk_size + 1) * I; i++) chunk_h[i] = chunk_hptr + i * N; for (int i = 0; i < I; i++) chunk_ih[i] = chunk_h + i * (chunk_size + 1); CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) * (chunk_size)))); if (rank != 0) { // The root process already have B vector CHECK_NULL((b = (short *) malloc(sizeof(short) * (N)))); 17
- 18. } MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD); Later each process calculates how many blocks there are in total and what is the size of the ﬁnal block. This is needed since usually N is not dividable by B, so the ﬁnal block is usually smaller then the rest of them. The time that marks the beginning of computation is stored in a variable start. In the main loop that counts interleaves, each process receives a portion of A vector. Main loop is repeated I times as explained earlier (in the section describing blocking and interleaving technique). int total_blocks = N / B + (N % B == 0 ? 0 : 1); int last_block_size = N % B == 0 ? B : N % B; MPI_Status status; int start, end; start = getTimeMilli(); for (int current_interleave = 0; current_interleave < I; current_interleave++) { MPI_Scatter(a + current_interleave * chunk_size * total_processes, chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT, 0, MPI_COMM_WORLD); int current_column = 1; // Fill first column with 0 for (int i = 0; i < chunk_size + 1; i++) chunk_ih[current_interleave][i][0] = 0; Then the main calculations begin. Firstly, the process checks whether it has to receive from another process. If so, it receives data required for the calculations. Then it processes the current cell, stores the result in separate array which will be gathered later. Finally, the process checks if it has to send the to another process. If so, it sends the last row of current block to another process. The process repeats these actions totalb locks times. Finally, it saves the time after execution in the end variable. for (int current_block = 0; current_block < total_blocks; current_block++) { // Receive int block_end = MIN2(current_column - (current_block == 18
- 19. 0 ? 1 : 0) + B, N); if (rank == 0 && current_interleave == 0) { for (int k = current_column; k < block_end; k++) { chunk_ih[current_interleave][0][k] = 0; } } else { int receive_from = rank == 0 ? total_processes - 1 : rank - 1; int size_to_receive = current_block == total_blocks - 1 ? last_block_size : B; MPI_Recv(chunk_ih[current_interleave][0] + current_block * B, size_to_receive, MPI_INT, receive_from, 0, MPI_COMM_WORLD, &status); if (DEBUG) printf("[%d] Received from %d: ", rank, receive_from); if (DEBUG) print_vector(chunk_ih[current_interleave][0] + current_block * B, size_to_receive); } // Process for (int j = current_column; j < block_end; j++, current_column++) { for (int i = 1; i < chunk_size + 1; i++) { int diag = chunk_ih[current_interleave][i - 1][j - 1] + sim[chunk_a[i - 1]][b[j - 1]]; int down = chunk_ih[current_interleave][i - 1][j] + DELTA; int right = chunk_ih[current_interleave][i][j - 1] + DELTA; int max = MAX3(diag, down, right); chunk_ih[current_interleave][i][j] = max < 0 ? 0 : max; } } // Send if (current_interleave != I - 1 || rank + 1 != total_processes) { int send_to = rank + 1 == total_processes ? 0 : rank + 1; int size_to_send = current_block == total_blocks - 1 ? last_block_size : B; MPI_Send(chunk_ih[current_interleave][chunk_size] + 19
- 20. current_block * B, size_to_send, MPI_INT, send_to, 0, MPI_COMM_WORLD); if (DEBUG) printf("[%d] Sent to %d: ", rank, send_to); if (DEBUG) print_vector(chunk_ih[current_interleave][chunk_size] + current_block * B, size_to_send); } } } end = getTimeMilli(); When all he calculations are ﬁnished, all processes starts the gather exe- cution. After gather is executed, the root process has the all H matrix. Then the root process prints an execution time to stderr stream and if debug is enabled it prints the H matrix. for (int i = 0; i < I; i++) { MPI_Gather(chunk_hptr + N + i * chunk_size * N, N * chunk_size, MPI_INT, hptr + i * chunk_size * total_processes * N, N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD); } if (rank == 0) { fprintf(stderr, "Execution: %f sn", (double) (end - start) / 1000000); } if (DEBUG) { if (rank == 0) { for (int i = 0; i < N - 1; i++) { print_vector(h[i], N); } } } MPI_Finalize(); The full code is provided in the Appendix section. 20
- 21. 3 Performance Results In this section, the performance results of our implementation on ALTIX is provided. Also, the results is compared to a sequential code performance. 3.1 Finding Optimal P and B In order to ﬁnd out optimal P and B, we tested the application with diﬀerent P and B parameters, where N = 10, 000. Before that we tested the sequential code. This code executed calculations for 12.598 seconds. The parallelized version execution times are shown in Figure 7. Figure 7: Performance results with diﬀerent P and B where N = 10, 000 From this it can be concluded that with parameters N = 10000, B = 100, P = 8 and I = 1 the parallel code executed calculations 9 times faster. 3.2 Finding Optimal I In order to ﬁnd out the optimal I, we selected the best result from the precious test where P = 8 and ran the test with diﬀerent I and B parameters. The result is shown in Figure 8. Because the environment like network congestion aﬀects our performance tests, the results might not be completely accurate. That is why we deduced from the results that the optimal parameters conﬁguration for N = 10, 000 is I = 2, B = 200, P = 8. With this conﬁguration parallel code calculates the matrix 8 times faster than the sequential code. Finally, we tested the parallel code with N = 25, 000 and parameters that we found to be optimal. The code executed calculations for 11.822213 seconds, where the sequential code ran for 76.884 seconds. From this it can be concluded that the parallel code runs 6.5 times faster. The result is slower, because as it was stated earlier, 21
- 22. Figure 8: Performance results with diﬀerent I and B where N = 10, 000 and P =8 the B and I depends on N , so the parameters conﬁguration for calculating vectors similarity of size N = 25, 000 is not optimal. 4 Conclusions During this project the parallel implementation of the Smith-Waterman Al- gorithm was made using blocking and interleaving techniques. The tech- niques and the code were explained in detail. The performance models for both linear and 2D torus were calculated. Also, for each network topologies the equations for ﬁnding optimum blocking factor B when using blocking technique and optimum B and interleaving factor I when using blocking and interleaving technique were found. After calculating the models, the conclu- sion was made that the calculation of B and I factors for our algorithm on these particular network topologies is the same. Performance tests using multiple processes on diﬀerent processors were done. It was found out that the optimal conﬁguration for calculating se- quence alignment of two vectors of size N = 10, 000 using our implemen- tation is I = 2, B = 200, P = 8. With this conﬁguration the parallel code calculates the matrix 8 times faster than the sequential code. With the same parameters conﬁguration the parallel code calculates the matrix of size N = 25, 000 6.5 times faster than the sequential code. 22
- 23. References [1] Peter Harrison, William Knottenbelt, Parallel Algorithms. Department of Computing, Imperial College London, 2009. [2] Norm Matlo, Programming on Parallel Machines. University of Califor- nia, Davis, 2011. 23
- 24. A How to Compile all: seq par seq: gcc SW.c -o seq.out par: icc protein.cpp -o protein.out -lmpi B How to Execute on ALTIX #!/bin/bash # @ job_name = ampp01parallel # @ initialdir = . # @ output = mpi_%j.out # @ error = mpi_%j.err # @ total_tasks = <number_of_process> # @ wall_clock_limit = 00:01:00 mpirun -np <number_of_process> ./protein.out <vector_a> <vector_b> <similarity_matrix> <gap_penalty> <N> <B> <I> C Code #include <stdio.h> #include <stdlib.h> #include <ctype.h> // character handling #include <stdlib.h> // def of RAND_MAX #include <sys/time.h> #include "mpi.h" #define DEBUG 1 #define MAX_SEQ 50 #define CHECK_NULL(_check) { if ((_check)==NULL) { fprintf(stderr, "Null Pointer allocating memoryn"); 24
- 25. exit(-1); } } #define AA 20 // number of amino acids #define MAX2(x,y) ((x)<(y) ? (y) : (x)) #define MAX3(x,y,z) (MAX2(x,y)<(z) ? (z) : MAX2(x,y)) #define MIN2(x,y) ((x)>(y) ? (y) : (x)) // function prototypes int getTimeMilli(); void read_pam(FILE* pam); void read_files(FILE* in1, FILE* in2); void print_vector(int* vector, int size); void print_short_vector(short* vector, int size); void memcopy(int* src, int* dst, int count); /* begin AMPP*/ int char2AAmem[256]; int AA2charmem[AA]; void initChar2AATranslation(void); /* end AMPP */ /* Define global variables */ int rank, total_processes; int DELTA; short *a, *b; int *chunk_hptr; int **chunk_h, ***chunk_ih; int *sim_ptr, **sim; // PAM similarity matrix int N, sizeA, B, I, chunk_size; short *chunk_a; int* hptr; int** h; FILE *pam; main(int argc, char *argv[]) { /* begin AMPP */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &total_processes); CHECK_NULL((sim_ptr = (int *) malloc(AA * AA * sizeof(int)))); 25
- 26. CHECK_NULL((sim = (int **) malloc(AA * sizeof(int*)))); for(int i = 0; i < AA; i++) sim[i] = sim_ptr + i * AA; if (rank == 0) { FILE *in1, *in2; /**** Error handling for input file ****/ if (!(argc >= 5 && argc <= 8)) { fprintf(stderr,"%s protein1 protein2 PAM gapPenalty [N] [B] [I]n",argv[0]); exit(1); } else { in1 = fopen(argv[1],"r"); in2 = fopen(argv[2],"r"); N = (argc > 5 ? atoi(argv[5]) : MAX_SEQ) + 1; B = argc > 6 ? atoi(argv[6]) : total_processes; I = argc > 7 ? atoi(argv[7]) : 1; DELTA = atoi(argv[4]); } /* end AMPP */ /* begin AMPP */ sizeA = N % (total_processes * I) != 0 ? N + (total_processes * I) - (N % (total_processes * I)) : N; CHECK_NULL((a = (short *) calloc(sizeof(short), sizeA))); CHECK_NULL((b = (short *) malloc(sizeof(short) * (N)))); initChar2AATranslation(); read_files(in1, in2); chunk_size = sizeA / (total_processes * I); CHECK_NULL((hptr = (int *) calloc(N * sizeA, sizeof(int)))); CHECK_NULL((h = (int **) calloc(sizeA, sizeof(int*)))); for(int i = 0; i < sizeA; i++) h[i] = hptr + i * N; pam = fopen(argv[3], "r"); read_pam(pam); } MPI_Bcast(sim_ptr, AA * AA, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&chunk_size, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&N, 1, MPI_INT, 0, MPI_COMM_WORLD); 26
- 27. MPI_Bcast(&B, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&I, 1, MPI_INT, 0, MPI_COMM_WORLD); MPI_Bcast(&DELTA, 1, MPI_INT, 0, MPI_COMM_WORLD); CHECK_NULL((chunk_hptr = (int *) malloc(sizeof(int) * (N) * (chunk_size + 1) * I))); CHECK_NULL((chunk_h = (int **) malloc(sizeof(int*) * (chunk_size + 1) * I))); CHECK_NULL((chunk_ih = (int ***) malloc(sizeof(int*) * I))); for(int i = 0; i < (chunk_size + 1) * I; i++) chunk_h[i] = chunk_hptr + i * N; for (int i = 0; i < I; i++) chunk_ih[i] = chunk_h + i * (chunk_size + 1); CHECK_NULL((chunk_a = (short *) malloc(sizeof(short) * (chunk_size)))); if (rank != 0) { CHECK_NULL((b = (short *) malloc(sizeof(short) * (N)))); } MPI_Bcast(b, N, MPI_SHORT, 0, MPI_COMM_WORLD); /*** PARALLEL PART ***/ /** compute "h" local similarity array **/ int total_blocks = N / B + (N % B == 0 ? 0 : 1); int last_block_size = N % B == 0 ? B : N % B; MPI_Status status; int start, end; start = getTimeMilli(); for (int current_interleave = 0; current_interleave < I; current_interleave++) { MPI_Scatter(a + current_interleave * chunk_size * total_processes, chunk_size, MPI_SHORT, chunk_a, chunk_size, MPI_SHORT, 0, MPI_COMM_WORLD); int current_column = 1; // Fill first column with 0 27
- 28. for (int i = 0; i < chunk_size + 1; i++) chunk_ih[current_interleave][i][0] = 0; for (int current_block = 0; current_block < total_blocks; current_block++) { // Receive int block_end = MIN2(current_column - (current_block == 0 ? 1 : 0) + B, N); if (rank == 0 && current_interleave == 0) { for (int k = current_column; k < block_end; k++) { chunk_ih[current_interleave][0][k] = 0; } } else { int receive_from = rank == 0 ? total_processes - 1 : rank - 1; int size_to_receive = current_block == total_blocks - 1 ? last_block_size : B; MPI_Recv(chunk_ih[current_interleave][0] + current_block * B, size_to_receive, MPI_INT, receive_from, 0, MPI_COMM_WORLD, &status); if (DEBUG) printf("[%d] Received from %d: ", rank, receive_from); if (DEBUG) print_vector(chunk_ih[current_interleave][0] + current_block * B, size_to_receive); } // Process for (int j = current_column; j < block_end; j++, current_column++) { for (int i = 1; i < chunk_size + 1; i++) { int diag = chunk_ih[current_interleave][i - 1][j - 1] + sim[chunk_a[i - 1]][b[j - 1]]; int down = chunk_ih[current_interleave][i - 1][j] + DELTA; int right = chunk_ih[current_interleave][i][j - 1] + DELTA; int max = MAX3(diag, down, right); chunk_ih[current_interleave][i][j] = max < 0 ? 0 : max; } } // Send 28
- 29. if (current_interleave != I - 1 || rank + 1 != total_processes) { int send_to = rank + 1 == total_processes ? 0 : rank + 1; int size_to_send = current_block == total_blocks - 1 ? last_block_size : B; MPI_Send(chunk_ih[current_interleave][chunk_size] + current_block * B, size_to_send, MPI_INT, send_to, 0, MPI_COMM_WORLD); if (DEBUG) printf("[%d] Sent to %d: ", rank, send_to); if (DEBUG) print_vector(chunk_ih[current_interleave][chunk_size] + current_block * B, size_to_send); } } } end = getTimeMilli(); for (int i = 0; i < I; i++) { MPI_Gather(chunk_hptr + N + i * chunk_size * N, N * chunk_size, MPI_INT, hptr + i * chunk_size * total_processes * N, N * chunk_size, MPI_INT, 0, MPI_COMM_WORLD); } if (rank == 0) { fprintf(stderr, "Execution: %f sn", (double) (end - start) / 1000000); } if (DEBUG) { if (rank == 0) { for (int i = 0; i < N - 1; i++) { print_vector(h[i], N); } } } //Free everything! free(sim_ptr); free(sim); free(b); 29
- 30. free(chunk_ih); free(chunk_h); free(chunk_hptr); free(chunk_a); if (rank == 0) { free(a); free(hptr); free(h); } MPI_Finalize(); } void memcopy(int* src, int* dst, int count) { for (int i = 0; i < count; i++) { dst[i] = src[i]; } } void print_vector(int* vector, int size) { for (int i = 0; i < size; i++) { printf("%2d ", vector[i]); } printf("n"); } void print_short_vector(short* vector, int size) { for (int i = 0; i < size; i++) { printf("%2d ", vector[i]); } printf("n"); } void read_pam(FILE* pam) { int i, j; int temp; /** read PAM250 similarity matrix **/ /* begin AMPP */ fscanf(pam,"%*s"); /* end AMPP */ for (i = 0; i < AA; i++) for (j = 0; j <= i; j++) { if (fscanf(pam, "%d ", &temp) == EOF) { 30
- 31. fprintf(stderr, "PAM file emptyn"); fclose(pam); exit(1); } sim[i][j]=temp; } fclose(pam); for (i = 0; i < AA; i++) for (j = i + 1; j < AA; j++) sim[i][j] = sim[j][i]; // symmetrify } void read_files(FILE* in1, FILE* in2) { int i=0; int nc; char ch; do { nc=fscanf(in1,"%c",&ch); if (nc>0 && char2AAmem[ch]>=0) { a[i++] = char2AAmem[ch]; } } while (nc>0 && (i<N)); fclose(in1); /** read second file in array "b" **/ i=0; do { nc=fscanf(in2,"%c",&ch); if (nc>0 && char2AAmem[ch]>=0) { b[i++] = char2AAmem[ch]; } } while (nc>0 && (i<N)); fclose(in2); } /* Begin AMPP */ void initChar2AATranslation(void) { int i; for(i=0; i<256; i++) char2AAmem[i]=-1; char2AAmem[’c’]=char2AAmem[’C’]=0; 31
- 32. AA2charmem[0]=’c’; char2AAmem[’g’]=char2AAmem[’G’]=1; AA2charmem[1]=’g’; char2AAmem[’p’]=char2AAmem[’P’]=2; AA2charmem[2]=’p’; char2AAmem[’s’]=char2AAmem[’S’]=3; AA2charmem[3]=’s’; char2AAmem[’a’]=char2AAmem[’A’]=4; AA2charmem[4]=’a’; char2AAmem[’t’]=char2AAmem[’T’]=5; AA2charmem[5]=’t’; char2AAmem[’d’]=char2AAmem[’D’]=6; AA2charmem[6]=’d’; char2AAmem[’e’]=char2AAmem[’E’]=7; AA2charmem[7]=’e’; char2AAmem[’n’]=char2AAmem[’N’]=8; AA2charmem[8]=’n’; char2AAmem[’q’]=char2AAmem[’Q’]=9; AA2charmem[9]=’q’; char2AAmem[’h’]=char2AAmem[’H’]=10; AA2charmem[10]=’h’; char2AAmem[’k’]=char2AAmem[’K’]=11; AA2charmem[11]=’k’; char2AAmem[’r’]=char2AAmem[’R’]=12; AA2charmem[12]=’r’; char2AAmem[’v’]=char2AAmem[’V’]=13; AA2charmem[13]=’v’; char2AAmem[’m’]=char2AAmem[’M’]=14; AA2charmem[14]=’m’; char2AAmem[’i’]=char2AAmem[’I’]=15; AA2charmem[15]=’i’; char2AAmem[’l’]=char2AAmem[’L’]=16; AA2charmem[16]=’l’; char2AAmem[’f’]=char2AAmem[’F’]=17; AA2charmem[17]=’L’; char2AAmem[’y’]=char2AAmem[’Y’]=18; AA2charmem[18]=’y’; char2AAmem[’w’]=char2AAmem[’W’]=19; AA2charmem[19]=’w’; } int getTimeMilli() { struct timeval tv; 32
- 33. gettimeofday(&tv, NULL); int ret = tv.tv_usec; ret += (tv.tv_sec * 1000000); // Add seconds return ret; } /* end AMPP*/ 33