LAMPIRAN A Instalasi Microsoft Visual Studio
Gambar 1 Install Visual Studio
Gambar 2 Setup Preparation
45
Gambar 3 Path Instalasi Pada langkah ini tentukan dimana letak Visual Studio akan di install. Setelah itu klik Next dan tunggu proses instalasi sampai selesai.
Gambar 4 Instalasi Komponen Visual Studio
46
Gambar 5 Restart Komputer
Gambar 6 Proses Instalasi Setelah Komputer Restart
47
Gambar 7 Proses Instalasi Selesai Instalasi MPICH2
Gambar 8 Instalasi MPICH2
48
Gambar 9 Proses Instalasi dan Finishing Setup Ikuti perintah next setelah window setup muncul sampai muncul window path installation, kemudian tentukan dimana MPICH2 akan di install. Klik next, maka proses instalasi akan dimulai, tunggu sampai selesai kemudian finish.
Gambar 10 Install smpd dan Validasi MPI Untuk menjalankan fungsi dari MPI yang akan di integrasikan dengan Visual Studio, maka service dari MPI perlu diaktifkan, install service smpd dengan cara smpd –install setelah install MPICH2. Setelah itu aktifkan smpd dengan command smpd –start , lalu cek dengan mpiexec – validate, bila success maka service MPI sudah berjalan. 49
Setting MPI Pada Visual Studio
Gambar 11 Additional Include Directories. Klik kanan project pada Solution Manager, kemudian pilih properties. Pada Configuration Properties, expand C/C++ pilih general, kemudian pada kolom Additional Include Directories, berikan path dari folder include OpenMPI, supaya header dapat terbaca oleh sistem.
Gambar 12 Additional Library Directories.
50
Gambar 13 Additional Dependencies. Expand menu linker, kemudian pilih general, pada Additional Library Directories berikan path folder lib supaya file mpi.lib yang dideklarasikan di Additional Dependencies pada sub menu linker input, dapat berjalan pada saat compile dan menjalankan aplikasi.
51
Setting Koneksi Cluster Konfigurasi Firewall Firewall pada masing-masing komputer user harus terbuka, supaya koneksi dari MPI yang dikirimkan dari komputer cluster tidak di block oleh komputer lainnya.
Gambar 14 Pencarian Firewall dengan searchbox.
Gambar 15 Advanced Security Firewall.
52
Gambar 16 Firewall Properties. Kemudian status Firewall State pilih menjadi off. Sehingga inbound dan outbound connections tidak memblokir koneksi MPI pada saat mengirim data pada cluster atau pada saat menerima data. Konfigurasi IP dan User Credential
Gambar 17 Search Network and Sharing Center.
53
Gambar 18 Network and Sharing Center. Pilih pada Change adapter setting , kemudian pada Local Area Connection klik kanan dan pilih properties.
Gambar 19 Local Area Connection Properties.
54
Gambar 20 IPV4 Properties.
Setting masing-masing PC user dengan menggunakan cara yang sama, dan set masing-masing IP PC. Dalam project ini PC pertama menggunakan IP 192.168.62.10 dan PC kedua menggunakan IP 192.168.62.11
55
PC 1
PC 2
Gambar 3.19 User Account host dan client.
Nama user pada PC 1 dan PC 2 dan juga password harus identik, supaya pada proses eksekusi OpenMPI PC 2 terdeteksi, dan MPI dapat melakukan transfer data antara PC 1 dan PC2. Setting Component Service Pada search box start menu, ketikkan dcomcnfg.exe , tekan enter, pilih Component service, kemudian masuk ke folder Computer, pada my Computer klik kanan pilih properties.
Gambar 21 Component Service.
56
Gambar 22 Limit COM Security pada My Computer Properties. Klik COM Security pilih edit limits. Disini akan di konfigurasikan koneksi user ke komputer utama, supaya security PC memberikan status allow pada user yang terhubung pada komputer utama. Add terlebih dahulu user yang akan diberikan permission untuk mengakses komputer utama.
Gambar 23 Search Select User. Klik advanced sehingga muncul menu untuk menambahkan jenis user yang akan di tambahkan ke permission.
57
Gambar 24 Advanced Select User. Klik Find now untuk mencari jenis user, kemudian pilih everyone, lalu klik OK.
Gambar 25 Edit Permission untuk user yang dipilih. Check box yang terdapat pada Access Permission dan launch and activation permision pada user Everyone. Beri check pada allow untuk semua opsi nya. Lalu OK dan tutup Component Service. 58
Tes Koneksi dan Eksekusi Aplikasi MPI
Gambar 26 Test Ping Gunakan command ping dengan diikuti nomor IP komputer cluster untuk mengetahui koneksi cluster yang sudah terhubung.
Gambar 27 Eksekusi MPI dengan Menggunakan Command prompt Aplikasi yang di implementasikan dengan MPI dijalankan menggunakan command prompt dengan perintah :
Local : mpirun –np 2 file.exe Angka 2 pada command tersebut digunakan untuk mensimulasikan jumlah proses yang secara virtual berjalan pada local host, bisa diganti dengan angka yang berjumlah 2n Cluster : mpirun –np 2 –host host1,host2 file.exe Sama seperti dengan local, hanya ditambahkan dengan –host dan juga dengan nama komputer masing-masing host, jumlah host dan angka host harus sama dengan sejumlah 2n.
59
Gambar 28 Task Manager Komputer Cluster Pastikan pada saat eksekusi dengan menggunakan MPI , CPU usage pada komputer cluster menunjukkan aktivitas pemrosesan. Hal ini menandakan ada data yang di proses di komputer cluster.
Setting Nvidia Nsight Langkah awal dalam menggunakan Nvidia Nsight, adalah pada PC user sudah terinstall visual studio, supaya pada waktu instalasi Nvidia Toolkit, template dari Nsight dapat terintegrasi pada new project visual studio, sehingga dapat langsung digunakan oleh user. Setelah Instalasi berhasil dilakukan, cek kompatibilitas dari hardware GPU, support atau tidak untuk memprogram dan menjalankan CUDA.
Gambar 29 Summary NVIDIA Installer setelah installing Toolkit.
60
Gambar 30 Pencarian Code Samples untuk uji coba GPU. Untuk mengetahui apakah GPU yang terpasang di PC mendukung CUDA dapat dilakukan pada NVIDIA CUDA samples browser, search dengan kata kunci particles kemudian pada smoke particles klik run.
Gambar 31 Smoke screen code samples. Apabila muncul render smoke screen , maka GPU mendukung CUDA
61
Gambar 32 Template dari CUDA yang terintegrasi dengan Visual Studio. Setelah proses instalasi selesai maka installation summary akan menampilkan fitur-fitur dan komponen dari CUDA Nsight yang telah berhasil di integrasikan pada visual studio dan pada PC user. Dan pada visual studio sudah terintegrasi template project CUDA runtime.
Gambar 33 Path CUDA pada environment variables. Pada Environment Variabel yang terdapat di My Computer Properties lalu pilih Advanced system settings, pastikan terdapat CUDA path yang berisi letak dari folder bin , include, dan library dari CUDA, supaya program CUDA dapat dikompilasi dan dieksekusi oleh user.
62
Eksekusi Aplikasi CUDA Pada saat CUDA di eksekusi pastikan GPU berjalan dengan menggunakan aplikasi GPU-z atau CUDA – z , pada aplikasi tersebut terdapat sensor dari processor GPU yang akan menunjukkan kepada user.
Gambar 34 Eksekusi aplikasi CUDA
Gambar 35 Sensor GPU pada saat idle dan Mengeksekusi Program
63
LAMPIRAN B Source Code CPU Computing Sorting #include #include #include #include #include
<stdio.h>
<stdlib.h> <windows.h>
void quicksort(float [10],int,int); int main() { LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; QueryPerformanceFrequency(&frequency); int size,i; float *x; float aa = 100.0; printf("Enter size of the array: "); scanf("%d",&size); x = (float *)malloc( (size+1)*sizeof(float) ); for(i=0;i<size;i++) { x[i]=((float)rand()/(float)(RAND_MAX)) * aa; } QueryPerformanceCounter(&t1); quicksort(x,0,size-1); QueryPerformanceCounter(&t2); elapsedTime = (t2.QuadPart - t1.QuadPart)*1000.0/ frequency.QuadPart; printf("\n\n%f ms\n",elapsedTime); system("pause"); return 0; } void quicksort(float x[],int first,int last) { int pivot,j,i; float temp; if(firstx[pivot]) j--; 64
if(i<j) { temp=x[i]; x[i]=x[j]; x[j]=temp; } } temp=x[pivot]; x[pivot]=x[j]; x[j]=temp; quicksort(x,first,j-1); quicksort(x,j+1,last); } }
Binary Search #include <stdio.h> #include #include <stdlib.h> #include #include <windows.h> int main() { LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; int c,n; int first, last, middle; float search; double *array; float c2=1.25; printf("number of elements\n"); scanf("%d",&n); array = (double *)malloc((n+1) * sizeof(double)); //printf("Enter %d integers\n", n); QueryPerformanceFrequency(&frequency); for ( c = 0 ; c < n ; c++ ) { array[c]=c2; c2=c2+1.25; } printf("\nvalue to find\n"); scanf("%f",&search); first = 0; last = n - 1; middle = (first+last)/2; QueryPerformanceCounter(&t1); while( first <= last ) { if ( array[middle] < search ){ first = middle + 1;} else if ( array[middle] == search ){ printf("%f found at location %d.\n", search, middle+1); 65
break;} else { last = middle - 1; } middle = (first + last)/2; } if ( first > last ) { printf("Not found! %d is not present in the list.\n", search); } QueryPerformanceCounter(&t2); elapsedTime = (t2.QuadPart - t1.QuadPart)*1000.0/ frequency.QuadPart; printf("\n\n\n%f ms\n",elapsedTime); system("pause"); return 0; } Matrix Multiplication #include <stdio.h> #include #include <stdlib.h> #include #include <windows.h> int main() { //FLOATING int i, j, k; double **mat1, **mat2, **res; long n; float aa = 5.0; LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; // get the order of the matrix from the user printf("Size of matrix:"); scanf("%d", &n); QueryPerformanceFrequency(&frequency); // dyamically allocate memory to store elements mat1 = (double **)malloc(sizeof(double) * n); mat2 = (double **)malloc(sizeof(double) * n); res = (double **) malloc(sizeof(double) * n); for (i = 0; i { mat1[i] mat2[i] res[i] }
< n; i++) = (double *)malloc(sizeof(double) * n); = (double *)malloc(sizeof(double) * n); = (double *)malloc(sizeof(double) * n);
// get the input matrix printf("\n"); for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { 66
//mat1[i][j] = rand() % 10 +1; mat1[i][j] = ((float)rand()/(float)(RAND_MAX)) * aa; } } printf("matrix 1:\n"); for(int aa=0; aa
} free(mat1); free(mat2); free(res); system("pause"); return 0; } Gauss Jordan Elimination #include #include #include #include #include #include #include #include
<stdio.h> <stdlib.h> <windows.h> <math.h> <malloc.h> <windows.h>
int main() { int i, j, n; double **a, *b, *x; LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; void gauss_jordan(int n, double **a, double *b, double *x); printf("\nNumber of equations: "); scanf("%d", &n); float aa = 10.0; QueryPerformanceFrequency(&frequency); x = (double *)malloc( (n+1)*sizeof(double) ); b = (double *)malloc( (n+1)*sizeof(double) ); a = (double **)malloc( (n+1)*sizeof(double *) ); for(i = 1; i <= n; i++) a[i] = (double *)malloc( (n+1)*sizeof(double) ); for(i = 1; i <= n; i++) { for(j = 1; j <= n; j++) { //a[i][j]=rand()%10 + 1; a[i][j]=((float)rand()/(float)(RAND_MAX)) * aa; } //b[i]=rand()%10 + 1; b[i]=((float)rand()/(float)(RAND_MAX)) * aa; } for(int aa = 1 ; aa<=n ; aa++) { 68
for(int bb = 1 ; bb<=n ; bb++) { printf("%.1f ",a[aa][bb]); } printf(" %.1f ",b[aa]); printf("\n"); } printf("\n\n"); QueryPerformanceCounter(&t1); gauss_jordan(n, a, b, x); QueryPerformanceCounter(&t2); elapsedTime = (t2.QuadPart - t1.QuadPart)*1000.0/ frequency.QuadPart; printf("\n\n\n%f ms\n",elapsedTime); printf("\nSolution\n"); printf("------------------------------------------------\n"); printf("x = ("); for(i = 1; i <= n-1; i++) printf("%lf, ", x[i]); printf("%lf)\n\n", x[n]); system("pause"); return(0); } void gauss_jordan(int n, double **a, double *b, double *x) { int i, j, k; int p; double factor; double big, dummy; for(k = 1; k <= n; k++) { // pivoting if(k < n) { p = k; big = fabs(a[k][k]); for(i = k+1; i <= n; i++) { if(big < fabs(a[i][k])) { big = fabs(a[i][k]); p = i; } } if(p != k) { for(j = 1; j <= n; j++) { dummy = a[p][j]; 69
a[p][j] = a[k][j]; a[k][j] = dummy; } dummy = b[p]; b[p] = b[k]; b[k] = dummy; } } // Gauss-Jordan elimination factor = a[k][k]; for(j = 1; j <= n; j++) a[k][j] /= factor; b[k] /= factor; for(i = 1; i <= n; i++) { if(i == k) continue; factor = a[i][k]; for(j = 1; j <= n; j++)
a[i][j] -=
a[k][j]*factor; b[i] -= b[k]*factor; } } for(i = 1; i <= n; i++)
x[i] = b[i];
return; } Source Code GPU Computing Sorting #include "cuda_runtime.h" #include "device_launch_parameters.h" #include #include <windows.h> using namespace std; #include <cuda.h> #include <stdio.h> #include <stdlib.h> #include #include <cuda_runtime_api.h> //#define NUM 8 __device__ inline void swap(float & a, float & b) { float tmp = a; a = b; b = tmp; } 70
__global__ void bitonicSort(float * values, float N) { extern __shared__ float shared[]; const unsigned int tid = threadIdx.x; shared[tid] = values[tid]; for (unsigned int k = 2; k <= N; k *= 2) { for (unsigned int j = k / 2; j>0; j /= 2) { unsigned int ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) { if (shared[tid] > shared[ixj]) { swap(shared[tid], shared[ixj]); } } else { if (shared[tid] < shared[ixj]) { swap(shared[tid], shared[ixj]); } } } } } values[tid] = shared[tid]; } int main(void) { cudaEvent_t start, stop; float time; float * dvalues; float * values; double NUM; float aa = 5.0; scanf("%d",&NUM); values = (float *)malloc( (NUM+1)*sizeof(float) ); size_t size = NUM * sizeof(int); for(int i = 0; i < NUM; i++) { //values[i]=rand()%10 + 1; values[i] = ((float)rand()/(float)(RAND_MAX)) * aa; } /*printf("\n nilai awal: "); for (int i=0; i
cudaMemcpy(dvalues, values, size , cudaMemcpyHostToDevice); cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); bitonicSort<<<1, NUM, size >>>(dvalues,NUM); cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); cudaMemcpy(values, dvalues, size, cudaMemcpyDeviceToHost); cudaFree(dvalues); /*printf("\n hasil pengurutan: "); for (int i=0; i
"cuda_runtime.h" "device_launch_parameters.h" <stdio.h> <stdlib.h> <windows.h>
__device__ int get_index_to_check(int thread, int num_threads, int set_size, int offset) { return (((set_size + num_threads) / num_threads) * thread) + offset; } __global__ void p_ary_search(float search, int array_length, *arr, int *ret_val ) {
int
const int num_threads = blockDim.x * gridDim.x; const int thread = blockIdx.x * blockDim.x + threadIdx.x; int set_size = array_length; while(set_size != 0){ int offset = ret_val[1]; int index_to_check = get_index_to_check(thread, num_threads, set_size, offset); if (index_to_check < array_length){ int next_index_to_check = get_index_to_check(thread + 1, num_threads, set_size, offset); if (next_index_to_check >= array_length){ next_index_to_check = array_length - 1; } if (search > arr[index_to_check] && (search < arr[next_index_to_check])) { ret_val[1] = index_to_check; } 72
else if (search == arr[index_to_check]) { ret_val[0] = index_to_check; } } set_size = set_size / num_threads; } } float chop_position(float search, float *search_array, int array_length) { float time; cudaEvent_t start, stop; int array_size = array_length * sizeof(int); if (array_size == 0) return -1; int *dev_arr; cudaMalloc((void**)&dev_arr, array_size); cudaMemcpy(dev_arr, search_array, array_size, cudaMemcpyHostToDevice); int *ret_val = (int*)malloc(sizeof(int) * 2); ret_val[0] = -1; // return value ret_val[1] = 0; // offset array_length = array_length % 2 == 0 ? array_length : array_length - 1; // array size int *dev_ret_val; cudaMalloc((void**)&dev_ret_val, sizeof(int) * 2); cudaMemcpy(dev_ret_val, ret_val, sizeof(int) * 2, cudaMemcpyHostToDevice); // Launch kernel cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); p_ary_search<<<16, 64>>>(search, array_length, dev_arr, dev_ret_val); cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); // Get results cudaMemcpy(ret_val, dev_ret_val, 2 * sizeof(int), cudaMemcpyDeviceToHost); int ret = ret_val[0]; printf("\nFound %i\n",ret_val[1]); printf("\nElapsed Time : %f ms",time); // Free memory on device cudaFree(dev_arr); cudaFree(dev_ret_val); 73
free(ret_val); return ret; } static float * build_array(int length) { float *ret_val = (float*)malloc(length * sizeof(float)); for (int i = 0; i { ret_val[i] = //ret_val[i] printf("%.2f }
< length; i++) (i * 2 + 0.5) - 1; = i; ",ret_val[i]);
return ret_val; } static void test_array(int length, float search, float index) { printf("Length %i Search %.2f\n", length, search); assert(index == chop_position(search, build_array(length), length) && "test_small_array()"); } static void test_arrays() { int length; float search; scanf("%d",&length); scanf("%f",&search); test_array(length, search, -1); } int main(){ test_arrays(); system("pause"); } Matrix Multiplication #include "cuda_runtime.h" #include "device_launch_parameters.h" #include #include <windows.h> using namespace std; #include <cuda.h> #include <stdio.h> #include <stdlib.h> #include
74
#include <cuda_runtime_api.h> #define BLOCK_SIZE 100 __global__ void gpuMM(float *A, float *B, float *C, int N) { int row = blockIdx.y*blockDim.y + threadIdx.y; int col = blockIdx.x*blockDim.x + threadIdx.x; float sum = 0.f; for (int n = 0; n < N; ++n) sum += A[row*N+n]*B[n*N+col]; C[row*N+col] = sum; } int main(int argc, char *argv[]) { LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; int N,K,L; awal: scanf("%d",&L); if(L < 1000) { printf("Input must be greater than 1000\n"); goto awal; } K = L/100; N = K*BLOCK_SIZE; float time; cudaEvent_t start, stop; float *hA,*hB,*hC; hA = new float[N*N]; hB = new float[N*N]; hC = new float[N*N]; float aa=5.0; for (int j=0; j
// Size of the memory in
bytes float *dA,*dB,*dC; cudaMalloc(&dA,size); cudaMalloc(&dB,size); cudaMalloc(&dC,size); dim3 threadBlock(BLOCK_SIZE,BLOCK_SIZE); dim3 grid(K,K); 75
// Copy matrices from the host to device cudaMemcpy(dA,hA,size,cudaMemcpyHostToDevice); cudaMemcpy(dB,hB,size,cudaMemcpyHostToDevice); //Execute the matrix multiplication kernel cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); gpuMM<<>>(dA,dB,dC,N); cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); float *C; C = new float[N*N]; cudaMemcpy(C,dC,size,cudaMemcpyDeviceToHost); cudaFree(dA); cudaFree(dB); cudaFree(dC); printf("%f ms\n",time); system("pause"); } Gauss Jordan Elimination main.cpp #include<stdio.h> #include #include<stdlib.h> #include "Common.h" int main(int argc , char **argv) { float *a_h = NULL ; float *b_h = NULL ; float *result , sum ,rvalue ; int numvar ,j ; float aa = 5.0; numvar = 0; scanf("%d",&numvar); a_h = (float*)malloc(sizeof(float)*numvar*(numvar+1)); b_h = (float*)malloc(sizeof(float)*numvar*(numvar+1)); int ii=0; for(int i = 1; i <= numvar; i++) { for(int i = 1; i <= numvar+1; i++) 76
{ //a_h[ii]=rand()%10 + 1; a_h[ii]=((float)rand()/(float)(RAND_MAX)) * aa; ii++; } } //Calling device function to copy data to device DeviceFunc(a_h , numvar , b_h); //Showing the data printf("\n\n"); /*for(int i =0 ; i< numvar ;i++) { for(int j =0 ; j< numvar+1; j++) { printf("%.2f ",b_h[i*(numvar+1) + j]); } printf("\n"); } */ //Using Back substitution method result = (float*)malloc(sizeof(float)*(numvar)); for(int i = 0; i< numvar;i++) { result[i] = 1.0; } for(int i=numvar-1 ; i>=0 ; i--) { sum = 0.0 ; for( j=numvar-1 ; j>i ;j--) { sum = sum + result[j]*b_h[i*(numvar+1) + j]; } rvalue = b_h[i*(numvar+1) + numvar] - sum ; result[i] = rvalue / b_h[i *(numvar+1) + j]; } //Tampil hasil /*for(int i =0;i
#include <cuda.h> #include "Common.h" #include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> #include #include <stdlib.h> #include #include <windows.h> __global__ void Kernel(float *, float * ,int ); void DeviceFunc(float *temp_h , int numvar , float *temp1_h) { float time; float *a_d , *b_d; LARGE_INTEGER frequency; LARGE_INTEGER t1,t2; double elapsedTime; cudaEvent_t start, stop; //Memory allocation on the device cudaMalloc(&a_d,sizeof(float)*(numvar)*(numvar+1)); cudaMalloc(&b_d,sizeof(float)*(numvar)*(numvar+1)); //Copying data to device from host cudaMemcpy(a_d, temp_h, sizeof(float)*numvar*(numvar+1),cudaMemcpyHostToDevice); //Defining size of Thread Block dim3 dimBlock(numvar+1,numvar,1); dim3 dimGrid(1,1,1); //Kernel call cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start,0); Kernel<<>>(a_d , b_d , numvar); cudaEventRecord(stop,0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); //Coping data to host from device cudaMemcpy(temp1_h,b_d,sizeof(float)*numvar*(numvar+1),cudaMemcpyD eviceToHost); //Deallocating memory on the device cudaFree(a_d); cudaFree(b_d); printf("%f ms\n",time); } Kernel.cu #include <cuda.h> #include "Common.h" 78
__global__ void Kernel(float *a_d , float *b_d ,int size) { int idx = threadIdx.x ; int idy = threadIdx.y ; //int width = size ; //int height = size ; //Allocating memory in the share memory of the device __shared__ float temp[16][16]; //Copying the data to the shared memory temp[idy][idx] = a_d[(idy * (size+1)) + idx] ; for(int i =1 ; i<size ;i++) { if((idy + i) < size) { float var1 =(-1)*( temp[i-1][i-1]/temp[i+idy][i-1]); temp[i+idy][idx] = temp[i-1][idx] +((var1) * (temp[i+idy ][idx])); } } b_d[idy*(size+1) + idx] = temp[idy][idx]; }
Common.h #ifndef __Common_H #define __Common_H #endif void getvalue(float ** ,int *); void DeviceFunc(float * , int , float *); Source Code Cluster Computing Sorting #include <stdio.h> #include <stdlib.h> #include <mpi.h> #define DEBUG #define ROOT 0 #define ISPOWER2(x) (!((x)&((x)-1))) float *merge(float array1[], float array2[], float size) { float *result = (float *)malloc(2*size*sizeof(float)); int i=0, j=0, k=0; while ((i < size) && (j < size)) result[k++] = (array1[i] <= array2[j])? array1[i++] : array2[j++]; while (i < size) 79
result[k++] = array1[i++]; while (j < size) result[k++] = array2[j++]; return result; } float sorted(float array[], float size) { int i; for (i=1; i<size; i++) if (array[i-1] > array[i]) return 0; return 1; } int compare(const void *p1, const void *p2) { return *(float *)p1 - *(float *)p2; } int main(int argc, char** argv) { int i, b=1, npes, myrank; long datasize; float localsize, *localdata, *otherdata, *data = NULL; int active = 1; MPI_Status status; double start, finish, p, s; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &npes); datasize = strtol(argv[1], argv, 10); if (!ISPOWER2(npes)) { if (myrank == ROOT) printf("Processor number must be power of two.\n"); return MPI_Finalize(); } if (datasize%npes != 0) { if (myrank == ROOT) printf("Datasize must be divisible by processor number.\n"); return MPI_Finalize(); } if (myrank == ROOT) { data = (float *)malloc(datasize * sizeof(float)); for (i = 0; i < datasize; i++) data[i] = rand()%99 + 1; } start = MPI_Wtime(); localsize = datasize / npes; localdata = (float *) malloc(localsize * sizeof(float)); MPI_Scatter(data, localsize, MPI_INT, localdata, localsize, MPI_INT, ROOT, MPI_COMM_WORLD); qsort(localdata, localsize, sizeof(int), compare); 80
while (b < npes) { if (active) { if ((myrank/b)%2 == 1) { MPI_Send(localdata, b * localsize, MPI_INT, myrank - b, 1, MPI_COMM_WORLD); free(localdata); active = 0; } else { otherdata = (float *) malloc(b * localsize * sizeof(float)); MPI_Recv(otherdata, b * localsize, MPI_INT, myrank + b, 1, MPI_COMM_WORLD, &status); localdata = merge(localdata, otherdata, b * localsize); free(otherdata); } } b <<= 1; } finish = MPI_Wtime(); if (myrank == ROOT) { #ifdef DEBUG if (sorted(localdata, npes*localsize)) { printf("\nParallel sorting succeed.\n\n"); } else { printf("\nParallel sorting failed.\n\n"); } #endif free(localdata); p = finish - start; printf(" Parallel : %.8f\n", p); /*start = MPI_Wtime(); qsort(data, datasize, sizeof(float), compare); finish = MPI_Wtime();*/ free(data); } return MPI_Finalize(); } Binary Search #include "mpi.h" #include #include <math.h> using namespace std; int main(int argc,char **argv) { const int Master = 0; const int Tag_Size = 1; const int Tag_Data= 2; 81
const int Tag_Max=3; int max; double MaxInAll; int MyId, P; double* A; int ArrSize, Target; int n, Start; int i, x; int Source, dest, Tag; int WorkersDone = 0 ; double start, finish, p; MPI_Status RecvStatus; MPI_Init(&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &MyId); MPI_Comm_size (MPI_COMM_WORLD, &P); start = MPI_Wtime(); //start working.. if (MyId == Master) { . cout<<"This is the master process on "<> ArrSize; .. A = new double[ArrSize]; srand ( P ); /* initialize random seed: */ for ( i= 0; i
Start = (i - 1) * ( ArrSize/(P-1) ); MPI_Send(A+Start, n, MPI_DOUBLE, dest, Tag, MPI_COMM_WORLD); } WorkersDone = 0; int MaxIndex = 0; while (WorkersDone < P-1 ) { MPI_Recv(&x, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &RecvStatus); Source = RecvStatus.MPI_SOURCE; Tag = RecvStatus.MPI_TAG; if (Tag == Tag_Max)/ { GlobIndx = (Source - 1)*(ArrSize/(P-1) ) + x; if ( A[GlobIndx] > MaxInAll) { MaxInAll = A[GlobIndx]; MaxIndex = GlobIndx; } WorkersDone++; } } if(WorkersDone==P-1) cout << "Process "<<Source<<" found the max of the array "<< MaxInAll<<" at index "<<MaxIndex; delete [] A; } else { max=0; cout<<"Process "<<MyId<<" is alive...\n"; Source = Master; Tag = Tag_Size; MPI_Recv(&n, 1, MPI_DOUBLE, Source, Tag, MPI_COMM_WORLD, &RecvStatus); A = new double[n]; Tag = Tag_Data; MPI_Recv(A, n, MPI_DOUBLE, Source, Tag, MPI_COMM_WORLD, &RecvStatus); cout<<"Process "<<MyId<< "Received "< max ) { max=A[i]; max_i=i; } i++; 83
} dest = Master; Tag = Tag_Max; cout<<"Process "<<MyId<< " has max equals "<<max<<endl; MPI_Send(&max_i, 1, MPI_DOUBLE, dest, Tag, MPI_COMM_WORLD); delete [] A; } finish = MPI_Wtime(); if (MyId == 0) { p = finish - start; printf(" Parallel : %.8f\n", p); } MPI_Finalize(); return 0; }
Matrix Multiplication #include <stdio.h> #include "mpi.h" #define N 5000 /* number of rows and columns in matrix */ MPI_Status status; double a[N][N],b[N][N],c[N][N]; int main(int argc, char **argv) { double start, finish, p; int numtasks,taskid,numworkers,source,dest,rows,offset,i,j,k,remainPar t,originalRows; //struct timeval start, stop; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); numworkers = numtasks-1; start = MPI_Wtime(); if (taskid == 0) { for (i=0; i
} //gettimeofday(&start, 0); /* send matrix data to the worker tasks */ rows = N/numworkers; offset = 0; remainPart = N%numworkers; for (dest=1; dest<=numworkers; dest++) { if (remainPart > 0) { originalRows = rows; ++rows; remainPart--; MPI_Send(&offset, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*N, MPI_DOUBLE,dest,1, MPI_COMM_WORLD); MPI_Send(&b, N*N, MPI_DOUBLE, dest, 1, MPI_COMM_WORLD); offset = offset + rows; rows = originalRows; } else { MPI_Send(&offset, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, 1, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*N, MPI_DOUBLE,dest,1, MPI_COMM_WORLD); MPI_Send(&b, N*N, MPI_DOUBLE, dest, 1, MPI_COMM_WORLD); offset = offset + rows; } } /* wait for results from all worker tasks */ for (i=1; i<=numworkers; i++) { source = i; MPI_Recv(&offset, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, 2, MPI_COMM_WORLD, &status); MPI_Recv(&c[offset][0], rows*N, MPI_DOUBLE, source, 2, MPI_COMM_WORLD, &status); } } if (taskid > 0) { source = 0; MPI_Recv(&offset, 1, MPI_INT, source, 1, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, 1, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*N, MPI_DOUBLE, source, 1, MPI_COMM_WORLD, &status); MPI_Recv(&b, N*N, MPI_DOUBLE, source, 1, MPI_COMM_WORLD, &status); 85
/* Matrix multiplication */ for (k=0; k
Gauss Jordan Elimination #include #include #include #include
<stdlib.h> <stdio.h> "mpi.h"
double serial_gaussian( double *A, double *b, double *y, int n ) { int i, j, k; double tstart = MPI_Wtime(); for( k=0; k
86
A[k*n+k] = 1.0; for( i=k+1; i
// space for matricies
for( i=0; i
} } MPI_Init (&argc,&argv); // Initialize MPI MPI_Comm com = MPI_COMM_WORLD; int size,rank; MPI_Comm_size(com,&size); MPI_Comm_rank(com,&rank);
// Get rank/size info
int manager = (rank == 0); if (size == 1) tstart = serial_gaussian ( A, b, y, n); else { if ( ( n % size ) != 0 ) { std::cout << "Unknowns must be multiple of processors." << std::endl; return -1; } int np = (int) n/size; a = new double[n*np]; tmp = new double[n*np]; if ( manager ) { tstart = MPI_Wtime(); final_y = new double[n]; } MPI_Scatter(A,n*np,MPI_INT,a,n*np,MPI_INT,0,com); for ( i=0; i < (rank*np); i++ ) { MPI_Bcast(tmp,n,MPI_INT,i/np,com); MPI_Bcast(&(y[i]),1,MPI_INT,i/np,com); for (row=0; row
{ a[row*n+j] = a[row*n+j] / a[row*n+np*rank+row]; } y[rank*np+row] = b[rank*np+row] / a[row*n+rank*np+row]; a[row*n+rank*np+row] = 1; for ( i=0; i
89