pthreads - Problem with the timmings of a program that uses 1-8 threads on a server that has 4 Dual Core Cpu's? -
i runig program on server @ university has 4 dual-core amd opteron(tm) processor 2210 , o.s. linux version 2.6.27.25-78.2.56.fc9.x86_64. program implements conways game of life , runs using pthreads , openmp. timed parrallel part of program using getimeofday() function using 1-8 threads. timings don't seem right. biggest time using 1 thread(as expected), time gets smaller. smallest time when use 4 threads.
here example when use array 1000x1000.
using 1 thread~9,62 sec, using 2 threads~4,73 sec, using 3 ~ 3.64 sec, using 4~2.99 sec, using 5 ~4,19 sec, using 6~3.84, using 7~3.34, using 8~3.12.
the above timings when use pthreads. when use openmp timing smaller follow same pattern.
i expected time decrease 1-8 because of 4 dual core cpus? thought because there 4 cpus 2 cores each, 8 threads run @ same time. have operating system server runs?
also tested same programs on server has 7 dual-core amd opteron(tm) processor 8214 , runs linux version 2.6.18-194.3.1.el5. there timings expected. timings smaller starting 1(the biggest) 8(smallest excecution time).
the program implements game of life correct, both using pthreads , openmp, cant figure out why timings example posted. in conclusion, questions are:
1) number of threads can run @ same time on system depends cores of cpus?it depends cpus although each cpu has more 1 cores? depends previous , operating system?
2) have way divide 1000x1000 array number of threads? if did openmp code wouldn't give same pattern of timings?
3)what reason might such timmings?
excuse english europe... thnx in advanse.
edit: code use openmp:
#define row 1000+2 #define col 1000+2 int num; int (*temp)[col]; int (*a1)[col]; int (*a2)[col]; int main() { int i,j,l,sum; int array1[row][col],array2[row][col]; struct timeval tim; struct tm *tm; double start,end; int st,en; (i=0; i<row; i++) (j=0; j<col; j++) { array1[i][j]=0; array2[i][j]=0; } array1[3][16]=1; array1[4][16]=1; array1[5][15]=1; array1[6][15]=1; array1[6][16]=1; array1[7][16]=1; array1[5][14]=1; array1[4][15]=1; a1=array1; a2=array2; printf ("\ngive number of threads:"); scanf("%d",&num); gettimeofday(&tim,null); start=tim.tv_sec+(tim.tv_usec/1000000.0); omp_set_num_threads(num); #pragma omp parallel private(l,i,j,sum) { printf("number of threads:%d\n",omp_get_num_threads()); (l=0; l<100; l++) { #pragma omp (i=1; i<(row-1); i++) { (j=1; j<(col-1); j++) { sum=a1[i-1][j-1]+a1[i-1][j]+a1[i-1][j+1]+a1[i][j-1]+a1[i][j+1]+a1[i+1][j-1]+a1[i+1][j]+a1[i+1][j+1]; if ((a1[i][j]==1) && (sum==2||sum==3)) a2[i][j]=1; else if ((a1[i][j]==1) && (sum<2)) a2[i][j]=0; else if ((a1[i][j]==1) && (sum>3)) a2[i][j]=0; else if ((a1[i][j]==0 )&& (sum==3)) a2[i][j]=1; else if (a1[i][j]==0) a2[i][j]=0; }//end of iteration j }//end of iteration #pragma omp barrier #pragma omp single { temp=a1; a1=a2; a2=temp; } #pragma omp barrier }//end of iteration l }//end of paraller region gettimeofday(&tim,null); end=tim.tv_sec+(tim.tv_usec/1000000.0); printf("\ntime elapsed:%.6lf\n",end-start); printf("all ok\n"); return 0; }
timings openmp code
a)system 7 dual core cpus using 1 thread~7,72 sec, using 2 threads~4,53 sec, using 3 threads~3,64 sec, using 4 threads~ 2,24 sec, using 5~2,02 sec, using 6~ 1,78 sec, using 7 ~1,59 sec,using 8 ~ 1,44 sec
b)system 4 dual core cpus using 1 thread~9,06 sec, using 2 threads~4,86 sec, using 3 threads~3,49 sec, using 4 threads~ 2,61 sec, using 5~3,98 sec, using 6~ 3,53 sec, using 7 ~3,48 sec,using 8 ~ 3,32 sec
above timings get.
one thing have remember you're doing on shared memory architecture. more loads/stores trying in parallel, more chance you're going have hit contention regards memory access, relatively slow operation. in typical applications in experience, don't benefit more 6 cores. (this anecdotal, go lot of detail, don't feel typing. suffice say, take these numbers grain of salt).
try instead minimize access shared resources if possible, see performance. otherwise, optimize got, , remember this:
throwing more cores @ problem not mean go quicker. taxation, there's curve when number of cores, starts becoming detriment collecting performance out of program. find "sweet spot", , use it.
Comments
Post a Comment