30/05/2013
Kebutuhan Komputer Berkinerja Tinggi
Pemrosesan Paralel Kudang B. Seminar
• Peramalan cuaca • Aerodinamik • Kercerdasan buatan: robotik • Rekayasa genetik
Contoh aplikasi di atas melibatkan komputasi intensif dan memerlukan daya olah yang tinggi
Example 1: Weather Prediction
Performance: Weather Prediction
• Area, segments – 3000*3000*11 cubic miles – .1*.1*.1 cubic mile: ~ 1011 segments
• Two day prediction – half hour periods: ~ 100 periods
• Computation per segment – Temp, Pressure, Humidity, Wind speed, Wind direction – Assume ~ 100 FLOPs
• Computational requirement: 1015 • Serial supercomputer: 109 instr/sec • Total serial time: 106 sec = 280 hours • Not too good for 48 hour weather prediction
1
30/05/2013
Parallel Weather Prediction
Example 2: N body problem
• 1 K workstations, grid connected
• Astronomy: bodies in space
– – – –
108 segment computations per processor 108 instructions per second 100 instructions per segment computation 100 time steps: 104 seconds = ~3 hours • Much more acceptable – Assumption: Communication not a problem here
• More workstations: – finer grid – better accuracy
– Attract each other: Gravitational force Newtons law – O(n*n) calculations per “snapshot” • Galaxy: ~ 1011 bodies -> ~ 1022 calculations • Calculation 1 micro sec • Snapshot: 1016 secs = ~1011 days = ~ 3*108 years – Is parallelism going to help us? NO – What does help? Better algorithm: Barnes Hut • Divides the space in “quad tree” • Treats “far’ away quads as one body
Other Challenging Applications
Application Specific Architectures
• Satellite data acquisition: billions of bits / sec • Satellite data processing
• Mapping an algorithm directly onto hardware
– Pollution levels, Remote sensing of materials – Image recognition
• Discrete optimization problems
– Planning, Scheduling, VLSI design
• Material modeling • Nuclear weapons modeling (ASCI) • Airplane/Satellite/Vehicle design
– ASICs: Application Specific Integrated Circuits – Levels of ‘specificity’ • Full custom ASICs • Standard cell ASICs • Field programmable gate arrays – Computational models • Dataflow graphs • Systolic arrays – Orders of magnitude better performance – Orders of magnitude lower power
2
30/05/2013
ASICS cont’
Contoh Nyata • Peramalan cuaca 24 jam di UK melibatkan sekitar 1012 operasi untuk dieksekusi. Ini memerlukan waktu 2.7 hours pada mesin
• How much faster than General purpose? – Example: 1D 1024 FFT • General purpose machine (G4): 25 micro secs • ASIC device (MIT Lincoln Labs): 32 nano secs • ASIC device uses 20 milliwatts (100 * less power)
• Future designs:
– 2 tera ops in small ( < cubic ft ) device – Target applications • FFT • Finite Impulse Response (FIR) Filters • Matrix multiply • QR decomposition
Motivation of Parallel Computing • Parallel Computing is cost effective – Off the shelf, commodity processors are very fast – Memory is very cheap – Building a processor that is a small factor faster costs an order of magnitude more – NoW is the time! • Cheapest way to get more performance: multiprocessor • NoW: Networks of workstations • Workstation can be an SMP • SMP: Symmetric Multi Processor
Berapa operasi untuk peramalan mingguan, bulanan, tahunan? Cray-1 (berkemampuan 108 operasi per detik).
• Menurut Einstein kecepatan cahaya: 3 x 108 m/dt. Dua
peralatan elektronik yang masing-masing mampu melakukan 1012 operasi/detik dan terpisah dengan jarak 0.5 mm. Dalam hal ini akan lebih lama waktu yang diperlukan bagi sinyal melakukan perjalanan antar dua peralatan tersebut daripada waktu yang diperlukan untuk melakukan eksekusi operasi (10detik) oleh salah satu peralatan elektronik tersebut. Jadi faktor pembatasnya adalah kecepatan cahaya. 12
SOLUSI: mendayagunakan paralelisme
Wile E. Coyote’s Parallel Computer • Get a lot of the fastest processors • Get a lot of memory per processor • Get the fastest network • Hook it all together • And then what ???
– Shared memory – Bus
3
30/05/2013
Now you need to program it!
Problem with Wile E. Coyote Architecture
Parallel programming introduces:
Von Neumann Machines not built for //ism • To get high speed, processors have lots of state
– Task partitioning, task scheduling – Data partitioning – Synchronization – Load balancing – Latency issues • hiding • tolerance
– Cache, stack, global memory
• To tolerate latency, we need fast context switch. WHY? • No free lunch: can’t have both – Certainly not if the processor was not designed for both
• Memory wall: memory gets slower and slower – in terms of number of cycles it takes to access
• Memory hierarchy gets more and more complex • Memory accesses block – No split phase memory access
Sequential vs Parallel Algorithms
Speedup
• Efficient Parallel Algorithms
• Ideal: n processors n fold speed up
– Maximize parallelism – Minimize synchronization, remote accesses – Efficiency is Architecture Dependent
• Efficient Sequential Algorithms – Minimize time, space – Efficiency is portable • Efficient C program on Pentium ~ Efficient C program on Alpha
– – – –
Ideal not always possible. WHY? Tasks are data dependent Not all processors are always busy Remote data
• Super linear speedup: >n speedup – Nonsense! Because we can execute the faster parallel program sequentially – No nonsense!! Because parallel computers do not just have more processors, they have more caches
4
30/05/2013
Parallel Programming
Implicit vs Explicit //ism
• Parallel Programming Paradigms
• Implicit: super compilers – Extract parallelism from sequential program – The general case is too hard • pointers, aliases, recursion, separate compilation • dynamic dependence distances in array references
– Super compilers • 20 years of parallelizing compilers and what do we get? ..not much: we understand loops (a bit)
– Multithreading • Pthreads, Solaris threads, not much difference – Message Passing • MPI rules, ..well, there is PVM (parallel virtual machine) – Data parallel programming • Niche work, but important
Pemrosesan Sekuensial & Paralel
3 x lebih cepat dari
• Explicit Parallelism: threads or messages – Complicates programming • creation, allocation, scheduling of processes • data partitioning • Synchronization ( locks, messages )
Klasifikasi Mesin Paralel Models of Computation ( Flynn 1966 )
1. 2. 3. 4. 5.
Single Instruction Stream, Single Data Stream : SISD. Multiple Instruction Stream, Single Data Stream : MISD. Single Instruction Stream, Multiple Data Stream : SIMD. Multiple Instruction Stream, Multiple Data Stream : MIMD. Single Program Multiple Data: SPMD.
5
30/05/2013
SISD Computers
von Neumann Architecture Computer
Untuk operasi a1 + a2 + a3 + … + an memerlukan sebanyak n akses ke memori oleh prosesor dan sebanyak n-1 operasi penjumlahan. Jadi kompleksitas waktu operasi adalah O(n).
MISD Computers N prosesor yang memiliki unit kontrol pribadi, berbagi guna memori bersama (shared memori).
Parallelisme diperoleh dengan menugaskan semua prosesor mengerjakan operasi/tugas yang berbeda secara simultan pada data yang sama.
SIMD Computers N prosesor beroperasi dibawah kendali aliran instruksi tunggal yang dikeluarkan oleh unit kontrol pusat.
The processors operate synchronously and a global clock is used to ensure lockstep operation.
6
30/05/2013
MIMD Computers
Potensi dari 4 kelas komputer
SPMD Computers Program yang sama dieksekusi pada prosesor komputer MIMD. SPMD bukan merupakan paradigma hardware, ini adalah software ekuivalen dari SIMD, namun bersifat asynchronous.
Perhatikan instruksi IF X = 0 THEN S1 ELSE S2 Asumsikan X = 0 pada prosesor P1, dan untuk X != 0 pada prosesor P2 Proses P1 mengeksekusi S1 paralel dengan prosesor P2 mengeksekusi S2 ( ini tidak dapat terjadi pada SIMD )
7