1. CPU = Latency Oriented Cores. Its design is to reduce latency, including operation latency, access latency, branch latency, data latency and so on. CPU is for sequential parts where latency matters.
2. GPU = Throughput Oriented Cores. Its design is to increase throughput, including memory throughput, operation throughput and so on. GPU is for parallel parts where throughput wins.
3. Scalability means the same application runs efficiently on new generation of cores and on more of the same cores. The "Heterogeneous Parallel Programming" course supports scalability.
4. Portbility means the same application runs efficiently on different types of cores and on systems with different organizations and interfaces.
5. A thread is a visualized or abstract Von-Neumann Processor.
6. A CUDA kernel is executed by a grid (array) of threads
– All threads in a grid run the same kernel code (SPMD)
– Each thread has indexes that it uses to compute memory addresses (decide what data to work on) and make control decisions (Simplifies memory addressing when processing multidimensional data)
7. Divide thread array into multiple blocks
• Threads within a block cooperate via shared memory, atomic operations and barrier synchronization
• Threads in different blocks do not interact
8. Heterogeneous host+device application C program
• Serial parts in host C code
• Parallel parts in device SPMD kernel C code
• Serial parts in host C code
• Parallel parts in device SPMD kernel C code
9. Device code can:
• R/W per-thread registers
• R/W all-shared global memory
10. Host code can
• Transfer data to/from per grid global memory
11. Compiling procedure
12. There is no dependence between any of the thread blocks. Hardware is free to assign blocks to any processor at any time.
13. All the threads in the same block would be assigned to the same Streaming Multiprocessors (SM, where threads are executed). In the current generation of the CUDA definition, we can assign up to eight thread blocks to each SM. This is a language level constraint.
14. Each block is executed as 32-thread warps. All threads in a warp executes the same instruction when selected. Warps are scheduling units in SM. It is an implementation decision not part of the CUDA programming model.
15. Thread blocks are partitioned into warps and the partition is always the same. DO NOT RELY ON ANY ORDERING WITHIN OR BETWEEN WARPS.
16. Common Programming Strategy:
a.Partition data into subsets or tiles that fit into shared memory.
b.Use one thread block to handle each tile by: loading tiles from global memory to shared memory by multiple threads, performing computations on subsets from shared memory to reduce the traffic to global memory, writing results from shared memory to global memory on completion.
------------------------------------------
Following are notes for CUDA's programming guide:
1. substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge.
2.
13. All the threads in the same block would be assigned to the same Streaming Multiprocessors (SM, where threads are executed). In the current generation of the CUDA definition, we can assign up to eight thread blocks to each SM. This is a language level constraint.
14. Each block is executed as 32-thread warps. All threads in a warp executes the same instruction when selected. Warps are scheduling units in SM. It is an implementation decision not part of the CUDA programming model.
15. Thread blocks are partitioned into warps and the partition is always the same. DO NOT RELY ON ANY ORDERING WITHIN OR BETWEEN WARPS.
16. Common Programming Strategy:
a.Partition data into subsets or tiles that fit into shared memory.
b.Use one thread block to handle each tile by: loading tiles from global memory to shared memory by multiple threads, performing computations on subsets from shared memory to reduce the traffic to global memory, writing results from shared memory to global memory on completion.
------------------------------------------
Following are notes for CUDA's programming guide:
1. substantial performance improvements can be realized by taking care that the code seldom requires threads in a warp to diverge.
2.