Graphics Processing Unit

Graphics Processing Unit

Graphics Processing Unit Zhenyu Ye Henk Corporaal 5SIA0, TU/e, 2015 Background image: Nvidia Tesla P100 with Pascal architecture. The GPU-CPU Gap (by NVIDIA) ref: Tesla GPU Computing Brochure 2

The GPU-CPU Gap (data) Historical data on 1403 Intel CPUs (left) and 566 NVIDIA GPUs (right). Reference: Figure 2.3 and 2.4 in the PhD thesis of Cedric Nugteren, TU/e, 2014. Improving the Progammability of GPU Architectures 3 http://repository.tue.nl/771987 The GPU-CPU Gap (by Intel) ref: "Debunking the 100X GPU vs. CPU myth", http://dx.doi.org/10.1145/1815961.1816021

4 In This Lecture, We Will Find Out The architecture of GPUs o o Single Instruction Multiple Thread Memory hierarchy The program model of GPUs

o o The threading hierarchy Memory space 5 GPU in Graphics Card Image: Nvidia GTX 980 6

GPU in Mobile Processors Image: Nvidia Jetson TX1 (Tegra X1 SOC) 7 GPU in High-Performance Computers Image: Nvidia P100 (Pascal Architecture) Chip-on-Wafer-on-Substrate

8 NVIDIA (Fermi Arch.) ref: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf 9 Transistor Count Manycore Multicore SOC

GPU FPGA ref: http://en.wikipedia.org/wiki/Transistor_count Processor Transistor Tech. Node

61-Core Xeon Phi 5,000,000,000 22nm 22-core Xeon Broadwell-E5 7,200,000,000 14nm

32-Core Sparc M7 10,000,000,000+ 20nm Processor Transistor Tech. Node

Nvidia GP00 Pascal 15,300,000,000 16nm FPGA Transistor Tech. Node

Virtex-Ultrascale XCVU440 20,000,000,000+ (multi-die 2.5D) 20nm Stratix 10 10GX5500 30,000,000,000+ 14nm

10 What Can 15bn Transistors Do? Render triangles. Billions of triangles per second. ref: "How GPUs Work", http://dx.doi.org/10.1109/MC.2007.59 11 The Graphics Pipeline

K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498 The Graphics Pipeline K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498 The Graphics Pipeline K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498 The Graphics Pipeline

K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498 The Graphics Pipeline K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498 Graphics Pipeline on GPU Graphics Pipeline on GPU Graphics Pipeline on GPU

Graphics Pipeline on GPU How Do GPUs Spend Their Die Area? GPUs are designed to match the workload of 3D graphics. Die photo of GeForce GTX 280 (source: NVIDIA) Texture: For fragment processing. ROP & Frame Buffer: For raster operation. J. Roca, et al. "Workload Characterization of 3D Games", IISWC 2006, link

T. Mitra, et al. "Dynamic 3D Graphics Workload Characterization and the Architectural Implications", Micro 1999, link GPUs Besides Graphics TOP500 supercomputer list in Nov. 2016. (http://www.top500.org) 22 Let's Start with Examples We will start from C and RISC. 23

Let's Start with C and RISC int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } Assembly code of inner-loop

lw r0, 4(r1) addi r0, r0, 1 sw r0, 4(r1) Programmer's view of RISC 24 Most CPUs Have Vector SIMD Units Programmer's view of a vector SIMD, e.g. SSE.

25 Let's Program the Vector SIMD Unroll inner-loop to vector operation. int A[2][4]; for(i=0;i<2;i++){ for(j=0;j<4;j++){ A[i][j]++; } } int A[2][4];

for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1 movups [ &A[i][0] ], xmm0 // store } movups=Move Unaligned Packed Single-Precision Floating- Point Values Looks like the previous example,

but each SSE instruction executes on 4 ALUs. 26 How Do Vector Programs Run? int A[2][4]; for(i=0;i<2;i++){ movups xmm0, [ &A[i][0] ] // load addps xmm0, xmm1 // add 1

movups [ &A[i][0] ], xmm0 // store } 27 CUDA Programmer's View of GPUs A GPU contains multiple SIMD Units. 28 CUDA Programmer's View of GPUs

A GPU contains multiple SIMD Units. All of them can access global memory. 29 What Are the Differences? SSE GPU Let's start with two important differences: 1. GPUs use threads instead of vectors 2. GPUs have the "Shared Memory" spaces

30 Thread Hierarchy in CUDA Grid contains Thread Blocks Thread Block contains Threads 31 Let's Start Again from C

convert into CUDA int A[2][4]; for(i=0;i<2;i++) for(j=0;j<4;j++) A[i][j]++; int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x;

A[i][j]++; } // define 2x4=8 threads // all threads run same kernel // each thread block has its id // each thread has its id // each thread has different i and j 32 Thread Hierarchy Example:

thread 3 of block 1 operates on element A[1][3] int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); // define 2x4=8 threads // all threads run same kernel __device__ kernelF(A){ // each thread block has its id i = blockIdx.x; // each thread has its id j = threadIdx.x; A[i][j]++;

// each thread has different i and j } 33 How Are Threads Scheduled? 34 Blocks Are Dynamically Scheduled 35

How Are Threads Executed? int A[2][4]; kernelF<<<(2,1),(4,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; } mv.u32 %r0, %ctaid.x mv.u32

%r1, %ntid.x mv.u32 %r2, %tid.x mad.u32 %r3, %r2, %r1, %r0 ld.global.s32 %r4, [%r3] add.s32 %r4, %r4, 1 st.global.s32 [%r3], %r4 // r0 = i = blockIdx.x // r1 = "threads-per-block"

// r2 = j = threadIdx.x // r3 = i * "threads-per-block" + j // r4 = A[i][j] // r4 = r4 + 1 36 // A[i][j] = r4 Utilizing Memory Hierarchy Memory access latency several cycles

100+ cycles 37 Example: Average Filters Average over a 3x3 window for a 16x16 array kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ i = threadIdx.y;

j = threadIdx.x; tmp = (A[i-1][j-1] + A[i-1][j] + + A[i+1][i+1] ) / 9; A[i][j] = tmp; Each thread loads 9 elements } from global memory. 38 It takes hundreds of cycles. Utilizing the Shared Memory Average over a

3x3 window for a 16x16 array kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] + + smem[i+1][i+1] ) / 9; 39

} Utilizing the Shared Memory kernelF<<<(1,1),(16,16)>>>(A); allocate __device__ kernelF(A){ __shared__ smem[16][16]; shared mem i = threadIdx.y; Each thread loads one j = threadIdx.x; element from global memory. smem[i][j] = A[i][j]; // load to smem

A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] + + smem[i+1][i+1] ) / 9; 40 } However, the Program Is Incorrect Hazards! kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16];

i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] + + smem[i+1][i+1] ) / 9; 41 } Let's See What's Wrong Assume 256 threads are scheduled on 8 PEs.

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; Before load instruction smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] + + smem[i+1][i+1] ) / 9; 42 } Let's See What's Wrong

Assume 256 threads are scheduled on 8 PEs. Some threads finish the load earlier than others. kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem

A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] + + smem[i+1][i+1] ) / 9; 43 } Let's See What's Wrong Assume 256 threads are scheduled on 8 PEs. Some elements in the window are not yet loaded by other threads. Error!

kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] + + smem[i+1][i+1] ) / 9; 44 }

How To Solve It? Assume 256 threads are scheduled on 8 PEs. kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] + + smem[i+1][i+1] ) / 9;

45 } Use a "SYNC" barrier kernelF<<<(1,1),(16,16)>>>(A); Assume 256 threads are scheduled on 8 PEs. __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y;

j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __SYNC(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] + + smem[i+1][i+1] ) / 9; 46 } Use a "SYNC" barrier kernelF<<<(1,1),(16,16)>>>(A); Assume 256 threads are

scheduled on 8 PEs. __device__ kernelF(A){ __shared__ smem[16][16]; Wait until all i = threadIdx.y; threads j = threadIdx.x; hit barrier smem[i][j] = A[i][j]; // load to smem __SYNC(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +

+ smem[i+1][i+1] ) / 9; 47 } Use a "SYNC" barrier kernelF<<<(1,1),(16,16)>>>(A); Assume 256 threads are scheduled on 8 PEs. All elements in the window are loaded when each

thread starts averaging. __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem __SYNC(); A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] + + smem[i+1][i+1] ) / 9; 48 }

Review What We Have Learned 1. Single Instruction Multiple Thread (SIMT) 2. Shared memory Q: What are the pros and cons of explicitly managed memory? Q: What are the fundamental differences between SIMT and vector SIMD programming model? 49

Take the Same Example Again Average over a 3x3 window for a 16x16 array Assume vector SIMD and SIMT both have shared memory. What are the differences? 50 Vector SIMD v.s. SIMT

int A[16][16]; // A in global memory __shared__ int B[16][16]; // B in shared mem for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ movups xmm0, [ &A[i][j] ] movups [ &B[i][j] ], xmm0 }} for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps xmm1, [ &B[i-1][j-1] ] addps xmm1, [ &B[i-1][j] ] divps xmm1, 9 }} for(i=0;i<16;i++){

for(j=0;i<4;j+=4){ addps [ &A[i][j] ], xmm1 }} kernelF<<<(1,1),(16,16)>>>(A); __device__ kernelF(A){ __shared__ smem[16][16]; i = threadIdx.y; (1)

j = threadIdx.x; smem[i][j] = A[i][j]; // load to smem (1) __sync(); // threads wait at barrier A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] + (2) (3)

+ smem[i+1][i+1] ) / 9; (2) } (3) (1) load to shared mem (2) compute (3) store to global mem

51 Vector SIMD v.s. SIMT int A[16][16]; // A in global memory __shared__ int B[16][16]; // B in shared mem for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ kernelF<<<(1,1),(16,16)>>>(A); __device__

__shared__ smem[16][16]; (a) movups xmm0, [ &A[i][j] ] movups [ &B[i][j] ], xmm0 }} for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps xmm1, [ &B[i-1][j-1] ] i = threadIdx.y; j = threadIdx.x;

(c) (b) smem[i][j] = A[i][j]; // load to smem __sync(); // threads wait at barrier (d) A[i][j] = ( smem[i-1][j-1] + smem[i-1][j] +

addps xmm1, [ &B[i-1][j] ] divps xmm1, 9 }} for(i=0;i<16;i++){ for(j=0;i<4;j+=4){ addps [ &A[i][j] ], xmm1 }} kernelF(A){ + smem[i+1][i+1] ) / 9; } (a) HW vector width explicit to programmer

(b) HW vector width transparent to programmers (c) each vector executed by all PEs in lock step (d) threads executed out of order, need explicit sync 52 Review What We Have Learned Algorithm include data level parallelism (DLP). GPU programmers convert DLP thread level parallelism (TLP). On GPU hardware, TLP is dynamically converted into DLP.

? In Nvidia GPUs, 32 threads are grouped into a warp, which is scheduled for execution together Programmers convert data level parallelism (DLP) into thread level parallelism (TLP). 53 HW Groups Threads Into Warps

Example: 32 threads per warp Warps time-muxed to PEs 54 Execution of Threads and Warps Let's start with an example. 55 Example of Implementation Note: NVIDIA may use a more complicated implementation.

See patent: US 8555035 B1 scheduler scheduler This example assumes the two warp schedulers are decoupled. There are several components: 1. Warp scheduler, with 46 program counters, each for a

warp. 2. Register file, with 32 lanes, typically one read port and one write port per lane. 3. Operant collector, with 32 lanes, can have multiple read ports. 4. Processing elements, 16 PEs per 32 data lane.

5. Result queue, with 32 lanes. 56 Example of Register Allocation Assumption: register file has 32 lanes each warp has 32 threads each thread uses 8 registers Acronym:

T: thread number R: register number Note: NVIDIA may use a more complicated allocation method. See patent: US 7634621 B1, Register file allocation. 57 Example Address : Instruction

0x0004 : add r0, r1, r2 0x0008 : sub r3, r4, r5 Assume: two data paths warp 0 on left data path warp 1 on right data path Acronyms: AGU: address generation unit r: register in a thread w: warp number 58

Read Src Op 1 Address : Instruction 0x0004 : add r0, r1, r2 0x0008 : sub r3, r4, r5 Read source operands: r1 for warp 0 r4 for warp 1 59 Buffer Src Op 1

Address : Instruction 0x0004 : add r0, r1, r2 0x0008 : sub r3, r4, r5 Push ops to op collector: r1 for warp 0 r4 for warp 1 60 Read Src Op 2 Address : Instruction

0x0004 : add r0, r1, r2 0x0008 : sub r3, r4, r5 Read source operands: r2 for warp 0 r5 for warp 1 61 Buffer Src Op 2 Address : Instruction 0x0004 : add r0, r1, r2 0x0008 : sub r3, r4, r5

Push ops to op collector: r2 for warp 0 r5 for warp 1 62 Execute Stage 1 Address : Instruction 0x0004 : add r0, r1, r2 0x0008 : sub r3, r4, r5

Other warps can now read to op collector Compute the first 16 threads in the warp. 63 Execute Stage 2 Address : Instruction 0x0004 : add r0, r1, r2 0x0008 : sub r3, r4, r5

Compute the last 16 threads in the warp. 64 Write Back Address : Instruction 0x0004 : add r0, r1, r2 0x0008 : sub r3, r4, r5 Write back: r0 for warp 0

r3 for warp 1 65 Recap What we have learned so far: Pros and cons of massive threading How threads are executed on GPUs Next: variations of GPU multiprocessor cluster. 66

NVIDIA Fermi (2009) PE ref: http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf 67 NVIDIA Kepler (2012) ref: http://www.nvidia.com/object/nvidia-kepler.html 68

NVIDIA Maxwell (2014) ref: http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF 69 Variations of Multiprocessor Cluster From Fermi to Kepler: More cores per cluster. From Kepler to Maxwell: Partition cluster into sub-clusters. Question: What are the Pros and Cons of these variations?

70 What Are the Possible Hazards? Three types of hazards we have learned earlier: Structural hazards Data hazards Control hazards 71 Structural Hazards

In practice, structural hazards need to be handled dynamically. 72 Data Hazards In practice, we have a much longer pipeline. 73 Control Hazards

We cannot travel back in time. What should we do? 74 Additional Hazard: Mem Bank Conflicts It is similar to structural hazard. Example: 32 threads try to perform load or store simultaneously to the 32-bank memory. 75 Shared Memory (compute capability 2.x)

Without bank conflict: Programmers take the responsibility to eliminate bank conflicts. With bank

conflict: 76 Global Memory Access Coalescing Example: 32 thread access the global memory (off-chip DRAM). The memory transaction will load/store consecutive data at 32/64/128-Byte boundary. Non-aligned access will cause additional memory transactions. More examples in backup slides. 77

Additional Hazards: Branch Divergence GPUs (and SIMDs in general) have additional challenges: branch divergence. 78 Let's Start Again from a simple CPU Let's start from MIPS. 79

A Naive SIMD Processor How to handle branch divergence within a vector? Most SIMD execute both paths of a branch, masking out part of the vectors accordingly. 80

Handling Branch In GPU Threads within a warp are free to branch. if( $r17 > $r19 ){ $r16 = $r20 + $r31 } else{ $r16 = $r21 - $r32 } $r18 = $r15 + $r16 Assembly code on the right are disassembled from cuda binary (cubin) using "decuda", link

81 If No Branch Divergence in a Warp If threads within a warp take the same path, the other path will not be executed. 82 Branch Divergence within a Warp If threads within a warp diverge, both paths have to be executed. Masks are set to filter out threads not executing on current path.

83 Dynamic Warp Formation Example: merge two divergent warps into a new warp if possible. Create warp 3 from warp 0 and 1 Create warp 4 from warp 0 and 1 84 ref: Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware. http://dx.doi.org/10.1145/1543753.1543756

Recap SIMT Architecture If no divergent branch, SIMT acts like interleaved threading SIMD. A vector of 32 data lanes has single program state. At divergent branch, data lanes are assigned program states in a branch stack. SIMD becomes SIMT. Classification of SIMT by Hennessy and Patterson Static Dynamic

ILP VLIW Superscalar DLP SIMD SIMT 85

Elaborated Microarchitecture Previous slides show a simplified architecture. Below is an elaborated example. Ref: http://www.gpgpu-sim.org/ 86 Performance Modeling What are the bottlenecks? 87

Roofline Model Performance bottleneck: computation bound v.s. bandwidth bound Note the log-log scale. GeoForce 8800 GPU 88 Optimization Is Key for Attainable Gflops/s

Special Functions Units (SFUs): Execute transcendental instructions such as sin, cosine, reciprocal, and square root. 89 Computation, Bandwidth, Latency Illustrating three bottlenecks in the Roofline model.

90 Program Optimization If I know the bottleneck, how to optimize? Optimization techniques for NVIDIA GPUs: "Best Practices Guide" http://docs.nvidia.com/cuda/ 91 Recap: What We have Learned GPU architecture (SIMD v.s. SIMT)

GPU programming (vector v.s. thread) Performance analysis (potential bottlenecks) 92 Further Reading GPU architecture and programming: NVIDIA Tesla: A unified graphics and computing architecture, in IEEE Micro 2008. ( http://dx.doi.org/10.1109/MM.2008.31) Scalable Parallel Programming with CUDA, in ACM Queue 2008. (

http://dx.doi.org/10.1145/1365490.1365500) Understanding throughput-oriented architectures, in Communications of the ACM 2010. ( http://dx.doi.org/10.1145/1839676.1839694) 93 Recommended Reading (Optional) Performance modeling: Roofline: an insightful visual performance model for multicore architectures, in Communications of the ACM 2009. (http://dx.doi.org/10.1145/1498765.1498785) Handling branch divergence in SIMT:

Dynamic warp formation: Efficient MIMD control flow on SIMD graphics hardware, in ACM Transactions on Architecture and Code Optimization 2009. ( http://dx.doi.org/10.1145/1543753.1543756) 94 Thank You Questions? Disclaimer: The opinions expressed in this presentation are solely from the presenter, and do not

express the views or opinions of his/her employer. 95 Backup Slides 96 Common Optimization Techniques Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure

Optimizations on memory bandwidth Global memory coalesce Avoid shared memory bank conflicts Grouping byte access Avoid Partition camping Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion Optimizations on operational intensity Use tiled algorithm Tuning thread granularity

97 Common Optimization Techniques Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure Optimizations on memory bandwidth Global memory coalesce Avoid shared memory bank conflicts Grouping byte access Avoid Partition camping Optimizations on computation efficiency

Mul/Add balancing Increase floating point proportion Optimizations on operational intensity Use tiled algorithm Tuning thread granularity 98 Multiple Levels of Memory Hierarchy Name Global Shared

Constant Texture Local Cache? L1/L2 No Yes Yes L1/L2 Latency (cycle)

200~400 (cache miss) 1~3 1~3 ~100 200~400 (cache miss) Read-only? R/W R/W Read-only Read-only R/W

99 Shared Mem Contains Multiple Banks 100 Compute Capability Need arch info to perform optimization. ref: NVIDIA, "CUDA C Programming Guide", (link)

101 Shared Memory (compute capability 2.x) without bank conflict: with bank conflict:

102 Common Optimization Techniques Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure Optimizations on memory bandwidth Global memory coalesce Avoid shared memory bank conflicts Grouping byte access Avoid Partition camping Optimizations on computation efficiency

Mul/Add balancing Increase floating point proportion Optimizations on operational intensity Use tiled algorithm Tuning thread granularity 103 Global Memory In Off-Chip DRAM Address space is interleaved among multiple channels. 104

Global Memory 105 Global Memory 106 Global Memory 107

Recently Viewed Presentations

  • The Problem of Evil - stevewatson.info

    The Problem of Evil - stevewatson.info

    The Problem of Evil ... Perhaps the best world has some evil in it Seems to be too much evil for that Objections Free Will is the source of Evil Our freedom is part of the goodness of the world...
  • Honors 228 Astrobiology Taylor / Geller Meeting #2 - Physical ...

    Honors 228 Astrobiology Taylor / Geller Meeting #2 - Physical ...

    HNRS 353 010 w/ Dr. H. Geller Physical Origins * * The Phase Diagram * Question What is the name of the phase change from liquid to gas? A Vaporization B Condensation C Deposition D Sublimation E None of the...
  • SINTAXIS:

    SINTAXIS:

    El cartero le entregó el paquete a Juan. Los verbos reflexivos y recíprocos Son una categoría especial de verbos transitivos Hay identidad entre el agente (el sujeto) y el paciente (el complemento) En el caso de los verbos reflexivos se...
  • 8 Canadian Barley Symposium th Quality Advancement of

    8 Canadian Barley Symposium th Quality Advancement of

    With superior malting and brewing quality attributes, this first Canadian-bred, high-enzyme 2-row barley quickly became a very popular variety which dominated malting barley production on the Prairies for over 20 years (1981-1999). Harrington enabled Canada to become a well-respected malting...
  • Performance cultures of teaching - Staff Portal

    Performance cultures of teaching - Staff Portal

    Judyth Sachs Macquarie University Recap on Project 8 universities involved How do we recognise and reward teaching? What do we mean by quality teaching? How do we know it is quality? How do we measure it? What do we do...
  • Path Loss Model for Outdoor 802.11 Wireless Links

    Path Loss Model for Outdoor 802.11 Wireless Links

    Empirical Path Loss Model for Outdoor 802.11b Wireless Links Motivation 802.11b is a promising technology for Wireless Connectivity. It is designed for indoor use.
  • Fish Anatomy - Central Catholic High School

    Fish Anatomy - Central Catholic High School

    Fish Gill Anatomy Fish Anatomy Fish Anatomy Functions of the Gill Gill Functions When water is passed over the gills, oxygen is absorbed and carbon dioxide and ammonium is exhaled. 75% of the ammonia excreted by the fish is through...
  • PowerPoint Presentation

    PowerPoint Presentation

    Genetic Tests Available. Research genetic testing - helps researches to learn more about how genes contribute to health and disease, as well as develop gene-based treatments. Sometimes the results do not directly help the research participant, but they may benefit...