ECE 462/562 Computer Architecture and Design T-Th 12:30-1:45

ECE 462/562 Computer Architecture and Design T-Th 12:30-1:45

ECE 462/562 Computer Architecture and Design T-Th 12:30-1:45 in HARV210 www.ece.arizona.edu/~ece462 Instructor Name: Ali Akoglu (ece.arizona.edu/~akoglu) Office: ECE 356-B Phone: (520) 626-5149 Email: [email protected] Office Hours: Tuesdays 11:00 AM 12:00 PM Thursdays 11:00 AM- 12:00 AM or by appointment Computer Architecture Algorithm Programming Language Operating System/Virtual Machines Instruction Set Architecture (ISA) Gates/Register-Transfer Level (RTL) Circuits Devices

Physics Abstraction Layers Application 2 Computer Architecture is Design and Analysis Design Architecture is an iterative process: Searching the space of possible designs At all levels of computer systems Analysis Creativity Cost / Performance

Analysis Good Ideas Mediocre Ideas Bad Ideas 3 Computer Architecture Applications suggest how to improve technology, provide revenue to fund development

Applications Technology Co m pat ib ility Improved technologies make new applications possible Cost of software development makes compatibility a major

force in market 4 Trends: The End of the Uniprocessor Era r Ha Ha bas e r a rd w ed ar w

d e d an s tw of a P IL , re !

Intel Intel cancelled cancelled high high performance performance uniprocessor, uniprocessor, joined joined IBM IBM and and Sun Sun for for multiple multiple processors processors 5 Crossroads: Conventional Wisdom

Old Conventional Wisdom: Power is free, Transistors expensive New Conventional Wisdom: Power wall Power expensive, Xtors free (Can put more on chip than can afford to turn on) Old CW: Sufficiently increasing Instruction Level Parallelism via compilers, innovation (Out-of-order, speculation, VLIW, ) New CW: ILP wall law of diminishing returns on more HW for ILP Old CW: Multiplies are slow, Memory access is fast New CW: Memory wall Memory slow, multiplies fast (200 clock cycles to DRAM memory, 4 clocks for multiply) Old CW: Uniprocessor performance 2X / 1.5 yrs New CW: Power Wall + ILP Wall + Memory Wall = Brick Wall

Uniprocessor performance now 2X / 5(?) yrs Sea change in chip design: multiple cores (2X processors per chip / ~ 2 years) More simpler processors are more power efficient 6 Instruction Set Architecture: Critical Interface software instruction set hardware

Properties of a good abstraction Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels 7 ISA vs. Computer Architecture Old definition of computer architecture = instruction set design Other aspects of computer design called implementation Our view is computer architecture >> ISA Architects job much more than instruction set design; technical hurdles today more challenging than those in instruction set design What really matters is the functioning of the complete system hardware, runtime system, compiler, operating system, and application Computer architecture is not just about transistors, individual instructions, or particular implementations

8 Course Focus Understanding the design techniques, machine structures, technology factors, evaluation methods that will determine the form of computers in 21st Century Technology Applications Parallelism Computer Architecture: Organization Hardware/Software Boundary Operating Systems Measurement &

Evaluation Programming Languages Interface Design (ISA) Compilers History 9 Related Courses ECE568 ECE568 Parallel Processing ECE369 ECE369 Strong

Prerequisite Basic computer organization, first look at pipelines + caches ECE ECE 462/562 462/562 Computer Architecture, First look at parallel architectures ECE569 ECE569 High Performance Computing, Advanced Topics

ECE ECE 474/574 474/574 ECE ECE576 576 Computer Aided Logic Design, FPGAs Computer Based Systems 10 Introduction Text for ECE462/562: Hennessy and Pattersons Computer Architecture, A Quantitative Approach, 5th Edition

Topics 1. Simple machine design (ISAs, microprogramming, unpipelined machines, Iron Law, simple pipelines) 2. Memory hierarchy (DRAM, caches, optimizations) plus virtual memory systems, exceptions, interrupts 3. Complex pipelining (score-boarding, out-of-order issue) 4. Explicitly parallel processors (vector machines, VLIW machines, multithreaded machines) 5. Multiprocessor architectures (memory models, cache coherence, synchronization) 11 Your ECE462/562 How would you like your ECE462/562? Mix of lecture vs. discussion Depends on how well reading is done before class Goal is to learn how to do good systems research Learn a lot from looking at good work in the past At commit point, you may chose to pursue your own new idea instead.

12 Coping with ECE462/562 Undergrads must have taken ECE274 and ECE369 Grad students with too varied background Review Appendix A, B, C review of ISA, Datapath, Pipelining and Memory Hierarchy 13 Policies Background: ECE369 or equivalent, based on Patterson and Hennessys Computer Organization and Design

Prerequisite: ECE274 & ECE369 & Programming in C 3 to 4 assignments, 2 exams, final project Grad students: extra exam questions, survey paper and presentation NO LATE ASSIGNMENTS Make-ups may be arranged prior to the scheduled activity. Inquiries about graded material => within 3 days of receiving a grade. You are encouraged to discuss the assignment specifications with your instructor, and your fellow students. However, anything you submit for grading must be unique and should NOT be a duplicate of another source. Read before the class Participate and ask questions Manage your time Start working on assignments early 14 Grading

Distribution of Components Grades Scale Component Percentage Percentage Grade Assignments+Quiz +Participation 35 90-100% A

Exam-I 15 80-89% B Exam-II 15 70-79% C Project 35

60-69% D Total 100 Below 60% E Introduction Assignments and Project Pairs only Who is my partner? (email by 09/06) Assignment-0 due 08/28 Announcements on the web

16 Research Paper Reading As graduate students, you are now researchers Most information of importance to you will be in research papers Ability to rapidly scan and understand research papers is key to your success 17 Project (Undergrad vs Grad)

Transition from undergrad to grad student ECE wants you to succeed, but you need to show initiative pick topic (more on this later) meet 3 times with faculty to see progress give oral presentation (grad students only) written report like conference paper 3 weeks work full time for 2 people Opportunity to do research in the small to help make transition from good student to research colleague 18 Project (Undergrad vs Grad)

Recreate results from research paper to see If they are reproducible If they still hold Papers from ISCA, HPCA, MICRO, IPDPS, ISC Performance evaluation of an architecture Using industry sponsored tools GEM5: gem5.org Pin: pintool.org

SimpleScalar: simplescalar.com A complete end-to-end processor (UGs !!) Take advantage of FPGAs!! Propose your own research project that is related to computer architecture 19 Measuring Performance Topics: (Chapter 1) Technology trends Performance equations 20 Technology Trends and This Book 1996 When I took this class!

2002 2009 2011 Reduced ILP to 1 chapter! Shift to multicore! Reduced emphasis on ILP Request, Data, Thread, Introduce thread level P. Instruction Level Introduce: GPU, cloud computing, Smart phones, tablets! 21 Problems

Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, not ready to supply Thread Level Parallelism or Data Level Parallelism for 1000 CPUs / chip, Architectures not ready for 1000 CPUs / chip Unlike Instruction Level Parallelism, cannot be solved by just by computer architects and compiler writers alone, but also cannot be solved without participation of computer architects 5th Edition Computer Architecture: A Quantitative Approach explores shift from Instruction Level Parallelism to Thread Level Parallelism / Data Level Parallelism 22 Classes of Parallelism

In Applications Data Level Parallelism o Data items that can be operated on concurrently Task-level Parallelism o Tasks of a work can operate independently In Hardware ILP: exploits DLP with compiler, pipelining, speculative execution Vector Architectures and GPUs: exploit DLP by applying a single instruction to a collection of data Thread-level parallelism: exploits DLP and TLP, tightly coupled hardware, interaction among threads Request level parallelism: exploits largely decoupled tasks specified by the programmer 23 Processor Technology Trends

Shrinking of transistor sizes: 250nm (1997) 130nm (2002) 65nm (2007) 32nm (2010) 28nm(2011, AMD GPU, Xilinx FPGA) 22nm(2011, Intel Ivy Bridge, die shrink of the Sandy Bridge architecture) Transistor density increases by 35% per year and die size increases by 10-20% per year more cores! 24 Trends: Historical Perspective 25 Power Consumption Trends Dyn power activity x capacitance x voltage2 x frequency Capacitance per transistor and voltage are decreasing, but number of transistors is increasing at a faster rate; hence clock frequency must be kept steady Leakage power is also rising

Power consumption is already between 100-150W in high-performance processors today 3.3GHz Intel core i7: 130 watts 26 Recent Microprocessor Trends Transistors: 1.43x / year Cores: 1.2 - 1.4x Performance: 1.15x Frequency: 1.05x Power: 1.04x 2004 2010 Source: Micron University Symp. 27

Improving Energy Efficiency Despite Flat Clock Rate Turn off the clock of inactive modules Disable FP unit, core, etc. Dynamic Voltage-Frequency Scaling Periods of low activity, lower the clock rate Low power mode DRAMs lower power mode for extending the battery Overclocking Intel, Turbo mode (2008), chip decides safe clock rate i7 3.3 GHz, can run in short bursts for 3.6GHz 28 Modern Processor Today Intel Core i7 Clock frequency: 3.2 3.33 GHz 45nm and 32nm products Cores: 4 6 Power: 95 130 W

Two threads per core 3-level cache, 12 MB L3 cache Price: $300 - $1000 29 Other Technology Trends DRAM density increases by 40-60% per year, latency has reduced by 33% in 10 years, bandwidth improves twice as fast as latency decreases Disk density improves by 100% every year, latency improvement similar to DRAM 30 First Microprocessor Intel 4004, 1971 4-bit accumulator architecture 8m pMOS

2,300 transistors 3 x 4 mm2 750kHz clock 8-16 cycles/inst. 31 Hardware Team from IBM building PC prototypes in 1979 Motorola 68000 chosen initially, but 68000 was late 8088 is 8-bit bus version of 8086 => allows cheaper system Estimated sales of 250,000 100,000,000s sold [ Personal Computing Ad, 11/81]

32 DYSEAC, first mobile computer! Carried in two tractor trailers, 12 tons + 8 tons Built for US Army Signal Corps 33 Measuring Performance Two primary metrics: wall clock time (response time for a program) and throughput (jobs performed in unit time) To optimize throughput, must ensure that there is minimal waste of resources Performance is measured with benchmark suites: a collection of programs that are likely relevant to the user SPEC CPU 2006: cpu-oriented programs (for desktops) SPECweb, TPC: throughput-oriented (for servers) EEMBC: for embedded processors/workloads 34

Performance CPU CPUtime time == Seconds Seconds == Instructions Instructions xx Cycles Cycles xx Seconds Seconds Program Program Instruction Cycle Program Program Instruction Cycle Inst Count CPI Clock Rate

Program X Compiler X X Inst. Set. X X Organization X Technology

X X 35 Amdahls Law Architecture design is very bottleneck-driven make the common case fast, do not waste resources on a component that has little impact on overall performance/power Amdahls Law: performance improvements through an enhancement is limited by the fraction of time the enhancement comes into play 36 Amdahls Law Considering an enhancement that runs 10 times faster than the original machine but is only usable 40% of

the time. Only 1.56x overall speedup An application is almost all parallel: 90%. Speedup using 10 processors => 5.3x 100 processors => 9.1x 1000 processors => 9.9x 37 Principle of Locality Most programs are predictable in terms of instructions executed and data accessed

Temporal locality: a program will shortly re-visit X Spatial locality: a program will shortly visit X+1 38 Exploit Parallelism Most operations do not depend on each other hence, execute them in parallel At the circuit level, simultaneously access multiple ways of a set-associative cache At the organization level, execute multiple instructions at the same time At the system level, execute a different program while one is waiting on I/O 39

Recently Viewed Presentations

  • Fractures of the Spine in Children

    Fractures of the Spine in Children

    Cervical Spine Injuries. Rare - < 1% of children's fractures. Neurologic Injury - "rare" to 44% Mortality in ≤ 9yrs. Age ≤ 7 yrs. Majority upper cervical, esp. craniocervical junction
  • Data Enhancement Project - Update

    Data Enhancement Project - Update

    Data dictionary to be a single repository of data provisioning, governed by the Uniform Network Code Operations Reporting Manual via UNCC* Quicker to change than the UNC process . Resource and time of a modification (anticipated self-governance, straight to Code...
  • Immunity: Rheumatoid Arthritis Case Study - NC-NET

    Immunity: Rheumatoid Arthritis Case Study - NC-NET

    2 years later. Mrs. Stevens RA has been well controlled for the past few years. She continues to take the Methotrexate & Celebrex; is in a water aerobics class 4 days a week and walks daily.
  • Structural Modeling - SEIDENBERG SCHOOL OF CSIS

    Structural Modeling - SEIDENBERG SCHOOL OF CSIS

    Unified Modeling Language (UML) de facto OO method Booch, Rumbaugh & Jacobson are principal authors Still in development Attempt to standardize the proliferation of OO variants Is purely a notation No modeling method associated with it (RUP) Is primarily owned...
  • Sales presentation - CIPS

    Sales presentation - CIPS

    The Big Picture - The World is Changing. From Forces for Change report from BITC. Significant population growth in emerging markets will continue to shift the current world order, with China and India likely to be the first and third...
  • Présentation PowerPoint - Ning

    Présentation PowerPoint - Ning

    WORK PLAN. Work plans provide a framework for planning and serve as a guide during a specified time period for carrying out work; The work plan includes a schedule of events and responsibilities that details the action to be taken...
  • Elements of a Short Story - Boone County Schools

    Elements of a Short Story - Boone County Schools

    Elements of Narrative Text OBJECTIVES Identify elements of narrative text Define elements of narrative text Demonstrate mastery of narrative text elements What is a narrative text? Narrative text is writing that tells a story. It can be a made-up story...
  • Practice Transformation Approaches from the Special Projects of

    Practice Transformation Approaches from the Special Projects of

    Are likely to be more of a focus after making more foundational changes (e.g., expanding care or sharing care) that affect delivery of care for all patients. Practice transformation is an inherently iterative process. Therefore, the components of the transformation...