Runtime Power Measurement/Model ing and Thermal Modeling Research

Runtime Power Measurement/Model ing and Thermal Modeling Research

Runtime Power Measurement/Model ing and Thermal Modeling Research Seminar Canturk ISCI MOTIVATION Power Matters! Performance improves exponentially SO DOES POWER DENSITY Chip areas increase 7%/year Battery Life: Improves Much Slower Thermal Issues Follows power density Packaging costs: +$1/W over ~40W Need good Measurement/Modeling techniques for Power & Thermally aware/adaptive systems Using Measurement to probe microarchitectural details

CASTLE, data activity experiment Compiler Level Power Optimizations SW Power Profiling and Optimization Power aware OS power modeling for decision making Dynamic thermal/power management Thermal hotspots & Power threshold 2 MOTIVATION Power Models reflecting modern processors Clock gating, power Voltage regulation, di/dt Need for Fast-Realtime Modeling and Measurement to observe long time periods

Thermal time constants: O(s) Not feasible even with architecural simulators i.e.: 1s of real run ~5 x IPC hrs of WATTCH simulation Need live, run-time power/thermal measures Dynamic Thermal Management Power-Aware OS & Systems control 3 THE BIG PICTURE Performance Monitoring Real Power Measurement Power Modeling Bottom line Thermal Modeling

To Estimate component power & temperature breakdowns for P4 at runtime 4 Remainder of Talk Related Work Performance Monitoring P4 Performance Counters Performance Reader LKM Performance Monitoring Real Power Measurement Real Power Measurement P4 Power Measurement Setup Examples Power Modeling

P4 Power Model Model + Measurement Sync Setup, Verification Power Modeling Thermal Modeling Refined Thermal Model Ex: Ppro Thermal Model Thermal Modeling 5 RELATED WORK Implementing counter readers: PCL [Berrendorf 1998], Intel VTune, Brink & Abyss [Sprunt 2002] Using counters for Performance:

HPC [Crummey 2001], CPU profilers Using counters for Power: CASTLE [Joseph 2001], power profilers event driven OS/cruise control [Bellosa 2000,2002] Real Power Measurement: Compiler Optimizations [Seng 2003] Cycle-accurate measurement with switch caps [Chang 2002] 6 RELATED WORK Power Management and Modeling Support: Instruction level energy [Tiwari 1994] PowerScope: Procedure level energy [Flinn 1999] Event counter driven energy coprocessor [Haid 2003] Power-breakdown driven energy reduction [Huang 2001] Virtual Energy Counters for Mem. [Kadayif 2001] ECOsystem: OS energy accounting [Ellis 2002] Thermal Management and Modeling Support:

PID based DTM [Skadron 2002] Architectural Thermal Model [Skadron 2003] Evaluating DTM techniques [Brooks 2001] 7 Milestone 1 Related Work Performance Monitoring P4 Performance Counters Performance Reader LKM Performance Monitoring Real Power Measurement Real Power Measurement P4 Power Measurement Setup Examples Power Modeling

P4 Power Model Model + Measurement Sync Setup, Verification Power Modeling Thermal Modeling Refined Thermal Model Ex: Ppro Thermal Model Thermal Modeling 8 Live CPU Performance Monitoring with Hardware Counters Most CPUs have hardware performance counters P4 Performance Monitoring HW:

18 Event Counters 18 Counter Configuration Control Registers Configure how to count 45 Event Selection Control Registers Configure what to count Additional Control Registers 9 Counter Overview Event Types 59 event classes 100s of events to count Metric Classifications: General Ex: Speculative Uops retired Branching Ex: Mispredicted conditionals Trace Cache and Front End Ex: Processor N deliver mode

Memory Ex: MOB Load replays Bus Ex: Prefetch bus accesses Characterization Ex: Packed SP retired Machine Clear Ex: Memory Order Machine Clear Counting Types Non-retirement: At-Retirement: Can count BOGUS vs NBOGUS, Tag uops,etc. Mechanisms: Front end tagging Execution tagging Replay Tagging No Tags Also: Event Counting Event Based Sampling Precise EBS 10

Our Event-Counter: Performance Reader Performance Reader implemented as Linux Loadable Kernel Module Implements 6 syscalls: select_events() reset_event_counter() start_event_counter() stop_event_counter() get_event_counts() set_replay_MSRs() User Level Interface: Defines the events Starts counters Stops counters Reads counters & TSC 11 Performance Reader: Example Validation L1 Hit Rate Experiment 120.00%

Ideal Hit Rate Acquired L1 Hit Rate L1 hit rate from L2 Access 100.00% Acquired Hit Rates L1_Dcache benchmark Controls cache hit behavior Validated against measured cache events Vary hit rate from 0-100% 80.00% 60.00% 40.00% 20.00% 0.00% 0.1 0.2

0.3 0.4 0.5 0.6 0.7 0.8 0.9 Desired Hit Rate (Benchmark Input) 12 1 Milestone 2 Related Work Performance Monitoring P4 Performance Counters Performance Reader LKM

Performance Monitoring Real Power Measurement Real Power Measurement P4 Power Measurement Setup Examples Power Modeling P4 Power Model Model + Measurement Sync Setup, Verification Power Modeling Thermal Modeling Refined Thermal Model

Ex: Ppro Thermal Model Thermal Modeling 13 P4 Power Measuring Setup Clamp ammeter on 12V lines on measured CPU 1mV/Adc conversion DMM reading clamp voltages Serial Reader (PowerMeter) (PowerPlotter) Voltage readings via RS232 to logging machine Convert to Power vs. time window 14 PowerPlotter: Example Branch exercise

High-Low L1Dcache L1Dcache (Taken rate: 1) Array Size L1Dcache Array Size Array Size 1/100 Fast x25 of of L1 L1~L2 x4 of L2 Benchmark Execution Initialization 15 SPEC Power Examples Spec GCC (O3) with specrun -a run 80 70 60 50 [W]

Different programs show very different power characteristics 40 30 20 10 0 0 100 time (s) 150 200 Spec VPR (O3) with specrun -a run 60 50 40 30 [W] Timescale of interest can be

huge => inaccessible via simulation 50 20 10 0 0 100 time(s) 200 300 16 400 500 Milestone 3 Related Work Performance Monitoring

P4 Performance Counters Performance Reader LKM Performance Monitoring Real Power Measurement Real Power Measurement P4 Power Measurement Setup Examples Power Modeling P4 Power Model Model + Measurement Sync Setup, Verification Power Modeling Thermal Modeling

Refined Thermal Model Ex: Ppro Thermal Model Thermal Modeling 17 P4 POWER MODEL Define components (I.e. L1 cache, BPU, Regs, etc.), whose powers well model: Define Components from annotated layout Determine combination of P4 events that represent component accesses best Gather counter info with minimal power overhead and program interruption Convert counter info into component power breakdowns

Verify total power against measured processor power Define Events Performance Monitoring Real Power Measurement Power Modeling 18 Defining Components 19 Defining Components 20 Defining Events Access Rates We determined 24 events to approximate access rates for 22 components Used Several Heuristics to represent each access rate

Ex: 2nd Level BPU: Metric 1: Instructions fetched from L2 (predict) Event: ITLB_Reference Counts ITLB translations Mask: All hits, misses Metric 2: Branches retired (history update) Event: branch_retired Counts branches retired Mask: Count all Taken/NT/Predicted/MissP Need to rotate counters 4 times to collect all event data Used 15 counters & 4 rotations to collect all event data 21 Access Rates Component Powers We gather counter data at measured computer via the tiny counter reader We send the access rates to logger machine Dont want to do any computation at host

Logger machine converts access rates to the component power breakdowns Computation done externally, still at runtime Access rates used as proxy to max component power weighting together with microarchitectural details EX: Trace cache delivers 3 uops/cycle max Power(TC)=Access-Rate(TC)/3 * MaxPower(TC) + Non-gated TC CLK power 22 Generic Equation Power(Component) || Access-Rate(Component) x Microarchitectural Scaling x MaxPower(Component) + Non-gated component Clock power 23 Experiment Setup Recall: Clamp ammeter on 12V

lines on measured CPU 1mV/Adc conversion DMM reading clamp voltages Serial Reader (PowerMeter) (PowerPlotter) Voltage readings via RS232 to logging machine Convert to Power vs. time window 24 Experiment Setup 1mV/Adc conversion Voltage readings via RS232 to logging machine 25 Experiment Setup 1mV/Adc conversion

POWER SERVER Component access rates over ethernet Voltage readings via RS232 to logging machine POWER CLIENT Convert voltage to measured power Convert access rates to modeled powers Sync together in time window 26 Area Based Power Estimate Total Power Result Branch exercise (Taken rate: 1) High-Low L1Dcache Fast (Hit Rate : 0.1) Measured Modeled

27 After Tuning? Branch exercise (Taken rate: 1) High-Low L1Dcache Fast (Hit Rate : 0.1) Measured Modeled 28 Component Breakdowns Component Breakdowns for branch_exercise Colors for 4 CPU subsystems Execution Issue - Retire 29 SPEC Results Gcc

Gzip Vpr Vortex Measured Modeled Gap Crafty 30 Milestone 4 Related Work Performance Monitoring P4 Performance Counters Performance Reader LKM Performance Monitoring Real Power Measurement Real Power Measurement

P4 Power Measurement Setup Examples Power Modeling P4 Power Model Model + Measurement Sync Setup, Verification Power Modeling Thermal Modeling Refined Thermal Model Ex: Ppro Thermal Model Thermal Modeling 31 THERMAL MODELING: A Basic Model Rth,h Based on lumped

R-C model from packaging Built upon power modeling Th Pi+Pj+Pk+Pl Rth,l Tb,l Pl Sampled Component Powers Respective component areas Physical processor Parameters Packaging Heat Transfer Rth,k Tb,k

Cth,l Pk Cth,k Blkl Blki Blkk Blkj Rth,j Rth,i DIE Tb,j Tb,i Pi HEATSINK Cth,h Cth,i Pj

Pi Cth,j Tb ,i Th Rth ,i Cth ,i dTb ,i dt Final difference equation : P t Ti t Ti i Cth ,i Cth ,i Rth ,i t : Sampling interval Ti : The temperature difference between32block and the heatsink Refined Thermal Model Steady State Analysis reveals, Heatsink-Die abstraction is not sufficient for real systems Proceeding to a multilayer thermal model: Active die thickness

metalization/insulation chip-package interface package heatsink Requires searching of several materials/ dimensions and thermal properties Multiple layers Multiple T nodes Multiple DEs Baseline Heat removal Structure: Thermal Grease Heat Spreader Package Die HEATSINK 33 Physical Structure vs. Thermal Model Ambient Temperature TA Ambient Airflow

Heatsink Th Thermal Grease R_hXA Rh Ch R_grXspr Heat Spreader Tspr Package Tp,i Die Tdie,i Pi Rp,i Cp,i Rspr Cspr Ptotal

Rdie,i Cdie,i 34 Analytical Derivation Rh Th Ch 4 Nodes 4 DEs 1) Tspr: Ptotal Tspr Th Rspr gr Cspr dTspr dt R_grXspr Discretizing time : Ptotal

Tspr Th Rspr gr Cspr Tspr Rspr t Tspr Final difference equation : Tspr Ptotal t 1 (Tspr Th )t Tp,i Cspr Cspr Rspr gr Tspr Tspr Tspr Cspr

Rp,i Ptotal Cp,i 35 EX: Ppro Thermal Model Use CASTLE [Joseph, 2001] computed component powers Determine component areas from Die photo Determine processor/packaging physical parameters Generate numerical thermal model Apply component difference equations recursively along power flow Tdie,i Tp,i Tspr T h Update Tdie,i Update Tp,i Update Tspr Update Th 36 Simulation Outputs Thermal nodes updated every t~20ms

Component Temperatures Build up to ~350K in ~5hrs Theatsink moves very slowly as expected Temperature (C) Pentium Pro Thermal Simulation At startup After 5 Hours 80 70 60 50 40 30 20 10 0 Am t n e bi

t ea H nk i s ea H ea r p tS r e d ec D e od ue s Is

e R r e d or M D em IM em FU s er h t O 37 SUMMARY

Performance Monitoring Real Power Measurement Power Modeling Thermal Modeling 38 Conclusions Contributions: Portable runtime real power measurement system Performance counter based runtime power & thermal model and runtime verification with synchronous real power measurement Thermal model, which can be applied to ANY power model - with good physical characterization - as long as physical

component based power breakdowns are used. Runtime modeling & measurement system for arbitrarily long timescales! Outcomes: We can do reasonably accurate real power measurements at runtime without interfering with HW We can perform runtime power modeling, with the tiny performance reader without inducing any significant overhead to power profile 39 What to do next? Keep tuning for SPECs <1st Stop> Try regression at several corners Wont do well due to clk gating??

Get data from Intel? Try runtime self updating model? Compare all to actual data Experiment with March., evaluate several power properties <2nd Stop> Add thermal Try to add lateral heat diffusion Get Contour results <3rd Result> P4 thermal monitor stuff Could be played from kernel to modulate clock Can we use with our models to do power savings on REAL HW?? 40 41 RELATED WORK performance monitoring

implementing counter readers: PCL Performance Counter Library, by Rudolf Berrendorf (University of Applied Sciences Bonn-Rhein-Sieg), Heinz Ziegler, and Bernd Mohr at the Central Institute for Applied Mathematics (ZAM) at the Research Centre Juelich , Germany uniform interface for several architectures (intel Pentium,MMX, Pro, III, 4/linux; IBM Power3, Power3-II/AIX; etc.) Software library with C, C++, Java & Fortran Bindings Kernel patch (Mikael Pettersson) recompile PAPI Performance Application Programming Interface Project, by Jack Dongarra, Kevin London, Shirley Moore, Philip Mucci, etc., at Innovative Computing Lab, CS dept., University of Tennessee Standard Simple high level API and low level programmable interface Supports Pentium, MMX, Pro, III/Linux, Windows; Power 3,4/AIX; etc. PerfCtr kernel patch (Mikael Pettersson) recompile 42

RELATED WORK performance monitoring implementing counter readers: Perfmon Performance Monitoring Tool by Richard Enbody, Associate Professor Department of Computer Science and Engineering, Michigan State University. For SUN Ultra-Sparc & Ppro Device Driver (LKM) Rabbit Performance Counters Library by Don Heller, Scalable Computing Laboratory, Iowa State University for Intel Pentium MMX, Pro, II, III/Linux; AMD/Linux functions to access from within C Cleanest of all, but still ~30 files & ~50instructions LKM Intels VTune Performance analyzer Windows & Linux

IBMs HPM toolkit Power 3,4/AIX Brink and Abyss Pentium 4 Performance Counter Tools For Linux, by Brinkley Sprunt, Electrical Engineering, Bucknell University brink: high level perl script to read experiment/config files abyss: c program to access counters abyss_dev: device driver for counter access EBS kernel patches: to handle PMIs 43 RELATED WORK performance monitoring using counter readers: CASTLE Project by Margaret Martonosi and Russ Joseph, Princeton University acquire Ppro counter data to model component power breakdowns Frank Bellosa, Benefits of Event Driven energy

Accounting in Power Sensitive Systems, 9th SIGOPS European workshop, 2000 Counters to show power ~ k x instr-ns/cycle (PII) OS power optimizations: Throttle down CPU/extend thread time for cache hit/slow down CPU core if main memory is accessed Andreas Weissel, Frank Bellosa, Process Cruise Control: Event driven clock scaling for dynamic power management, CASES 2002 Use event counters info to scale individual thread frequencies Intel Xscale / Modified Linux kernel 44 RELATED WORK performance monitoring using counter readers: HPC Toolkit, by John Mellor-Crummey, Rob Fowler, CS Dept. Rice University Uses perf counter data for profiling

converts raw profiling information into platform independent XML formats and produces performance metric correlations from multiple sources Used in compiler optimizations Jennifer Anderson, et al, Continuous Profiling: Where Have All the Cycles Gone?, ACM Transactions on Computer Systems, Vol. 15, No. 4, November 1997, pp. 357 - 390. Performance analysis example from DEC Data collection by counter sampling, performance info from program level to individual instructions 45 RELATED WORK real power CASTLE Project by Margaret Martonosi and Russ Joseph, Princeton University Shunt R over Ppro power lines to measure total processor power John Seng, Dean Tullsen, Effect of compiler optimizations on Pentium 4 Power consumption, 7th Annual Workshop on Interaction between Compilers and Computer

Architectures, February, 2003 Shunt R between VRM and CPU Marc A. Viredaz, Deborah A. Wallach, Power Evaluation of Itsy Version 2.3, tech. note TN57, WRL, Compaq Computer Corp., 2000 similar series R to estimate battery life of itsy pocket computer 46 RELATED WORK real power Frank Bellosa, Benefits of Event Driven energy Accounting in Power Sensitive Systems, 9th SIGOPS European workshop, 2000 Crude Current measurement with DMM for Pentium II to help define per instruction powers Andreas Weissel, Frank Bellosa, Process Cruise Control: Event driven clock scaling for dynamic power management, CASES 2002 series sense resistor added to Intel IQ 80310 evaluation platform power supply, to measure energy effect of frequency

scaling Naehyuck Chang, Kwanho Kim, and Hyun Gyu Lee, "CycleAccurate Energy Consumption Measurement and Analysis: Case Study of ARM7TDMI" ISLPED 2000 & IEEE Transactions on VLSI Systems, Vol. 10, pp. 146 - 154, Apr., 2002. cycle accurate energy consumption measurement based on charge transfer Inserts switch caps between power supply and Processor that switch with the same clock frequency!! 47 RELATED WORK power model Simulation Tools: WATTCH, by David Brooks and Margaret Martonosi, Princeton University, ISCA 2000 Architectural power simulator Power Models intergrated upon SimpleScalar SimplePower by W. Ye, N. Vijaykrishnan, M. Kandemir, Penn-State University, and M. Irwin The Design and Use of SimplePower: A cycleaccurate energy estimation tool, DAC, June 2000

Execution driven, Cycle accurate, RTL power estimation Emulates 5 stage pipe with SimpleScalars Integer ISA 48 RELATED WORK power model Power Modeling: R. Joseph and M. Martonosi. Run-Time Power Estimation in High Performance Microprocessors, International Symposium on Low Power Electronics and Design, 2001 complete CASTLE Project: Collects Ppro counter data and models component power breakdowns verifying against measured total power Also Wattch simulation vs. counter approximation for SimpleScalar architecture Russ Joseph, David Brooks, and Margaret Martonosi, "Live, Runtime Power Measurements as a Foundation for Evaluating Power/Performance Tradeoffs" Workshop on Complexity Effectice Design (WCED, held in conjunction with ISCA-28), 2001 Evaluate power vs. performance by measuring total power and acquiring performance data from counters i.e. Cache hit rate,

branch prediction, bitline activity 49 RELATED WORK power model H. Zeng, X. Fan, C. Ellis, A. Lebeck, and A. Vahdat, ECOSystem: Managing Energy as a First Class Operating System Resource, Proceedings of ASPLOS X, Oct. 2002 Uses Currentcy Model (Fixed Power & Time budget for a task) for OS level energy management for battery life ECOsystem is the Linux OS implementation Considers CPU ON/OFF could do better with Power model H. Zeng, C. Ellis, A. Lebeck, A. Vahdat , Currentcy: Unifying Policies for Resource Management, USENIX 2003 Annual Technical Conference Detailed description of currency (OS scheduling, etc.) Flinn J., Satyanarayanan, M., PowerScope: A Tool for Profiling the Energy Usage of Mobile Applications, Proceedings of the Second IEEE Workshop on Mobile Computing Systems and Applications February, 1999

Maps Energy Program structure (Power Profiling Energy efficient SW design) DMM gets energy for machine kernel modification (system monitor) gets PIDs for processes and identifies procedures for profiling offline 50 RELATED WORK power model V. Tiwari, S. Malik, and A. Wolfe, Power analysis of embedded software: A first step towards software power minimization, International Conference on Computer-Aided Design & IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 1994 PIONEER WORK in Power Measurement/Modeling Measure current drawn by an Intel 486DX2 Processor and DRAM Generate Energy cost table for instructions Identify inter-instructions effects: circuit state overhead, resource constraint effect, cache miss effects there are 1 million like this: modeling SW energy, I wont put here

Lee, A. Ermedahl, and S. Min. An accurate instruction-level energy consumption model for embedded risc processors ACM SIGPLAN Conf. on Languages, Compilers, and Tools for Embedded Systems (LCTES'01), Jun 2001 Derives energy consumption for instructions rather than functional units for RISC ARM7TDMI processor Uses their cycle-accurate power measurement scheme Black box approach (similar to F. Bellosa) with linear regression 51 RELATED WORK power model J. Russell and M.F. Jacome, "Software Power Estimation and Optimization for High Performance, 32-bit Embedded Processors," Proc. of ICCD '98 Estimates SW energy for i960 family 32 bit embedded RISC processors Uses digitizing oscilloscope/series Resistor over processor power lines for measurement Uses const Pest for processor power and estimates energy based on runtime ( wont work with clock gating!) J. Haid, G. Kafer, et al, "Run-Time Energy Estimation in SystemOn-a-Chip Designs", ASP-DAC 2003

Proposes a coprocessor for runtime energy estimation for SoC Defines similar event counters in coprocessor and uses power macromodels M. Lajolo, A. Raghunathan, S. Dey, L. Lavagno, and A. Sangiovanni-Vincentelli. Efficient power estimation techniques for hw/sw systems, IEEE Proc. VOLTA'99 International Workshop on Low Power Design, pages 191--199, March 1999. Power estimation for HW/SW SoC designs RTL HW simulator and Instruction Set simulator using instruction level power models 52 RELATED WORK power model M. Huang, J. Renau, and J. Torrellas. Profile-based energy reduction in high-performance processors, In 4th Workshop on Feedback-Directed and Dynamic Optimization, December 2001 Use profiling to determine when to activate/deactivate low power methods i.e. DVS, clock gating, etc. Use energy statistics (power breakdowns) from performance counters for profiling (SIM) I. Kadayif , T. Chinoda , M. Kandemir , N. Vijaykirsnan ,

M. J. Irwin , A. Sivasubramaniam, vEC: virtual energy counters, Proceedings of the 2001 ACM SIGPLANSIGSOFT workshop on Program analysis for software tools and engineering, 2001 Uses Perfmon library for UltraSPARC to read SPARC HW perf counters related to memory Converts readings to power using analytical memory energy model estimates memory system energy consumption 53 RELATED WORK power model Luca Benini et al System-level power estimation and optimization, Proceedings 1998 international symposium on Low power electronics and design System-level power optimization: techniques and tools, Proceedings of international symposium on Low power electronics and design, 1999 Tutorial on power conscious system level design Memory optimizations, Hardware software partitioning, instruction level power optimizations, DVS, DPM (allow components to sleep)

Supporting system-level power exploration for DSP applications, Proceedings of the 10th Great Lakes Symposium on VLSI, 2000 Modified ARM simulator for instruction level power estimation 54 RELATED WORK thermal model K. Skadron, T. Abdelzaher, and M. R. Stan. Control-theoretic techniques and thermal-RC modeling for accurate and localized dynamic thermal management, In Proc. HPCA-8, pages 17--28, Feb. 2002. Single degree component based thermal R-C model for MIPS R10000 scaled to 0.18Um Only die heatsink thermal conduction, with const. heatsink and Si properties only Power/Thermal Simulation using Wattch for verification of DTM with PID controller Sabry, M.-N.; Bontemps, A.; Aubert, V.; Vahrmann, R, Realistic and efficient simulation of electro-thermal effects in VLSI circuits, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , Volume: 5 Issue: 3 , Sep 1997

Transistor level with interdevice thermal resistances Szekely, V.; Poppe, A.; Pahi, A.; Csendes, A.; Hajas, G.; Rencz, M, Electro-thermal and logi-thermal simulation of VLSI designs, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , Volume: 5 Issue: 3 , Sep 1997 LOGITHERM simulator module for gate level thermal simulation, by thermal characterization of logic gates 55 RELATED WORK thermal model COSMOS/FloWorks by NIKA fluid flow and thermal analysis program Heat flow computation based on mesh analysis A. Dhodapkar, C. H. Lim, G. Cai, and W. R. Daasch. TEMPEST: A thermal enabled multi-model power / performance estimator, Proceedings of Workshop on Power-Aware Computer Systems, Nov. 2000.

Thermally enabled architectural simulator based on SimpleScalar Single R,C for the whole processor packaging oriented D. Brooks and M. Martonosi. Dynamic thermal management for high-performance microprocessors. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture, pages 171--82, Jan. 2001. Discusses Microarchitectural and scaling DTM mechanisms Uses moving average of power for ~100K cycles of Wattch simulation as a proxy for temperature to detect thermal emergencies for DTM triggering 56 RELATED WORK thermal model Thermal Monitoring, Intel Architecture SW developers Manual vol. 3 Catastrophic shutdown detector thermal diode resets stop clock duty cycle Automatic Thermal monitor

Internally modulate stop clock duty cycle Software controlled clock modulation SW modulates stop clock duty cycle Kevin Skadron et al, Temperature aware Microarchitecture, 30th ISCA, 2003 HotSpot: architecture level thermal simulator built upon Wattch Uses multiple degree thermal R-C model for die, packaging, heatsink and convection to ambient More realistic area estimates based on Alpha 21364 Back Back 57 58 Counter Access Heuristics 1) BUS CONTROL: No 3rd Level cache BSQ allocations ~ IOQ allocations Metric1: Bus accesses from all agents

Event: IOQ_allocation Counts various types of bus transactions Should account for BSQ as well access based rather than duration MASK: Default req. type, all read (128B) and write (64B) types, include OWN,OTHER and PREFETCH Metric2: Bus Utilization(The % of time Bus is utilized) Event: FSB_data_activity Counts DataReaDY and DataBuSY events on Bus Mask: Count when processor or other agents drive/read/reserve the bus Expression: FSB_data_activity x BusRatio / Clocks Elapsed To account for clock ratios 59 Counter Access Heuristics 2) L2 Cache:

Metric: 2nd Level cache references Event: BSQ_cache_reference Counts cache ref-s as seen by bus unit MASK: All MESI read misses (LD & RFO) 2nd level WR misses 3) 2nd Level BPU: Metric 1: Instructions fetched from L2 (predict) Event: ITLB_Reference Counts ITLB translations Mask: All hits, misses & UC hits Metric 2: Branches retired (history update) Event: branch_retired Counts branches retired Mask: Count all Taken/NT/Predicted/MissP 60 Counter Access Heuristics

4) ITLB & I-Fetch: etc 10) FP Execution: Metric: FP instructions executed event1: packed_SP_uop counts packed single precision uops event2: packed_DP_uop counts packed single precision uops event3: scalar_SP_uop counts scalar double precision uops event4: scalar_DP_uop counts scalar double precision uops event5: 64bit_MMX_uop counts MMX uops with 64bit SIMD operands event6: 128bit_MMX_uop counts integer SSE2 uops with 128bit SIMD operands event7: x87_FP_UOP counts x87 FP uops

event8: x87_SIMD_moves_uop counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops Back Back 61 62 INTRODUCTION to RUNTIME What is Runtime Power/Thermal Measurement: Methodology for measuring CPU power / temperature and component breakdowns 3 alternatives: 1. Measuring power/temperature directly from hardware; i.e. with multimeter probes Impossible with VLSI Runtime speed 2. Simulating processor execution with SW and extracting power/temperature data WATTCH, Tempest, etc. Computation time problems, especially with thermal Cycle level detail 3. Runtime Measurement: Getting Processor power/thermal data at

runtime using both hardware and software Runtime speed and SW support not cycle detail! 63 INTRODUCTION to RUNTIME Why Runtime Power/Thermal Measurement: Offers a hybrid technique overlapping slow, but detailed simulation and crude, but fast realtime measurements Hardware performance counters help extract lots of useful information both performance and power on the fly Can be used for priming instead of a long simulation where the last few million instructions bear the most of interest 64 WHY POWER & THERMAL Moores Law: Transistor count x4 / 3 years DRAM density x4 / 3 years Performance improves exponentially SO DOES POWER [1]

Nuclear Core Example: 65 WHY POWER & THERMAL 66 WHY POWER & THERMAL Battery technology increases much slower Packaging costs: +$1/W over 35-40W [2] Back to slide Back to67 slide POWER BASICS Total Power = Dynamic Power + Static Power + Short Circuit Power Dynamic Power (switching power): Discharging of Capacitances when switching occurs (0 1) data dependent

Csw= (1/2)..CL.Vdd2.f Where this came from 68 Derivation of Switching Power dV iC C dt dV Power i V C V dt

2 Energy (1 / 2) CV at each 1 0 transition : Energy C LVdd2 Power Energy / time C LVdd2 clock period C LVdd2 f at each 0 1 transition : this ch arg e is dissipated Total Energy : Energy / transition total 0 1 transitions Power Energy / transition P0 1 P0 1 probability of switching in a cycle (1 / 2) switching activity 1 Power C LVdd2 f 2 69

POWER BASICS Static Power (leakage power): Due to leakage through the N channel and through the drain-substrate junctions. 70 POWER BASICS Short Circuit Power : Due to finite rise time of input signal. Generic CMOS feature In comparison: Currently: 80% Sw. + 10% Leak + 10% SC Future: 45% Sw. + 45% Leak + 10% SC [3] 71 NEED FOR SPEED WATTCH simulates 80K instr-s/sec SpecINT 164.GZIP runs: ~350s with average upc ~1.3 on 1.4 GHz P4 producing ~665 billion uops WATTCH simulation would take ~100 days Assuming a 1GHz Machine: 1s of real run ~5 x IPC hrs of WATTCH simulation Back to slide

Back to72 slide P4 Details Karelian.ee: P4 1.4GHz 0.18, C4-FC-PGA-423 Heatsink Folded Fin M6, Al interconnect Die Size: 217 mm2 Package Size: 5.34cm x 5.17cm Power: Idle/typ./max=??/51.8/71W D$1&T$1/L2: 8K&12KUops/256K Voltage: 1.7/1.75V 73 P4 Details 1st LKM: Implements syscall: getCPUinfo() Gathers CPU info from: /asm/processor.h Intel control registers (CR4) CPUID instruction Reveals: Debug Store mechanism exists for PEBS TSC exists MSRs implemented We can read/write performance counters

EX: karelian (P4,willamette): UserLevel_CPUinfo viale (P4, Northwood): UserLevel_CPUinfo Back Back 74 P4 Detector - Counter Clusters EVENTS P4 Components Event Detectors Event Counters 4 bit wide bus 75 Counters, ESCRs & CCCRs Simplified Recipe: 1. Select Event to count 2. Select a counter

(also defines CCCR) 3. Select an ESCR 4. Set ESCR fields 5. Set CCCR fields 6. Enable CCCR 76 Counting Mechanisms Counting Types Non-retirement: Events occur any time during execution At-Retirement: Events at the retirement of instruction Can count BOGUS vs NBOGUS, Tag uops to count, etc. Terminology Mechanisms: Front end tagging (i.e. LD/ST retired) Execution tagging (i.e. packed_DP_retired) Replay Tagging (i.e. L1 misses) No Tags (i.e. uops retired) Also: Event Counting | IEBS | PEBS

Back Back 77 At Retirement Counting Terminology BOGUS/NBOGUS (speculative) Tagging (count uops that encounter event) Replay (Data speculation) Back Back 78 Verifying Counter Reader 1) L1Dcache_exercise: Uses pointer assignment L1=8K, L2=256K Array Size = (L1 Size/Hit Rate) i.e. for 10% Hit rate: 80K 20K entries Array Size < L2 size

Array elements PRBS of array indices Bench loop: new index array[old index] However, gcc puts 5 LDs in the bench loop 4 static Hit rate ~ 100% 1 our load our desired hit rate 79 Verifying Counter Reader 1) L1Dcache_exercise results: L1Dcache Experiment 120.00% 100.00% Acquired Rates Acquired L1 Hit Rate Our L1 Hit Rate From L2 Accesses 80.00% 60.00%

40.00% Ex: 20.00% 0.00% 0.04 -20.00% 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Desired Hit Rate 1 2 100 1000 L1Dcache_exercise Hit Rate = 0.25 80 Verifying Counter Reader 2) branch_exercise: Uses random number comparison Assigns 400K PRBS array outside bench loop To avoid rand() instructions in bench loop bench loop: Compares array index to threshod

Threshold = RAND_MAX*TakenRate Repeats 1000 reseeding each time However gcc adds 2 more branches into bench loop: Loop exit condition (Prediction ~ 100%) Unconditional JMP (Prediction ~ 100%) Our Branchs Expected Mispredict Rate: ~ (0.5 - |TakenRate 0.5| ) 81 Verifying Counter Reader 2) branch_exercise results: Branch Prediction Experiment 120.00% Approximated Mispredict Rate Our Branch's Taken Rate Acquired Rates 100.00% 80.00%

Back Back 60.00% Ex: 40.00% branch_exercise Taken Rate=0.5 20.00% 0.00% 0 0.1 0.2 0.25 0.3 0.4 0.5 0.6 0.7 0.75 0.8

Desired Taken Rate 0.9 1 82 P4 POWER MEASUREMENT Clamp Current Probe over 12V lines Log voltage readings Convert to instantaneous power: 12 x Vsample x 1000 Complete Setup: 1mV/Adc conversion ]

V [ e tl ag ngs Vo eadi R Log Power values Plot Power values Serial Reader (PowerMeter) (PowerPlotter) 83 MEASUREMENT Method Select Power lines that reflect CPU power P4 uses 12 V lines Clamp the current probe over the 12V lines 1mV/Adc conversion Connect the clamp into DMM Send Voltage reading over serial Log the voltage readings

Convert to instantaneous power as: 12 x Vsample x 1000 Log Power values Plot Power values 84 MEASUREMENT Tools Poll serial port ~20ms quicker overkill, slower overlook Compute running average sample every t you select Easier to sync with Power Model PowerMeter: Convert voltage reading to power and log P=12 x Vread x 1000 PowerPlotter:

Plot Power samples over sliding time window 100 s history with 1000 samples (t = 100ms) 85 Current Probe Fluke i410 Uses Hall Voltage to measure current and convert to Voltage: 1mV / Adc Range: 0.5 400A Accuracy: 3.5%+0.5A Generated voltage is fed to DMM Compared against the Ppro Amoeba shunt setup for verification 86 Clamp vs Shunt sampled current for L1Dcache from clamp sampled current for L1Dcache from shunt 8 7

7 6 6 5 5 4 3 current 4 2 2 1 1 0 current

3 0 0 200 400 600 800 1000 0 200 600 800 1000 1200 current for grep from shunt

current for grep from clamp 7 9 8 7 6 5 4 3 2 1 0 6 5 Series1 4 Series1 A A 400

3 2 1 0 0 100 200 300 100 ms 400 500 600 0 100 200 100 ms 300

400 87 Back Back DMM Agilent 34401A Measurement Motive: We should sample as quick as possible (grep case) Measurement Setup: Fast 4 digit, Autozero OFF, Display OFF From [8], 1000 readings/s (x150 faster than fast 6 digit) Serial Interface: From [9] 55 ASCII readings /s Polling serial port faster than 20ms is overkill 88 Back Back P4 Power Lines Which power lines should we cut / clamp? [5] shows the power lines:

1-CPU power connector 13-System power connector P1 13 & P2 1 [6],[7] say P4 uses 12V lines for CPU, rather than 5V lines Both P1 & P2 have 12, 5 and 3.3 V lines I run branch_exercise (takenRate=1) and gzip_static obtain the current variation on the lines 89 Current on Power Lines Current on Connector P1 line7 (12V) Current on Connector P1 lines1,3,,6,18,19,20,22 (5V) Current on Connector lines 11,12,23 (3.3V) 2.5 1.6

Current on connector P2 line1 (3.3V) 2.5 0.8 1.4 2 0.7 2 1.2 0.6 I [A] 0.8 0.6 0.4 0.5 1.5

Series1 1 I(A) I [A] 1.5 I [A] 1 0.4 1 0.3 0.2 0.5 0.5 0.2 0.1 20

40 60 time (s) 80 0 0 0 20 Series1 40 60 time (s) 80 0

10 Current on Connector P2 line 3 (12V) Current on connector P2 line14 (5V) 0.5 1.6 0.45 1.4 0.4 1.2 0.35 60 70 0 80

0 0 10 20 30 40 time (s) 50 60 70 -0.2 40 time (s) 50 60 Series1

70 80 0.1 0.05 0 0 10 20 30 40 50 60 -0.05 0 30 70

Series1 0.15 0.2 20 60 0.2 1 0.6 10 50 0.3 0.8 0 40 time(s)

0.25 0.4 0.05 30 0.35 0.2 0.1 20 Current on connector P2 line 9 (5V) 0.4 0.15 10 Series1 1.2

I [A] I [A] time (s) 50 1.4 0.6 0.2 40 1.6 0.8 0.25 30 1.8 1

0.3 20 Current on Connector P2 line7 (12V) I [A] -0.2 0 0 I [A] 0 0 0 10 20 30 40

time (s) 50 60 70 -0.1 Series1 time (s) Series1 Series1 Reveals ALL 3 12V lines currents follow CPU activity All add to CPU Power! 90 Back Back 70

Validating with Optimizations Compare to Optimizations vs Power of [Seng & Tullsen] 53 SPECINT AVE. Power vs gcc Optimizations AVErage Power [W] 51 49 O0 O1 O2 O3 O3 unroll O3 unroll ALL 47 45 43 41 39 GZIP VPR

GCC 91 Optimizations O0 None at all O1 fomit-frame-pointer thread-jumps, delayed-branches, defer-pop O2 fomit-frame-pointer CSE related blocks, jumps, expensive optimizations, reschedule instr-ns, etc. -O3 fomit-frame-pointer O2 + inline functions heuristically -O3 fomit-frame-pointer funroll-loops Only for #iterations known at compile/run time -O3 fomit-frame-pointer funroll-all-loops Do for all loops (usually bad result) 92 GZIP power vs time 70 [W]

Power for GZIP Optimizations 60 50 40 30 O0 O1 O2 O3 O3unroll O3unrollALL 20 10 0 0 100 200

300 400 time (s) 500 93 600 700 800 900 GZIP power vs time All have similar power Exec. time(O0) ~ x2 Exec Time(Oelse) Different data sets provide different power profile 94 3 specINT average Power SPECINT AVE. Power vs gcc Optimizations AVErage Power [W] 53 51

O0 49 O1 47 O2 45 O3 43 O3 unroll 41 O3 unroll ALL 39 GZIP VPR GCC

Optimized code runs quicker, and yet with less average power specFP art seems to be the exception? 95 Back Back About the ripples Add ripple stuff here!!!!!!!!!!!!!!!!!!!!!!!!!!! 96 P4 Architecture vs Layout Components to Model: 1) 2) 3) 4) 5) 12) FP RF 6) MOB 18) Rename Bus Control 7) Mem Control 13) Decode

19) Inst-n Qs L2 Cache 14) Trace $ 8) DTLB 20) Schedule nd 2 Level BPU 9) Int EXE 15) 1st Level BPU 21) Inst-n Qs ITLB & Ifetch 10) FP EXE 16) Microcode ROM 22) Retirement 97 L1 Cache 17) Allocation 11) Int RF Back Back Defining Components 98 Counter Rotations 99 Back Back Experiment Setup

POWER SERVER POWER CLIENT 100 Component Breakdowns THERMAL Basics Duality heat flow electrical flow Thermal Mass (Capacitance) : Cth=c.A.t [J/K] c: Specific heat [J/m3K] A: Block Area [m2] t: Wafer thickness [m] Thermal Resistance : Rth,norm=.t/A [K/W] : Thermal resistivity [m.K/W] A: Block Area [m2] t: Wafer thickness [m] 102 Simplified Thermal Model Tb,j

Pj Divide the CPU to component blocks Th Pi+Pj+Pk+Pl Pi Rth,k Tb,k Cth,l Pk Cth,k Blkl Blki Pi Blkk Blkj Rth,j Rth,i

DIE Tb,j Pj Cth,i Cth,j Numerical Values? See Quantitative Example >> Tb ,i Th Rth ,i Cth ,i dTb ,i dt Final difference equation : P t Ti t Ti i Cth ,i Cth ,i Rth ,i

HEATSINK Cth,h Rth,l Tb,i Cth,j Each block dissipates different power, Pblock reveal different temperature changes, Tblock Rth,h Tb,l Pl Rth,j t : Sampling interval Ti : The temperature difference between block and the heatsink t should be much smaller than the RC time constant, th,i 103 QUANTITATIVE EXAMPLE

Back to slide Back to slide Use t=0.1 mm thinned wafer Areas given in table (c=106 [J/m3K] & =10-2 [m.K/W] ) th=RthCth=c t2=10-4s=100s ind. of Area!ind. ind. of Area!of ind. of Area!Area! 2 1.11 4 2.85 100 100 100 Temperature buildup for Regfile with t =133.4 ns: Tblk ( w.r.t. HeatSink ) Pblk t Tblk t

Cth ,blk Rth ,blk Cth ,blk 104 THERMAL FORMULATION Th For any block, i: Pi Tb ,i Th Rth ,i Cth ,i dTb ,i dt Discretizing time : Tb ,i Tb ,i Th Pi Rth ,i Cth ,i t Define : Ti Tb ,i Th Assuming Th const :

Tb ,i Tb ,i Th Ti 0 Pi Ti T Cth ,i i Rth ,i t Final difference equation : Ti Pi t Ti t Cth ,i Cth ,i Rth ,i Rth,i Tb,i Pi Cth,i t : Sampling interval Ti: The temperature difference between

block and the heatsink t should be much smaller than the RC time constant, th,i Back to slide Back to slide 105 Refined Thermal Model Tb,j Pj Steady State Analysis reveals, Heatsink-Die abstraction is not sufficient for real systems Proceeding to a multilayer thermal model: Active die thickness metalization/insulation chip-package interface package heatsink Requires searching of several materials/ dimensions and thermal properties Multiple layers Multiple T nodes Multiple DEs

Baseline Heat removal Structure: 106 Rth,j Cth,j Refined Thermal Model Need to define the physical structure Tb,j Pj All the layers heat-flux propagates through Corresponding Thermal model Multinode Different Assumptions/decisions Physical Parameters for different elements

Dimensions Material types th and cth New set of Thermal update DEs 107 Rth,j Cth,j Physical Model vs. Thermal Model TA R_hXA Rh Th Ch R_grXspr Rspr Tspr Cspr Rp,i Tp,i Ptotal Cp,i Rdie,i

Tdie,i Pi Cdie,i 108 Analytical Derivation 4 Nodes 4 DEs 1) Tspr: T T Ptotal Rsprspr grh Cspr dTspr dt Discretizing time : Ptotal Tspr Th Rspr gr Cspr Tspr

t Ptotal .t Rspr1gr (Tspr Th )t Cspr .Tspr Final difference equation : Tspr Ptotal t 1 (Tspr Th )t Cspr Cspr Rspr gr Tspr Tspr Tspr 109 Analytical Derivation Tspr Th 2) Th: Th Rspr gr t Ch

1 (Th TA )t Ch Rh a Th Th Th 3) Tdie,i: Tdie,i Pi t Cdie,i 1 (Tdie,i Tp ,i )t Cdie,i Rdie,i Tdie,i Tdie,i Tdie,i Tdie,i Tp ,i 4) Tp,i: Th Rdie,i C p ,i t 1 (Tp ,i Tspr )t C p ,i R p ,i

Tp ,i Tp ,i Tp ,i 110 Temperature Updating and Initial Conditions D.E.s should be updated along the direction of current (power) flow: Tdie,i Tp,i Tspr Th Update Tdie,i Update Tp,i Update Tspr Update Th It is not reasonable to start from ambient temperatures as initial conditions. Mostly, the processor is already running TA is given as ~50oC by Intel Thermal Design Guidelines Assume idle power:(Ppro ~2 W) Th=TA+2W.Rhxa=~52oC Tspr=Th+2W.Rspr+gr=~52oC Tp,i=Tdie,i=Tspr=~52oC Back Back

111 Rth,h Th Pi+Pj+Pk+Pl Steady State Solution Rth,l Tb,l Pl Steady State Solution : Ti 0 1 Cth ,i Ti , ss Rth,k Tb,k Cth,l Pk Rth,j Rth,i Tb,j Tb,i

Pi Final difference equation : P t Ti t Ti i Cth ,i Cth ,i Rth ,i Cth,h Pj Cth,i Cth,j If Rth,iRth,i x20 Tss,i Tss,I x20 Regfile ex. of presentation 1: Pi=10 & Rth,i=4 Ti,ss=40K Pi Ti t Ti 0 R th

, i .t Pi Rth ,i Pi th Ai Numerically for decode : Ti , ss ,dec 5 3.10 2 0.15K Back Back 112 Cth,k EX: Ppro Thermal Model Tb,j Pj Use CASTLE computed component powers Select thermal sampling interval Determine component areas from Die photo Determine processor/packaging physical parameters Generate numerical thermal model Apply component difference equations

recursively 113 Rth,j Cth,j Simulation and c values hardcoded for materials (except Si) Areas/Relative Areas Hardcoded for components Individual R and C computed for components D.E. loop is re-executed every t, in the discussed order Updated Thermal Nodes displayed every t~20ms Component Temperatures Build up to ~350K in ~5hrs Clock Temp. Shoots up Theatsink moves very slowly as expected For complete set of computed numerical simulation results go to additional slides 114 Simulation Outputs at Startup 115

Simulation Outputs After 5 hrs Back Back 116 Thermal Model Parameters BASELINE AMBIENT TEMPERATURE T_ambient = 323; /* in K */ Intel Thermal Design Guidelines SAMPLING INTERVAL dt = 5e-6 sec.s I Choose Processor Specific Parameters 117 Physical Parameters 15% of Heatsink area has fins, 85% doesnt Overall Rth estimate: Rfin Rnofin 118 Physical Parameters

Temperature assumed uniform along heat spreader and therefore, above 119 Physical Parameters We dont use total R&C for package as its decomposed into component areas in the model DIE: Process info scaled from P4 data in [7] using ITRS 1999 & 2001 and interpolating MPU pitch vs. Wire pitch Metal layer & Isolation scale factor 2.15 ITRS FEP Si final device thickness ~100nm (130nm tech.) I used the overall wafer thickness Temperature dependent Si: Si(T)=1.5486.102.(300/T)4/3 120 Physical Parameters DIE Rth Estimate: Rdie=RSi+Rmetal+Rpoly+RSiO2 For 10% die area: RSi~ 0.1 K/W Rmetal~ 0.0008 K/W Rpoly~ single layer ignorable RSi~0.86 K/W

Rdie~ RSi+RSiO2 DIE Cth Estimate: Only Si considered as rest is much thinner Back Back 121 Numerical Values Back Back 122 Computed Thermal values Back Back 123 Computed Thermal v.2 values Back Back 124

Ppro info & Areas Complete processor info([4],[5],[6]) 200MHz 4 Metal layers Package: 387 pin DC-PGA Package size: 6.76cm x 6.25cm 0.35 BiCMOS Die Size: 196mm2 (14x14) Area estimates for die Scale component areas from [1]: [1] 150 MHz 0.50 Die size:306mm2 Ours 200 MHz 0.35 196mm2 I use x0.64 area scaling and [1]s breakdowns for component area estimates 125 Component Areas Close to Intel data:

14.3% 2.5% 4.6% 4.1% 2.2% 8.6% 7.6% 4.2% 3.9% 1.3% 4.4% 11.8% Clock area found from Intel data as: Aclk=Pclk/PwrDensityclk = 1.7% 7.9% These areas cover ~81.3% of die

4.0% 126 CASTLE Breakdown Areas We need to convert given areas to CASTLE comp-s: DECODE ID+MIS=11.7% ISSUE RS=7.6% REORDER RAT+(ROB&RRF)=8.6% DMEM DCU = 8.6% IMEM IFU=11.8% FUNC_UNIT AGU+IEU+FEU=10% OTHER 100-above=41.7% CLOCK 1.7%

127 Back Back CASTLE Power measurement / profiling tool Developed by Prof Martonosi and Russ Implemented on a P6, Linux Generates power profiles for benchmarks at runtime Uses performance counters to gather utilization information Uses WATTCHs per usage wattage values for max power values ([8 p.3]) Uses heuristics to extract usage counts for blocks Uses register sampling to compute activity factors for single ended bitlines. Computes total processor power Uses a digital multimeter for validation 128 Performance Counters

Exist on most new processors Majorly used to track performance related events Cache misses Committed intr-s, etc. Can be used to gather power related data P6 has 2 performance counters that count 77 events Can be accessed with: RDMSR (Read Machine Specific Register) WRMSR (Write Machine Specific Register) RDTSC (Read Time Stamp Counter) Kernel level (Ring 0) instructions Exemplary events: 0. TSC elapsed machine cycles 03. 03H L1 read misses 44. C0H instr-ns retired 129 Heuristics To extract power related data from performance counters

Platform Dependent! 130 CASTLE implementation Platform: P6, 200 MHz | Linux kernel v2.2.16-3 Xmultimeter server Server code Client Code Kernel Code Series Resistance HW counters 131 CASTLE Filesystem User Code

Client: Includes cpu-monitor & cpu-network Cpu-monitor: Client Code Provides the x-windows for power breakdown bar graphs Acquires power breakdowns from cpu-network Cpu-network: Connects to server side through ethernet Gets event counts and number of elapsed cycles for each tracked event Constructs component power values from event data using heuristics 132 CASTLE Filesystem User Code Multimeter: Real Multimeter reads the voltage over series R and sends over RS232 Xmmeter reads the serial port and converts the voltage

reading into power as: P=(Vread/Rs).Vdd Xmultimeter server Series Resistance X-window displays the readings 133 CASTLE Filesystem User Code Server code Server: Reads the performance counts with syscall getglobaleventcount defined in kernel code every second Acquires event counts and elapsed cycles for all events Sends the event and cycle data to client as a stream of chars. 134 CASTLE Filesystem Kernel Code

Required to access counters Scattered in: /usr/src/linux/arch/i386/kernel/entry.S /usr/src/linux/include/linux/sched.h /usr/src/linux/kernel/fork.c /usr/src/linux/kernel/sched.c Kernel Code Defines 2 new system calls: Geteventcount Getglobaleventcount Accesses the counters, gets counter & cycle data Syscall returns the server event and cycle counts as a 2D array 135 CASTLE Details

In castle code, 12 distinct events are defined From [1] and [8], 10 of the events are used: instructions decoded instructions executed instructions retired floating point operations executed branches retired Branches Decoded L1 instruction cache accesses L1 data cache accesses L2 unified cache accesses main memory requests [1] and [8] suggest a 10ms sampling period Probe-server samples counters every second 136 Power Breakdown Components CASTLE tracks 12 events Develops power breakdowns for 8 units: DECODE ISSUE REORDER DMEM IMEM

FUNC_UNIT OTHER CLOCK Component powers recomputed every second in CPU-network 137 Thermal Modeling with CASTLE Thermal model requires only power and sampling time information Thermal model can be added at user level, by: extending cpu-network for temperature updates extending cpu-monitor for a new thermal x-window A pitfall resides as the sampling period Sampling time should be smaller than time constant, for reliable modeling (<< 100s) Back Back 138 EOP 139

Recently Viewed Presentations

  • Dynamic and Distributed Scheduling in Communication Networks ...

    Dynamic and Distributed Scheduling in Communication Networks ...

    In a stylesheet using XSL or CSS Possibly embedded in a program applet, or script, or JAVA bean defined for that particular DTD, set of tags, or tag By reference to pre-existing mutual agreement amongst user communities aka "namespaces" By...
  • SIM331: Microsoft Forefront Online Protection for Exchange ...

    SIM331: Microsoft Forefront Online Protection for Exchange ...

    SIM 309 - Microsoft Forefront Online Protection for Exchange Advanced Routing Scenarios Deep Dive. SIM 326 - Microsoft Forefront End-to-End Protection for Information Worker Business. SIM 333 - Centralized Management of Anti-Malware/Anti-Spam Using Microsoft Forefront Protection Server Management Console
  • Learning Objective To study Chapter 3 of the novel

    Learning Objective To study Chapter 3 of the novel

    Learning Objective: To study Chapter 3 of the novel Looking for parallels - Candy's Dog. It is significant that Lennie is the only man (other than Carlson, who shoots the dog) not in the bunk house.
  • Loops - While, Do, For

    Loops - While, Do, For

    Arial Courier New Arial Unicode MS Times New Roman Monotype Sorts Forte Default Design Chapter 4 Repetition Statements (loops) Loops - While, Do, For Repetition Statements The while Statement Logic of a while Loop The while Statement Trace while Loop...
  • Mirror or Lens Equations - Berkner&#x27;s Base for Physics and Math

    Mirror or Lens Equations - Berkner's Base for Physics and Math

    Mirror Equations Lesson 4 * * * * * * * Objective Quantitatively determine the relationship between the focal length and distance an object and its image are from the mirror. Mirror or Lens Equations Equations that are used to...
  • VIEWING THE WORLD IN COLOR - Denton ISD

    VIEWING THE WORLD IN COLOR - Denton ISD

    Pictorial depth cues: clues about distance that can be given in a flat picture. Includes linear perspective, texture gradients, interposition, relative size, height in plane, and light and shadow. PERCEPTUAL CONSTANCIES IN VISION.
  • 2 Unit PDHPE

    2 Unit PDHPE

    Epidemiological Terms Health Status - the pattern of health of the population over a period of time Epidemiology - the study of illness and disease in groups or populations through
  • CS2: Intro - Laurentian

    CS2: Intro - Laurentian

    Przemyslaw (Pshemek) Pawluk. MSc in Software Engineering (Sweeden) and Information Technology (Poland) Professor at George Brown College. Co-founder of Mobi-Learning Inc. Research interests: Blended learning. Technology enhanced learning. Cloud and mobile technology in education (c) Khaled Mahmud