슬라이드 1 - Ibm

슬라이드 1 - Ibm

Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul National University Outline Introduction Our Approach Buffered Compare Architecture Evaluation Summary 2 Introduction Memory Wall The number of cores in a chi p is increasing The memory bandwidth is n ot as much --> memory wall problem Emerging big data applicatio ns require even more bandw idth Actually, much of the bandw idth is wasted! CPU Memor y 3

Introduction Table Scan Which items are made out of wood? Which items are heavier t han 5kg? Item# Material Weight A Wood 10kg B Metal 1.5kg C Metal 7kg D Stone 3kg E Wood

2kg 4 Introduction Table Scan Core Ke y Search key Cmp Result D0 D1 D2 D3 D1 D2 D3 Data in table D0

DRAM Data are read and the comparisons are done We only need the result waste in bandwidth! 5 Introduction Table Scan Core Ke y Result Ke y Cmp D0 D1 D2 D3 DRAM Do compare within the memory Only two transfers needed instead of many Essentially a PIM (processing-in-memory) approach 6

Introduction - PIM PIM research was active late 90s ~ early 00s EXECUBE, IRAM, FlexRAM, Smart memory, Yukon, DIVA, etc. Multiple cores in DRAM --> hard to integrate Re-gaining interests for various reasons Big data workloads Limited improvement of processor speed Limited improvement of memory bandwidth 3D stacked memory (HMC, HBM, etc.) 7 Introduction - PIM PIM with 3D stacked memory Host Processor Crossbar Crossbar Network Network HMC HMC Controller Controller PIM Directo Localit ry y Monito r PCU PCU

DRAM Control ler DRAM Control ler PMU LastLastLevel Level Cache Cache PCU L2 L2 Cache Cache Out-OfOrder Core HMC L1 L1 Cache Cache Host Processor PCU DRAM Control ler PEI (PIM enabled instructions) [J. Ahn et al., ISCA 2015]

In-Order Core List Prefetc Pre h f. Buffer Mes.Tri g. Pref. Message Queue DRAM DRAM Controller Controller Crossbar Network Tesseract NI [J. Ahn et al., ISCA 2015] 8 Our Approach - DRAM Architecture & Motivation Internal Shared Bus Bank

Bank Global wordline Activated Column Decoder Local Wordline Local Bitline Activated Activated Local Row Decoder Bank Mat Row Global Dataline Mat Off-chip link Chip I/O

Bank Bank Global Row Decoder DRAM Chip Activated 512 x 512 cells Local Sense Amp. Buffered (Row Buffer) Global Dataline Global Sense Amp. (Bank I/O) A single chip is comprised of 8-16 banks When accessing data, a row in a bank is activ ated and stored in a row buffer A cache line (64B) is fetched at one burst 9 Our Approach - DRAM Architecture & Motivation Internal Zzz.. Bank Shared Bus Zzz.. Bank

Global wordline Column Decoder Local Wordline Local Bitline Activated Local Row Decoder Zzz.. Bank Mat Global Dataline Mat Off-chip link Chip I/O Bank Bank Global Row Decoder

DRAM Chip 512 x 512 cells Local Sense Amp. (Row Buffer) Global Dataline Global Sense Amp. (Bank I/O) One bank can fill up the bandwidth for the off-chip link Since the time required to activate a row is very long and th us multiple banks are used We have 8x-16x internal bandwidth Most of the internal bandwidth is wasted 10 Our Approach - DRAM Architecture & Motivation Internal Shared Bus Compute Compute Bank Bank Global wordline Column Decoder Local Wordline Local Bitline

Activated Local Row Decoder Compute Bank Mat Global Dataline Mat Off-chip link Chip I/O Bank Bank Global Row Decoder DRAM Chip 512 x 512 cells Local Sense Amp. (Row Buffer) Global Dataline Global Sense Amp.

(Bank I/O) Compute inside each bank to utilize the excess bandwidth 11 Our Approach Goal Utilize the unused internal bandwidth Minimal area overhead to DRAM Less-invasive to the existing ecosystem (i.e., leave the DDR3/4 protocol intact as much as possible) 12 Our Approach Goal Internal Shared Bus Bank Bank Column Decoder Local Wordline Local Bitline Global wordline Local Row Decoder Bank

Mat Global Dataline Mat Off-chip link Chip I/O Bank Bank Global Row Decoder DRAM Chip 512 x 512 cells Local Sense Amp. (Row Buffer) Global Dataline Global Sense Amp. (Bank I/O) 1. All PIM operations have deterministic latency All DRAM CMDs (ACT, RES, ) have pre-determined latencies DDR protocols have no mechanism for a memory to signal proces

sors No branching, caching, or pipelining allowed Preserves existing DDR interface and makes logic lightweight 13 Our Approach Goal Internal Shared Bus Bank Bank Global wordline Activated Column Decoder Local Wordline Activated 512 x 512 cells Local Sense Amp. (Row Buffer) Global Dataline Global Sense Amp. (Bank I/O) 1. All PIM operations have deterministic latency 2. Single-row restriction

Local Bitline Activated Activated Local Row Decoder Bank Mat Global Dataline Mat Off-chip link Chip I/O Bank Bank Global Row Decoder DRAM Chip Inter-bank communication is expensive Activation of other rows incur another overhead Allows to use bank I/O as an operand register 14

Our Approach - What to compute with PIM? We focus on compare-n-op patterns over a lo ng range of data DRAM D0 D1 D2 DN CMP Key 15 Our Approach - What to compute with PIM? Compare-n-read Returns the match results for each item DRAM D0 D1 D2 DN CMP Key Result: (=, <, =, , >) 16 Our Approach - What to compute with PIM? Compare-n-select

Returns the min/max among each item DRAM D0 D1 D2 DN CMP Max Max: (D7) 17 Our Approach - What to compute with PIM? Compare-n-increment Increments matching items DRAM K0, V0 K1, V1 K2, V2++ KN, VN CMP K2 18

Buffered Compare Architecture Bank Bank Column Decoder CGEN Bank I/O Key Buffer Local Wordline Local Bitline Global wordline Global Dataline Internal Shared Bus Global Row Decoder Bank Result Queue Chip I/O

Mat Bank Mat Bank Local Row Decoder DRAM Chip 512 x 512 cells Local Sense Amp. (Row Buffer) Global Dataline Arithmetic Unit Key buffer: Holds a value written by the processor Arithmetic unit: Performs computation (cmp, add, etc.) k I/O and Key buffer as operands Result queue: Stores compare results CGEN: Repeats the bank-local commands The datapath is 64 bits wide 0.53% overhead in DRAM area using Ban 19 Buffered Compare Architecture Data cells To/from internal shared bus CGEN

Bank I/O Result Queue Column Decoder Key Buffer Mats Global Dataline Global wordline KeyBuffer Buffer Key Bank IO Mat Mats Arithmetic Unit Arithmetic Unit Cmd Gen Result Queue 64

Mask 64 64 Global Row Decoder Bank 64 Buffered Compare Architecture - Compare-n-read Data cells To/from internal shared bus Mats Bank IO is activated 64 Mask 64 64 KeyBuffer Buffer Key A DRAM row

64 Mats Arithmetic Unit Cmd Gen Result Queue A DRAM row is activated and the data becomes read y Buffered Compare Architecture - Compare-n-read Data cells To/from internal shared bus Mats Bank IO is activated Key buffer is filled with the search key 64 Mask 64 64

KeyBuffer Buffer Key A DRAM row 64 Mats Arithmetic Unit Cmd Gen Result Queue The host writes the search key to the key buffer Buffered Compare Architecture - Compare-n-read Data cells To/from internal shared bus Mats target data are fetched to the bank I/O Bank IO is activated Key buffer is filled with the search key

Mats Arithmetic Unit Cmd Gen Result Queue 64B data are read to the Bank I/O 64 Mask 64 64 KeyBuffer Buffer Key A DRAM row 64 Buffered Compare Architecture - Compare-n-read Data cells To/from internal shared bus Mats target data are fetched to the bank I/O Bank IO

is activated Key buffer is filled with the search key 64 Mask 64 64 KeyBuffer Buffer Key A DRAM row 64 Mats Arithmetic Unit Arithmetic unit performs comparison and queues the results Cmd Gen Result Queue Comparison is performed on the arithmetic unit, and the results are queued Buffered Compare Architecture - Compare-n-read Data cells

To/from internal shared bus Mats target data are fetched to the bank I/O Bank IO is activated Key buffer is filled with the search key 64 Mask 64 64 KeyBuffer Buffer Key A DRAM row Mats Arithmetic Unit Arithmetic unit performs Cmd Gen Repeat , using the

command generator comparison and queues the results Result Queue Repeat for the determined range 64 Buffered Compare Architecture - Compare-n-read Data cells To/from internal shared bus fetched to the bank I/O filled with the search key Mats Arithmetic Unit to the host target data are Key buffer is Send the results Mats Bank IO is activated

64 Mask 64 64 KeyBuffer Buffer Key A DRAM row Arithmetic unit performs Cmd Gen Repeat , using the command generator comparison and queues the results Result Queue Send the results to the host 64 Buffered Compare Architecture - Problems and Solutions Problems Virtual address cannot be handled Physical address should be used or Virtual address should be translated within DRAM Cache coherence problem Processor cache and the DRAM has to be coherent Solutions

Direct segment with non-cacheable region Keep base, limit, and offset registers for a large memory seg ment Translation can be done by simple additions Data are kept non-cacheable within the segment 27 Buffered Compare Architecture - Problems and Solutions Problems Data placement A 64bit word is distributed over multiple chips of a rank and interleaved in units of 8bits, but we need a whole word Chip 1 Chip 0 A0 B0 A1 B1 Chip 7 A7 B7 Word A is distributed Solutions: Use word-interleaving within the segment

Chip 1 Chip 0 A0 A1 B0 B1 Chip 7 H0 H1 Critical-word-first is disabled within the segment 28 Buffered Compare Architecture - Programming Model Work items CMP CMP CMP Item 0 Item 1

CMP Item N-1 Target Data Core Core Core SW code __kernel search(keys[], searchkey, d[]){ int id = get_global_id(0) if (keys[id] == searchkey) return d[id] = 1 } Instruction BC_cmp_read(searchkey, keys, N) Memory Controller DRAM cmd CMP_RD(searchkey, addr, range) DRAM Banks OpenCL based programming model Programmers need not be aware of DRAM parameters (page size, number of banks, ) 29 Evaluation - Setup McSimA+ simulator Processor

22nm, 16 cores running at 3GHz 16KB private L1 32MB S-NUCA L2 Directory-based MESI coherence Memory 28nm DDR4-2000 4 ranks per channel 16 banks per chip PAR-BS (parallelism-aware batch scheduling) 30 Evaluation - Setup Six workloads TSC : In-memory linear scan (Column-store) TSR : In-memory linear scan (Row-store) BT : B+ tree traversal (index scan) MAX : MAX aggregation SA : Sequence assembly KV : Key-value store BC was evaluated against baseline and AMO (Active Memory Operation) 31 Evaluation - Speedup

BC performs 3.62 times better than the baseline AMO 32 Evaluation Bandwidth Usage Baseline Ext.BW (GBps) 20 BC 15 10 5 0 TSC TSR MAX Baseline Int.BW (GBps) AMO BT AMO KV SA Geomean SA Geomean

BC 256 64 16 4 TSC TSR MAX BT KV BC can utilize internal bandwidth by more than 8.64x on geomean 33 Evaluation Sensitivity Usually, the more aggregate banks, the more speedup Sometimes introducing more ranks degrades speed 34 Experimental Result 6.05 1 Proc. Mem AMO

1.5 Baseline 0.5 BC Norm. Energy 2 0 TSC TSR MAX BT KV SA Geomean Energy consumption reduced by 73.3% on average Proc : 77.2% Mem: 43.9% 35 Summary We proposed Buffered Compare, a processing-in-me mory approach to utilize internal bandwidth of DRA M Minimal overhead to the DRAM area Less invasive to existing DDR protocols

3.62X speedup and 73.3% energy reduction Limitations Restricted within a single large segment When using x4 devices, only up to 32bits are supported for the operands 36 The End Thank you! 37

Recently Viewed Presentations

  • Chemical Bonding - Lewis Structures

    Chemical Bonding - Lewis Structures

    Lewis Symbols for Ions For ions, dots are added or taken away depending on the charge, square brackets are drawn, and the charge is indicated in the upper right hand corner. By following the octet/duet rule, we can say that...
  • Stars Part Two: - tuhsphysics.ttsd.k12.or.us

    Stars Part Two: - tuhsphysics.ttsd.k12.or.us

    Hubble Snapshot Captures Life Cycle of Stars. In this stunning picture of the giant galactic nebula NGC 3603, the crisp resolution of NASA's Hubble Space Telescope captures various stages of the life cycle of stars in one single view. To...
  • RFT Presentation Template

    RFT Presentation Template

    Support suppliers - help them help you Adverserial relationship Allow scope for good design Detailed specifications Support wider market development Not done Be creative and flexible (e.g. MOU) Follow standard procedures without questioning Rotherham's tips for success Stimulate, enable and...
  • What is accredidation?

    What is accredidation?

    * PBL originated from a curriculum reform by medical faculty at Case Western Reserve University in the late 1950s. Innovative medical and health science programs continued to evolve the practice of PBL, particularly the specific small group learning and tutorial...
  • Equilibrium Equilibrium Constant Constant  Turn in CV Final

    Equilibrium Equilibrium Constant Constant Turn in CV Final

    Equilibrium Constant . Turn in CV Final Report and Pre-lab in folders . Drop your Group Form. Download example spreadsheet from Dr. Ng's website. Today - look at a dynamic equilibrium reaction between varying concentrations of reactants for the following...
  • Students: Get: - 3 handouts - Hand written

    Students: Get: - 3 handouts - Hand written

    following can be as low as 0.01 pg/cm (pg = picogram = 10-12 gram) trail. To positively be identified as a trail pheromone, a compound must . induce trail-following and be present in the sternal gland of the termites.
  • Newton&#x27;s 3rd Law and Momentum

    Newton's 3rd Law and Momentum

    Momentum is given the symbol . p. Momentum Equation. p = mv. p = momentum in kg * m/s. m = mass in kg. v = velocity in m/s. Just like velocity, momentum has size and direction! Example Momentum Problem....
  • Contract Compliance Create Publish Contract Workspace

    Contract Compliance Create Publish Contract Workspace

    Create a Contract Workspace (continued). Designate whether NCP applies.. Term Type: Select Fixed, Perpetual or Auto-Renew. Select the Effective Date and Expiration Date for the contract.. In the templates area, select SCC: PricingTerms-Bulkload.. Click Create.. Your contract workspace is now...