Deferred Runtime Pipelining for contentious multicore ...

Deferred Runtime Pipelining for contentious multicore ...

Deferred Runtime Pipelining for contentious multicore transactions Shuai Mu Sebastian Angel Dennis Shasha 1 Multicore programming today Dangerous atomics Confusing semaphores Too many locks

2 Yet developers use transactions to interact with databases and distributed systems Simpler, less error-prone, efficient enough... 3 What keeps multicore devs in the dark ages? multicore txs have very high overheads (historically) 4 Three main points in this talk STM is not crazy expensive anymore

Recent proposals (STO [EuroSys 16], TDSL [PLDI 16]) use type information Competitive with fine-grained locking for many workloads DB community has for decades leveraged workload knowledge We extend STO to support more workloads efficiently New concurrency control protocol called DRP DRP is inspired by work in DB, but avoids static analysis Can handle arbitrary transactions defined at runtime 5 Three main points in this talk STM is not crazy expensive anymore DB community has for decades leveraged workload knowledge We extend STO to support many more workloads efficiently

6 Software Transactional Objects [EuroSys 16] void transfer(TArray& bal, TBox& num, int src, int dst, int amt) { TRANSACTION { Begin transaction bal[src] = bal[src] - amt; bal[dst] = bal[dst] + amt; Read operations num = num + 1;

} RETRY(true); } Write operations Try to commit STO uses OCC (or TL2) to execute transactions 1. 2. 3. 4. Reads performed without locks and writes buffered locally Locks are acquired on all objects in write set Reads are certified: check values read are still valid (abort otherwise) Install writes and release locks

7 STO inherits the performance profile of OCC no aborts + low overhead = life is good we want to be here! many aborts = wasted work throughput workload contention (probability of conflicts) 8 Three main points in this talk STM is not crazy expensive anymore

DB community has for decades leveraged workload knowledge We extend STO to support many more workloads efficiently 9 Transaction chopping [Shasha et al., TODS 95] Tx 1 Tx 2 R; W; W; R; R; R; W; R; W; Assume they conflict Static analysis Tx 1

Tx 2 W; R; W; R; R; R; W; R; W; Time R; W; W; R; R; R; W; R; W; W; R; W; R; R; R; W; R; W; Time 10 Runtime Pipelining [Xie et al., SOSP 15] Static analysis + runtime checks finer chopping more concurrency under contention

Tx 1 R; W; W; R; R; R; W; R; W; Tx 2 W; R; W; R; R; R; W; R; W; Time 11 Runtime Pipelining is good at high contention

Runtime Pipelining (RP) throughput OCC workload contention (probability of conflicts) 12 Problem: hard to port RP to STO Transactions are defined at runtime and may have unknown read/write sets void credit(TArray& bal, int clients, int amt) { TRANSACTION { for (int i = 0; i < clients; i++) { bal[i] = bal[i] + amt; } }

} 13 Three main points in this talk STM is not crazy expensive anymore DB community has for decades leveraged workload knowledge We extend STO to support many more workloads efficiently 14 Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank

Release locks when all operations on a rank are done Locks currently held TRANSACTION { bal[src] = bal[src] - amt; bal[dst] = bal[dst] + amt; } // rank 1 // rank 2 15 Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done

Locks currently held TRANSACTION { bal[src] = bal[src] - amt; bal[dst] = bal[dst] + amt; } // rank 1 // rank 2 1 16 Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank

Release locks when all operations on a rank are done But what if program order, control flow, orLocks data currently held TRANSACTION dependencies { disagree with ranks? bal[src] = bal[src] - amt; bal[dst] = bal[dst] + amt; // rank 1 // rank 2 2

} Tx 1: Tx 2: R; W; R; W; R; W;

R; W; 17 Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done Locks currently held TRANSACTION { bal[dst] = bal[dst] + amt; bal[src] = bal[src] - amt; } // rank 2

// rank 1 2 18 Strawman Give each object a unique rank (for example its memory address) Acquire locks in increasing order of rank Release locks when all operations on a rank are done Locks currently held TRANSACTION { bal[dst] = bal[dst] + amt; bal[src] = bal[src] - amt; } Cannot abort either!

// rank 2 // rank 1 cannot lock 1 (not in rank order!) This is an in-place update to the shared state (not a local update as in OCC) And we have already release its lock. oops. 19 Deferred Runtime Pipelining (DRP) During execution: Asynchronously read operations + Buffer write logic Similar to lazy evaluation

Similar to OCC (but for logic instead of values) After call to commit: Acquire locks in rank order + Enqueue write logic onto objects in a pipeline Similar to RP (but enqueues logic instead of performing operations) 20 Example of DRP (actual syntax is less verbose) TRANSACTION { auto bal_dst = bal.async_read(dst);

bal.enqueue(dst, { bal_dst + amt }); auto bal_src = bal.async_read(src); bal.enqueue(src, {bal_src amt}); Locks currently held // rank 2 // rank 2 // rank 1 // rank 1 } Thread-local state Shared global state

bal[dst]: ({bal_dst + amt}) 21 Example of DRP (actual syntax is less verbose) TRANSACTION { auto bal_dst = bal.async_read(dst); bal.enqueue(dst, { bal_dst + amt }); auto bal_src = bal.async_read(src); bal.enqueue(src, {bal_src amt}); Locks currently held // rank 2 // rank 2 // rank 1

// rank 1 } Thread-local state Shared global state bal[dst]: ({bal_dst + amt}) bal[src]: ({bal_src - amt}) 22 Example of DRP (actual syntax is less verbose) TRANSACTION {

auto bal_dst = bal.async_read(dst); bal.enqueue(dst, { bal_dst + amt }); auto bal_src = bal.async_read(src); bal.enqueue(src, {bal_src amt}); Locks currently held // rank 2 // rank 2 // rank 1 // rank 1 } We know write set! Thread-local state

Shared global state bal[dst]: ({bal_dst + amt}) bal[src]: ({bal_src - amt}) 23 Example of DRP (actual syntax is less verbose) TRANSACTION { auto bal_dst = bal.async_read(dst); bal.enqueue(dst, { bal_dst + amt }); auto bal_src = bal.async_read(src); bal.enqueue(src, {bal_src amt}); Locks currently held

// rank 2 // rank 2 1 // rank 1 // rank 1 } Thread-local state bal[dst]: ({bal_dst + amt}) Shared global state append

bal[src]: ({bal_src - amt}) bal[src]: ({bal_src - amt}) 24 Example of DRP (actual syntax is less verbose) TRANSACTION { auto bal_dst = bal.async_read(dst); bal.enqueue(dst, { bal_dst + amt }); auto bal_src = bal.async_read(src); bal.enqueue(src, {bal_src amt}); Locks currently held // rank 2

// rank 2 2 // rank 1 // rank 1 } Thread-local state bal[dst]: ({bal_dst + amt}) bal[src]: ({bal_src - amt}) Shared global state append

bal[src]: ({bal_src - amt}) bal[dst]: ({bal_dst + amt}) 25 Wait a second This requires the developer to rewrite transactions... And some txs cant be expressed with async reads + deferred writes 26 DRP also supports legacy transactions Legacy transactions execute with an OCC-ish protocol New protocol allows async + legacy transactions to coexist

Cool and very efficient mechanism. See paper. DRP automatically detects if a transaction is legacy or not! Read set of legacy tx = not empty Read set of async transaction = empty (only promises) 27 Evaluation questions (more in the paper) How does DRP perform on standard benchmarks? How does DRP perform as contention varies? 28 How does DRP perform on standard

benchmarks? Silo [SOSP 13] multicore database ported to STO and DRP TPC-C benchmark new order, payment, delivery, stock level, order status STAMP: suite of multicore applications See paper or ask me about it 29 TPC-C workload Runtime Pipelining (RP) throughput OCC

STO workload contention (probability of conflicts) (transaction chopping) 30 Evaluation questions (more in the paper) How does DRP perform on standard benchmarks? How does DRP perform as contention varies? 31

TPC-C with varying contention (32 threads) STO (transaction chopping) TPC-C has 5 transaction types. We only make 3 of them async. The other 2 types abort a lot. Workload contention increases 32 Summary DRP expands the workloads that can benefit from using STO DRP works transparently with async and legacy transactions DRP guarantees opacity and avoids deadlock and aborts

33 STAMP Results (32 threads) STO uses workload-specific optimizations called predicates (DRP does not implement these) 34 Impact of async txs at high contention Percentage of async transactions 35

Actual syntax void transfer(TArray& bal, TBox& num, int src, int dst, int amt) { TRANSACTION { // the next two lines explicitly use deferred interfaces to buffer intentions Intention* bal_src = bal.defer_at(src); bal.defer_update(src, new Intention([&](int& val){ return bal_src->result - amt; }, {bal_src})); // the next three lines use syntactic sugar based on C++'s operator // overloading and implicit type conversion to achieve the same effects auto bal_dst = bal[dst]; bal[dst] = bal_dst + amt; num += 1; }

} 36 Lots more in the paper DRP guarantees opacity and deadlock freedom DRP can be implemented incrementally Details on how to implement DRP into STO with low overhead 37 Avoiding rank mismatch If we know read/write set Acquire all locks ahead of time (predeclaration locking) If you ever need to acquire a lock out of rank order, acquire all prior locks

Ask programmer to predefine accesses ACCESS([&bal[dst], 2], [&bal[src], 2], [&num, 1]); TRANSACTION { bal[dst] = bal[dst] + amt; // rank 2 bal[src] = bal[src] - amt; // rank 1 num = num + 1; // rank 3 } This is a nightmare: error-prone, hard to reason about, etc. 38

Recently Viewed Presentations

  • Diapositive 1 - 16.ticfga.ca

    Diapositive 1 - 16.ticfga.ca

    Classe dans les locaux du CJLRS et soutien individuel adapté CFP CÉA Le Moyne-D'Iberville CÉA Antoine-Brossard Comité local de Longueuil 6 projets Parent scolaire Depuis janvier 2005 CFP PARTENAIRES CSSS Pierre-Boucher CJE Longueuil Sécurité du revenu, direction régionale RESPONSABLE CÉA...
  • Come and See Training for Governors A new

    Come and See Training for Governors A new

    Come and See Training for Governors A new Religious Education programme for Catholic Primary Schools. Like Before you Begin this offers an opportunity for reflection at adult level.
  • National Average Temperature (Annual) 56 55 54 53

    National Average Temperature (Annual) 56 55 54 53

    Times New Roman Wingdings Default Design Microsoft Excel Chart PowerPoint Presentation PowerPoint Presentation ...
  • Diapositiva 1 - LUISS Guido Carli

    Diapositiva 1 - LUISS Guido Carli

    The governors have the means of governing, since rationalized parliamentarianism ensures the stability and power of the majority bloc; The governors are effectively answerable to the governed, since the latter always have an alternative solution, at the next election, if...
  • Tips for the fsa writing component - Teague Middle School

    Tips for the fsa writing component - Teague Middle School

    problem/solution. main idea and supporting details. ... Use the planning page to plan your essay - use PEEL-EL for each body paragraph. ONLY USE INFORMATION FROM THE TEXTS. ... Tips for the fsa writing component Last modified by: Struck, Karen...
  • Foal Neonatal care - Marsha Brantley

    Foal Neonatal care - Marsha Brantley

    Parturition and Foal Neonatal Care ... Surgical correction Therapeutic trimming Axial- off set knees No treatment, undesirable in race horses, can lead to unsoundness Rotational- toed out Self correction Muscular development of thoracic muscles Spiral - toed in Fetlock is...
  • Tunneling Devices - University of Notre Dame

    Tunneling Devices - University of Notre Dame

    EE666 - Advanced Semiconductor Devices Tunneling Devices Dane Wheeler April 19, 2005 Outline Motivation Band-to-Band Tunneling Device Proposals Fabrication Techniques Notre Dame Devices Conclusions Motivation Scaling: some proposed tunneling field effect transistor (TFET) designs do not suffer from short channel...
  • Digital Library Architecture- Key Principles

    Digital Library Architecture- Key Principles

    E + - unsigned integer unsigned integer unsigned number • Formal specification does not guarantee correctness • Formal specification does not prescribe the implementation Informal: The function intrt(a) returns the largest integer whose square is less than or equal to...