Chidambaram, V., Pillai, T., Arpaci-Dusseau, A. and Arpaci ...

Chidambaram, V., Pillai, T., Arpaci-Dusseau, A. and Arpaci ...

Optimistic Crash Consistency Chidambaram, V., Pillai, T., Arpaci-Dusseau, A. and ArpaciDusseau, R. Presented by Mohammed Alabdulhadi 28 October 2013 Outline

Background Introduction Key Points of the paper Paper main contributions. Pessimistic Crash Consistency Probabilistic Crash Consistency Optimistic Crash Consistency

Implementation of Optimistic File System (OptFS) Evaluation Case Studies Related Work Conclusion Paper Evaluation Questions and Answers?

Background File Systems Operations: Data Operations. Metadata Operations. During a metadata operation, the system must ensure that data are

written to disk in such a way that the file system can be recovered to a consistent state after a system crash. Background Approaches used to handle Metadata operations and recovery: Soft Updates. Journaling.

NVRAM. Other approaches. The solution presented in this paper is based on Journaling approach. Background What is Journaling ?

Journaling is maintaining an auxiliary log that records all metadata operations and ensuring that the log and data buffers are synchronized in such a way to guarantee recoverability. If the system crashes, the log system replays the log to bring the file system to a consistent state Introduction

The introduction of write buffering which enables disk writes to be completed out of order complicated the known recovery techniques for system crashes. The objective of out order write is increase in performance. The notification received after a write issue implies only that the disk has received the request, not that the data has been written to the disk surface persistently.

Introduction Without ordering, most file systems cannot ensure that state can be recovered after a crash. Write ordering is achieved in modern drives via expensive cache flush operations ; such flushes cause all buffered dirty data in the drive to be written to the surface (i.e., persisted) immediately.

For Example, to ensure A is written before B, a client issues the write to A, and then a cache flush; when the flush returns, the client can safely assume that A reached the disk; the write to B can then be safely issued. Introduction Cache Flush Disadvantages: Expensive. Low Performance.

This approach of flushing is pessimistic; it assumes a crash will occur and goes to great lengths to ensure that the disk is never in an inconsistent state. So, it is called pessimistic crash consistency. Introduction The poor performance that results from pessimism has led some systems to disable flushing.

Disabling flushes does not necessarily lead to file system inconsistency, but rather introduces it as a possibility. This approach is called probabilistic crash consistency. A probabilistic approach is insufficient for many applications, where certainty in crash recovery is desired. Key points of the paper

Combination techniques leads to both high performance and deterministic consistency; in the rare event that a crash does occur. Optimistic crash consistency either avoids inconsistency by design or ensures that enough information is present on the disk to detect and discard improper updates during recovery. Paper main contributions

Study of probabilistic crash consistency to show which exact factors affect the probability that a crash will leave the file system inconsistent. Introduce optimistic crash consistency, a new approach to building a crash-consistent journaling file system. Types of Crash Consistency Pessimistic Crash Consistency.

Probabilistic Crash Consistency. Optimistic Crash Consistency. Pessimistic Crash Consistency Based on Flush cache operation. A transaction is the atomic update of metadata to the journal.

Pessimistic Crash Consistency Before committing a transaction Tx to the journal: 1. The file system first writes any data blocks (D) associated with the transaction to their final destinations. 2. The file system uses the journal to log metadata updates; we refer to these journal writes as JM. 3. The file system issues a write to a commit block (JC).

4. The transaction Tx is said to be committed. 5. The file system is free to update the metadata blocks in place (M). If a crash occurs during this check pointing process, the file system can recover by scanning the journal and replaying committed transactions. Pessimistic Crash Consistency D JM J C M

To achieve this ordering, the file system issues a cache flush wherever order is required. Suggested Optimizations: D|JM JC M DJM|JC M Order between transactions; journaling file systems

transactions are committed to disk in order (i.e., Txi Txi+1) assume Pessimistic Crash Consistency Drawbacks: An expensive cache flush is issued, thus forcing all pending writes to disk, when perhaps only a subset of them needed to be flushed.

The flushes are issued even though the writes may have gone to disk in the correct order anyhow. Crashes are rare. Performance Impact. Pessimistic Crash Consistency Performance impact:

Probabilistic Crash Consistency Disable flushes. A risk of file-system inconsistency is introduced. In some cases, practitioners observed that skipping flush commands sometimes did not lead to observable inconsistency, despite the presence of occasional crashes. Such commentary led to a debate.

No guarantees of consistency VS. Performance gain. Probabilistic Crash Consistency Window of vulnerability (W) occurs due to reordering. For example, if A should be written to disk before B, but B is written at time t1 and A written at t2, the state of the system is vulnerable to inconsistency in the time period between, W = t2 t1.

Probability of Inconsistency (Pinc) Dividing the total time spent in windows of vulnerability by the total run time of the workload (Pinc =WWi/tworkload) Probabilistic Crash Consistency Probabilistic Crash Consistency

Factors affecting Pinc Workload Queue Size (Disk Scheduler) Journal Layout Probabilistic Crash Consistency Factors affecting Pinc :

Workload Early commit (JC JM|D), Early checkpoint (M D|JM|JC), Transaction misorder (TxiTxi1) Mixed.

Probabilistic Crash Consistency Factors affecting Pinc : Queue Size Probabilistic Crash Consistency Factors affecting Pinc : Journal Layout

Probabilistic Crash Consistency A probabilistic approach is insufficient for many applications, where certainty in crash recovery is desired. Optimistic Crash Consistency Goals:

To commit transactions to persistent storage in a manner that maintains consistency to the same extent as pessimistic journaling. The same performance as with probabilistic consistency. Optimistic Crash Consistency Optimistic crash consistency is based on two main ideas: 1. Checksums can remove the need for ordering writes. 2. Asynchronous durability notifications are used to delay check pointing a

transaction until it has been committed durably. (Minimum Extension to the disk interface) Optimistic Crash Consistency With an asynchronous durability notification the disk informs the upper-level client that a specific write request has completed and is now guaranteed to be durable.

Two notifications from the disk: 1. The disk has received the write. 2. The write has been persisted. Optimistic Crash Consistency Optimistic Consistency Properties: 1. Metadata written in transaction Tx:i+1 cannot be observed unless

metadata from transaction Tx:i is also observed. 2. It is not possible for metadata to point to invalid data Optimistic journaling allows the disk to perform writes in any order it chooses, but ensures that in the case of a crash, the necessary consistency properties are upheld for ordered transactions.

Optimistic Crash Consistency Optimistic Crash Consistency Optimistic Techniques In-Order Journal Recovery In-Order Journal Release Checksums

Background Write after Notification Reuse after Notification Selective Data Journaling Optimistic Crash Consistency In-Order Journal Recovery The recovery process reads the journal to observe which transactions were

made durable and it simply discards or ignores any write operations that occurred out of the desired ordering. The correction that optimistic journaling applies is to ensure that if any part of a transaction Tx:i was not correctly or completely made durable, then neither transaction Tx:i nor any following transaction Tx:j where j>i is left durable.

Optimistic Crash Consistency In-Order Journal Release To ensure that journal transactions are not overwritten until all corresponding checkpoint writes of metadata are confirmed as durable. Optimistic Crash Consistency Checksums

Checksum is used to detect whether or not a write related to a specific transaction has occurred. Metadata transactional checksumming. Data transactional checksumming. Optimistic Crash Consistency Metadata transactional checksumming:

Ensuring metadata is durably written to the journal. A checksum is calculated over JM and placed in JC. If a crash occurs during the commit process, the recovery procedure can detect the mismatch between JM and the checksum in JC and not replay that transaction or any transactions following. Optimistic Crash Consistency Data transactional checksumming:

Used to ensure that data blocks D are written in their entirety as part of the transaction. The data checksums and their on-disk block addresses stored in JC. The journal recovery process can abort transactions upon mismatch. Optimistic Crash Consistency Background Write after Notification

Ensures that the checkpoint of the metadata (M) occurs after the preceding writes to the data and the journal (i.e., D, JM, and JC). Pessimistic journaling guaranteed this behavior with a flush after JC. Optimistic journaling explicitly postpones the checkpoint write of metadata M until it has been notified that all previous transactions have been durably completed.

Optimistic Crash Consistency Reuse after Notification To ensure that durable metadata from earlier transactions never points to incorrect data blocks changed in later transactions. Problem: Data block DA is freed from one file MA and allocated to another file, MB and rewritten with the contents DB.

A durable version of MA may point to the erroneous content of DB. Optimistic Crash Consistency Reuse after Notification Optimistic Solution: Freeing of DA and update to MA, denoted MA , is written as part of a transaction JMA :i.

The allocation of DB to MB is written in a later transaction as DB:i+1 and JMB:i+1. Optimistic journaling guarantees that JMA : i occurs before DB:i+1 by ensuring that data block DA is not reallocated to another file until the file system has been notified by the disk that JMA :i has been durably written; at this point, the data block DA is durably free. Optimistic Crash Consistency

Selective Data Journaling Used if update-in-place is desired for performance. Data journaling places both metadata and data in the journal and both are then updated in-place at checkpoint time. Selective data journaling allows ordered journaling to be used for the common case and data journaling only when data blocks are repeatedly overwritten within the same file and the file needs to maintain its original layout on disk.

Optimistic Crash Consistency Selective Data Journaling Optimistic Crash Consistency Durability vs. Consistency Optimistic journaling uses an array of novel techniques to ensure that writes to disk

are properly ordered, or that enough information exists on disk to recover from an untimely crash when writes are issued out of order. The result is file-system consistency and proper ordering of writes, but without guarantees of durability. Some applications may wish to force writes to stable storage for the sake of durability, not ordering.

Optimistic Crash Consistency Durability vs. Consistency Ordering sync, osync(), guarantees ordering between writes. Durability sync, dsync(), ensures when it returns that pending writes have been persisted. Implementation of OptFS OptFS is built on the principles of optimistic crash consistency.

Set of modifications of Linux ext4 file system. Slight change in disk interface to provide asynchronous durability notification. Implementation of OptFS Since current disks do not implement the proposed asynchronous durability notification interface, OptFS uses an approximation: durability timeouts.

Durability timeouts represent the maximum time interval that the disk can delay committing a write request to the non-volatile platter. Upon expiration of the time interval , OptFS considers the block to be durable. Apply the optimistic techniques as described earlier. Evaluation Reliability (Consistency guarantees)

Performance. Resource Consumption. Journal Size. Evaluation Reliability

Evaluation Performance: Micro-benchmarks Evaluation Performance: Macro-benchmarks Evaluation

Performance Summary: OptFS significantly outperforms ordered mode with flushes on most workloads, providing the same level of consistency at considerably lower cost. On many workloads, OptFS performs as well as ordered mode without flushes, which offers no consistency guarantees. OptFS may not be suitable for workloads which consist mainly of sequential

overwrites. Evaluation Resource consumption Evaluation Journal size

Case Studies Related Work Soft Updates shows how to carefully order disk updates so as to never leave an on-disk structure in an inconsistent form. While journaling works at the abstraction level of metadata and data,

Soft Updates works directly with file system structures, significantly increasing its complexity. Related Work Similar to that of Frost et al.s work on Featherstitch which provides a generalized framework to order file-system updates, in either a softupdating or journal-based approach.

Optimistic Crash consistency better in performance and easier for developers. Related Work Rethinking the sync is a similar approach. Disk writes only need to become durable when some external entity can observe said durability. Delaying persistence until such externalization occurs, huge gains in performance can be

realized. Optimistic Crash Consistency is complimentary, in that it reduces the number of such durability events, instead enforcing a weaker and higher performance ordering among writes, but avoiding the complexity of implementing dependency tracking within the OS. Related Work No-Order File System (NoFS), which removes the need for any ordering

to disk at all, thus providing excellent performance. A lack of ordered writes means certain kinds of crashes can lead to a recovered file system that is consistent, but that contains data from partially completed operations. Related Work

Conclusion Optimistic crash consistency, a new approach to crash consistency in journaling file systems that uses a range of novel techniques to obtain both a high level of consistency and excellent performance. Introduce two new file-system primitives, osync() and dsync(), which decouple ordering from durability. Decoupling holds the key to resolving the constant tension between

consistency and performance in file systems. Paper Evaluation + Successful approach to provide a metadata handling solution that combines high level of consistency along with high performance. + Explain the concepts by using examples. + Many experiments to evaluate their solution.

- Durability timeout notification based on the disk maximum write time is not accurate. - Optimistic Crash Consistency in not suitable for workloads that contains many sequential overwrites. - Repeating a lot of information in different sections. Thank You

Questions

Recently Viewed Presentations

  • Diapositiva 1 - PROGRAMA MOSCAMED

    Diapositiva 1 - PROGRAMA MOSCAMED

    Abrir DNR Garmin. Cargar el Shape que será almacenado al GPS. NOTA: si estamos trabajando con varios Shapes, se cargara el Shape que tengamos seleccionado. la Ruta es: File/Load From/ArcMap/Layer. Después nos muestra una ventana, donde escogemos los datos que...
  • Central American History and Literature

    Central American History and Literature

    -1995 Claribel Alegría (b. 1924): El Salvador Claribel Alegría: Ashes of Izalco Exposed the massacre in 1932 of 30,000 peasants in the city of Izalco, El Salvador Portrayed a love story between a Salvadoran woman and a man from the...
  • Missional?

    Missional?

    Missional? Worship Prayer Reason Emotion Attraction The Kingdom of God Evangelism Soteriological Ecclesiological Eschatological Social Fellowship
  • Lecture 1 - Texas Tech University

    Lecture 1 - Texas Tech University

    In this course (and in most of the modern world, except the USA!) we will use (almost) exclusively the SI system of units. SI = "Systéme International" (French) More commonly called the "MKS system" (meter-kilogram-second) or more simply, "the metric...
  • Financial Accounting and Accounting Standards

    Financial Accounting and Accounting Standards

    LO 7 Explain how to report and analyze inventory. Inventory Turnover Ratio Illustration 9-26 Illustration: In its 2009 annual report Kellogg Company reported a beginning inventory of $897 million, an ending inventory of $910 million, and cost of goods sold...
  • Okay, We've Arrived at Secondary…Options for the Future?

    Okay, We've Arrived at Secondary…Options for the Future?

    Okay, We've Arrived at Secondary…Options for the Future? Dr. Barzanna White and Barbara Driscoll Session 4A 9:30 Session 4B 12:45 * * * * * * * * * * * * Why Do We Need Mental Health Services in...
  • Chapter 9

    Chapter 9

    North Africa was divided into three sultanates. Thousands of pilgrims traveled to Mecca for the hajj, going through Cairo. Cairo became was the cultural capital of the Islamic world in 1261. Rulers of the . Mamluk . empire . announced...
  • 5 Magazine Articles - My Blog | A fine WordPress.com site

    5 Magazine Articles - My Blog | A fine WordPress.com site

    The slogan along the middle says, "For all the ways you play." This text means that people can do more things that are fun when you smell really good. Some people could take the ad and use it in a...