Meaning of Isolation in ACID

  • Most databases are accessed by multiple clients at the same time, which could lead to race condition issues when 1 transaction (txn) reads data that is concurrently modified by another txn.
Race condition between 2 clients concurrently increasing a counter

Isolation in the sense of ACID means that 2 concurrent transactions are isolated from each other. Each txn can pretend that it is the only txn running on the entire DB.
In practice, this highest level of isolation is rarely used because of the performance expense. So, databases often offer multiple levels of Isolation to trade-off the performance and isolation level.

Weak Isolation Levels

Dirty read
  • A transaction can see changes to data made by other txn that are not committed yet. This data changes may actually not happend if the other txn rollbacks later.
  • There is no lock overhead in this level => high performant.
Read Commited

In this level, 2 rules are guaranteed:

  • when reading from DB, a txn only see data that has been committed.
  • when writing to DB, a txn only overwrite data that has been committed.

Example, user 1 set x = 3, but user 2 get x still return old value of x, 2, while user 1 has not yet committed.

Implementation:

  • Prevent dirty write: using row-level locks, when a txn want to modify a row, it has to acquire the lock on that row. The lock will be released when txn committed/ aborted.
  • Prevent dirty read: DB stores both old committed value and the new value set by ongoing txn, other txn that read the row are given the old value.
Repeatable read/ Snapshot isolation

Let's consider an example that data still looks inconsistent even with Read committed isolation level: Alice has $1000 in 2 bank accounts, $500 each. Now, she transfers $100 from acc-2 to acc-1. When the txn is being processed, she check the acc balance at a time before the incoming payment has arrived and other acc after the outgoing transfer has been made. What she get is acc-1 has $500, acc-2 has $400 => $100 was missing.
In fact, this is a acceptable scenario because the data will be consistent eventually. However, some situations cannot tolerate such temporary inconsistency, e.g. Analytic queries when you may want to run a query that scan over large parts of DB to check the integrity of data => these queries can return nonsensical results.

Repeatable Read is the solution to this problem, the rule is simple:

Txn sees all the data that was committed in the DB at the start of the txn.

Implementation

  • Read repeatable uses the same lock mechanism as Read committed to avoid dirty read/write.
  • To avoid non-repeatable read, Snapshot isolation use a technique called multi-version concurrency control (MVCC):
    • when a txn started, it is given a unique, always-increasing id txid
    • whenever a txn writes anything to DB, the written data is tagged with the txid of the writer.
    • when a txn reads from DB, the txid is used to decide which rows it can see and which are invisible:
      • at the start of each txn, DB makes a list of ongoing txn at that time. Any writes belong to those txn are ignored
      • any writes made by aborted txn are ignored
      • any write made by transactions with higher txid are ignored
      • all other writes are visible
Serializable

There are still many ways in which we can have concurrency bug when using Repeatable read level. Let's imagine that Alice and Bob are 2 on-call doctors for a particular shift. Both are feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button to go off call at approximately the same time:

Certainly, this scenario doesn't violate the Read repeatable isolation level, but after 2 txn commit, no doctor is on call!!! This is called write skew.

One possible solution could be locking all the rows that the txn depends on explicitly:

Begin Transaction;
Select * From doctors Where on_call = true and shift_id = 1234 for update;
Update doctors Set on_call = false Where name = 'Alice' and shift_id = 1234;
Commit;

Another example: meeting room booking system

BEGIN TRANSACTION;
    -- Check for any existing bookings that overlap with the period of noon 1pm
SELECT COUNT(*) FROM bookings WHERE room_id = 123 AND
end_time > '2015-01-01 12:00' AND start_time < '2015-01-01 13:00';
    -- If the previous query returned zero:
INSERT INTO bookings
(room_id, start_time, end_time, user_id)
VALUES (123, '2015-01-01 12:00', '2015-01-01 13:00', 666);
COMMIT;

=> 2 users can concurrently insert a conflicting meeting.

The patterns of 2 above examples is:

  • A select query checks whether a requirements is satisfied by searching for rows for match some search condition
  • Depending on the result of first query, the application code decides how to continue
  • If the app decides to go ahead, it makes a write to DB and commits the txn. The write changes the outcome of step 1. However, by the time the write is made, the premise of the decision is no longer true due to another concurrent txn.

In fact, we cannot always rely on locking to solve this problem for 2 reason:

  • performance cost of locking
  • there is nothing to lock, like in Meeting booking system, the Select query return nothing => Select For Update cann't attach locks to anything.

We need a stronger isolation level to deal with such cases: Serializable:

  • even though transactions may execute in parallel, the end result is the same as if they had executed one at a time, serially, without any concurrency.

To implement serializability, most databases use one of 3 techniques:

  • Literally executing txn in a serial order
  • Two-phase locking
  • Optimistic concurrency control
Two-phase Locking (2PL)

Database use 2 types of Lock in 2PL protocol:

  • Shared lock (S-Lock): if a txn acquires S-Lock on an object, others can acquire S-Lock on that object, but cannot acquire X-Lock on that object.
  • Exclusive lock (X-Lock): if a txn acquires X-Lock on an object, others cannot acquire any type of locks on that object.

The locks are used as follows in 2PL:

  • if a txn want to read an object, it must acquire S-Lock first. If there is a X-Lock on the object at the same time, the txn must wait.
  • if a txn want to write to an object, it must acquires X-Lock first. If there is a lock on the object at the same time, the txn must wait.
  • if a txn want to read first then write an object, it may upgrade its S-Lock to X-Lock. The upgrade works the same as getting an X-Lock directly.
  • after a txn has acquires the lock, it must continue to hold the lock until the end of txn (commit/ abort) => this is why it is called 2PL, the number of lock acquired increases (growing phase) then decreases to 0 (shrinking phase) at the end of txn.
Number of lock acquired over time.

Because the locks are held till the end the txn, no other txn can interfere, it must wait the first txn finishes.

Performance of 2PL

The big limit of 2PL is the performance: txn throughput and response times of queries are significantly worse due to 2 reasons:

  • the overhead of acquiring and releasing the locks
  • reduced concurrency: by design, if 2 txn try to do anything that may result in race condition, one has to wait for other to complete first.