Database Reading Transaction Isolation Issues and Gap Locking

2024-12-30 · 6min · someone

Table of Contents

What problem can happen when the write operation in a database on some rows does not block read operations on the same rows?

This scenario describes a potential dirty read problem - a classic concurrency issue in databases. Let me explain what happens:

When writes don't block reads (i.e., there's no read lock during writes), a transaction can read data that's in the middle of being modified by another transaction that hasn't yet committed. This creates several problems:

Inconsistent data: If Transaction A reads data while Transaction B is updating it but hasn't committed, Transaction A sees intermediate data that may never actually become permanent in the database if Transaction B rolls back.

For example:

Time 1: Transaction B begins updating account balance from $1000 to $800
Time 2: Transaction A reads balance (sees $800) before Transaction B commits
Time 3: Transaction B rolls back due to an error
Time 4: Transaction A is now working with invalid data ($800) when the real balance is still $1000

Business rule violations: Imagine a banking system where you need to maintain a minimum balance. A dirty read could make it appear that an account has sufficient funds when it actually doesn't, leading to incorrect decisions.

This is why databases implement isolation levels like READ COMMITTED or SERIALIZABLE that prevent dirty reads by ensuring a transaction can only read committed data. Would you like me to explain more about how these isolation levels work to prevent this problem?

But even if the READ COMMITTED level is applied, there would be still Read Skew problem?

Yes, exactly! Even with READ COMMITTED isolation level, Read Skew can still occur. READ COMMITTED only prevents dirty reads but doesn't protect against Read Skew (also called Non-Repeatable Reads).

Here's a classic example:

-- Account A starts with $1000, Account B starts with $1000
-- Money transfer of $100 from A to B

Time 1: Transaction 1 reads Account A balance ($1000)
Time 2: Transaction 2 transfers $100 from A to B (commits)
Time 3: Transaction 1 reads Account B balance ($1100)

In this case, Transaction 1 sees an inconsistent state where there appears to be $100 more in the total system ($2100 instead of $2000) because it read Account A before the transfer and Account B after the transfer. This is Read Skew - different reads within the same transaction see different versions of the data.

To prevent Read Skew, you need REPEATABLE READ or SERIALIZABLE isolation level, where:

REPEATABLE READ ensures that once a transaction reads a row, it will see the same data for that row throughout the transaction
SERIALIZABLE provides even stronger guarantees by preventing phantom reads as well

Would you like me to explain the specific mechanisms (like snapshot isolation) that databases use to implement these stronger consistency guarantees?

What problem can happen when I apply REPEATABLE READ?

Even with REPEATABLE READ isolation level, you can still encounter the Phantom Read problem. Here's how it works:

Let's say you have a banking scenario:

-- Transaction 1: Find all accounts with balance > $1000
Time 1: SELECT * FROM accounts WHERE balance > 1000;  -- Returns 2 accounts
       -- Keep reading lock on these rows

Time 2: -- Transaction 2 inserts a new account with $1500 and commits
        INSERT INTO accounts VALUES ('new_account', 1500);

Time 3: -- Transaction 1 runs the same query again
        SELECT * FROM accounts WHERE balance > 1000;  -- Now returns 3 accounts!

Even though REPEATABLE READ ensures you get consistent reads of existing rows, it doesn't prevent new rows (phantoms) from appearing in your result set when you run the same query again within the same transaction.

This happens because:

REPEATABLE READ only locks the rows it reads
It doesn't lock the gaps between rows or prevent inserts that match your WHERE clause
New rows can be inserted that match your query conditions

This is why you might need SERIALIZABLE isolation if your application requires complete consistency across range queries. SERIALIZABLE would prevent these phantom reads by effectively locking the entire range specified by your WHERE clause.

Can rows gap lock prevent this phantom read?

Yes, gap locks can prevent phantom reads! A gap lock locks the gaps between existing rows, preventing other transactions from inserting new rows into these spaces.

Here's how it works:

-- Using gap locks
Time 1: SELECT * FROM accounts WHERE balance > 1000;
        -- This acquires locks on:
        -- 1. Existing rows where balance > 1000
        -- 2. Gaps between these rows
        -- 3. Gap before first matching row
        -- 4. Gap after last matching row

Time 2: -- Transaction 2 tries to insert new account
        INSERT INTO accounts VALUES ('new_account', 1500);
        -- This will be blocked by the gap lock until Transaction 1 commits

Time 3: -- Transaction 1's repeat query will see same results
        SELECT * FROM accounts WHERE balance > 1000;

InnoDB in MySQL implements this with "next-key locking", which combines:

Record locks (on the index records)
Gap locks (on the spaces between index records)

However, gap locks come with tradeoffs:

Reduced concurrency since insertions in the range are blocked
Potential for deadlocks if multiple transactions acquire gap locks in different orders
Performance overhead from maintaining additional locks

This is why some applications choose to live with phantom reads rather than use gap locks or SERIALIZABLE isolation, especially in high-concurrency scenarios.

What's the trade-offs with this gap locking?

The gap locking comes with several important trade-offs:

Advantages:

Prevents phantom reads by blocking insertions into locked ranges
Ensures stronger consistency for range queries
Makes REPEATABLE READ almost as strong as SERIALIZABLE level

Disadvantages:

Reduced Concurrency:

-- Transaction 1
SELECT * FROM orders WHERE order_status BETWEEN 'PENDING' AND 'SHIPPED';
-- Locks the entire range, blocking these valid concurrent operations:

-- Transaction 2 - Blocked
INSERT INTO orders VALUES (100, 'PENDING', ...);  -- Blocked even for new records

-- Transaction 3 - Blocked
UPDATE orders SET order_status = 'PROCESSING' 
WHERE order_id = 50;  -- Blocked if status falls in range

Higher Deadlock Probability:

-- Transaction 1
SELECT * FROM accounts WHERE balance BETWEEN 1000 AND 2000;
-- (Acquires gap lock on 1000-2000 range)

-- Transaction 2
SELECT * FROM accounts WHERE balance BETWEEN 1500 AND 2500;
-- (Tries to acquire overlapping gap lock, potential deadlock)

Performance Impact:

Extra overhead maintaining gap locks
More memory usage for lock management
Longer transaction wait times
More complex lock management logic

This is why many systems choose alternative approaches:

Use SNAPSHOT isolation (available in PostgreSQL)
Accept phantom reads for better performance
Design schemas/queries to minimize range locks
Use application-level locking for specific cases