Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memdb: retain old version nodes of ART to satisfy snapshot read #1503

Merged
merged 5 commits into from
Nov 20, 2024

Conversation

you06
Copy link
Contributor

@you06 you06 commented Nov 19, 2024

ref pingcap/tidb#57425

Changes

The snapshot iterators always read from a snapshot of MemBuffer, but writes between Next calls can alter the structure of ART (see the "but explanation" section for details), potentially causing the snapshot iterator to read incorrect results.

This PR introduces a counter for active snapshots. When the counter is greater than 0, it indicates that old versions need to be retained for snapshot reads. In such cases, we store freed nodes in unused slices and delay the actual free operation to prevent them from being reused.

Add a counter for

Bug explanation

1. The snapshot iterator scans to node (click to expand the figure)
  │                                 │ 
  │        iterator range           │ 
  ◄─────────────────────────────────► 
  │                                 │ 
  │                                 │ 
  │            ┌────────┐           │ 
  │            │        │           │ 
  │            │  root  │           │ 
  │            │        │           │ 
  │            └────┬───┘           │ 
  │                 │               │ 
  │         ┌───────┴───────┐       │ 
  │         │               │       │ 
  │    ┌────▼───┐      ┌────▼───┐   │ 
  │    │        │      │        │   │ 
  │    │    1   │      │    2   │   │ 
  │    │        │      │        │   │ 
  │    └─────▲──┘      └────────┘   │ 
  │          │                      │ 
  │          │                      │ 
  │          │                      │ 
  │          │                      │ 
  │          │                      │ 
  │    ┌─────┴───┐                  │ 
  │    │ snapshot│                  │ 
  │    │ iterator│                  │ 
  │    └─────────┘                  │ 
  │                                 │ 
2. The node 1 grows to larger capacity (node 3) due to coming writes, and the node1 is reused (click to expand the figure)
 │                                 │                
 │        iterator range           │                
 ◄─────────────────────────────────►                
 │                                 │                
 │                                 │                
 │            ┌────────┐           │                
 │            │        │           │                
 │            │  root  │           │                
 │            │        │           │                
 │            └────┬───┘           │                
 │                 │               │                
 │         ┌───────┴───────┬───────┼─────────┐      
 │         │               │       │         │      
 │    ┌────▼───┐      ┌────▼───┐   │    ┌────▼───┐  
 │    │        │      │        │   │    │        │  
 │    │    3   │      │    2   │   │    │    1   │  
 │    │        │      │        │   │    │        │  
 │    └────────┘      └────────┘   │    └─────▲──┘  
 │                                 │          │     
 │                                 │          │     
 │                                 │          │     
 │                                 │          │     
 │                                 │          │     
 │                                 │    ┌─────┴───┐ 
 │                                 │    │ snapshot│ 
 │                                 │    │ iterator│ 
 │                                 │    └─────────┘ 
 │                                 │                
  1. The iterators will return keys out of the given range in following Next call.

@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 19, 2024
@cfzjywxk cfzjywxk requested review from cfzjywxk and ekexium November 19, 2024 08:31
Signed-off-by: you06 <[email protected]>
Copy link
Contributor

@ekexium ekexium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks OK. I just have some questions:

  1. In your example, which variable in the iterator is pointing to node-1? Is it nodes in baseIter?
  2. Why are we only fixing for SnapshotIter? What about the ART Iterator? Is it currently safe in implementation but not guaranteed by design?
  3. Are there any other inner structure change that could lead to iterator invalidation, other than the free nodes? I don't see other cases at first glance. Have you verified this as well?

Signed-off-by: you06 <[email protected]>
@you06
Copy link
Contributor Author

you06 commented Nov 19, 2024

  • In your example, which variable in the iterator is pointing to node-1? Is it nodes in baseIter?

Yes, nodes[len(nodes) - 1] is node1 in the example.

  • Why are we only fixing for SnapshotIter? What about the ART Iterator?

The SnapshotIter can be explained always read from the snapshot, so we need to protect it against the later writes.

For Iterator, there should be no writes during the iterator, unless the result makes no sense.

Is it currently safe in implementation but not guaranteed by design?

Yes, such usage in TiDB is out of my expectation. For long-term, we may deprecate the SnapshotIter and replace it with SnapshotScan which returns all the rows in one call.

  • Are there any other inner structure change that could lead to iterator invalidation, other than the free nodes? I don't see other cases at first glance. Have you verified this as well?

I don't see other cases also. GetSnapshotValue can filter out the new added keys or versions.

Copy link
Contributor

@cfzjywxk cfzjywxk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

internal/unionstore/memdb_test.go Show resolved Hide resolved
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Nov 19, 2024
@cfzjywxk
Copy link
Contributor

cfzjywxk commented Nov 19, 2024

@you06
Please also ensure the read-with-write test cases are covered in the PR for tidb repo.

Signed-off-by: you06 <[email protected]>
@cfzjywxk cfzjywxk requested a review from ekexium November 20, 2024 02:07
@ti-chi-bot ti-chi-bot bot added the lgtm label Nov 20, 2024
Copy link

ti-chi-bot bot commented Nov 20, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfzjywxk, ekexium

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Nov 20, 2024
Copy link

ti-chi-bot bot commented Nov 20, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-11-19 14:06:12.450469174 +0000 UTC m=+969934.641338172: ☑️ agreed by cfzjywxk.
  • 2024-11-20 02:42:27.197463847 +0000 UTC m=+1015309.388332843: ☑️ agreed by ekexium.

@ti-chi-bot ti-chi-bot bot merged commit 05d115b into tikv:master Nov 20, 2024
12 checks passed
rleungx pushed a commit to rleungx/client-go that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants