Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate key errors on resuming continued txn #555

Merged
merged 1 commit into from
Dec 1, 2023

Conversation

arajkumar
Copy link
Contributor

This commit addresses and resolves the issue of duplicate key errors when resuming partially executed transactions (continuedTxn) in pgcopydb. We have reintroduced the transaction metadata file, which is essential for identifying the commitLSN of a partial transaction.

Unlike our previous approach, which led to a deadlock between the transform and apply phases, this update brings a more efficient process. Now, the apply phase creates metadata for any partial (continued) transactions during the commit. This metadata is then used to accurately skip the already applied partial transaction if a resume is needed.

This fix is crucial, particularly for tables with unique constraints, where executing the same continued transaction twice previously resulted in duplicate key errors. With this update, pgcopydb ensures smooth and error-free handling of transaction resumes.

@lospejos
Copy link
Contributor

@dimitri We also faced the error ERROR: duplicate key value violates unique constraint... Key (id)=(value) already exists.
Kindly ask to verify and merge this PR, I hope this may help in our case.

Thanks!

@arajkumar
Copy link
Contributor Author

Hello @lospejos , Did you see the error message while resuming the pgcopydb?

@lospejos
Copy link
Contributor

Hello @lospejos , Did you see the error message while resuming the pgcopydb?

I tried to resume migration process using --resume command line option like this:

$ pgcopydb follow --table-jobs 8 --index-jobs 8 --resume --not-consistent --slot-name pgcopydb --no-acl --no-owner --no-role-passwords --no-comments --fail-fast --filter /opt/pgcopydb_filter.ini --plugin wal2json

and faced the same error:

ERROR  [TARGET 26433] ERROR:  duplicate key value violates unique constraint "tablename_pkey"
ERROR  [TARGET 26433] DETAIL:  Key (id)=(value) already exists.
ERROR  Failed to apply SQL from file "/root/.local/share/pgcopydb/00000001000025E700000014.sql", see above for details
INFO   Reported write_lsn 25E7/15235EC0, flush_lsn 25E7/15235EC0, replay_lsn 25E7/14FE1E60
ERROR  Subprocess catchup with pid 18699 has exited with error code 12
ERROR  Postgres query was interrupted: update pgcopydb.sentinel set startpos = $2, write_lsn = $1, flush_lsn = $2 returning startpos, endpos, apply, write_lsn, flush_lsn, replay_lsn
ERROR  Failed to fetch pgcopydb.sentinel current values
INFO   Reported write_lsn 25E7/152DEC50, flush_lsn 25E7/152DEC50, replay_lsn 25E7/14FE1E60
INFO   Streamed up to write_lsn 25E7/152DEC50, flush_lsn 25E7/152DEC50, stopping: endpos is 0/0
INFO   Prefetch process has terminated
ERROR  Some sub-process exited with errors, see above for details
ERROR  Failed to follow changes from source, see above for details

@arajkumar
Copy link
Contributor Author

Thanks @lospejos, This is the exact scenario the PR is trying to solve. I recommend compiling pgcopydb with this PR to find out whether it helps.

@lospejos
Copy link
Contributor

Thanks @lospejos, This is the exact scenario the PR is trying to solve. I recommend compiling pgcopydb with this PR to find out whether it helps.

Thanks for your information and for your contribution!

@shubhamdhama
Copy link
Contributor

shubhamdhama commented Nov 28, 2023

@arajkumar can you please elaborate the root cause? Maybe I need to revisit my understanding of how whole resuming is working in online now. I believe #525 wouldn't allow the case you mentioned in the description.

Also, is there a way to reproduce this?

@arajkumar
Copy link
Contributor Author

arajkumar commented Nov 29, 2023

@shubhamdhama In issue #525, the implementation permits the execution of transactions up to the COMMIT stage. The decision to either COMMIT or ROLLBACK is made based on the transaction's commit LSN. However, this approach encounters a challenge with transactions (lacking a Commit LSN) that target tables with UNIQUE or PRIMARY KEY constraints. When such a transaction is executed a second time (as might occur during a resume operation), it leads to failures due to constraint violations.

This PR also has a test case which can be used to reproduce the problem.

EDIT: The current version of txn metadata has been written and consumed only from apply process, so there won't be any kind of deadlock would be possible unlike the previous implementation.

@lospejos
Copy link
Contributor

I tried to add changes from this PR (I replacing file ld_apply.c with copy form your PR) and rerun previously interrupted migration using command line:

$ pgcopydb follow --table-jobs 8 --index-jobs 8 --resume --not-consistent --slot-name pgcopydb --no-acl --no-owner --no-role-passwords --no-comments --fail-fast --filter /opt/pgcopydb_filter.ini --plugin wal2json

BTW I'm not sure if the rerun command was ok, please, correct me, if I was wrong with this command.

Unfortunately I encountered the same error:

13:52:42 38458 ERROR  [TARGET 31303] ERROR:  duplicate key value violates unique constraint "tablename_pkey"
13:52:42 38458 ERROR  [TARGET 31303] DETAIL:  Key (id)=(247420390) already exists.
13:52:42 38458 ERROR  Failed to apply SQL from file "/root/.local/share/pgcopydb/00000001000025E700000014.sql", see above for details
13:52:42 38456 INFO   Reported write_lsn 25E7/152DEC50, flush_lsn 25E7/152DEC50, replay_lsn 25E7/14FE1E60
13:52:42 38452 ERROR  Subprocess catchup with pid 38458 has exited with error code 12
13:52:43 38456 ERROR  Postgres query was interrupted: update pgcopydb.sentinel set startpos = $2, write_lsn = $1, flush_lsn = $2 returning startpos, endpos, apply, write_lsn, flush_lsn, replay_lsn
13:52:43 38456 ERROR  Failed to fetch pgcopydb.sentinel current values
13:52:43 38456 INFO   Reported write_lsn 25E7/15511EB8, flush_lsn 25E7/15511EB8, replay_lsn 25E7/14FE1E60
13:52:43 38456 INFO   Streamed up to write_lsn 25E7/15511EB8, flush_lsn 25E7/15511EB8, stopping: endpos is 0/0
13:52:43 38456 INFO   Prefetch process has terminated
13:52:43 38452 ERROR  Some sub-process exited with errors, see above for details
13:52:43 38452 ERROR  Failed to follow changes from source, see above for details

I'm not sure that my rerun command was correct.
Also hope that this information will help.

@arajkumar
Copy link
Contributor Author

arajkumar commented Nov 29, 2023

@lospejos Did you retry from the begining? It won't work if you just resume after building new binary., because the previous version of apply doesn't know about txn metadata.

I would recommend to do the following,

pgcopydb snapshot --follow --plugin wal2json &

pgcopydb follow --restart

pgcopydb follow --resume --plugin wal2json

It is not related, but these flags --table-jobs 8 --index-jobs 8 --no-acl --no-owner --no-role-passwords --no-comments --filter /opt/pgcopydb_filter.ini won't have any impact on the follow command.

@lospejos
Copy link
Contributor

@arajkumar

Thanks for your comment. I tried to execute commands you provided earlier, but there was errors:

# pgcopydb snapshot --follow --plugin wal2json
16:50:21 40981 INFO   Running pgcopydb version 0.14.1.7.gbbbe01d.dirty from "/opt/pgcopydb_0.14.1.7/pgcopydb"
16:50:21 40981 INFO   A previous run has run through completion
16:50:21 40981 FATAL  Please use --restart to allow for removing files that belong to a completed previous run.

# pgcopydb follow --restart
16:51:00 40984 INFO   Running pgcopydb version 0.14.1.7.gbbbe01d.dirty from "/opt/pgcopydb_0.14.1.7/pgcopydb"
16:51:00 40984 INFO   [SOURCE] Copying database from "postgres://<skipped>"
16:51:00 40984 INFO   [TARGET] Copying database into "postgres://<skipped>"
16:51:01 40984 ERROR  Failed to send CREATE_REPLICATION_SLOT command:
16:51:01 40984 ERROR  [SOURCE 891444] [42710] ERROR:  replication slot "pgcopydb" already exists
16:51:01 40984 ERROR  [SOURCE 891444] Context: CREATE_REPLICATION_SLOT "pgcopydb" LOGICAL "test_decoding"
16:51:01 40984 ERROR  Failed to create a logical replication slot and export a snapshot, see above for details

# pgcopydb follow --resume --plugin wal2json
16:51:36 40987 INFO   Running pgcopydb version 0.14.1.7.gbbbe01d.dirty from "/opt/pgcopydb_0.14.1.7/pgcopydb"
16:51:36 40987 ERROR  Options --snapshot is mandatory unless using --not-consistent
16:51:36 40987 FATAL  Option --resume requires option --not-consistent

@arajkumar
Copy link
Contributor Author

@lospejos Ok, since you are not doing clone and only follow, snapshot is not really needed. Lets try this.

pgcopydb stream cleanup

pgcopydb follow --restart

pgcopydb follow --resume

@lospejos
Copy link
Contributor

@lospejos Ok, since you are not doing clone and only follow, snapshot is not really needed. Lets try this.

pgcopydb stream cleanup

pgcopydb follow --restart

pgcopydb follow --resume

@arajkumar I've started commands:

pgcopydb stream cleanup

pgcopydb follow --restart --plugin wal2json 

After that it seems that migration process resumed without errors. I see messages like:

11:29:30 48828 INFO   Transformed 11796 JSON messages into SQL file "/root/.local/share/pgcopydb/00000001000025F300000013.sql"
11:29:31 48827 INFO   Reported write_lsn 25F3/14058F30, flush_lsn 25F3/13E02FB8, replay_lsn 0/0
11:29:41 48827 INFO   Reported write_lsn 25F3/142E5A98, flush_lsn 25F3/14058F30, replay_lsn 0/0
11:29:51 48827 INFO   Reported write_lsn 25F3/14464B10, flush_lsn 25F3/142E5A98, replay_lsn 0/0
11:30:01 48827 INFO   Reported write_lsn 25F3/1461ED58, flush_lsn 25F3/14464B10, replay_lsn 0/0
11:30:11 48827 INFO   Reported write_lsn 25F3/147FA298, flush_lsn 25F3/1461ED58, replay_lsn 0/0
11:30:21 48827 INFO   Reported write_lsn 25F3/149AEA78, flush_lsn 25F3/147FA298, replay_lsn 0/0
11:30:31 48827 INFO   Reported write_lsn 25F3/14B5BFD0, flush_lsn 25F3/149BE778, replay_lsn 0/0

This continued and I have no chance to enter the third command pgcopydb follow --resume.

After checking data in tables on Source and Target databases, I see a difference - no new data was migrated to a Target database since previous migration was failed (a couple days ago).
I'm not sure is it due to my invalid previous commands (something was accidentally disabled or incorrectly updated), or something else.

I decided to cleanup target database and restart data migration process from scratch. If it also will fail with ERROR: duplicate key value violates unique constraint error, I'll try to use you approach from previous message.

@arajkumar
Copy link
Contributor Author

@lospejos I missed pgcopydb sentinel set apply. Include that too and run it from a separate terminal tab/window with all the necessary env. Otherwise, the data won't be moved to the target.

@arajkumar
Copy link
Contributor Author

@dimitri @shubhamdhama Could you please review this fix?

@lospejos
Copy link
Contributor

@lospejos I missed pgcopydb sentinel set apply. Include that too and run it from a separate terminal tab/window with all the necessary env. Otherwise, the data won't be moved to the target.

Could you please specify, where/when exactly should I execute pgcopydb sentinel set apply?

Is this commands order correct:

pgcopydb stream cleanup
pgcopydb sentinel set apply
pgcopydb follow --restart --plugin wal2json
pgcopydb follow --resume

?

@arajkumar
Copy link
Contributor Author

arajkumar commented Dec 1, 2023

pgcopydb stream cleanup

# run follow in background 
pgcopydb follow --restart --plugin wal2json &

# wait for sentinel to be created by follow
sleep 10

# enable apply
pgcopydb sentinel set apply

# now bring follow to foreground
fg

# run for few mins & if you want to test resume, stop using Ctrl + C and then resume like below

pgcopydb follow  --plugin wal2json --resume

Copy link
Owner

@dimitri dimitri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just left a nitpicking comment about a DEBUG message, please consider fixing it before merge.

src/bin/pgcopydb/ld_apply.c Outdated Show resolved Hide resolved
@arajkumar arajkumar force-pushed the no-cmt-lsn branch 3 times, most recently from cc3df09 to a965e9d Compare December 1, 2023 13:03
@arajkumar arajkumar requested a review from dimitri December 1, 2023 13:27
src/bin/pgcopydb/ld_apply.c Outdated Show resolved Hide resolved
@arajkumar arajkumar force-pushed the no-cmt-lsn branch 2 times, most recently from 3721de2 to 11090a7 Compare December 1, 2023 14:02
@arajkumar arajkumar requested a review from dimitri December 1, 2023 14:12
Copy link
Owner

@dimitri dimitri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost there. Sorry I missed a couple places in my previous review.

src/bin/pgcopydb/ld_apply.c Outdated Show resolved Hide resolved
src/bin/pgcopydb/ld_apply.c Outdated Show resolved Hide resolved
This commit addresses and resolves the issue of duplicate key errors when resuming partially executed transactions (continuedTxn) in pgcopydb. We have reintroduced the transaction metadata file, which is essential for identifying the commitLSN of a partial transaction.

Unlike our previous approach, which led to a deadlock between the transform and apply phases, this update brings a more efficient process. Now, the apply phase creates metadata for any partial (continued) transactions during the commit. This metadata is then used to accurately skip the already applied partial transaction if a resume is needed.

This fix is crucial, particularly for tables with unique constraints, where executing the same continued transaction twice previously resulted in duplicate key errors. With this update, pgcopydb ensures smooth and error-free handling of transaction resumes.

Signed-off-by: Arunprasad Rajkumar <[email protected]>
@dimitri dimitri merged commit bb2e3e0 into dimitri:main Dec 1, 2023
16 checks passed
@dimitri dimitri added bug Something isn't working enhancement New feature or request labels Dec 1, 2023
@dimitri dimitri added this to the v0.15 milestone Dec 1, 2023
@arajkumar
Copy link
Contributor Author

Thank you @dimitri !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants