-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix duplicate key errors on resuming continued txn #555
Conversation
4927abb
to
32d7265
Compare
@dimitri We also faced the error Thanks! |
Hello @lospejos , Did you see the error message while resuming the pgcopydb? |
32d7265
to
8d72823
Compare
I tried to resume migration process using
and faced the same error:
|
Thanks @lospejos, This is the exact scenario the PR is trying to solve. I recommend compiling pgcopydb with this PR to find out whether it helps. |
Thanks for your information and for your contribution! |
@arajkumar can you please elaborate the root cause? Maybe I need to revisit my understanding of how whole resuming is working in online now. I believe #525 wouldn't allow the case you mentioned in the description. Also, is there a way to reproduce this? |
@shubhamdhama In issue #525, the implementation permits the execution of transactions up to the COMMIT stage. The decision to either COMMIT or ROLLBACK is made based on the transaction's commit LSN. However, this approach encounters a challenge with transactions (lacking a Commit LSN) that target tables with UNIQUE or PRIMARY KEY constraints. When such a transaction is executed a second time (as might occur during a resume operation), it leads to failures due to constraint violations. This PR also has a test case which can be used to reproduce the problem. EDIT: The current version of txn metadata has been written and consumed only from apply process, so there won't be any kind of deadlock would be possible unlike the previous implementation. |
8d72823
to
9da29fd
Compare
I tried to add changes from this PR (I replacing file $ pgcopydb follow --table-jobs 8 --index-jobs 8 --resume --not-consistent --slot-name pgcopydb --no-acl --no-owner --no-role-passwords --no-comments --fail-fast --filter /opt/pgcopydb_filter.ini --plugin wal2json BTW I'm not sure if the rerun command was ok, please, correct me, if I was wrong with this command. Unfortunately I encountered the same error:
I'm not sure that my rerun command was correct. |
@lospejos Did you retry from the begining? It won't work if you just resume after building new binary., because the previous version of apply doesn't know about txn metadata. I would recommend to do the following,
It is not related, but these flags |
Thanks for your comment. I tried to execute commands you provided earlier, but there was errors: # pgcopydb snapshot --follow --plugin wal2json
16:50:21 40981 INFO Running pgcopydb version 0.14.1.7.gbbbe01d.dirty from "/opt/pgcopydb_0.14.1.7/pgcopydb"
16:50:21 40981 INFO A previous run has run through completion
16:50:21 40981 FATAL Please use --restart to allow for removing files that belong to a completed previous run.
# pgcopydb follow --restart
16:51:00 40984 INFO Running pgcopydb version 0.14.1.7.gbbbe01d.dirty from "/opt/pgcopydb_0.14.1.7/pgcopydb"
16:51:00 40984 INFO [SOURCE] Copying database from "postgres://<skipped>"
16:51:00 40984 INFO [TARGET] Copying database into "postgres://<skipped>"
16:51:01 40984 ERROR Failed to send CREATE_REPLICATION_SLOT command:
16:51:01 40984 ERROR [SOURCE 891444] [42710] ERROR: replication slot "pgcopydb" already exists
16:51:01 40984 ERROR [SOURCE 891444] Context: CREATE_REPLICATION_SLOT "pgcopydb" LOGICAL "test_decoding"
16:51:01 40984 ERROR Failed to create a logical replication slot and export a snapshot, see above for details
# pgcopydb follow --resume --plugin wal2json
16:51:36 40987 INFO Running pgcopydb version 0.14.1.7.gbbbe01d.dirty from "/opt/pgcopydb_0.14.1.7/pgcopydb"
16:51:36 40987 ERROR Options --snapshot is mandatory unless using --not-consistent
16:51:36 40987 FATAL Option --resume requires option --not-consistent |
@lospejos Ok, since you are not doing clone and only follow, snapshot is not really needed. Lets try this.
|
9da29fd
to
9d87d33
Compare
@arajkumar I've started commands:
After that it seems that migration process resumed without errors. I see messages like:
This continued and I have no chance to enter the third command After checking data in tables on I decided to cleanup target database and restart data migration process from scratch. If it also will fail with |
@lospejos I missed |
@dimitri @shubhamdhama Could you please review this fix? |
Could you please specify, where/when exactly should I execute Is this commands order correct:
? |
pgcopydb stream cleanup
# run follow in background
pgcopydb follow --restart --plugin wal2json &
# wait for sentinel to be created by follow
sleep 10
# enable apply
pgcopydb sentinel set apply
# now bring follow to foreground
fg
# run for few mins & if you want to test resume, stop using Ctrl + C and then resume like below
pgcopydb follow --plugin wal2json --resume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just left a nitpicking comment about a DEBUG message, please consider fixing it before merge.
cc3df09
to
a965e9d
Compare
3721de2
to
11090a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there. Sorry I missed a couple places in my previous review.
This commit addresses and resolves the issue of duplicate key errors when resuming partially executed transactions (continuedTxn) in pgcopydb. We have reintroduced the transaction metadata file, which is essential for identifying the commitLSN of a partial transaction. Unlike our previous approach, which led to a deadlock between the transform and apply phases, this update brings a more efficient process. Now, the apply phase creates metadata for any partial (continued) transactions during the commit. This metadata is then used to accurately skip the already applied partial transaction if a resume is needed. This fix is crucial, particularly for tables with unique constraints, where executing the same continued transaction twice previously resulted in duplicate key errors. With this update, pgcopydb ensures smooth and error-free handling of transaction resumes. Signed-off-by: Arunprasad Rajkumar <[email protected]>
Thank you @dimitri ! |
This commit addresses and resolves the issue of duplicate key errors when resuming partially executed transactions (continuedTxn) in pgcopydb. We have reintroduced the transaction metadata file, which is essential for identifying the commitLSN of a partial transaction.
Unlike our previous approach, which led to a deadlock between the transform and apply phases, this update brings a more efficient process. Now, the apply phase creates metadata for any partial (continued) transactions during the commit. This metadata is then used to accurately skip the already applied partial transaction if a resume is needed.
This fix is crucial, particularly for tables with unique constraints, where executing the same continued transaction twice previously resulted in duplicate key errors. With this update, pgcopydb ensures smooth and error-free handling of transaction resumes.