Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Cross cluster replication fails to allocate shard on follower cluster #1465

Open
borutlukic opened this issue Nov 25, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@borutlukic
Copy link

What is the bug?
Replication does not start. Shard fails to allocate.

How can one reproduce the bug?
Steps to reproduce the behavior:

PUT _plugins/_replication/proxy-2024.11/_start
{
"leader_alias": "main-cluster",
"leader_index": "proxy-2024.11",
"use_roles":{
"leader_cluster_role": "all_access",
"follower_cluster_role": "all_access"
}
}

  1. See error
    GET _plugins/_replication/proxy-2024.11/_status
    {
    "status": "FAILED",
    "reason": "",
    "leader_alias": "prod-mon-elk-muc",
    "leader_index": "proxy-2024.11",
    "follower_index": "proxy-2024.11"
    }

What is the expected behavior?
Replication should start

What is your host/environment?

  • OS: Ubuntu 22
  • Opensearch 2.18 docker image
  • Plugins: just what is part of opensearch docker image

Do you have any screenshots?
N/A

Do you have any additional context?
Opensearch logs give lots java stack traces, but they all end with java.lang.IllegalStateException: confined:

Example stack log:
[2024-11-25T16:51:05,128][ERROR][o.o.r.r.RemoteClusterRepository] [opensearch-node-114] Restore of [proxy-2024.11][0] failed due to java.lang.IllegalStateException: confined
at org.apache.lucene.store.MemorySegmentIndexInput.ensureAccessible(MemorySegmentIndexInput.java:103)
at org.apache.lucene.store.MemorySegmentIndexInput.buildSlice(MemorySegmentIndexInput.java:461)
at org.apache.lucene.store.MemorySegmentIndexInput.clone(MemorySegmentIndexInput.java:425)
at org.apache.lucene.store.MemorySegmentIndexInput$SingleSegmentImpl.clone(MemorySegmentIndexInput.java:530)
at org.opensearch.replication.repository.RestoreContext.openInput(RestoreContext.kt:39)
at org.opensearch.replication.repository.RemoteClusterRestoreLeaderService.openInputStream(RemoteClusterRestoreLeaderService.kt:76)
at org.opensearch.replication.action.repository.TransportGetFileChunkAction$shardOperation$1.invoke(TransportGetFileChunkAction.kt:59)
at org.opensearch.replication.action.repository.TransportGetFileChunkAction$shardOperation$1.invoke(TransportGetFileChunkAction.kt:57)
at org.opensearch.replication.util.ExtensionsKt.performOp(Extensions.kt:55)
at org.opensearch.replication.util.ExtensionsKt.performOp$default(Extensions.kt:52)
at org.opensearch.replication.action.repository.TransportGetFileChunkAction.shardOperation(TransportGetFileChunkAction.kt:57)
at org.opensearch.replication.action.repository.TransportGetFileChunkAction.shardOperation(TransportGetFileChunkAction.kt:33)
at org.opensearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:131)
at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74)
at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.lang.Thread.run(Thread.java:1583)

Followed by:
[2024-11-25T16:51:05,135][ERROR][o.o.r.r.RemoteClusterRepository] [opensearch-node-114] Releasing leader resource failed due to NotSerializableExceptionWrapper[wrong_thread_exception: Attempted access outside owning thread]
at jdk.internal.foreign.MemorySessionImpl.wrongThread(MemorySessionImpl.java:315)
at jdk.internal.misc.ScopedMemoryAccess$ScopedAccessError.newRuntimeException(ScopedMemoryAccess.java:113)
at jdk.internal.foreign.MemorySessionImpl.checkValidState(MemorySessionImpl.java:219)
at jdk.internal.foreign.ConfinedSession.justClose(ConfinedSession.java:83)
at jdk.internal.foreign.MemorySessionImpl.close(MemorySessionImpl.java:242)
at jdk.internal.foreign.MemorySessionImpl$1.close(MemorySessionImpl.java:88)
at org.apache.lucene.store.MemorySegmentIndexInput.close(MemorySegmentIndexInput.java:514)
at org.opensearch.replication.repository.RestoreContext.close(RestoreContext.kt:52)
at org.opensearch.replication.repository.RemoteClusterRestoreLeaderService.removeLeaderClusterRestore(RemoteClusterRestoreLeaderService.kt:142)
at org.opensearch.replication.action.repository.TransportReleaseLeaderResourcesAction.shardOperation(TransportReleaseLeaderResourcesAction.kt:48)
at org.opensearch.replication.action.repository.TransportReleaseLeaderResourcesAction.shardOperation(TransportReleaseLeaderResourcesAction.kt:31)
at org.opensearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:131)
at org.opensearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:74)
at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:89)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1005)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.lang.Thread.run(Thread.java:1583)

@borutlukic borutlukic added bug Something isn't working untriaged labels Nov 25, 2024
@borutlukic
Copy link
Author

Setting:
"plugins.replication.follower.index.recovery.chunk_size": "1gb",
"plugins.replication.follower.index.recovery.max_concurrent_file_chunks": "1"

Seems to fix the issue. It appears that if the files are too large on the primary cluster, replication fails to start unless recovery.chunk_size is big enough to transfer files in one go.

@borutlukic
Copy link
Author

It appears that setting 'plugins.replication.follower.index.recovery.chunk_size' to the max (which is 1gb) I can get all but one index to replicate. Would it be possible to raise the limit to higher than 1gb? As there appears to be something strange happening when the transfer of files needs to happen in chunks. The chunk with offset >0 will fail to get transferred with the 'java.lang.IllegalStateException: confined'

@dblock
Copy link
Member

dblock commented Jan 6, 2025

[Catch All Triage - 1, 2, 3, 4]

@dblock dblock removed the untriaged label Jan 6, 2025
@Lavisx
Copy link

Lavisx commented Jan 9, 2025

I am experiencing the same problem. After upgrading to OpenSearch 2.18.0 from 2.12.0, I am unable to create the replication. My index has a 10GB shard, and it cannot be transferred.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants