Skip to content
This repository has been archived by the owner on Oct 23, 2024. It is now read-only.

v1.4.6

Compare
Choose a tag to compare
@unterstein unterstein released this 10 Aug 15:07
· 1920 commits to master since this release

Overview

Changes around unreachableStrategy

Recent changes in Apache Mesos introduced the ability to handle intermittent connectivity to an agent which may be running a Marathon task. This change introduced the TASK_UNREACHABLE. This allows for the ability for a node to disconnect and reconnect to the cluster without having a task replaced. This resulted in (based on default configurations) of a delay of 75 seconds before Marathon would be notified by Mesos to replace the task. The previous behavior of Marathon was usually sub-second replacement of a lost task.

It is now possible to configure unreachableStrategy for apps and pods to instantly replace unreachable apps or pods. To enable this behavior, you need to configure your app or pod as shown below:

{
  ...
  "unreachableStrategy": {
    "inactiveAfterSeconds": 0,
    "expungeAfterSeconds": 0
  },
  ...
}

Note: Instantly means as soon as marathon becomes aware of the unreachable task. By default, Marathon is notified after 75 seconds by Mesos
that an agent is disconnected. You can change this duration in Mesos by configuring agent_ping_timeout and max_agent_ping_timeouts.

Migrating unreachableStrategy

If you want all of your apps and pods to adopt a UnreachableStrategy that retains the previous behavior where instance were immediately replaced so that you does not have to update every single app definition.

To change the unreachableStrategy of all apps and pods, set the environment variable MIGRATION_1_4_6_UNREACHABLE_STRATEGY to true, which leads to the following behavior during migration:

When opting in to the unreachable migration step

  1. all app and pod definitions that had a config of UnreachableStrategy(300 seconds, 600 seconds) (previous default) are migrated to have UnreachableStrategy(0 seconds, 0 seconds)
  2. all app and pod definitions that had a config of UnreachableStrategy(1 second, x seconds) are migrated to have UnreachableStrategy(0 seconds, x seconds)
  3. all app and pod definitions that had a config of UnreachableStrategy(1 second, 2 seconds) are migrated to have UnreachableStrategy(0 seconds, 0 seconds)

Note: If you set this variable after upgrading to 1.4.6, it will have no effect. Also, the UnreachableStrategy default has not been changed, so in order for apps and pods created in the future to have the replace-instantly behavior, unreachableStrategy's inactiveAfterSeconds and expungeAfterSeconds must be set to 0 as seen in the JSON above.

Fixed issues

  • MARATHON-7681 - Fixes an issue in WorkQueue that could cause Marathon to drop exceptions and become unresponsive.
  • MARATHON-7653 - Fixes an issue in which Marathon could become unresponsive when pod status wasn't available
  • MARATHON-7629 - Fixes issue in which Marathon could get into an infinite kill loop, in certain situations
  • MARATHON-7469 - Fixes a bug in which a new leading Marathon would kill tasks launched by an ongoing deployment during the former leader.
  • MARATHON-7472, MARATHON-7358 - Further improve deployment performance by removing unnecessary thread blocking.
  • MARATHON-7617 - Capture storage cache layer metrics, by category
  • MARATHON-7536 - Disable HTTP TRACE method in API
  • MARATHON-7566 - Fix a regression in which content-type was required for the ping endpoint
  • MARATHON-7334 - Fix a regression in which fetch[].destPath was ignored
  • Replace lock with non-blocking concurrent data structure in WorkQueue, evidently the source of some contention looking at thread dump from MARATHON-7400
  • MARATHON-7433 - Fix a group deployment issue which would cause non-root groups without a nested group to be ignored.
  • MARATHON-7462 - Fix a race condition in WorkQueue which caused various issues, such as dropped events and could cause components of Marathon to become unresponsive.
  • MARATHON-7628 - Added agentId to the pod status
  • MARATHON-7458 - Reset root group cache when elected as leader
  • MARATHON_EE-1590 - Change unreachableStrategy to be able to start instant replacement tasks
  • MARATHON_EE-1591 - Allow migration from previous UnreachableStrategy default