OTX D-Fine Detection Algorithm Integration (#4142)

* init * remove convertbox * Refactor D-FINE detector: remove unused components and update model configuration * update * update * Update * update recipes * Add d-fine-m * Fix recipes * dfine-l * Add dfine m - no aug * format changes * learnable params + disable teacher distillation * update * add recipes * update * update * update recipes * add dfine_hgnetv2_x * Update recipes * add tile DFine recipes * update recipes and tile batch size * update * update LR * DFine revert LR changes * make multi-scale optional * update tile recipes * update tiling recipes * add backbone pretrained weights * updawte * update * loss * update * Update * refactor d-fine criterion * * Fix docstring punctuation and remove unused aux_loss parameter in DFINETransformerModule * Refactor DFineCriterion * Update style changes * conv batchnorm fuse * update hybrid encoder * Refactor DFINE HybridEncoderModule to improve code clarity and remove redundant parameters * minor update * Refactor D-FINE module structure by removing obsolete detector file and reorganizing imports * Refactor import paths in D-FINE module and clean up unused code * Refactor D-FINE module by removing commented code, cleaning up imports, and updating documentation * Refactor D-FINE module by updating type hints, improving error messages, and enhancing documentation for RandomIoUCrop * Refactor D-FINE module by improving the weighting function's return structure and updating type hints in DFINECriterion * Update d-fine unit test * Refactor D-FINE module by enhancing docstrings for clarity and updating parameter names for consistency * Add D-Fine Detection Algorithm entries to CHANGELOG and object detection documentation * Fix device assignment for positional embeddings in HybridEncoderModule * Refactor D-FINE module by removing unused functions and integrating dfine_bbox2distance in DFINECriterion * Update codeowners * Add advanced parameters to optimization config in DFine model * Remove DFINE M, S, N model configuration files * disable tiling mem cache * Update codeowners * revert codeowner changes * Remove unused DFINE model configurations from unit tests * Add heavy unit test workflow and mark tests accordingly * Add container configuration for Heavy-Unit-Test job in pre_merge.yaml * Add additional transformations to D-Fine configuration and update test skips for unsupported models * Reduce batch size and remove heavy markers from unit tests in test_tiling.py * Revert "Add additional transformations to D-Fine configuration and update test skips for unsupported models" This reverts commit d5c66f5. * Revert "Reduce batch size and remove heavy markers from unit tests in test_tiling.py" This reverts commit 563e033. * Add additional transformations to D-Fine configuration in YAML files * disable pytest heavy tag * update * Remove unused DFine-L model configurations and update unit tests * Add DFine-X model template for class-incremental object detection * Update docs/source/guide/explanation/algorithms/object_detection/object_detection.rst Co-authored-by: Samet Akcay <[email protected]> * Update copyright years from 2024 to 2025 in multiple files * Rename heavy unit tests to intense unit tests and update related configurations * Update container image in pre_merge.yaml for Intense-Unit-Test job * update pre-merge * update ubuntu container image * update container image * Add new object detection model configuration for DFine HGNetV2 X * update image * Update pre-merge workflow to use Ubuntu 24.04 and simplify unit test coverage reporting * install sqlite * Remove sudo from apt-get command in pre-merge workflow * Remove sudo from apt-get command in pre-merge workflow * Update pre-merge workflow to install additional dependencies and correct model name in converter * Update detection configuration: increase warmup steps and patience, add min_lr, and remove unused callbacks * Remove D-Fine model recipes from object detection documentation * Skip tests for unsupported models: add check for D-Fine * Skip tests for unsupported models: add check for D-Fine * Skip tests for unsupported models: add check for DFine * Refactor DFine model: remove unused checkpoint loading and update optimizer configuration documentation; change reg_scale to float in DFINETransformer. --------- Co-authored-by: Samet Akcay <[email protected]>
openvinotoolkit · Jan 17, 2025 · d663fd7 · d663fd7
1 parent a6d5795
commit d663fd7
Show file tree

Hide file tree

Showing 24 changed files with 3,736 additions and 11 deletions.
diff --git a/.github/workflows/pre_merge.yaml b/.github/workflows/pre_merge.yaml
@@ -84,6 +84,38 @@ jobs:
           curl -Os https://uploader.codecov.io/latest/linux/codecov
           chmod +x codecov
           ./codecov -t ${{ secrets.CODECOV_TOKEN }} --sha $COMMIT_ID -U $HTTP_PROXY -f .tox/coverage_unit-test-${{ matrix.tox-env }}.xml -F ${{ matrix.tox-env }}
+  Intense-Unit-Test:
+    runs-on: [otx-gpu-a10g-1]
+    container:
+      image: "ubuntu:24.04"
+    needs: Code-Quality-Checks
+    timeout-minutes: 120
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - python-version: "3.10"
+            tox-env: "py310"
+          - python-version: "3.11"
+            tox-env: "py311"
+    name: Intense-Unit-Test-with-Python${{ matrix.python-version }}
+    steps:
+      - name: Install dependencies
+        run: apt-get update && apt-get install -y libsqlite3-0 libsqlite3-dev libgl1 libglib2.0-0
+      - name: Checkout repository
+        uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
+      - name: Install Python
+        uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Install tox
+        run: |
+          python -m pip install --require-hashes --no-deps -r .ci/requirements.txt
+          pip-compile --generate-hashes --output-file=/tmp/requirements.txt --extra=ci_tox pyproject.toml
+          python -m pip install --require-hashes --no-deps -r /tmp/requirements.txt
+          rm /tmp/requirements.txt
+      - name: Run unit test
+        run: tox -vv -e intense-unit-test-${{ matrix.tox-env }}
   Integration-Test:
     if: |
       github.event.pull_request.draft == false &&

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -22,6 +22,8 @@ All notable changes to this project will be documented in this file.
   (<https://github.com/openvinotoolkit/training_extensions/pull/3979>)
 - Add OpenVINO inference for 3D Object Detection task
   (<https://github.com/openvinotoolkit/training_extensions/pull/4017>)
+- Add D-Fine Detection Algorithm
+  (<https://github.com/openvinotoolkit/training_extensions/pull/4142>)
 
 ### Enhancements
 

diff --git a/docs/source/guide/explanation/algorithms/object_detection/object_detection.rst b/docs/source/guide/explanation/algorithms/object_detection/object_detection.rst
@@ -73,6 +73,8 @@ We support the following ready-to-use model recipes:
 +------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
 | `Object_Detection_ResNeXt101_ATSS <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/recipe/detection/atss_resnext101.yaml>`_    |   ResNeXt101-ATSS   | 434.75              | 344.0           |
 +------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
+| `D-Fine X Detection <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/recipe/detection/dfine_x.yaml>`                           |   D-Fine X          | 202.486             | 240.0           |
++------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
 
 Above table can be found using the following command
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -398,6 +398,7 @@ convention = "google"
 markers = [
     "gpu",  # mark tests which require NVIDIA GPU
     "cpu",
-    "xpu",  # mark tests which require Intel dGPU
+    "xpu",  # mark tests which require Intel dGPU,
+    "intense", # intense unit tests which require better CI machines
 ]
 python_files = "tests/**/*.py"
diff --git a/src/otx/algo/common/layers/transformer_layers.py b/src/otx/algo/common/layers/transformer_layers.py
@@ -1,4 +1,4 @@
-# Copyright (C) 2024 Intel Corporation
+# Copyright (C) 2024-2025 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 #
 """Implementation of common transformer layers."""
@@ -10,6 +10,7 @@
 from typing import Callable
 
 import torch
+import torch.nn.functional as f
 from otx.algo.common.utils.utils import get_clones
 from otx.algo.modules.transformer import deformable_attention_core_func
 from torch import Tensor, nn
@@ -306,6 +307,151 @@ def forward(
         return self.output_proj(output)
 
 
+class MSDeformableAttentionV2(nn.Module):
+    """Multi-Scale Deformable Attention Module V2.
+
+    Note:
+        This is different from vanilla MSDeformableAttention where it uses
+        distinct number of sampling points for features at different scales.
+        Refer to RTDETRv2.
+
+    Args:
+        embed_dim (int): The number of expected features in the input.
+        num_heads (int): The number of heads in the multiheadattention models.
+        num_levels (int): The number of levels in MSDeformableAttention.
+        num_points_list (list[int]): Number of distinct points for each layer. Defaults to [3, 6, 3].
+    """
+
+    def __init__(
+        self,
+        embed_dim: int = 256,
+        num_heads: int = 8,
+        num_levels: int = 4,
+        num_points_list: list[int] = [3, 6, 3],  # noqa: B006
+    ) -> None:
+        super().__init__()
+        self.embed_dim = embed_dim
+        self.num_heads = num_heads
+        self.num_levels = num_levels
+        self.num_points_list = num_points_list
+
+        num_points_scale = [1 / n for n in num_points_list for _ in range(n)]
+        self.register_buffer(
+            "num_points_scale",
+            torch.tensor(num_points_scale, dtype=torch.float32),
+        )
+
+        self.total_points = num_heads * sum(num_points_list)
+        self.head_dim = embed_dim // num_heads
+
+        self.sampling_offsets = nn.Linear(embed_dim, self.total_points * 2)
+        self.attention_weights = nn.Linear(embed_dim, self.total_points)
+
+        self._reset_parameters()
+
+    def _reset_parameters(self) -> None:
+        """Reset parameters of the model."""
+        init.constant_(self.sampling_offsets.weight, 0)
+        thetas = torch.arange(self.num_heads, dtype=torch.float32) * (2.0 * math.pi / self.num_heads)
+        grid_init = torch.stack([thetas.cos(), thetas.sin()], -1)
+        grid_init = grid_init / grid_init.abs().max(-1, keepdim=True).values  # noqa: PD011
+        grid_init = grid_init.reshape(self.num_heads, 1, 2).tile([1, sum(self.num_points_list), 1])
+        scaling = torch.concat([torch.arange(1, n + 1) for n in self.num_points_list]).reshape(1, -1, 1)
+        grid_init *= scaling
+        self.sampling_offsets.bias.data[...] = grid_init.flatten()
+
+        # attention_weights
+        init.constant_(self.attention_weights.weight, 0)
+        init.constant_(self.attention_weights.bias, 0)
+
+    def forward(
+        self,
+        query: Tensor,
+        reference_points: Tensor,
+        value: Tensor,
+        value_spatial_shapes: list[list[int]],
+    ) -> Tensor:
+        """Forward function of MSDeformableAttention.
+
+        Args:
+            query (Tensor): [bs, query_length, C]
+            reference_points (Tensor): [bs, query_length, n_levels, 2], range in [0, 1], top-left (0,0),
+                bottom-right (1, 1), including padding area
+            value (Tensor): [bs, value_length, C]
+            value_spatial_shapes (List): [n_levels, 2], [(H_0, W_0), (H_1, W_1), ..., (H_{L-1}, W_{L-1})]
+
+        Returns:
+            output (Tensor): [bs, Length_{query}, C]
+        """
+        bs, len_q = query.shape[:2]
+        _, n_head, c, _ = value[0].shape
+        num_points_list = self.num_points_list
+
+        sampling_offsets = self.sampling_offsets(query).reshape(
+            bs,
+            len_q,
+            self.num_heads,
+            sum(self.num_points_list),
+            2,
+        )
+
+        attention_weights = self.attention_weights(query).reshape(
+            bs,
+            len_q,
+            self.num_heads,
+            sum(self.num_points_list),
+        )
+        attention_weights = f.softmax(attention_weights, dim=-1)
+
+        if reference_points.shape[-1] == 2:
+            offset_normalizer = torch.tensor(value_spatial_shapes)
+            offset_normalizer = offset_normalizer.flip([1]).reshape(1, 1, 1, self.num_levels, 1, 2)
+            sampling_locations = (
+                reference_points.reshape(
+                    bs,
+                    len_q,
+                    1,
+                    self.num_levels,
+                    1,
+                    2,
+                )
+                + sampling_offsets / offset_normalizer
+            )
+        elif reference_points.shape[-1] == 4:
+            num_points_scale = self.num_points_scale.to(query).unsqueeze(-1)
+            offset = sampling_offsets * num_points_scale * reference_points[:, :, None, :, 2:] * 0.5
+            sampling_locations = reference_points[:, :, None, :, :2] + offset
+        else:
+            msg = (f"Last dim of reference_points must be 2 or 4, but get {reference_points.shape[-1]} instead.",)
+            raise ValueError(msg)
+
+        # sampling_offsets [8, 480, 8, 12, 2]
+        sampling_grids = 2 * sampling_locations - 1
+
+        sampling_grids = sampling_grids.permute(0, 2, 1, 3, 4).flatten(0, 1)
+        sampling_locations_list = sampling_grids.split(num_points_list, dim=-2)
+
+        sampling_value_list = []
+        for level, (h, w) in enumerate(value_spatial_shapes):
+            value_l = value[level].reshape(bs * n_head, c, h, w)
+            sampling_grid_l = sampling_locations_list[level]
+            sampling_value_l = f.grid_sample(
+                value_l,
+                sampling_grid_l,
+                mode="bilinear",
+                padding_mode="zeros",
+                align_corners=False,
+            )
+
+            sampling_value_list.append(sampling_value_l)
+
+        attn_weights = attention_weights.permute(0, 2, 1, 3).reshape(bs * n_head, 1, len_q, sum(num_points_list))
+        weighted_sample_locs = torch.concat(sampling_value_list, dim=-1) * attn_weights
+        output = weighted_sample_locs.sum(-1).reshape(bs, n_head * c, len_q)
+
+        return output.permute(0, 2, 1)
+
+
 class VisualEncoderLayer(nn.Module):
     """VisualEncoderLayer module consisting of MSDeformableAttention and feed-forward network.
-Original file line number
+Diff line change
@@ Expand Up / @@ -73,6 +73,8 @@ We support the following ready-to-use model recipes: @@
     +------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
     | `Object_Detection_ResNeXt101_ATSS <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/recipe/detection/atss_resnext101.yaml>`_    |   ResNeXt101-ATSS   | 434.75              | 344.0           |
     +------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
+    | `D-Fine X Detection <https://github.com/openvinotoolkit/training_extensions/blob/develop/src/otx/recipe/detection/dfine_x.yaml>`                           |   D-Fine X          | 202.486             | 240.0           |
+    +------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------+---------------------+-----------------+
     Above table can be found using the following command
@@ Expand Down @@