demo_mse_cnn.mp4
Code database with an implementation of MSE-CNN [1]. Besides the code, the dataset and coefficients obtained after training are provided.
- MSE-CNN Implementation
Apply the MSE-CNN in your own pictures with its demo here: https://kevinmevin-demo-mse-cnn.hf.space
>>> import torch
>>> import msecnn
>>> import train_model_utils
>>>
>>> # Initialize parameters
>>> path_to_folder_with_model_params = "model_coefficients/best_coefficients"
>>> device = "cuda:0"
>>> qp = 32 # Quantisation Parameter
>>>
>>> # Initialize Model
>>> stg1_2 = msecnn.MseCnnStg1(device=device, QP=qp).to(device)
>>> stg3 = msecnn.MseCnnStgX(device=device, QP=qp).to(device)
>>> stg4 = msecnn.MseCnnStgX(device=device, QP=qp).to(device)
>>> stg5 = msecnn.MseCnnStgX(device=device, QP=qp).to(device)
>>> stg6 = msecnn.MseCnnStgX(device=device, QP=qp).to(device)
>>> model = (stg1_2, stg3, stg4, stg5, stg6)
>>>
>>> model = train_model_utils.load_model_parameters_eval(model, path_to_folder_with_model_params, device)
>>>
>>> # Loss function
>>> loss_fn = msecnn.LossFunctionMSE()
>>>
>>> # Path to labels
>>> l_path_val = "example_data/stg2"
>>>
>>> # Random CTU and labels
>>> CTU = torch.rand(1, 1, 128, 128).to(device)
>>> CTU
tensor([[[[0.9320, 0.6777, 0.4490, ..., 0.0413, 0.6278, 0.5375],
[0.3544, 0.5620, 0.8339, ..., 0.6420, 0.2527, 0.3104],
[0.0555, 0.4991, 0.9972, ..., 0.3898, 0.1169, 0.1661],
...,
[0.9452, 0.3566, 0.9825, ..., 0.3941, 0.7534, 0.8656],
[0.3839, 0.8459, 0.4369, ..., 0.9569, 0.2609, 0.6421],
[0.1734, 0.7182, 0.8074, ..., 0.2122, 0.7573, 0.2492]]]])
>>> cu_pos = torch.tensor([[0, 0]]).to(device)
>>> cu_size = torch.tensor([[64, 64]]).to(device) # Size of the CU of the second stage
>>> split_label = torch.tensor([[1]]).to(device)
>>> RDs = torch.rand(1, 6).to(device) * 10_000
>>> RDs
tensor([[1975.6646, 2206.7600, 1570.3577, 3570.9478, 6728.2612, 527.9994]])
>>> # Compute prediction for stages 1 and 2
>>> # Stage 1 and 2
>>> pred1_2, CUs, ap = model[0](CTU, cu_size, cu_pos) # Pass CU through network
>>> pred1_2
tensor([[9.9982e-01, 1.8124e-04, 9.9010e-21, 5.9963e-29, 1.9118e-24, 1.0236e-25]],
grad_fn=<SoftmaxBackward0>)
>>> CUs.shape
torch.Size([1, 16, 64, 64])
>>>
>>> # Compute the loss
>>> loss, loss_CE, loss_RD = loss_fn(pred1_2, split_label, RDs)
>>> loss
tensor(177.1340, grad_fn=<AddBackward0>)
>>> loss_CE
tensor(174.3921, grad_fn=<NegBackward0>)
>>> loss_RD
tensor(2.7419, grad_fn=<MeanBackward1>)
The emergence of new technologies that provide creative audiovisual experiences, such as 360-degree films, virtual reality, augmented reality, 4K, 8K UHD, 16K, and also with the rise of video traffic on the web, shows the current demand for video data in the modern world. Because of this tension, Versatile Video Coding (VVC) was developed due to the the necessity for the introduction of new coding standards. Despite the advancements achieved with the introduction of this standard, its complexity has increased very much. The new partitioning technique is responsible for majority of the increase in encoding time. This extended duration is linked with the optimization of the Rate-Distortion cost (RD cost). Although VVC offers higher compression rates, the complexity of its encoding is high.
In light of this, the Multi-Stage Exit Convolutional Neural Nework (MSE-CNN) was developed. This Deep Learning-based model is organised in a sequential structure with several stages. Each stage, which represents a different partition depth, encompasses a set of layers for extracting features from a Coding Tree Unit (CTU) and deciding how to partition it. Instead of using recursive approaches to determine the optimal way to fragment an image, this model allows VVC to estimate the most appropriate way of doing it. This work presents a model of the MSE-CNN that employs training procedures distinct from the original implementation of this network, as well as the ground-thruth to train and validate the model and an interpretation of the work done by the MSE-CNN’s original creators.
The key objective of partitioning is to divide frames into pieces in a way that results in a reduction of the RD cost. To achieve a perfect balance of quality and bitrate, numerous image fragments combinations must be tested, which is computationally expensive. Due to the intensive nature of this process, a high compression rate can be attained. Partitioning contributes heavily to both the complexity and compression gains in VVC. H.266 (VVC), organize a video sequence in many frames that are divided into smaller pieces. First, pictures are split into coding tree units (CTUs), and then they are divided into coding units (CUs). For the luma channel, the largest CTU size in VVC is 128x128 and the smallest is 4x4. In VVC, a quad-tree (QT) is initially applied to the CTUs in the first level, and then a quad-tree with nested multi-type tree (QMTT) is applied recursively.
This innovation makes it possible to split CUs in different rectangle forms. Splitting a CU into:
- three rectangles with a ratio of 1:2:1 results in a ternary tree (TT), with the center rectangle being half the size of the original CU; when applied horizontally it is called a horizontal ternary tree (HTT), and vertical ternary tree (VTT) when it is done vertically.
- two rectangles results in a binary tree (BT)partition, a block with two symmetrical structures; like in the case of the TT, depending on the way the split is done, it can be called either a vertical binary tree (VBT) or a horizontal binary tree (HBT).
The association of BT and TT is named a multi-type tree (MTT). The introduction of BT and TT partitions enables the creation of various new types of forms, with heights and widths that can be a combination between 128, 64, 32, 16, 8 and 4. The increased number of possible CUs boosts the ability of the codec to fragment an image more efficiently, allowing better predictions and higher compressing abilities. Although this standard now have these advantages, as a downside it takes longer to encode.
Multi-Stage Exit Convolutional Neural Network (MSE-CNN) is a DL model that seeks to forecast CUs in a waterfall architecture (top-down manner), it integrates . This structure takes a CTU as input, extracts features from it, splits the CU into one of at most six possible partitions (Non-split, QT, HBT, VBT, HTT, and VTT), and then sends it to the next stage. This model has CTUs as inputs in the first stage, either in the chroma or luma channel, and feature maps in the subsequent stages. Furthermore, it generates feature maps and a split decision at each level. In the event that one of the models returns the split decision as Non-Split, the partitioning of the CU is ended immediately.
This model is composed by the following blocks:
- Initially, this model adds more channels to the input of this network to create more features from it; this is accomplished by utilising simple convolutional layers.
- To extract more characteristics from the data, the information is then passed through a series of convolutional layers; these layers were named Conditional Convolution.
- At the end, a final layer is employed to determine the optimal manner of partitioning the CU. This layer is a blend of fully connected and convolutional layers.
Note: For more details regarding these layers check [1]
The loss developed for the MSE-CNN is the result of two other functions, as defined in the following expression:
In the above equation,
Concerning the second member of the MSE-CNN loss function, this constituent gives the network the ability to also make predictions based on the RD Cost.
In the above equation, the RD costs
ensures that CU's partitions with greater erroneously predicted probability values or greater RD cost values
The strategy used to train the MSE-CNN was very similar to the one used in [1]. The first parts of the model to be trained were the first and second stages, in which 64x64 CUs were passed through the second depth. Afterwards, transfer learning was used to pass certain coefficients of the second stage to the third. Then, the third stage was trained with 32x32 CUs flowing through it. After this step, a similar process was done to the following stages. It is worth noting that, beginning with stage 4, various CUs forms are at the models' input. This means that these stages were fed different kinds of CUs.
At the end of training, 6 models were obtained one for each partitioning depth in the luma channel. Although models for the luma and chroma channels could be created for all the shapes of CUs that are possible, rather than just for each depth, only six were trained for the sake of assessing the model behaviour in a simpler and more understandable configuration.
Due to the deterministic nature of the first stage, where CTUs are always partitioned with a QT, it was implemented together with the second stage. If it was done separately, the training for the first two stages would have to be done at the same time. Consequently, two distinct optimisers would need to be employed, which could result in unpredictable training behaviour.
When implementing the sub-networks on code, those that were meant to cater for varying CU sizes were further implemented separately. For example, in the case of the sub-network utilised when the minimum width or height is 32, two variants of the first two layers were built. This was done because 64x32 and 32x32 CUs can flow across this block. Because of this, the first two layers were implemented separately from the entire block. Then, they were used in conjunction with the remaining layers based on the dimensions of the input CU. The same procedures were followed for the other types of sub-networks.
When the network was being trained, some of the RD costs from the input data had very high values. Consequently, the RD loss function value skyrocketed, resulting in extremely huge gradients during training. As a result, the maximum RD cost was hard coded at
Please see this page to understand better the dataset and also access it.
Since it was verified that the Rate-Distortion Loss,
Stage | F1-Score | Recall | Precision |
---|---|---|---|
Stage 2 | 0.9111 | 0.9111 | 0.9112 |
Stage 3 | 0.5624 | 0.5767 | 0.5770 |
Stage 4 | 0.4406 | 0.4581 | 0.4432 |
Stage 5 | 0.5143 | 0.5231 | 0.5184 |
Stage 6 | 0.7282 | 0.7411 | 0.7311 |
Results with weighted average for F1-score, recall and precision.
Metric | VTM-7.0 | VTM-7.0+Model | Gain |
---|---|---|---|
Bitrate | 3810.192 kbps | 4069.392 kbps | 6.80% |
Y-PSNR | 35.7927 dB | 35.5591 dB | -0.65% |
Complexity | 1792.88 s | 1048.95 s | -41.49% |
These results were obtained with the "medium" configuration for the multi-thresholding method.
Folder | Description |
---|---|
dataset | This folder contains all of the dataset and all of the data that was processed in order to obtain it |
example_data | Here you can find some example data that it is used for the scripts in usefull_scripts folder |
model_coefficients | The last coefficient obtained during training, as well as the best one in terms of the best F1-score obtained in testing data |
src | Source code with the implementation of the MSE-CNN and also useful code and examples |
Files | Description |
---|---|
constants.py | Constant values used in other python files |
custom_dataset.py | Dataset class to handle the files with the ground-thruth information, as well as other usefull classes to work together with the aforementioned class |
dataset_utils.py | Functions to manipulate and process the data, also contains functions to interact with YUV files |
msecnn.py | MSE-CNN and Loss Function classes implementation |
train_model_utils.py | Usefull functions to be used during training or evaluation of the artificial neural network |
utils.py | Other functions that are usefull not directly to the model but for the code implementation itself |
In order to explore this project, it is needed to first install of the libraries used in it.
For this please follow the below steps:
- Create a virtual environment to install the libraries; follow this link in case you don't know how to do it; you possibly need to install pip
- Run the following command:
pip install -r requirements.txt
This will install all of the libraries references in the requirements.txt file. - When you have finished using the package or working on your project, you can deactivate the virtual environment:
$ deactivate
This command exits the virtual environment and returns you to your normal command prompt. 3. Enjoy! :)
-
Locate the
dist
folder in your project's root directory. This folder contains the package distributions, including the source distribution (*.tar.gz
file) and the wheel distribution (*.whl
file). -
Install the package using one of the following methods:
- Install the source distribution:
pip install dist/msecnn_raulkviana-1.0.tar.gz
- Install the wheel distribution:
pip install dist/msecnn_raulkviana-1.0.whl
-
Once the package is installed, you can import and use its functionalities in your Python code.
The documentation can be found by following this link.
Feel free to contact me through this email or create either a issue or pull request to contribute to this project ^^.
This project license is under the MIT License.
Task | Description | Status (d - doing, w - waiting, f- finished) |
---|---|---|
Implement code to test functions | Use a library, such as Pytest, to test some functions from the many modules developed | w |
Update documentation regarding training each stage | Create a documentation regarding the training and the data processing pipeline for each stage. Also create a simple script that can automate this steps, for easy of use | w |
[1] T. Li, M. Xu, R. Tang, Y. Chen, and Q. Xing, “DeepQTMT: A Deep Learning Approach for Fast QTMT-Based CU Partition of Intra-Mode VVC,” IEEE Transactions on Image Processing, vol. 30, pp. 5377–5390, 2021, doi: 10.1109/tip.2021.3083447.
[2] R. K. Viana, “Deep learning architecture for fast intra-mode CUs partitioning in VVC,” Universidade de Aveiro, Nov. 2022.