SLAM-AAC

SLAM-AAC is a LLM-based framework for Automated Audio Captioning (AAC) task. Inspired by techniques in machine translation and ASR, the model enhances audio captioning by incorporating paraphrasing augmentation and a plug-and-play CLAP-Refine strategy. For more details, please refer to the paper.

Model Architecture

SLAM-AAC uses EAT as the audio encoder and Vicuna-7B as the LLM decoder. During training, only the Linear Projector and LoRA modules are trainable. For inference, multiple candidates are generated using different beam sizes, which are then refined using the CLAP-Refine strategy.

Performance and checkpoints

Pre-trained and fine-tuned checkpoints for the Clotho and AudioCaps datasets are available. These checkpoints include the Linear Projector and LoRA modules. Ensure proper setup of the corresponding environments (e.g., EAT) before use.

Pre-training

SLAM-AAC was pre-trained on AudioCaps, Clotho, WavCaps, and MACS datasets. For more information on these datasets, you can refer to this repository. Additionally, the Clotho dataset was augmented using a back-translation-based paraphrasing technique.

Audio Encoder	LLM	Checkpoint	Pre-training Dataset
EAT-base (fine-tuned)	vicuna-7b-v1.5	link	AudioCaps, Clotho, WavCaps, MACS

Fine-tuning

We fine-tuned the pre-trained model on the Clotho and AudioCaps datasets, respectively. The final evaluation was conducted using audio captions generated with the CLAP-Refine decoding strategy.

Dataset	Audio Encoder	LLM	Checkpoint	METEOR	CIDEr	SPICE	SPIDEr	SPIDEr-FL	FENSE
Clotho	EAT-base (fine-tuned)	vicuna-7b-v1.5	link	19.7	51.5	14.8	33.2	33.0	54.0
AudioCaps	EAT-base (fine-tuned)	vicuna-7b-v1.5	link	26.8	84.1	19.4	51.8	51.5	66.8

Data preparation

Ensure your jsonl data follows this format:

{"key": "Y7fmOlUlwoNg_1", "source": "/root/data/AudioCaps/waveforms/test/Y7fmOlUlwoNg.wav", "target": "Constant rattling noise and sharp vibrations"}
{"key": "Y6BJ455B1aAs_1", "source": "/root/data/AudioCaps/waveforms/test/Y6BJ455B1aAs.wav", "target": "A rocket flies by followed by a loud explosion and fire crackling as a truck engine runs idle"}

In addition, you can refer to the manifest file we've provided, which includes the Clotho dataset enhanced with paraphrasing augmentation as bonus.

Model Training

To pre-train the SLAM-AAC model with pre-training data, you can run the following command:

# Pre-train the model
bash scripts/pretrain.sh

You can fine-tune the model on the AudioCaps or Clotho datasets using the provided checkpoint or your own pre-trained model by running the following commands:

# Fine-tune on AudioCaps
bash scripts/finetune_audiocaps.sh

# Fine-tune on Clotho
bash scripts/finetune_clotho.sh

You can also fine-tune the model without loading any pre-trained weights, though this may result in reduced performance.

Note

In the current version of SLAM-LLM, the peft_ckpt parameter is no longer required. However, if you are using the checkpoint provided by us, which was trained with an earlier version, please keep the peft_ckpt parameter in your configuration to ensure compatibility.
Due to differences in dependency versions, there may be slight variations in the performance of the SLAM-AAC model.

Inference

To perform inference with the trained models with beam search:

# Inference on AudioCaps (Beam Search)
bash scripts/inference_audiocaps_bs.sh

# Inference on Clotho (Beam Search)
bash scripts/inference_clotho_bs.sh

To generate better captions, use the CLAP-Refine strategy with multiple beam search decoding. This method leverages our pre-trained CLAP model. Though it takes more time, it ensures higher-quality results. Use the following commands to apply it:

# Inference on AudioCaps (CLAP-Refine)
bash scripts/inference_audiocaps_CLAP_Refine.sh

# Inference on Clotho (CLAP-Refine)
bash scripts/inference_clotho_CLAP_Refine.sh

If you already have the generated candidates and want to directly refine them using the CLAP-Refine strategy, you can run the following command:

bash scripts/clap_refine.sh

Citation

If you find SLAM-AAC useful, please cite the following paper:

@article{chen2024slam,
  title={SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs},
  author={Chen, Wenxi and Ma, Ziyang and Li, Xiquan and Xu, Xuenan and Liang, Yuzhe and Zheng, Zhisheng and Yu, Kai and Chen, Xie},
  journal={arXiv preprint arXiv:2410.09503},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SLAM-AAC

Model Architecture

Performance and checkpoints

Pre-training

Fine-tuning

Data preparation

Model Training

Note

Inference

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

SLAM-AAC

Model Architecture

Performance and checkpoints

Pre-training

Fine-tuning

Data preparation

Model Training

Note

Inference

Citation