pre-processing about AudioSet (resample to 16kHz) #108

wisekimm · 2023-08-03T07:18:53Z

Hi! Yaun Gong, Thank you for providing good research and opensource code.

There is one problem with reproducing.

When i convert the AudioSet (my dataset is 32kHz) to 16kHz using sox (Based on ast/egs/esx50/prep_esc50.py), warning is happen.

The code I used are as below
os.system('sox ' + base_dir + '/audio/' + audio + ' -r 16000 ' + base_dir + '/audio_16k/' + audio)

This seems to be the cause of making the mean and std of the data different from what you wrote.
(our mean and std are -3.539583 and 3.4221482, which are not the same as -4.2677393, 4.5689974 in your code)
And the difference between these datasets seems to make the result value (mAP) different.

The results of the 5 epochs are:
0.408, 0.425, 0.434, 0.433, 0.433
Compare the results given in your source code:
4.153, 0.439, 0,448, 0.449, 0.449

So, in conclusion, my question is this.
Did you also have any warning signal in the process of resample AudioSet data? If so, I wonder how you solved it.

sox warn marks are shown below.
Thanks :)

sox WARN rate: rate clipped 5 samples; decrease volume?
sox WARN dither: dither clipped 5 samples; decrease volume?

The text was updated successfully, but these errors were encountered:

YuanGongND · 2023-08-03T07:41:38Z

hi there,

I get these warnings on some other datasets (not for AS, because we download them as 16kHz). I believe these are safe and not the cause of the problem.

The problem is the dataset, have you downloaded the dataset from PANNs paper repo? If so, it is a known issue. I guess this version is somehow different from what we use.

0.433 and 0.449 is a huge difference on AudioSet. FYI, for getting video data purpose, we used another version which was downloaded independently with the AST used version after the AST work, which is noticeably smaller. But still, we get very similar results, so it is also unlikely to be the data size issue.

Can you reproduce our ESC-50 result? (with/without AS pretraining).

-Yuan

YuanGongND · 2023-08-03T07:43:54Z

Btw, what is the format of your AS data? We have .flac.

YuanGongND · 2023-08-03T07:49:16Z

What if you use our best-pretrained model to infer your test set? (note please keep everything unchanged, including the std/mean).

wisekimm · 2023-08-03T10:20:04Z

Thank you for your quick reply :)

The problem is the dataset, have you downloaded the dataset from PANNs paper repo?

Yes, I downloaded the dataset from PANNs paper repo. (for unbal set, 1.1TB)
https://github.com/qiuqiangkong/audioset_tagging_cnn

I got it from shell script file provided by PANNs repo. First, download wav file with 32kHz and then downsample to 16kHz.
https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/master/scripts/1_download_dataset.sh
using yt-dlp & ffmpeg. (main code is as below). Is there no problem here? Should I download it to 16kHz directly?
os.system("yt-dlp --quiet -o '{}' -x https://www.youtube.com/watch?v={}".format(video_name, audio_id))
os.system("ffmpeg -loglevel panic -i {} -ac 1 -ar 32000 -ss {} -t 00:00:{} {} "format(video_path, str(datetime.timedelta(seconds=start_time)), duration, audio_path))

Is this the cause of the problem? I calculated the mean/std from "get_normal_stats.py" and changed it to a new value( -3.539583 3.4221482) because it is different from your mean/std.
btw, If the data is a problem, I will try to download it again through a different path. (Not PANNs repo method)
Is there any github repo that can help me download AudioSet data?

Can you reproduce our ESC-50 result? (with/without AS pretraining).

Yes, I can reproduce your ESC-50 result both environ.(accuracy : 88.60% and 95.55%)
And, weighted averaged model results for balanced set with label enhancement is 0.340183 which is little different yours 0.349992

Btw, what is the format of your AS data? We have .flac.

My AS data is .wav format.
Do you think file format is important? If this is important, I try to get it back as a flac format.

What if you use our best-pretrained model to infer your test set? (note please keep everything unchanged, including the std/mean).

My results of eval set(keep everything unchanged, including the std/mean) is as below.
It seems to be a little below your results.
And the mean/std of my eval dataset is -3.589223 / 4.0911736

Model 0 /pretrained_models/audioset_10_10_0.4495.pth mAP: 0.449942, AUC: 0.974956, d-prime: 2.770742
Model 1 /pretrained_models/audioset_10_10_0.4483.pth mAP: 0.447466, AUC: 0.974261, d-prime: 2.754138
Model 2 /pretrained_models/audioset_10_10_0.4475.pth mAP: 0.446870, AUC: 0.973542, d-prime: 2.737371
Ensemble 3 Models mAP: 0.474453, AUC: 0.979376, d-prime: 2.886452

Model 0 /pretrained_models/audioset_10_10_0.4495.pth mAP: 0.449942, AUC: 0.974956, d-prime: 2.770742
Model 1 /pretrained_models/audioset_10_10_0.4483.pth mAP: 0.447466, AUC: 0.974261, d-prime: 2.754138
Model 2 /pretrained_models/audioset_10_10_0.4475.pth mAP: 0.446870, AUC: 0.973542, d-prime: 2.737371
Model 3 /pretrained_models/audioset_12_12_0.4467.pth mAP: 0.446010, AUC: 0.973149, d-prime: 2.728355
Model 4 /pretrained_models/audioset_14_14_0.4431.pth mAP: 0.442671, AUC: 0.972391, d-prime: 2.711262
Model 5 /pretrained_models/audioset_16_16_0.4422.pth mAP: 0.439268, AUC: 0.972788, d-prime: 2.720174
Ensemble 6 Models mAP: 0.484173, AUC: 0.980903, d-prime: 2.931331

Thanks :)

YuanGongND · 2023-08-03T17:52:37Z

hi,

I will need to follow up this later as I am working on a deadline.

The problem is the dataset, have you downloaded the dataset from PANNs paper repo?

Yes, I downloaded the dataset from PANNs paper repo. (for unbal set, 1.1TB)
https://github.com/qiuqiangkong/audioset_tagging_cnn

I got it from shell script file provided by PANNs repo. First, download wav file with 32kHz and then downsample to 16kHz.
https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/master/scripts/1_download_dataset.sh
using yt-dlp & ffmpeg. (main code is as below). Is there no problem here? Should I download it to 16kHz directly?
os.system("yt-dlp --quiet -o '{}' -x https://www.youtube.com/watch?v={}".format(video_name, audio_id))
os.system("ffmpeg -loglevel panic -i {} -ac 1 -ar 32000 -ss {} -t 00:00:{} {} "format(video_path, str(datetime.timedelta(seconds=start_time)), duration, audio_path))

Is this the cause of the problem? I calculated the mean/std from "get_normal_stats.py" and changed it to a new value( -3.539583 3.4221482) because it is different from your mean/std.
btw, If the data is a problem, I will try to download it again through a different path. (Not PANNs repo method)
Is there any github repo that can help me download AudioSet data?

I am not the person download the dataset so don't know the details, but we used youtube-dl.

Model 0 /pretrained_models/audioset_10_10_0.4495.pth mAP: 0.449942, AUC: 0.974956, d-prime: 2.770742
Model 1 /pretrained_models/audioset_10_10_0.4483.pth mAP: 0.447466, AUC: 0.974261, d-prime: 2.754138
Model 2 /pretrained_models/audioset_10_10_0.4475.pth mAP: 0.446870, AUC: 0.973542, d-prime: 2.737371
Ensemble 3 Models mAP: 0.474453, AUC: 0.979376, d-prime: 2.886452

Thanks so much for providing this. It is much closer compared with "0.433 and 0.449", am I right? For the first model, your number is actually better? A bit off is understandable due to data pre-processing differences.

If so, it tells that the problem is not the data, but your training process, which could be e.g., data balancing etc. How did you do that?

-Yuan

wisekimm · 2023-08-03T18:51:16Z

Thank you for your quick and kind reply despite your busy time.

If so, it tells that the problem is not the data, but your training process, which could be e.g., data balancing etc. How did you do that?

That's a totally sensible thing to say!
But I didn't change anything except dataset mean and std in the "run.sh " file.

bal=bal
lr=1e-5
epoch=5
lrscheduler_start=2
lrscheduler_step=1
lrscheduler_decay=0.5
wa_start=1
wa_end=5
freqm=48
timem=192
mixup=0.5
fstride=10
tstride=10
batch_size=12
dataset_mean=-3.539583
dataset_std=3.4221482
audio_length=1024
noise=False
metrics=mAP
loss=BCE
warmup=True
wa=True

So, one of my options is to re-download AudioSet directly to 16kHz flac format file and test it again.

Thanks :)

YuanGongND · 2023-08-03T19:08:37Z

So, one of my options is to re-download AudioSet directly to 16kHz flac format file and test it again.

My guess is that won't help as we saw the data isn't the problem.

For data balancing, we give a weight to each sample. But due to the nature of AudioSet is not a "stable" set, we do not have identical sets, so your sample weight file need to be regenerated. Have you done so? How did you do that?

YuanGongND · 2023-08-03T19:10:06Z

I meant this: https://github.com/YuanGongND/ast/blob/master/egs/audioset/gen_weight_file.py

A mistake in generating the sample weight could cause a pretty large performance drop. Please see our PSLA paper for the comparison.

wisekimm · 2023-08-03T19:40:32Z

My guess is that won't help as we saw the data isn't the problem.

It make sense. I think you're right.

For data balancing, we give a weight to each sample. But due to the nature of AudioSet is not a "stable" set, we do not have identical sets, so your sample weight file need to be regenerated. Have you done so? How did you do that?

Yes. According to PSLA repo Step2, I created a sample's weight file for the full set json file.
Are there any specific things I need to be cautious about while doing this?

Thanks

YuanGongND · 2023-08-03T20:13:17Z

I am not sure what could be the problem. Since using our norm stats for inference is fine, maybe you could try to use our normalization stats for training? But still I feel it would not cause such a large performance difference.

I am working on a deadline so cannot follow up this further, will check back after a week. This is not the version we used to train AST, but we got similar result on this version: https://www.dropbox.com/s/18hoeq92juqsg2g/audioset_2m_cleaned.json?dl=1.

You can check if the labels of the training set is consistent with yours.

-Yuan

wisekimm · 2023-08-04T07:26:56Z

Thanks. I will try with a new labels and let you know the result.

wisekimm · 2023-08-10T07:27:20Z

When I train with mean and std written in your code(-4.2677393, 4.5689974), I get the following results.

The results of the 5 epochs are:
0.412 0.436 0.444 0.447 0.448
wa_result : 0.457
Compare the results given in your source code:
0.415, 0.435, 0,448, 0.449, 0.449
wa_result : 0.459

This is a much better result.

However, if i run "get_normal_stats.py" to calculate the mean and std of that training data, the mean is -3.3275723 and the std is 3.8845778.
We can get similar mAP result with the mean, std wriiten in your code, but I wonder what the problem is fundamentally.

YuanGongND · 2023-08-12T01:42:58Z

hi there,

thanks so much for reporting this.

I am not sure about the reason, in my experiments, the results are not sensitive to the input stats and we use different values for different projects. But I guess for now you can just use our mean/std.

A paper from other groups reports some details about mean/std: https://arxiv.org/pdf/2203.13448.pdf

-Yuan

YuanGongND added the reproduction Cannot reproduce the result label Aug 3, 2023

YuanGongND mentioned this issue Sep 3, 2023

How can i get the video and audio pairs of audioset? YuanGongND/cav-mae#10

Open

YuanGongND mentioned this issue Sep 29, 2023

The problem of reproducing the AST result in full dataset #85

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pre-processing about AudioSet (resample to 16kHz) #108

pre-processing about AudioSet (resample to 16kHz) #108

wisekimm commented Aug 3, 2023 •

edited

Loading

YuanGongND commented Aug 3, 2023 •

edited

Loading

YuanGongND commented Aug 3, 2023

YuanGongND commented Aug 3, 2023

wisekimm commented Aug 3, 2023 •

edited

Loading

YuanGongND commented Aug 3, 2023

wisekimm commented Aug 3, 2023 •

edited

Loading

YuanGongND commented Aug 3, 2023

YuanGongND commented Aug 3, 2023

wisekimm commented Aug 3, 2023

YuanGongND commented Aug 3, 2023

wisekimm commented Aug 4, 2023

wisekimm commented Aug 10, 2023 •

edited

Loading

YuanGongND commented Aug 12, 2023

pre-processing about AudioSet (resample to 16kHz) #108

pre-processing about AudioSet (resample to 16kHz) #108

Comments

wisekimm commented Aug 3, 2023 • edited Loading

YuanGongND commented Aug 3, 2023 • edited Loading

YuanGongND commented Aug 3, 2023

YuanGongND commented Aug 3, 2023

wisekimm commented Aug 3, 2023 • edited Loading

YuanGongND commented Aug 3, 2023

wisekimm commented Aug 3, 2023 • edited Loading

YuanGongND commented Aug 3, 2023

YuanGongND commented Aug 3, 2023

wisekimm commented Aug 3, 2023

YuanGongND commented Aug 3, 2023

wisekimm commented Aug 4, 2023

wisekimm commented Aug 10, 2023 • edited Loading

YuanGongND commented Aug 12, 2023

wisekimm commented Aug 3, 2023 •

edited

Loading

YuanGongND commented Aug 3, 2023 •

edited

Loading

wisekimm commented Aug 3, 2023 •

edited

Loading

wisekimm commented Aug 3, 2023 •

edited

Loading

wisekimm commented Aug 10, 2023 •

edited

Loading