Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pre-processing about AudioSet (resample to 16kHz) #108

Open
wisekimm opened this issue Aug 3, 2023 · 13 comments
Open

pre-processing about AudioSet (resample to 16kHz) #108

wisekimm opened this issue Aug 3, 2023 · 13 comments
Labels
reproduction Cannot reproduce the result

Comments

@wisekimm
Copy link

wisekimm commented Aug 3, 2023

Hi! Yaun Gong, Thank you for providing good research and opensource code.

There is one problem with reproducing.

When i convert the AudioSet (my dataset is 32kHz) to 16kHz using sox (Based on ast/egs/esx50/prep_esc50.py), warning is happen.

The code I used are as below
os.system('sox ' + base_dir + '/audio/' + audio + ' -r 16000 ' + base_dir + '/audio_16k/' + audio)

This seems to be the cause of making the mean and std of the data different from what you wrote.
(our mean and std are -3.539583 and 3.4221482, which are not the same as -4.2677393, 4.5689974 in your code)
And the difference between these datasets seems to make the result value (mAP) different.

The results of the 5 epochs are:
0.408, 0.425, 0.434, 0.433, 0.433
Compare the results given in your source code:
4.153, 0.439, 0,448, 0.449, 0.449

So, in conclusion, my question is this.
Did you also have any warning signal in the process of resample AudioSet data? If so, I wonder how you solved it.

sox warn marks are shown below.
Thanks :)

sox WARN rate: rate clipped 5 samples; decrease volume?
sox WARN dither: dither clipped 5 samples; decrease volume?

@YuanGongND
Copy link
Owner

YuanGongND commented Aug 3, 2023

hi there,

I get these warnings on some other datasets (not for AS, because we download them as 16kHz). I believe these are safe and not the cause of the problem.

The problem is the dataset, have you downloaded the dataset from PANNs paper repo? If so, it is a known issue. I guess this version is somehow different from what we use.

0.433 and 0.449 is a huge difference on AudioSet. FYI, for getting video data purpose, we used another version which was downloaded independently with the AST used version after the AST work, which is noticeably smaller. But still, we get very similar results, so it is also unlikely to be the data size issue.

Can you reproduce our ESC-50 result? (with/without AS pretraining).

-Yuan

@YuanGongND YuanGongND added the reproduction Cannot reproduce the result label Aug 3, 2023
@YuanGongND
Copy link
Owner

Btw, what is the format of your AS data? We have .flac.

@YuanGongND
Copy link
Owner

What if you use our best-pretrained model to infer your test set? (note please keep everything unchanged, including the std/mean).

@wisekimm
Copy link
Author

wisekimm commented Aug 3, 2023

Thank you for your quick reply :)

The problem is the dataset, have you downloaded the dataset from PANNs paper repo?

Yes, I downloaded the dataset from PANNs paper repo. (for unbal set, 1.1TB)
https://github.com/qiuqiangkong/audioset_tagging_cnn

I got it from shell script file provided by PANNs repo. First, download wav file with 32kHz and then downsample to 16kHz.
https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/master/scripts/1_download_dataset.sh
using yt-dlp & ffmpeg. (main code is as below). Is there no problem here? Should I download it to 16kHz directly?
os.system("yt-dlp --quiet -o '{}' -x https://www.youtube.com/watch?v={}".format(video_name, audio_id))
os.system("ffmpeg -loglevel panic -i {} -ac 1 -ar 32000 -ss {} -t 00:00:{} {} "format(video_path, str(datetime.timedelta(seconds=start_time)), duration, audio_path))

Is this the cause of the problem? I calculated the mean/std from "get_normal_stats.py" and changed it to a new value( -3.539583 3.4221482) because it is different from your mean/std.
btw, If the data is a problem, I will try to download it again through a different path. (Not PANNs repo method)
Is there any github repo that can help me download AudioSet data?

Can you reproduce our ESC-50 result? (with/without AS pretraining).

Yes, I can reproduce your ESC-50 result both environ.(accuracy : 88.60% and 95.55%)
And, weighted averaged model results for balanced set with label enhancement is 0.340183 which is little different yours 0.349992

Btw, what is the format of your AS data? We have .flac.

My AS data is .wav format.
Do you think file format is important? If this is important, I try to get it back as a flac format.

What if you use our best-pretrained model to infer your test set? (note please keep everything unchanged, including the std/mean).

My results of eval set(keep everything unchanged, including the std/mean) is as below.
It seems to be a little below your results.
And the mean/std of my eval dataset is -3.589223 / 4.0911736

Model 0 /pretrained_models/audioset_10_10_0.4495.pth mAP: 0.449942, AUC: 0.974956, d-prime: 2.770742
Model 1 /pretrained_models/audioset_10_10_0.4483.pth mAP: 0.447466, AUC: 0.974261, d-prime: 2.754138
Model 2 /pretrained_models/audioset_10_10_0.4475.pth mAP: 0.446870, AUC: 0.973542, d-prime: 2.737371
Ensemble 3 Models mAP: 0.474453, AUC: 0.979376, d-prime: 2.886452

Model 0 /pretrained_models/audioset_10_10_0.4495.pth mAP: 0.449942, AUC: 0.974956, d-prime: 2.770742
Model 1 /pretrained_models/audioset_10_10_0.4483.pth mAP: 0.447466, AUC: 0.974261, d-prime: 2.754138
Model 2 /pretrained_models/audioset_10_10_0.4475.pth mAP: 0.446870, AUC: 0.973542, d-prime: 2.737371
Model 3 /pretrained_models/audioset_12_12_0.4467.pth mAP: 0.446010, AUC: 0.973149, d-prime: 2.728355
Model 4 /pretrained_models/audioset_14_14_0.4431.pth mAP: 0.442671, AUC: 0.972391, d-prime: 2.711262
Model 5 /pretrained_models/audioset_16_16_0.4422.pth mAP: 0.439268, AUC: 0.972788, d-prime: 2.720174
Ensemble 6 Models mAP: 0.484173, AUC: 0.980903, d-prime: 2.931331

Thanks :)

@YuanGongND
Copy link
Owner

hi,

I will need to follow up this later as I am working on a deadline.

The problem is the dataset, have you downloaded the dataset from PANNs paper repo?

Yes, I downloaded the dataset from PANNs paper repo. (for unbal set, 1.1TB)
https://github.com/qiuqiangkong/audioset_tagging_cnn

I got it from shell script file provided by PANNs repo. First, download wav file with 32kHz and then downsample to 16kHz.
https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/master/scripts/1_download_dataset.sh
using yt-dlp & ffmpeg. (main code is as below). Is there no problem here? Should I download it to 16kHz directly?
os.system("yt-dlp --quiet -o '{}' -x https://www.youtube.com/watch?v={}".format(video_name, audio_id))
os.system("ffmpeg -loglevel panic -i {} -ac 1 -ar 32000 -ss {} -t 00:00:{} {} "format(video_path, str(datetime.timedelta(seconds=start_time)), duration, audio_path))

Is this the cause of the problem? I calculated the mean/std from "get_normal_stats.py" and changed it to a new value( -3.539583 3.4221482) because it is different from your mean/std.
btw, If the data is a problem, I will try to download it again through a different path. (Not PANNs repo method)
Is there any github repo that can help me download AudioSet data?

I am not the person download the dataset so don't know the details, but we used youtube-dl.

Model 0 /pretrained_models/audioset_10_10_0.4495.pth mAP: 0.449942, AUC: 0.974956, d-prime: 2.770742
Model 1 /pretrained_models/audioset_10_10_0.4483.pth mAP: 0.447466, AUC: 0.974261, d-prime: 2.754138
Model 2 /pretrained_models/audioset_10_10_0.4475.pth mAP: 0.446870, AUC: 0.973542, d-prime: 2.737371
Ensemble 3 Models mAP: 0.474453, AUC: 0.979376, d-prime: 2.886452

Thanks so much for providing this. It is much closer compared with "0.433 and 0.449", am I right? For the first model, your number is actually better? A bit off is understandable due to data pre-processing differences.

If so, it tells that the problem is not the data, but your training process, which could be e.g., data balancing etc. How did you do that?

-Yuan

@wisekimm
Copy link
Author

wisekimm commented Aug 3, 2023

Thank you for your quick and kind reply despite your busy time.

If so, it tells that the problem is not the data, but your training process, which could be e.g., data balancing etc. How did you do that?

That's a totally sensible thing to say!
But I didn't change anything except dataset mean and std in the "run.sh " file.

bal=bal
lr=1e-5
epoch=5
lrscheduler_start=2
lrscheduler_step=1
lrscheduler_decay=0.5
wa_start=1
wa_end=5
freqm=48
timem=192
mixup=0.5
fstride=10
tstride=10
batch_size=12
dataset_mean=-3.539583
dataset_std=3.4221482
audio_length=1024
noise=False
metrics=mAP
loss=BCE
warmup=True
wa=True

So, one of my options is to re-download AudioSet directly to 16kHz flac format file and test it again.

Thanks :)

@YuanGongND
Copy link
Owner

So, one of my options is to re-download AudioSet directly to 16kHz flac format file and test it again.

My guess is that won't help as we saw the data isn't the problem.

For data balancing, we give a weight to each sample. But due to the nature of AudioSet is not a "stable" set, we do not have identical sets, so your sample weight file need to be regenerated. Have you done so? How did you do that?

@YuanGongND
Copy link
Owner

I meant this: https://github.com/YuanGongND/ast/blob/master/egs/audioset/gen_weight_file.py

A mistake in generating the sample weight could cause a pretty large performance drop. Please see our PSLA paper for the comparison.

@wisekimm
Copy link
Author

wisekimm commented Aug 3, 2023

My guess is that won't help as we saw the data isn't the problem.

It make sense. I think you're right.

For data balancing, we give a weight to each sample. But due to the nature of AudioSet is not a "stable" set, we do not have identical sets, so your sample weight file need to be regenerated. Have you done so? How did you do that?

Yes. According to PSLA repo Step2, I created a sample's weight file for the full set json file.
Are there any specific things I need to be cautious about while doing this?

Thanks

@YuanGongND
Copy link
Owner

I am not sure what could be the problem. Since using our norm stats for inference is fine, maybe you could try to use our normalization stats for training? But still I feel it would not cause such a large performance difference.

I am working on a deadline so cannot follow up this further, will check back after a week. This is not the version we used to train AST, but we got similar result on this version: https://www.dropbox.com/s/18hoeq92juqsg2g/audioset_2m_cleaned.json?dl=1.

You can check if the labels of the training set is consistent with yours.

-Yuan

@wisekimm
Copy link
Author

wisekimm commented Aug 4, 2023

Thanks. I will try with a new labels and let you know the result.

@wisekimm
Copy link
Author

wisekimm commented Aug 10, 2023

When I train with mean and std written in your code(-4.2677393, 4.5689974), I get the following results.

The results of the 5 epochs are:
0.412 0.436 0.444 0.447 0.448
wa_result : 0.457
Compare the results given in your source code:
0.415, 0.435, 0,448, 0.449, 0.449
wa_result : 0.459

This is a much better result.

However, if i run "get_normal_stats.py" to calculate the mean and std of that training data, the mean is -3.3275723 and the std is 3.8845778.
We can get similar mAP result with the mean, std wriiten in your code, but I wonder what the problem is fundamentally.

@YuanGongND
Copy link
Owner

hi there,

thanks so much for reporting this.

I am not sure about the reason, in my experiments, the results are not sensitive to the input stats and we use different values for different projects. But I guess for now you can just use our mean/std.

A paper from other groups reports some details about mean/std: https://arxiv.org/pdf/2203.13448.pdf

-Yuan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reproduction Cannot reproduce the result
Projects
None yet
Development

No branches or pull requests

2 participants