-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pre-processing about AudioSet (resample to 16kHz) #108
Comments
hi there, I get these warnings on some other datasets (not for AS, because we download them as 16kHz). I believe these are safe and not the cause of the problem. The problem is the dataset, have you downloaded the dataset from PANNs paper repo? If so, it is a known issue. I guess this version is somehow different from what we use. 0.433 and 0.449 is a huge difference on AudioSet. FYI, for getting video data purpose, we used another version which was downloaded independently with the AST used version after the AST work, which is noticeably smaller. But still, we get very similar results, so it is also unlikely to be the data size issue. Can you reproduce our ESC-50 result? (with/without AS pretraining). -Yuan |
Btw, what is the format of your AS data? We have .flac. |
What if you use our best-pretrained model to infer your test set? (note please keep everything unchanged, including the std/mean). |
Thank you for your quick reply :)
Yes, I downloaded the dataset from PANNs paper repo. (for unbal set, 1.1TB) I got it from shell script file provided by PANNs repo. First, download wav file with 32kHz and then downsample to 16kHz. Is this the cause of the problem? I calculated the mean/std from "get_normal_stats.py" and changed it to a new value( -3.539583 3.4221482) because it is different from your mean/std.
Yes, I can reproduce your ESC-50 result both environ.(accuracy : 88.60% and 95.55%)
My AS data is .wav format.
My results of eval set(keep everything unchanged, including the std/mean) is as below.
Thanks :) |
hi, I will need to follow up this later as I am working on a deadline.
I am not the person download the dataset so don't know the details, but we used youtube-dl.
Thanks so much for providing this. It is much closer compared with "0.433 and 0.449", am I right? For the first model, your number is actually better? A bit off is understandable due to data pre-processing differences. If so, it tells that the problem is not the data, but your training process, which could be e.g., data balancing etc. How did you do that? -Yuan |
Thank you for your quick and kind reply despite your busy time.
That's a totally sensible thing to say!
So, one of my options is to re-download AudioSet directly to 16kHz flac format file and test it again. Thanks :) |
My guess is that won't help as we saw the data isn't the problem. For data balancing, we give a weight to each sample. But due to the nature of AudioSet is not a "stable" set, we do not have identical sets, so your sample weight file need to be regenerated. Have you done so? How did you do that? |
I meant this: https://github.com/YuanGongND/ast/blob/master/egs/audioset/gen_weight_file.py A mistake in generating the sample weight could cause a pretty large performance drop. Please see our PSLA paper for the comparison. |
It make sense. I think you're right.
Yes. According to PSLA repo Step2, I created a sample's weight file for the full set json file. Thanks |
I am not sure what could be the problem. Since using our norm stats for inference is fine, maybe you could try to use our normalization stats for training? But still I feel it would not cause such a large performance difference. I am working on a deadline so cannot follow up this further, will check back after a week. This is not the version we used to train AST, but we got similar result on this version: https://www.dropbox.com/s/18hoeq92juqsg2g/audioset_2m_cleaned.json?dl=1. You can check if the labels of the training set is consistent with yours. -Yuan |
Thanks. I will try with a new labels and let you know the result. |
When I train with mean and std written in your code(-4.2677393, 4.5689974), I get the following results.
This is a much better result. However, if i run "get_normal_stats.py" to calculate the mean and std of that training data, the mean is -3.3275723 and the std is 3.8845778. |
hi there, thanks so much for reporting this. I am not sure about the reason, in my experiments, the results are not sensitive to the input stats and we use different values for different projects. But I guess for now you can just use our mean/std. A paper from other groups reports some details about mean/std: https://arxiv.org/pdf/2203.13448.pdf -Yuan |
Hi! Yaun Gong, Thank you for providing good research and opensource code.
There is one problem with reproducing.
When i convert the AudioSet (my dataset is 32kHz) to 16kHz using sox (Based on ast/egs/esx50/prep_esc50.py), warning is happen.
The code I used are as below
os.system('sox ' + base_dir + '/audio/' + audio + ' -r 16000 ' + base_dir + '/audio_16k/' + audio)
This seems to be the cause of making the mean and std of the data different from what you wrote.
(our mean and std are -3.539583 and 3.4221482, which are not the same as -4.2677393, 4.5689974 in your code)
And the difference between these datasets seems to make the result value (mAP) different.
So, in conclusion, my question is this.
Did you also have any warning signal in the process of resample AudioSet data? If so, I wonder how you solved it.
sox warn marks are shown below.
Thanks :)
The text was updated successfully, but these errors were encountered: