-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why do we have to manually split training and validation images? #45
Comments
Hi, Although, I must admit that I deliberately omitted this feature so far. The reason is that I wanted users to wisely choose their validation data. Lots of things can go wrong when capturing a dataset (wrong labeling, badly focused image, lots of blur, dirt on cam, oversampling one class, …) and the validation set should be chosen and checked with extra care. Furthermore, users almost always want to train models for predicting events in the future. Therefore, it makes sense that the validation set is captured after the training set. Similarly, for biomedical applications, trained models still need to work for data from a new experiment or of another patient. If the existing dataset is very large, a random allocation of validation data might be an option. Therefore, I’m now thinking how to implement the train_test_split, you suggested, as an option. |
Maybe I write a helper tool for automating that split. It does not have to be embedded in the main project. It can just create the required folder structure based on one folder input. The project seems very promising, and thank you for the great contribution. I used NVIDIA-digits once in this study.: https://link.springer.com/article/10.1007/s11694-020-00707-7 However, installing digits is a hassle, especially for my students who have no programming background. I am looking for an alternative to digits that can be easily installed, then I saw your project. I could not found legacy GoogleNet in the predefined list. It is nice to know the opportunity for getting some support for the program. |
Sounds like AIDeveloper could be a helpful tool for you. The students just need to download and unzip. AIDeveloper even works with GPU support. Maybe you already discovered the "Python" tab within AIDeveloper. There, you can execute any code you want in the same Python environment that is used by AIDeveloper. Hence, packages like tensorflow, scikit-learn, opencv and so on are available without having to install anything. |
Thank you for the information. It looks like the user has much control over it. |
Dear Maik, Here is my tt_split implementation. Probably, it needs more error handling. I had near-zero experience with QT, and this one is my first Qt program, and looks like it does the job :) https://gist.github.com/aferust/55bb70359fdd3148c7e920b02907084a |
Thanks for sharing your code! |
While I really thank the GUI and the splitting code, I really think this should be a feature of the software. Having a simple option where you load a class, then set the percentage for training, validation, and testing would be very very useful. i understand the concern that 'garbage in, garbage out' where you would want people to check their images before using it, but I think a tool like this is more about developing skills, a reasonable model, and fast. there is also many free and easy ways to collect pretty good data, and sure is there going to be some garbage in a dataset? yeah probably, but if you have a class with 10,000 images, and you have 90% very high quality images, then that's good enough accuracy for a model made from a GUI. No one will be making a model in this tool and using it at Google or Facebook or something right? this is just for developing understanding, hobby models, etc. Having a model that is 70% good isn't bad at all! plus the problem with using external script to do this is it just makes life harder than it has to be. the model should be continually trained and it will be better and better, and if its bad then you retrain it or you start going through the dataset, etc. i think having the option to split the dataset in the software would help people in that journey. |
@DankMemeGuy thanks for your suggestions. I have implemented a (kind of) quick solution. You can now find a new checkbox 'Validation split(%)'. You can change that fraction during the training process on the fly. |
thank you very much!! |
Yes, why? It is a huge hassle. Can you easily implement an auto split procedure using a scikit-learn function (train_test_split)?
The text was updated successfully, but these errors were encountered: