The procedures to get and process the datasets are described here.
This is the folder structure you need to handle data used in the CoMix repository. The data is stored in the following sub-folders:
- original datasets
data/datasets
: contains the downloaded raw data (from other sources), used in the CoMix repository.
- processed datasets
data/datasets.unify
: contains the unified data after having processed all the raw images and the splits (this folder is already present).
- unified datasets
data/comix.coco
: in order to evaluate the models, the data is converted to COCO format.
- predictions
predicts.coco
: predictions for detection modelspredicts.caps
: captioning
The eBDtheque dataset can be downloaded from the website, after registration. Once you have downloaded the dataset, ONLY place the Pages
folder into the datasets/eBDtheque
folder.
To convert the images of eBDtheque to the unified format, run the following command:
$ python comix/process/ebdtheque.py
Check the comix/process/ebdtheque.py
file for the arguments, if you want to change the default values.
According to the license of Manga109, the redistribution of the images of Manga109 is not permitted. Thus, you should download the images of Manga109 via the Manga109 webpage.
After downloading, unzip Manga109.zip
into the folder datasets
. Move all the contents of Manga109_released_x
to the upper folder, then delete the empty directory Manga109_released_x
.
Remove unused files:
cd data/datasets/Manga109
rm -rf annotations.v20*
rm -rf annotations
The folder structure should look like this:
datasets/
└── Manga109
├── images
├── books.txt
├── readme.txt
To convert the images of Manga109 to the unified format, run the following command:
$ python comix/process/manga109.py
which has the following arguments:
--input-path
: path to the Manga109 folder (default:data/datasets/Manga109
)--output-path
: path to the output folder (default:data/datasets.unify/Manga109
)--override
: override the existing images, annotations are always overritten (default:False
)--limit
: stop after the first{limit}
books (default:None
)
After downloading, unzip DCM_dataset_public_images.zip
into the folder datasets
. Rename the extracted directory as DCM
and delete the zip file.
The DCM dataset, needs to be preprocessed before converted into the unified format. To preprocess the DCM dataset (jpg renaming) and then convert images to the unified format, run the following command:
$ python comix/process/dcm.py
In DCM original enumeration of images starts from '001' rather than '000'. We decided to keep it.
Download the original page images. Unzip raw_pages_images.tar.gz
into the folder datasets
and rename the extracted folder books
. Then, move this folder in an upper new created directory named comics
.
The folder hierarchy should look like this:
datasets/
└── comics
├── books
The Comics dataset, needs to be preprocessed before converted into the unified format. To preprocess the Comics dataset (jpg renaming) and then convert images to the unified format, run the following command:
$ python comix/process/comics.py
In Comics dataset some images are not viewable (usually first/last ones). We renamed them anyway.
Check the comix/process/comics.py
file for the arguments, if you want to change the default values.
To download the dataset, please refer to magi repository. After downloading, please locate Popmanga
folder into data/datasets
and rename it into popmanga
. Then, inside the folder delete annotations
.
Now, you can convert the images of comics to the unified format running the following command:
$ python comix/process/popmanga.py
In the path data/datasets.unify/name_of_the_dataset/splits
are available the splits for every dataset except for Manga109
. In particular val.csv
and test.csv
are available for every dataset. Furthermore, in comics
there is also train.csv