Skip to content

Latest commit

 

History

History
366 lines (220 loc) · 8.4 KB

03_q1_datasets.md

File metadata and controls

366 lines (220 loc) · 8.4 KB
title date description tags link
Working with Q1 datasets
2023-11-04T03:00:00.000Z
In this post you will learn how to work with STAC metadata to create Q1 datasets.
STAC, Q1, datasets

Q1 Training Datasets

Training Datasets (TDS) in EOTDL are categorized into different quality levels, which in turn will impact the range of functionality that will be available for each dataset.

In this tutorial you will learn about Q1 datsets, datasets with STAC metadata.

To ingest a Q1 datasets you will need its STAC metadata.

Some datasets already have STAC metadata, and can be ingested directly into EOTDL. However, in case that your dataset does not have STAC metadata but you want to ingest it as a Q1 dataset, the EOTDL library also offers functionality to create the metadata. Let's see an example using the EuroSAT dataset.

from eotdl.datasets import download_dataset

download_dataset("EuroSAT-RGB", version=1, path="data", force=True)
100%|██████████| 90.3M/90.3M [00:04<00:00, 22.5MiB/s]
100%|██████████| 2/2 [00:04<00:00,  2.30s/file]





'data/EuroSAT-RGB/v1'
!ls data/EuroSAT-RGB/v1
EuroSAT-RGB.zip  metadata.yml
!unzip -q data/EuroSAT-RGB/v1/EuroSAT-RGB.zip -d data/EuroSAT-RGB

The EuroSAT dataset contains satellite images for classification, i.e. each image has one label associated. In this case, the label can be extracted from the folder structure.

import os 

labels = os.listdir('data/EuroSAT-RGB/2750')
labels
['Industrial',
 'Forest',
 'HerbaceousVegetation',
 'PermanentCrop',
 'Highway',
 'Residential',
 'SeaLake',
 'River',
 'AnnualCrop',
 'Pasture']

For faster processing, we will generate a copy of the dataset with only 10 images per class.

import shutil 

os.makedirs('data/EuroSAT-RGB-small/', exist_ok=True)
for label in labels:
    os.makedirs('data/EuroSAT-RGB-small/' + label, exist_ok=True)
    images = os.listdir('data/EuroSAT-RGB/2750/' + label)[:10]
    for image in images:
        shutil.copy('data/EuroSAT-RGB/2750/' + label + '/' + image, 'data/EuroSAT-RGB-small/' + label + '/' + image)

You can use the STACGenerator to create the STAC metadata for your dataset in the form of a dataframe. The item parser will depend on the structure of your dataset. We offer some predefined parsers for common datasets, but you can also create your own parser.

from eotdl.curation.stac.parsers import UnestructuredParser
from eotdl.curation.stac.stac import STACGenerator
from eotdl.curation.stac.dataframe_labeling import LabeledStrategy

stac_generator = STACGenerator(image_format='jpg', item_parser=UnestructuredParser, labeling_strategy=LabeledStrategy)

df = stac_generator.get_stac_dataframe('data/EuroSAT-RGB-small')
df.head()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
image label ix collection extensions bands
0 data/EuroSAT-RGB-small/Industrial/Industrial_1... Industrial 0 data/EuroSAT-RGB-small/source None None
1 data/EuroSAT-RGB-small/Industrial/Industrial_1... Industrial 0 data/EuroSAT-RGB-small/source None None
2 data/EuroSAT-RGB-small/Industrial/Industrial_1... Industrial 0 data/EuroSAT-RGB-small/source None None
3 data/EuroSAT-RGB-small/Industrial/Industrial_1... Industrial 0 data/EuroSAT-RGB-small/source None None
4 data/EuroSAT-RGB-small/Industrial/Industrial_1... Industrial 0 data/EuroSAT-RGB-small/source None None

Now we save the STAC metadata. The id given to the STAC catalog will be used as the name of the dataset in EOTDL (which has the same requirements than can be found in the documentation).

output = 'data/EuroSAT-RGB-small-STAC'
stac_generator.generate_stac_metadata(stac_id='EuroSAT-RGB-Q1', description='EuroSAT-RGB dataset', stac_dataframe=df, output_folder=output)
/home/juan/miniconda3/envs/eotdl/lib/python3.8/site-packages/rasterio/__init__.py:304: NotGeoreferencedWarning: Dataset has no geotransform, gcps, or rpcs. The identity matrix will be returned.
  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)


Generating source collection...


100%|██████████| 100/100 [00:00<00:00, 972.12it/s]

Validating and saving catalog...
Success!

And, optionally, the labels using the labels extension.

from eotdl.curation.stac.extensions.label import ImageNameLabeler

catalog = output + '/catalog.json'
labels_extra_properties = {'label_properties': ["label"],
                          'label_methods': ["manual"],
                          'label_tasks': ["classification"]}

labeler = ImageNameLabeler()
labeler.generate_stac_labels(catalog, stac_dataframe=df, **labels_extra_properties)
Generating labels collection...


100it [00:00, 2549.64it/s]


Success on labels generation!

Once the STAC metadata is generated, we can ingest the dataset into EOTDL.

from eotdl.datasets import ingest_dataset

ingest_dataset('data/EuroSAT-RGB-small-STAC')
Loading STAC catalog...
New version created, version: 1


100%|██████████| 200/200 [00:32<00:00,  6.13it/s]


Ingesting STAC catalog...
Done

After the ingestion, you can explore and stage your dataset like shown in the previous tutorial.

from eotdl.datasets import retrieve_datasets

datasets = retrieve_datasets('EuroSAT')
datasets
['EuroSAT-RGB',
 'EuroSAT',
 'EuroSAT-RGB-STAC',
 'EuroSAT-STAC',
 'EuroSAT-small',
 'EuroSAT-RGB-Q1']
from eotdl.datasets import download_dataset

dst_path = download_dataset('EuroSAT-RGB-Q1')
dst_path
'/home/juan/.cache/eotdl/datasets/EuroSAT-RGB-Q1/v1'

By default it will only download the STAC metadata. In case you also want to download the actual data, you can use the assets parameter.

The force parameter will overwrite the dataset if it already exists.

from eotdl.datasets import download_dataset

dst_path = download_dataset('EuroSAT-RGB-Q1', force=True, assets=True)
dst_path
100%|██████████| 200/200 [00:31<00:00,  6.39it/s]





'/home/juan/.cache/eotdl/datasets/EuroSAT-RGB-Q1/v1'

You will find the data in the assets subfolder, where a subfolder for each items with its id will contain all the assets for that item.

from glob import glob

glob(dst_path + '/assets/*.jpg')[:3]
['/home/juan/.cache/eotdl/datasets/EuroSAT-RGB-Q1/v1/assets/AnnualCrop_1033.jpg',
 '/home/juan/.cache/eotdl/datasets/EuroSAT-RGB-Q1/v1/assets/HerbaceousVegetation_1743.jpg',
 '/home/juan/.cache/eotdl/datasets/EuroSAT-RGB-Q1/v1/assets/HerbaceousVegetation_1977.jpg']

Alternatively, you can download an asset using its url.

import json

with open(dst_path + '/EuroSAT-RGB-Q1/source/Highway_594/Highway_594.json', 'r') as f:
	data = json.load(f)

data['assets']
{'Highway_594': {'href': 'https://api.eotdl.com/datasets/654502991c54ab3a79d81007/download/Highway_594.jpg',
  'type': 'image/jpeg',
  'title': 'Highway_594',
  'roles': ['data']}}
from eotdl.datasets import download_file_url

url = data['assets']['Highway_594']['href']
download_file_url(url, 'data')
'data/assets/Highway_594.jpg'