-
Notifications
You must be signed in to change notification settings - Fork 6
Glossary of Terms
An emerging concept that, in analogy to ARD, describes a higher level of data preparation that allows for the direct employment of the data in AI / ML workflows. AI- readiness also refer to self-explanatory TDS or models that simplify the use by non (EO) domain experts. FAIR data principles arguably will play a key role in the definition of Ai-readiness in EO.
Analysis Ready EO datasets have been processed and pre-processed to a mature level that allows the direct use of the data in analytics such as time series analysis. ARD data should ensure that observed trends or other features are not related to inconsistencies of data calibration. ARD specifications are often specific to their application domain e.g. CEOS defined ARD for land applications (CARD4L) which includes surface reflectance and backscatter measurements. Level-1 or Level-2 EO datasets with sufficient level of (geometric, radiometric and spectral) pre-processing that allows for the direct employment of the data in e.g. time-series analysis. ARD data should ensure that observed trends or other features are not related to inconsistencies of data calibration. ARD considerations are also critical for enabling interoperability between various sensor observations. The CARD4L ARD specification is gaining widespread adoption in the EO community and industry.
Analysis Ready Cloud Optimized (ARCO) data is Analysis Ready Data which are also stored in formats that allow efficient, direct access to data subsets.
The product of a pipeline component. It can be a file (raw/curated dataset, json, HTML data report, a text file...) or a directory.
Benchmark datasets are mature (i.e. well described and quality controlled) paired Training Datasets with defined application purposes (e.g. vessel detection in SAR imagery). Moreover, benchmark datasets are ideally curated and offered together with independent test data and expected metrics. The benchmark dataset is offered to compare the performance of different ML algorithms or models.
The process of feature engineering may involve multiple steps and computations, that may or may not be carried out by third party services (such as Sentinel HUB statistical API e.g. by an API of a could based EO service provider API). Depending on the service used (or even the version of the same service), the same process can result in different features which may hurt the performance of ML models. Cloud-agnostic feature engineering refers to the process of ensuring that the same features are always retrieved for the same processes, which may include key metadata alongside the models to use the same service and version for feature engineering in inference at all time.
Field of study responsible for extracting useful information from images and videos. Since images are the primary source of information in EO applications, the majority of training datasets available today are tailored to such applications. In addition, thanks to the advances in Deep Learning, the field of CV has seen a tremendous increase in interest lately thanks to new algorithms such as CNNs and the available of Open Source software.
Data curation is the organization and integration of data collected from various sources such that the value of the data is maintained over time, and the data remains available for reuse and preservation. Data curation includes "all the processes needed for principled and controlled data creation, maintenance, and management, together with the capacity to add value to data”.
The process of splitting the data, for example into train validation and test sets.
Procedure to verify that our assumptions about the data are correct and stay correct even for new data to be added in an existing dataset and/or brand-new datasets. Techniques commonly used are: hypothesis testing, deterministic test (verifying data attributes measurable without uncertainty (number of columns, categorical variable values, legal range for numerical variables, etc.), non-deterministic test (comparing the current dataset with a fixed reference one), etc.
A collection of data representing particular variables of interest consisting on multiple records of the data in question. Examples include structured datasets such as tabular data, where records are stored in the rows of the tables and the variables as the columns, or unstructured datasets such as text, images, video or audio (among others). The values represented can be numeric or categorical, and may include missing values. Related to this, an EO data product is a dataset containing observations made by a remote sensing instrument and that is generated following a fixed specification and production procedure.
Subfield of Machine Learning primary concerned with the study and development of (deep) artificial neural networks (although other algorithms that may learn directly from raw data in an end-to-end way without requiring feature engineering can also be included). This field has seen an increase in interest in the past decade thanks to increase on data availability, computing resources and Open Source software.
An interactive analysis performed at the beginning of the project to explore the data and learn as much as possible from it. It informs many decisions about the development of a model: maximizing data understanding (data types, ranges, cardinality, correlations, outliers, etc.), identifying and correct data problems (missing data, uninformative features), test assumptions about the problem, uncover data biases.
Classic structure for data pipelines used to fetch, pre-process and store a dataset.
Features are variables that compose the data input of a machine learning (ML) model. In ML, it is typically referred to as input variables or dependent variables. In the context of EO these can be the pixel values of different spectral bands in optical data or the values of backscatter coefficients in different polarisations for SAR data. Moreover, EO features can also be derived from different derived variables, such as vegetation indexes (NDVI) or HH polarization ratios. The process of using domain knowledge to extract features suitable for training a ML model from raw input data is termed feature engineering.
The independent variable in an EO TDS can consist of original EO observations (e.g. L1-L3 products) based on a single data acquisition or an entire time series of observations. They may also stem from represent derived variables such as Vegetation Indices or results from other data transformations (e.g. PCA or phenological metrics). Often EO data time series are transformed into more relevant statistical metrics of representations such as percentiles, arithmetic means, variance/standard deviations, interquartile ranges. This process is referred to as feature engineering and is often of critical importance to good use of ML in EO.
A Machine Learning Operation tool that can store the definition as well as the implementation of features, and serve them for online (real-time) inference with low latency and for offline (batch) inference with high throughput. It provides a centralized implementation of the features so that different models requiring the same feature can access a single storage entity (feature registry).
Measurements taken in-situ (i.e. situated in the original place) of the measured target variable. In an EO context in-situ data are typically collected in field/ground campaigns, over cal/val sites, through surveys or instrument measurements.
Subfield of Artificial Intelligence concerned with the study of algorithms that are capable of learning and adapt to a particular task (or set of tasks) directly from data without requiring explicit instructions. Different levels of human intervention are required depending on the algorithms, e.g feature engineering for traditional models such as SVMs or Random Forest, to models able to learn end to end from raw data, e.g artificial neural networks.
Set of best practices and practices and methods for an efficient end-to-end development and operation of performant, scalable, reliable, automated and reproducible ML solutions in a real production setting.
A sequence of one or more components linked together by artifacts, and controlled by hyperparameters and/or configurations. It should be tracked and reproducible.
Even though imagery is the primary source of EO data, most of the problems are related to the retrieval of bio-geo-physical or bio-chemical variables (in general, referred to as parameter retrieval). This application usually involves non-imagery data, such as time series or tabular data, where traditional ML algorithms (non-DL) can give better results (such as XGBoost). A feature engineering pre-process is usually required to extract the required feature from the source data first.
The pre-processing step is right at the beginning of the ML pipeline, just after data fetching. It implements the cleaning steps and other pre-processing that we have learned during the EDA. The pre- processing step should only apply transformations that are needed to make the training and test data look like what the model will encounter in production.
In contemporary AI and ML pre-trained models offer the benefit or pre-trained ML model architectures (e.g. with various convolutional layers) which are computationally costly to generate and have general benefits across application scenarios. In addition, a more pragmatic and application centric perspective on pre-trained models is also considered in the context of this project, where pre-trained models offer the EO community analytical building blocks to easily integrate a specific analytical capability (e.g. vessel detection in Sentinel-2 imagery) into a larger workflow.
Product validation datasets are those that support the validation of EO derived products or maps. Typically, they are created following specific statistical sampling procedures (e.g. random, systematic or stratified sampling). They support independent accuracy assessments and validation efforts and are typically created using methods different from the creation of a TDS.
AI inference can be carried out on cloud servers or the edge (such as the final user’s device). Cloud inference allows control on the hardware and facilitates monitoring, while compromising in other aspects such as privacy or latency where edge inference is preferred. AI cloud inference is usually done with a web server that runs the model, receiving inputs and returning predictions. Different hardware may be required depending on the model and available computational budget (more or less CPU cores, RAM or even GPUs). Additionally, in order to ensure high availability of the service, load balancers and orchestration techniques are usually required. All these features make for “production grade inference” in contrast to a single web server, running on accessible hardware without fine-grained control.
Reference datasets are an umbrella term covering in- situ data, ground truth or similar datasets based on measurements that ideally have very high accuracy for the target variable in question. They are often used as product validation datasets.
An orchestrated, tracked and versioned workflow that can be reproduced and inspected.
Versioning scheme identifying each version with a number containing three parts X.Y.Z.:
- X is the major version, increment this when introducing changes breaking backward compatibility.
- Y is the minor version, increment this when backward- compatible changes are implemented.
- Z is the patch version, increment this when addressing bugs or make smaller features changes.
Test datasets are typically a subset of the training dataset retained from training a ML model (e.g. train-test split) and used to evaluate the predictive performance of ML models after training. In benchmark datasets, they are useful for tracking the progress of the field (e.g ImageNet).
Essential element dataset to train and parameterise a ML model, consisting of reference values of the target variable (dependent or Y variable, commonly referred to as “label”) and the predictive features (independent or X variables) which typically represent spatially and temporally collocated EO data or derived features that are spatially and temporally collocated with the reference data. Training datasets (TDS) are paired combinations of reference values and EO features in order to train a ML model. However unpaired TDS are also important and need to be considered as these allow pairing with various EO features such as different sensor data, data products or metrics/statistics. An important consideration is also different levels of granularity of EO TDS such as (1) dataset level (e.g. European Crop Types), (2) measurement level (e.g. Wheat parcel # 5362) and (3) pixel level (e.g. all pixel level time series of EO features contained in parcel # 5362). TDS have temporal and spatial representativeness. For example, an in-situ measurement of a biophysical variable has a discrete time stamp when the measurement was taken, and depending on the measurement procedure it will only be representative of a certain spatial unit for the target variable. Training datasets are usually extracted from datasets as a split, leaving some data left for validation and/or testing purposes. Datasets without labels can still be considered training datasets for unsupervised learning applications.
Dataset subset used for evaluating ML models during training. It is typically extracted from the training dataset (and hence it is not used for training) and useful for hyperparameter tuning. Once the best set of hyperparameters is found, it can be used for training to produce the final model (that is evaluated with a test dataset). Validation is usually performed with a held-out validation dataset or using cross validation.
Tracking the dependencies and evolution of a software project (e.g. version 1.0 of a dataset may have 10 features, version 2.0 15 features). Typically the approach followed to manage this is Semantic Versioning.