Results of ALiCT

Linguistic Capability Specifications

Table 1: Structural predicates and generative rules for the linguistic capabilities of sentiment analysis.

Table 2: Structural predicates and generative rules for the linguistic capabilities of hate speech detection. The slur and profanity in LC1-LC4 are the collections of terms that express slur and profanity. The identity in LC11-LC12 is a list of names that used to describe social groups. In this work, we reuse these terms from Hatecheck.

Baselines

Capability Testing Baselines

ALiCT is evaluated by comparing with the state-of-the-art linguistic capability testing for sentiment analysis and hate speech detection as following:

CHECKLIST(paper, repo) for sentiment analysis

Hatecheck(paper, repo) for hate speech detection

Model Under Test

Given the generated test cases from the ALiCT and capability testing baselines, models in the table 3 are evaluated:

Table 3: The NLP model used in our evaluation.

Evaluation of of expansion phase of ALiCT

the test case diversity provided by ALiCT expansion phase of ALiCT is also compared against that of one syntax-based (MT-NLP) and three adversarial (Alzantot-attack, BERT-Attack and SememePSO-attack) as follows:

Syntax-based approach

MT-NLP: Metamorphic Testing and Certified Mitigation of Fairness Violations in NLP Models

Adversarial approaches

Alzantot-attack: Generating Natural Language Adversarial Examples
BERT-Attack: BERT-ATTACK: Adversarial Attack Against BERT Using BERT
SememePSO-attack: Word-level textual adversarial attacking as combinatorial optimization

Experiment Results

RQ1: Diversity

Figure 1: Results of Self-BLEU (left) and Syntactic diversity (right) of ALiCT and capability-based testing baselines for sentiment analysis and hate speech detection. Use of only ALiCT seed sentences and all ALiCT sentences are denoted as ALiCT and ALiCT+EXP respectively.

Figure 2: Results of Self-BLEU (left) and Syntactic diversity (right) between original sentences of capability-based testing baselines and ALiCT generated sentences from the original sentences.

Table 4: Comparison results against MT-NLP.

Table 5: Comparison results against adversarial attacks.

Figure 3: Neuron coverage results of ALiCT and CHECKLIST.

Table 6: Examples for text generation compared with the syntax-based and adversarial generation baselines.

RQ2: Effectiveness

Table 7: Results of BERT-base, RoBERTa-base and DistilBERT-base sentiment analysis models on ALiCT test cases using all seeds. BERT-base, RoBERTa-base and DistilBERT-base models are denoted as BERT, RoBERTa and dstBERT,respectively.

Table 8: Results of dehate-BERT and twitter-RoBERTa hate speech detection models on ALiCT test cases using all seeds. dehate-BERT and twitter-RoBERTa models are denoted as BERT and RoBERTa respectively.

RQ3: Consistency

Table 9: Consistency Results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Results of ALiCT

Table of Contents

Linguistic Capability Specifications

Baselines

Capability Testing Baselines

Model Under Test

Evaluation of of expansion phase of ALiCT

Experiment Results

RQ1: Diversity

RQ2: Effectiveness

RQ3: Consistency

Files

README.md

Latest commit

History

README.md

File metadata and controls

Results of ALiCT

Table of Contents

Linguistic Capability Specifications

Baselines

Capability Testing Baselines

Model Under Test

Evaluation of of expansion phase of ALiCT

Experiment Results

RQ1: Diversity

RQ2: Effectiveness

RQ3: Consistency