Intended Sarcasm Detection in English (SemEval 2022 Task A)

Built an NLP pipeline to detect intended sarcasm in English conversational text as a binary classification task (SemEval 2022 iSarcasmEval, Task A).
Consolidated and re-labeled multi-part dataset splits into a unified binary corpus; prepared robust training/evaluation loops.
Fine-tuned BERT-base (primary) and explored DistilBERT variants; applied class weighting and careful validation.
Achieved a peak F1 (sarcastic class) of 0.6529 (epoch 5), exceeding the SemEval 2022 winning score of 0.6052 reported for the same task.

Repo: GitHub
SemEval (Task site): https://sites.google.com/view/semeval2022-isarcasmeval
Dataset: https://github.com/iabufarha/iSarcasmEval
Associated Paper: https://aclanthology.org/2022.semeval-1.111.pdf

Left: unifying the three dataset parts into a binary task; Middle: BERT fine-tuning head; Right: evaluation artifacts (confusion/PR).

Highlights

Task & Data: iSarcasmEval (SemEval 2022) English subset; 6,134 training examples and 1,400 test examples after consolidation
Labeling Strategy: Converted each split into single-text entries with {sarcastic, non-sarcastic} labels; used winning annotator labels where applicable
Models: BERT-base (primary), DistilBERT (explored)
Training (BERT-base): batch size 4, epochs 10, learning rate 2e-5, weight decay 0.01
Metrics: F1 (sarcastic), precision/recall, calibration and error slicing
Result: F1 (sarcastic) = 0.6529 @ epoch 5 (vs. 0.6052 SemEval winning score for Task A)

Dataset & Labeling Overview

Part 1: Sarcastic tweets + human unsarcastic rephrases → sarcastic = “sarcastic” / rephrase = “non-sarcastic”
Part 2: Texts with 5 annotator votes → use majority label
Part 3: Paired sarcastic vs. non-sarcastic texts → add both items with corresponding labels
This yields a diverse binary dataset reflecting author-reported sarcasm and annotator validation.

Example (EN)

Sarcastic: “Gotta love people who follow you and unfollow because you don’t follow them within an hour or 2. Sorry I don’t stay on Twitter 24/7.”
Unsarcastic: “I dislike people who follow me, only to unfollow me when I don’t follow back right away. I’m not on Twitter that much to follow right away.”

Model & Hyperparameters

We fine-tuned BERT-base for binary classification; DistilBERT was also tested as a lighter alternative.

Parameter	Value
Batch size	4
Epochs	10
Learning rate	2e-5
Weight decay	0.01

Results (Validation)

Best F1 on the sarcastic class reached 0.6529 at epoch 5.

Epoch	Train Loss	Val Loss	F1 (sarcastic)
1	0.5382	0.2890	0.5900
2	0.5203	0.4868	0.6217
3	0.3473	0.5875	0.6499
4	0.2091	0.7926	0.6453
5	0.0925	0.9241	0.6529
6	0.0718	0.8687	0.6352
7	0.0714	1.0984	0.6300
8	0.0389	1.2328	0.6352
9	0.0241	1.2649	0.6403
10	0.0142	1.3279	0.6377

Note: Despite rising validation loss after epoch 5, F1 peaked at epoch 5—indicating mild overfitting; we select epoch 5 by metric.

Limitations

Primary data source is Twitter, which may limit generalization across platforms/domains.
Sarcasm manifests differently by community and context; cross-domain robustness requires further data and adaptation.

Conclusion

Our BERT-based system (with DistilBERT explored) effectively detects intended sarcasm on iSarcasmEval. It achieves a peak F1 of 0.6529 on the sarcastic class, demonstrating strong performance and surpassing the SemEval 2022 reported winning score for Task A. Future work: domain adaptation beyond Twitter, richer context modeling, and improved calibration.

Dataset & Labeling Overview

Model & Hyperparameters

Results (Validation)

Limitations

Conclusion

References