Intended Sarcasm Detection in English (SemEval 2022 Task A)
Transformer-based NLP with bias-aware labeling and BERT fine-tuning
- Built an NLP pipeline to detect intended sarcasm in English conversational text as a binary classification task (SemEval 2022 iSarcasmEval, Task A).
- Consolidated and re-labeled multi-part dataset splits into a unified binary corpus; prepared robust training/evaluation loops.
- Fine-tuned BERT-base (primary) and explored DistilBERT variants; applied class weighting and careful validation.
- Achieved a peak F1 (sarcastic class) of 0.6529 (epoch 5), exceeding the SemEval 2022 winning score of 0.6052 reported for the same task.
Repo: GitHub
SemEval (Task site): https://sites.google.com/view/semeval2022-isarcasmeval
Dataset: https://github.com/iabufarha/iSarcasmEval
Associated Paper: https://aclanthology.org/2022.semeval-1.111.pdf
Highlights
- Task & Data: iSarcasmEval (SemEval 2022) English subset; 6,134 training examples and 1,400 test examples after consolidation
- Labeling Strategy: Converted each split into single-text entries with {sarcastic, non-sarcastic} labels; used winning annotator labels where applicable
- Models: BERT-base (primary), DistilBERT (explored)
- Training (BERT-base): batch size 4, epochs 10, learning rate 2e-5, weight decay 0.01
- Metrics: F1 (sarcastic), precision/recall, calibration and error slicing
- Result: F1 (sarcastic) = 0.6529 @ epoch 5 (vs. 0.6052 SemEval winning score for Task A)
Dataset & Labeling Overview
- Part 1: Sarcastic tweets + human unsarcastic rephrases → sarcastic = “sarcastic” / rephrase = “non-sarcastic”
- Part 2: Texts with 5 annotator votes → use majority label
- Part 3: Paired sarcastic vs. non-sarcastic texts → add both items with corresponding labels
This yields a diverse binary dataset reflecting author-reported sarcasm and annotator validation.
Example (EN)
Sarcastic: “Gotta love people who follow you and unfollow because you don’t follow them within an hour or 2. Sorry I don’t stay on Twitter 24/7.”
Unsarcastic: “I dislike people who follow me, only to unfollow me when I don’t follow back right away. I’m not on Twitter that much to follow right away.”
Model & Hyperparameters
We fine-tuned BERT-base for binary classification; DistilBERT was also tested as a lighter alternative.
| Parameter | Value |
|---|---|
| Batch size | 4 |
| Epochs | 10 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
Results (Validation)
Best F1 on the sarcastic class reached 0.6529 at epoch 5.
| Epoch | Train Loss | Val Loss | F1 (sarcastic) |
|---|---|---|---|
| 1 | 0.5382 | 0.2890 | 0.5900 |
| 2 | 0.5203 | 0.4868 | 0.6217 |
| 3 | 0.3473 | 0.5875 | 0.6499 |
| 4 | 0.2091 | 0.7926 | 0.6453 |
| 5 | 0.0925 | 0.9241 | 0.6529 |
| 6 | 0.0718 | 0.8687 | 0.6352 |
| 7 | 0.0714 | 1.0984 | 0.6300 |
| 8 | 0.0389 | 1.2328 | 0.6352 |
| 9 | 0.0241 | 1.2649 | 0.6403 |
| 10 | 0.0142 | 1.3279 | 0.6377 |
Note: Despite rising validation loss after epoch 5, F1 peaked at epoch 5—indicating mild overfitting; we select epoch 5 by metric.
Limitations
- Primary data source is Twitter, which may limit generalization across platforms/domains.
- Sarcasm manifests differently by community and context; cross-domain robustness requires further data and adaptation.
Conclusion
Our BERT-based system (with DistilBERT explored) effectively detects intended sarcasm on iSarcasmEval. It achieves a peak F1 of 0.6529 on the sarcastic class, demonstrating strong performance and surpassing the SemEval 2022 reported winning score for Task A. Future work: domain adaptation beyond Twitter, richer context modeling, and improved calibration.