Javin Liu, Hao Yu, Vidya Sujaya, Pratheeksha Nair, Kellin Pelrine, Reihaneh Rabbany

Conference on Empirical Methods in Natural Language Processing

Abstract

In this work, we propose a weak supervision pipeline S WEET : S upervise W eakly for E ntity E xtraction to fight T rafficking for extracting person names from noisy escort advertisements. Our method combines the simplicity of rule-matching (through antirules , i.e., negated rules) and the generalizability of large language models fine-tuned on benchmark, domain-specific and synthetic datasets, treating them as weak labels. One of the major challenges in this domain is limited labeled data. S WEET addresses this by obtaining multiple weak labels through labeling functions and effectively aggregating them. S WEET outperforms the previous supervised SOTA method for this task by 9% F1 score on domain data and better generalizes to common benchmark datasets. Furthermore, we also release HTG EN , a synthetically generated dataset of escort advertisements (built using ChatGPT) to facilitate further research within the community.