Social Media Toxicity Detection

NLP · DistilBERT · TF-IDF · scikit-learn — Spring 2026

Overview

A multi-stage NLP pipeline designed to detect and classify toxic language across social media platforms at scale. The system ingests raw user-generated content, applies platform-aware preprocessing, and combines classical machine learning with transformer-based deep learning to distinguish generic offensive speech from targeted hate speech.

The core challenge: toxic language looks very different on long-form forums versus short-form Twitter. Slang, abbreviations, and cultural context shift dramatically across platforms — a single model trained on one domain routinely fails on the other. This project explores whether domain-adaptive fine-tuning can close that gap.

Built as a university research project at CU Boulder, the pipeline covers the full NLP lifecycle from raw data ingestion through unsupervised exploration, supervised classification, and transformer fine-tuning — producing both quantitative benchmarks and qualitative insight into what makes language toxic.

Dataset

Two publicly available datasets were combined to enable cross-platform analysis — one from long-form forum comments, one from short-form social posts:

Google Jigsaw Civil Comments — 1,804,874 Wikipedia talk-page comments labeled with 7 toxicity sub-scores (toxic, severe_toxic, obscene, threat, insult, identity_hate, sexual_explicit). After deduplication and short-text removal: 1,765,331 usable records. Class imbalance: 91.7% non-toxic.
CardiffNLP TweetEval Hate — 9,000 tweets with binary hate labels (0 = non-hate, 1 = hate). After cleaning: 8,954 records. Loaded directly via the HuggingFace Datasets API.
Modeling dataset: 58,954 records total — 50,000 stratified samples from Jigsaw (to control for imbalance) combined with all 8,954 Twitter records. Final split: 51% non-toxic / 49% toxic.
Vocabulary overlap between the two corpora: just 2.6% — confirming a large domain gap and motivating the need for domain-specific preprocessing and fine-tuning.

Research Questions

The project was structured around three interconnected research questions:

Can transformer-based models (DistilBERT) better distinguish targeted hate speech from general offensive language compared to classical ML baselines (TF-IDF + LinearSVC, LightGBM)?
How significant is the domain gap between long-form forum comments (Jigsaw) and short-form tweets (TweetEval), and can two-stage fine-tuning bridge it?
What linguistic features — vocabulary, syntax, text length, word frequency — are most predictive of toxicity, and do these signals generalize across platforms?

Methodology

The pipeline runs in three stages: data preparation, unsupervised exploration, and supervised modeling.

Data preparation applies different preprocessing for each modeling approach. For classical ML: contraction expansion → lowercase normalization → mention anonymization (@user → @USER) → URL and hashtag stripping → NLTK tokenization → stopword removal. For transformers: minimal normalization only — preserving subword context for the BERT tokenizer.

Unsupervised analysis uses TF-IDF (unigrams + bigrams, top 50K features) → TruncatedSVD (50K → 200 dimensions) → StandardScaler to project comments into a dense feature space. KMeans clustering (k=2 through k=6) and t-SNE visualization (5K sample) explore whether toxicity forms natural clusters in this space.

Supervised modeling runs four classical baselines — Logistic Regression, Multinomial Naive Bayes, LinearSVC, LightGBM — benchmarked on accuracy, F1, precision, and recall. The best classical model is then compared against two-stage DistilBERT fine-tuning: Stage 1 on 50K Jigsaw samples, Stage 2 on 9K TweetEval hate records.

Key Findings & Visualizations

Select a visualization below to explore the key findings interactively:

Domain gap confirmed: only 2.6% vocabulary overlap between Jigsaw and Twitter corpora. The two datasets share almost no surface-level language despite both covering toxicity.
KMeans (k=2, silhouette score = 0.117) naturally recovered the ground-truth toxic/non-toxic class structure in the TF-IDF feature space — without access to labels.
t-SNE visualization reveals non-linear cluster boundaries, supporting the use of non-linear models (LightGBM, DistilBERT) over linear classifiers.
Text length is not predictive of toxicity: median word counts are nearly identical (36 non-toxic vs. 34 toxic across Jigsaw).
Most toxic content consists of personal insults rather than organized group-based hate speech — consistent with the 91.7% non-toxic / 8.3% toxic Jigsaw class distribution.
Two-stage DistilBERT fine-tuning outperforms all classical baselines on the Twitter domain, validating the domain-adaptive approach.

Technical Stack

The pipeline is built entirely in Python with a clear separation between classical ML tooling and transformer-based deep learning.

Data — HuggingFace Datasets (streaming ingestion), Pandas, NumPy
Preprocessing — NLTK (tokenization, stopwords), contractions library, regex
Classical ML — scikit-learn: TF-IDF, TruncatedSVD, StandardScaler, KMeans, t-SNE, LogisticRegression, MultinomialNB, LinearSVC; LightGBM
Deep Learning — HuggingFace Transformers (DistilBERT), PyTorch
Visualization — matplotlib, seaborn, plotly, wordcloud

Conclusions

Cross-platform toxicity detection is genuinely hard, and the 2.6% vocabulary overlap between Jigsaw and TweetEval is the clearest evidence of why. A model that achieves strong accuracy on forum comments will not transfer cleanly to tweets without domain-specific adaptation.

Classical bag-of-words features (TF-IDF + TruncatedSVD) provide useful signal and competitive baselines, but they miss contextual meaning — sarcasm, dog-whistles, and in-group slang all defeat surface-level features.

Two-stage DistilBERT fine-tuning is the most effective approach tested: Stage 1 on the large Jigsaw corpus provides a strong general toxicity prior; Stage 2 on TweetEval adapts the model to short-form, slang-heavy text. This sequential strategy is more data-efficient than training a single model on a merged dataset.

Future Work

Several directions remain open for future exploration:

Real-world data collection via platform APIs or web scraping — current benchmark datasets are curated and may not reflect live content moderation conditions
Bias analysis across demographic groups — AAVE, in-group slang, and minority-language toxicity are systematically underrepresented in existing datasets and may produce false positive rates
Deployment as a moderation API — wrapping the best model in a lightweight inference service to support real-time content scoring
Multilingual extension — most toxicity datasets are English-only; cross-lingual transfer or multilingual fine-tuning would significantly expand practical reach