An open-source tool for detecting AI-generated Arabic text
Farras (فرّاس) is a detection system for identifying AI-generated Arabic text. The project went through two major iterations, each informed by the limitations of the previous one.
| Version | Base Model | Training Data | Accuracy | Status |
|---|---|---|---|---|
| v1 | AraBERTv2 + XGBoost + N-grams | Custom Gemini-only (12,796 samples) | 93.16% (internal), 86% (external) | Archived |
| v2 | XLM-RoBERTa | KFUPM-JRCAI multi-generator (28,098 samples) | 97.27% | Deployed |
The first version explored whether stylistic and structural cues could outperform deep transformers for Arabic AI detection. Five approaches were compared:
- Naive Bayes baseline (55.7% accuracy)
- Character N-grams with logistic regression (87.0%)
- Farasa morphological features with XGBoost (81.2%)
- AraBERTv2 fine-tuning (85.2%)
- Hybrid ensemble combining N-grams + linguistic features (93.2%)
Key finding: the hybrid model that combined surface-level patterns with linguistic features outperformed the deep transformer. The full analysis is in the research report.
Limitations identified:
- Dataset was Gemini-only — the model had never seen GPT-4, Llama, or Jais outputs
- AraBERT's aggressive Arabic normalization (diacritics removal, alef normalization) was destroying detection signals
- External evaluation dropped to 86% accuracy, confirming poor generalization
Informed by the AraGenEval 2025 shared task results and the v1 limitations, the second version made three key changes:
- Switched to XLM-RoBERTa — the AraGenEval findings showed
xlm-roberta-baseoutperforms AraBERT for Arabic AI detection (F1=0.770 vs ~0.618) - Multi-generator training data — used KFUPM-JRCAI datasets covering 4 generators (ALLaM, Jais, Llama 3.1, GPT-4) across 2 domains (academic abstracts + social media)
- No Arabic normalization — following the BUSTED team's finding that text normalization destroys stylistic cues that differentiate AI from human writing
Results:
| Class | Precision | Recall | F1 |
|---|---|---|---|
| Human | 0.93 | 0.94 | 0.93 |
| AI | 0.98 | 0.98 | 0.98 |
| Overall Accuracy | 97.27% |
farras.app (Next.js on Vercel)
↓ API calls via @gradio/client
HuggingFace Space (Gradio backend)
↓ loads model at startup
HuggingFace Hub (XLM-RoBERTa weights, 1.1GB)
farras/
├── v1-hybrid/ # First iteration (archived)
│ ├── Arabic_AI_Text_Detection_Report.pdf # Full research report
│ ├── app.py # Gradio app (ensemble backend)
│ ├── hybrid_model/ # N-gram + linguistic feature code
│ └── finetuned_model/ # AraBERTv2 fine-tuning notebook
│
├── v2-xlmr/ # Current deployed version
│ ├── app.py # Gradio app (XLM-RoBERTa backend)
│ ├── train_xlmr.py # Training script
│ └── requirements.txt
│
└── web/ # Landing page (Next.js)
cd v2-xlmr
pip install -r requirements.txt
# Download model from HuggingFace Hub
python -c "from transformers import AutoModel, AutoTokenizer; AutoTokenizer.from_pretrained('Rashidbm/farras-xlmr-arabic-ai-detector'); AutoModel.from_pretrained('Rashidbm/farras-xlmr-arabic-ai-detector')"
python app.pycd v2-xlmr
python train_xlmr.pyRequires the KFUPM-JRCAI datasets in Datasets/KFUPM-JRCAI/.
- Struggles with short texts (<50 words) — training data averages 110-879 words per sample
- Optimized for Modern Standard Arabic and common dialects; may underperform on very niche dialects
- Detection accuracy may degrade as LLMs improve their Arabic generation
- Live app: farras.app
- Model weights: Rashidbm/farras-xlmr-arabic-ai-detector
- HF Space: Rashidbm/farras-ai-detector
- Rashid Binkulaib
- Mohammed Alomar
- Nawaf Alwazrah
MIT