A SURGE through the world of large language models (LLMs), text-detection, and GAN-based attention methods.
Repository: https://github.com/HARSHITJAIS14/SURGE
harshitjais14-surge/
├── README.md
├── Week1/
│ └── LLMIntro-2024-07-22-1743.excalidraw
├── Week2/
│ └── Code/
│ ├── mpi.ipynb
│ └── mpi\_120.csv
├── Week3/
│ └── LLMTextDetectionSurvey.pdf
├── Week4/
│ ├── bert-pretrainingdetector.ipynb
│ ├── description.md
│ └── merged\_dataset.csv
├── Week6/
│ ├── gan-detection.ipynb
│ └── gan\_bertattention.ipynb
└── Week7/
└── GANBERT\_pytorch.ipynb
In Week 1, we dove into the fundamentals and workflow of large language models—from pretraining to supervised fine-tuning, and finally to reinforcement learning (including RLHF). Along the way, we covered key concepts such as base architectures, dataset creation, instruction tuning, and common pitfalls like hallucinations. We also did a hands-on quickstart with the OpenAI API.
Topics Covered:
- LLM architecture & pretraining stages
- Supervised Fine-Tuning (SFT)
- Reinforcement Learning & RLHF
- Prompting basics
- Hallucination in LLMs
- OpenAI API usage
Resources:
- Busy Person's Intro to LLMs (YouTube)
- Deep Dive into LLMs (YouTube)
- Basics of Prompting (YouTube)
- OpenAI API Quickstart Guide
Artifacts:
- Diagrammed workflow in Excalidraw:
Week1/LLMIntro-2024-07-22-1743.excalidraw
- Implemented a Machine Personality Inventory tool using Big Five Personality Factors (OCEAN) for assessment of ChatGPT model on text data.
- Notebook
mpi.ipynbwalks through data preprocessing, feature extraction, and personality inference. - Dataset sample in
mpi_120.csv.
- Read a literature survey of LLM text-detection approaches (PDF in Week3) which gave a full view over the works done over LLM Text Detection till 2023.
- Summarized methods, benchmarks, and open challenges in a concise report and added it to the merged pdf.
- Developed a detector to distinguish pretrained vs. fine-tuned text using BERT.
- Notebook
bert-pretrainingdetector.ipynbincludes model training and evaluation. merged_dataset.csvis a smaller version of the CHEAT dataset;description.mddetails dataset construction.
- Learnt about a lot of dataset for LLM Text Detection and listed some of the major datasets and their papers in the Week5 Directory.
- Explored GAN-based methods for generating and detecting synthetic text/images.
gan-detection.ipynbbuilds a basic GAN over word co-occurence matrix as shown in the pdf of the research paper gan_detection_compressed.pdf for adversarial examples.gan_bertattention.ipynbadds a BERT-attention module to enhance detection robustness.
- Full PyTorch implementation of GANBERT paper: integrating GAN-generated data into BERT training loops.
- Notebook
GANBERT_pytorch.ipynbdemonstrates training and performance analysis over the short version of RAID dataset.