Built as a final-year dissertation project, this pipeline explored how deep learning and cheminformatics could improve the early-stage drug discovery process. The system represented molecules as SMILES strings, embedded them with ChemBERTa, used OpenAI models to augment effect and class labels, compressed representations with an autoencoder, generated candidates with transformer models, fine-tuned outputs using reinforcement learning and decoded generated vectors back into chemically valid SMILES. A Next.js/Tailwind frontend connected to a Python backend made the system easier for researchers to test and inspect.
57k+
Generated molecules analysed
2.7k molecules
Core dataset
768D
Embedding size
Key details
- Designed an end-to-end pipeline rather than an isolated model experiment.
- Used LLMs as a structured data augmentation layer for molecular effect and class labels.
- Validated generated molecules using RDKit, clustering, classifier predictions and novelty analysis.
- Built a usable frontend so technical and non-technical reviewers could interact with the system.