AI Drug Discovery Pipeline | Daniel Flockhart

Built as a final-year dissertation project, this pipeline explored how deep learning, reinforcement learning and cheminformatics could improve early-stage drug discovery. The system starts with curated cognitive molecules and ZINC 250K SMILES, removes duplicates and invalid structures with RDKit, embeds canonical SMILES into 768-dimensional ChemBERTa vectors, and uses OpenAI models to augment each molecule with effect and class labels. Those molecule, effect and class features are compressed through an autoencoder before a two-stage transformer generates new candidate vectors: first through supervised pre-training for chemically plausible outputs, then through population-based reinforcement learning and genetic search against effect similarity, class similarity, structural similarity, validity and length/diversity rewards. A GRU/self-attention decoder converts generated vectors back into canonical SMILES, while classifiers, RDKit checks, PCA/t-SNE/K-means clustering, Morgan fingerprints and Tanimoto similarity analysis validate target fit, chemical validity and novelty. A Next.js/Tailwind frontend connected to a Flask backend made the generation, molecule inspection, analysis and experimental LLM synthesis-route features usable without working directly through the research code.

Research architecture

End-to-end generative molecule pipeline

The work tied together representation learning, LLM-assisted labelling, supervised learning, reinforcement learning, cheminformatics validation and a usable web interface so the discovery workflow could be tested as an integrated pipeline rather than isolated proofs of concept.

SMILES to ChemBERTa to autoencoder latent space
Supervised transformer to RL and genetic fine-tuning
RDKit, classifier, clustering and Tanimoto validation

57.6k

Generated candidates

48 combos

Target tests

2.7k molecules

Core dataset

ZINC 250k

Decoder data

768D ChemBERTa

Embeddings

0.133 avg similarity

Novelty

Key details

Designed an end-to-end research pipeline rather than an isolated model experiment, covering dataset curation, embedding, LLM augmentation, latent compression, generation, decoding, classifier ranking and novelty analysis.
Processed a hand-curated dataset of approximately 2,700 cognitive-related molecules alongside ZINC 250K SMILES, removing duplicate or chemically invalid molecules with RDKit checks and SMILES length filtering.
Used ChemBERTa to convert canonical SMILES into 768-dimensional molecular embeddings, then validated representational structure with PCA, t-SNE and K-means clusters.
Used OpenAI models as a structured data augmentation layer for pharmacological effect and molecular class labels, then validated sample predictions against known molecules including methylphenidate, oxymorphone and tramadol.
Trained an autoencoder over combined molecule, effect and class vectors, improving latent-space cluster separation compared with the original ChemBERTa vectors.
Built a two-stage transformer process: supervised pre-training for structurally plausible molecule generation, followed by reinforcement-learning fine-tuning with genetic operators and custom reward metrics.
Trained a GRU/self-attention decoder to reconstruct canonical SMILES from generated vectors and used RDKit to reject or inspect invalid molecular outputs.
Analysed 57,600 generated candidates across 48 target-effect combinations using classifier confidence distributions, attrition filtering, Morgan fingerprints and Tanimoto similarity novelty checks.
Built a Next.js/Tailwind frontend and Flask backend so reviewers could generate molecules, inspect candidate structures, view analysis and request experimental synthesis-route starting points.