About
The advancement of AI in drug discovery heavily depends upon the access to unified datasets that integrate chemical, biological, therapeutic, and safety-related drug information. However, major public databases such as ChEMBL, SIDER, DrugBank, BindingDB, and DailyMed each focus only on isolated components of this pharmacological landscape and differ in schema, scope, and terminology. This fragmentation poses a significant barrier to scalable, machine-learning-ready data integration.
This study introduces a comprehensive and reproducible pipeline for constructing a unified dataset comprising nearly 6,000 small molecule drugs. Using a locally hosted MySQL instance of the ChEMBL v35 database as the core schema, we systematically extracted and standardized key chemical representations (SMILES, InChI), target-binding affinities (IC₅₀, Ki, Kd), protein interaction data, clinical development stages, and therapeutic indications. To enhance clinical applicability, we incorporated adverse effect data from SIDER and DailyMed, while DrugBank and BindingDB were utilized to enrich mechanistic insights and pharmacological data.
All data preprocessing, standardization, and integration were implemented using Python, resulting in a complex, multi-label dataset with high dimensionality, suitable for downstream applications in off-target prediction, modelling adverse effects, multitask learning, and generative drug discovery. This study bridges the gap between chemical structure, pharmacodynamics, and clinical outcomes, providing a scalable platform for hypothesis development and application of AI in drug discovery and development.

Download
The dataset is available at no charge. It is supplied as is, without warranty, and may be freely used and distributed under the terms of the Gnu General Public License (GPL).
To download the dataset, follow this link: DrugDataset.xlsx.
