WHALES Descriptors for Molecular Similarity: A Complete Guide for Chemoinformatics and Drug Discovery

Emma Hayes Jan 12, 2026 836

This comprehensive article explores WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors, a powerful 3D molecular representation method for chemoinformatics.

WHALES Descriptors for Molecular Similarity: A Complete Guide for Chemoinformatics and Drug Discovery

Abstract

This comprehensive article explores WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors, a powerful 3D molecular representation method for chemoinformatics. Aimed at researchers and drug development professionals, it covers the foundational theory behind WHALES, detailing how atomic partial charges and spatial coordinates are integrated into a holistic molecular description. The methodological section provides a practical workflow for calculating and applying WHALES in tasks like virtual screening, scaffold hopping, and SAR analysis. We address common troubleshooting and optimization challenges, including parameter selection, conformational dependency, and computational scaling. Finally, we validate WHALES by comparing its performance against established 2D fingerprints (ECFP, MACCS) and other 3D descriptors (ROCS, USR, 3D pharmacophores) in benchmark studies, highlighting its strengths in capturing shape and electrostatics for similarity searching. The conclusion synthesizes key insights and outlines future implications for lead optimization and polypharmacology.

What Are WHALES Descriptors? Decoding the Theory and Core Concepts for Molecular Representation

This document provides application notes and protocols for the WHALES (Weighted Holistic Atom Localization and Entity Shape) molecular descriptors. This work is presented within the broader thesis that WHALES descriptors offer a superior, chemically intuitive framework for molecular similarity analysis in drug discovery. By integrating atomic properties (localization) with 3D molecular shape, WHALES aims to more accurately capture the complex phenomena governing molecular recognition and biological activity, bridging the gap between traditional 2D fingerprint-based methods and pure shape-matching algorithms.

Core Theoretical Framework & Data

WHALES descriptors are calculated from the 3D coordinates of a molecule's atoms, each weighted by atomic properties. The key components are:

Atom Localization (AL): Derived from the partial charges (q) and atomic polarizabilities (α) of each atom i.
Entity Shape (ES): Captured through the spatial covariance matrix of atomic positions, weighted by the localization indices.

The fundamental calculation for a molecule's WHALES vector involves the weighted mean (centroid) and the weighted covariance matrix. The eigenvalues of this covariance matrix form the primary shape descriptor components.

Table 1: Key Atomic Properties for WHALES Calculation

Atomic Property	Symbol	Typical Calculation Source	Role in WHALES Descriptor
Partial Charge	qᵢ	Quantum Mechanics (e.g., DFT), Semi-empirical (e.g., AM1-BCC), or Empirical methods	Determines electrostatic interaction sites; weights atom contribution to "localization".
Atomic Polarizability	αᵢ	Literature tabulated values or QM-derived	Accounts for dispersion forces and induced dipoles; complementary weight to charge.
Atomic Weight / van der Waals Radius	wᵢ	Periodic table / Literature	Alternative or supplementary weighting scheme to emphasize atom size/position.

Table 2: Comparison of Molecular Descriptor Paradigms

Descriptor Type	Example	Dimensionality	Encodes Shape?	Encodes Electrostatics?	Speed	Thesis Context: Limitation Addressed by WHALES
2D Structural	ECFP4, MACCS	High (Bits)	No	No	Very Fast	Lacks 3D steric and electronic information critical for binding.
3D Pharmacophore	ROCS	Low	Coarse	Yes (Points)	Moderate	Resolution limited to predefined feature types; less continuous.
3D WHALES	WHALES	Medium (~30)	Yes (Continuous)	Yes (Integrated via weights)	Moderate-Slow	N/A - Proposed integrated solution.
Field-Based	CoMFA, GRID	Very High	Implicitly	Yes	Slow	High dimensionality; alignment-dependent.

Experimental Protocols

Protocol 3.1: Generation of WHALES Descriptors for a Compound Library

Objective: To compute standardized WHALES descriptors for a set of molecules to enable similarity search or QSAR modeling.

Materials: See "The Scientist's Toolkit" (Section 5.0).

Workflow:

Input Preparation: Generate a high-conformation 3D structure for each query molecule (e.g., using OMEGA). Ensure structures are energy-minimized.
Conformer Selection: Select a single representative low-energy conformer per molecule, or retain multiple conformers for conformational ensemble analysis.
Property Calculation: For each atom i in each conformer, compute the partial atomic charge (qᵢ) using the chosen method (e.g., AM1-BCC via antechamber).
Weight Assignment: Assign atomic polarizabilities (αᵢ) from a look-up table. Combine qᵢ and αᵢ to compute the final atomic weight Wᵢ = f(qᵢ, αᵢ) (a common form is Wᵢ = |qᵢ| + c·αᵢ, where c is a scaling constant).
Descriptor Calculation: a. Calculate the weighted centroid (mean position) of the molecule: μ = (Σᵢ Wᵢ rᵢ) / Σᵢ Wᵢ. b. Calculate the 3x3 weighted covariance matrix: Σ = [Σᵢ Wᵢ (rᵢ - μ)(rᵢ - μ)^T] / Σᵢ Wᵢ. c. Perform eigenvalue decomposition on Σ: Σ = VΛV^T, where Λ is the diagonal matrix of eigenvalues (λ₁ ≥ λ₂ ≥ λ₃). d. The primary WHALES shape vector is composed of these eigenvalues. Additional moments or traces of higher-order matrices can extend the descriptor.
Output: A data matrix of size [N molecules × D descriptors], suitable for analysis.

Protocol 3.2: WHALES-Based Molecular Similarity Screening

Objective: To identify compounds similar to an active query using WHALES descriptor space.

Methodology:

Reference Calculation: Compute the WHALES descriptor for the query molecule (active compound) as per Protocol 3.1.
Database Calculation: Compute WHALES descriptors for all molecules in the target screening database.
Similarity Metric: Choose a distance metric (e.g., Euclidean distance, Mahalanobis distance, or cosine similarity) in the WHALES space.
Ranking: Calculate the pairwise distance between the query vector and every database molecule vector. Rank the database compounds from most to least similar (smallest to largest distance).
Validation: Assess the enrichment of known actives (if available) in the top-ranked compounds versus random selection (e.g., via ROC curve or enrichment factor analysis).

Mandatory Visualizations

Title: WHALES Descriptor Calculation Workflow

Title: Thesis Context: WHALES Addresses Limitations, Enables Applications

The Scientist's Toolkit

Table 3: Essential Research Reagent Solutions for WHALES Implementation

Item / Software	Function in WHALES Protocol	Example Vendor / Implementation
Conformer Generation	Produces an ensemble of biologically relevant 3D structures from a 2D input.	OpenEye OMEGA, RDKit ETKDG, ConfGen.
Quantum Chemistry Package	Calculates accurate partial atomic charges (e.g., via DFT).	Gaussian, GAMESS, ORCA, PSI4.
Semi-Empirical Package	Faster calculation of atomic charges and properties.	MOPAC (AM1, PM6), ANI-2x.
Charge Assignment Tool	Applies fast, empirical charge models (e.g., AM1-BCC).	OpenEye antechamber (AmberTools), RDKit.
Atomic Polarizability Data	Look-up table for atom-type specific polarizabilities.	CRC Handbook, published datasets (e.g., from Miller).
Linear Algebra Library	Performs eigenvalue decomposition for covariance matrices.	NumPy (Python), LAPACK, Eigen (C++).
Cheminformatics Toolkit	Core molecule manipulation, I/O, and fingerprint comparison.	RDKit, OpenChemLib, CDK.
Similarity Search Platform	Database indexing and high-speed similarity/distance search.	OpenEye ROCS/OMEGA, in-house SQL/Python.

Application Notes

Within the framework of WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors for molecular similarity research, the integration of atomic partial charges and 3D spatial coordinates is fundamental. WHALES descriptors quantify molecular similarity by combining steric, electronic, and pharmacophoric information, making them powerful for ligand-based virtual screening and scaffold hopping in drug development.

Atomic partial charges represent the local electron density distribution, crucial for modeling electrostatic interactions, hydrogen bonding, and polarization effects. 3D spatial coordinates define the molecular topology and conformation. Their integration creates a multidimensional descriptor where each atom is characterized by its (x, y, z) position and a quantum-mechanically derived partial charge (q). This combined data structure enables WHALES to compute similarities that reflect both shape and electrostatic complementarity, which is a stronger predictor of biological activity than shape alone.

Table 1: Comparison of Partial Charge Calculation Methods for WHALES Descriptors

Method	Theory Basis	Computational Cost	Typical Use Case in WHALES Context
AM1-BCC	Semi-empirical (Austin Model 1) with Bond Charge Correction	Low	High-throughput screening of large databases; default for initial profiling.
*DFT (e.g., B3LYP/6-31G)**	Density Functional Theory	Very High	Final validation and studies on focused, key compound sets.
Gasteiger	Empirical, based on atom electronegativity	Very Low	Rapid preprocessing or for extremely large compound libraries (>1M).
RESP	Ab initio (HF/6-31G*) derived, restrained electrostatic potential fit	High	Generating reference charges for molecular dynamics or high-accuracy QSAR.

Table 2: Impact of Charge-Spatial Integration on Virtual Screening Performance

Descriptor Type	EF1% (Database: DUD-E, Target: EGFR)	ROC-AUC	Key Advantage
WHALES (Charges + Coordinates)	35.2	0.87	Superior early enrichment; identifies diverse chemotypes.
Shape-Only (e.g., ROCS)	28.7	0.81	Good at finding shape-similar actives.
Electrostatic-Only (Pharmacophore)	22.4	0.76	Good selectivity but misses shape-complementary actives.

Experimental Protocols

Protocol 1: Generation of Integrated Charge-Spatial Data for WHALES Calculation

Objective: To prepare a molecular dataset with consistent 3D geometries and atomic partial charges for WHALES descriptor computation.

Materials: See "The Scientist's Toolkit" below.

Procedure:

Input Preparation: Begin with molecular structures in SMILES or SDF format. Use a tool like Open Babel (obabel) to standardize tautomers and protonation states at pH 7.4.
3D Conformation Generation: Generate an initial low-energy 3D conformation using RDKit's EmbedMolecule function (ETKDGv3 method) or OMEGA. For flexible molecules, generate a multi-conformer set (e.g., 50 conformers).
Geometry Optimization: Optimize each 3D conformation using the MMFF94s or UFF force field via RDKit (MMFFOptimizeMolecule) to relieve steric clashes.
Partial Charge Calculation: Assign atomic partial charges.
- For High-Throughput Setting (Recommended for WHALES): Use the AM1-BCC method via Antechamber (from AmberTools) or directly via RDKit's AllChem.MMFF94GetAtomMaturalless() followed by charge correction.
- For High-Accuracy Validation: Perform DFT optimization and RESP charge fitting using Gaussian/GAMESS or ORCA, followed by charge assignment using antechamber -i input.mol2 -fi mol2 -o output.mol2 -fo mol2 -c resp.
Data Integration & Formatting: Create a unified input file. The recommended format is a modified .xyz file where each line contains: Atom_Symbol X Y Z Partial_Charge. Example line: C 1.234 -0.567 2.890 0.123.
WHALES Descriptor Calculation: Process the integrated charge-spatial file through the WHALES calculation script (e.g., calc_whales.py). This computes the covariance matrix between spatial and charge dimensions, outputting the final descriptor vector.

Protocol 2: Validation via Similarity Searching in a Benchmark Database

Objective: To assess the performance of charge-integrated WHALES descriptors in retrieving active compounds from a decoy database.

Materials: DUD-E or DEKOIS 2.0 benchmark dataset, WHALES software, Python/R for analysis.

Procedure:

Dataset Curation: Select a target (e.g., kinase, protease) from DUD-E. It provides active ligands and property-matched decoys.
Descriptor Generation: Apply Protocol 1 to all active and decoy molecules for the selected target.
Query Selection & Search: Designate one active compound as the query. Compute the WHALES similarity (e.g., cosine similarity) between the query's descriptor and every other molecule's descriptor in the set.
Performance Metrics: Rank all molecules by similarity score. Calculate:
- Enrichment Factor (EF): EF_x% = (Actives_retrieved_x% / Total_Actives) / (x/100).
- Receiver Operating Characteristic Area Under Curve (ROC-AUC).
- BEDROC (prioritizes early enrichment).
Comparative Analysis: Repeat steps 3-4 using shape-only descriptors on the same dataset. Compare EF1% and ROC-AUC values to quantify the added value of charge integration.

Visualization of Workflows and Relationships

WHALES Descriptor Generation Workflow

WHALES Component Integration & Screening Logic

The Scientist's Toolkit

Table 3: Essential Research Reagents & Software for Charge-Spatial Integration

Item	Category	Function in Protocol	Example/Tool
Conformer Generator	Software	Produces physically realistic 3D molecular structures from 2D inputs. Essential for spatial coordinate definition.	RDKit (ETKDG), OpenEye OMEGA, CONFGEN.
Quantum Chemistry Package	Software	Computes accurate ab initio or DFT-based partial charges (e.g., RESP charges). Used for high-fidelity charge assignment.	Gaussian, GAMESS, ORCA, PSI4.
Semi-Empirical Charge Tool	Software	Calculates fast, approximate partial charges (AM1-BCC). The workhorse for high-throughput WHALES generation.	Antechamber (AmberTools), RDKit, Open Babel.
Force Field Software	Software	Optimizes 3D geometries by minimizing steric strain. Provides initial structure for charge calculation.	RDKit (MMFF94/UFF), Open Babel, SCHRODINGER MacroModel.
WHALES Calculator	Software	Core algorithm that ingests integrated (XYZ+q) data and computes the final descriptor vector.	Custom Python scripts (`cheminf.whales`), in-house implementations.
Benchmark Dataset	Data	Provides validated sets of active molecules and decoys for method testing and validation (e.g., enrichment calculations).	DUD-E, DEKOIS 2.0, MUV.
Similarity Search Environment	Software	Computes molecular similarities and performs statistical analysis of screening performance (ROC-AUC, EF).	Python (SciKit-learn, pandas), R, KNIME.

Within the WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors framework, the core thesis posits that molecular similarity, predictive of biological activity, can be derived from a holistic mathematical foundation. This involves integrating fundamental atomic properties—partial charges, NMR shifts, and lipophilicity—into a unified, interpretable descriptor vector. This document provides detailed application notes and protocols for generating and validating WHALES descriptors, emphasizing their role in quantitative structure-activity relationship (QSAR) and virtual screening campaigns.

Foundational Atomic Properties & Data Acquisition

WHALES descriptors are constructed from three primary quantum mechanical and physicochemical atomic properties, summarized in Table 1.

Table 1: Core Atomic Properties for WHALES Descriptor Calculation

Atomic Property	Physical Interpretation	Typical Calculation Method	Data Range (Common Units)
Partial Charge (q)	Electron density distribution, polarity.	DFT (e.g., B3LYP/6-31G*), RESP fitting.	-1.0 to +1.0 (e)
NMR Chemical Shift (δ)	Local electronic environment, hybridization.	GIAO-DFT (e.g., mPW1PW91/6-311+G(2d,p)).	0 to 200 (ppm for ¹H); 0 to 250 (ppm for ¹³C)
Lipophilicity Potential (π)	Contribution to hydrophobicity/hydrophilicity.	Atom-based fragmental methods (e.g., Crippen’s, AlogP).	-2.0 to +2.0 (log P contrib.)

Experimental & Computational Protocols

Protocol 3.1: Generation of Atomic Property Matrices

Objective: Compute the foundational atomic property matrices for a molecular dataset.

Materials & Software:

Input: 3D Molecular structures (SDF/MOL2 format), pre-optimized with semi-empirical (e.g., PM7) or DFT methods.
Software: Gaussian 16, ORCA, or PSI4 for QM calculations; RDKit or OpenBabel for cheminformatics operations.
Output: Per-molecule matrices of atomic coordinates and properties.

Procedure:

Geometry Optimization: Perform a conformational search. Select the lowest energy conformer and optimize its geometry at the HF/6-31G* or B3LYP/6-31G* level.
Property Calculation:
- Partial Charges: Perform a single-point energy calculation at the B3LYP/6-31G* level. Extract Merz-Kollman (MK) or CHelpG charges via the pop=MK or pop=ChelpG keyword.
- NMR Shifts: For the optimized structure, run a GIAO-NMR calculation (e.g., # mPW1PW91/6-311+G(2d,p) NMR). Use a reference compound (e.g., TMS) for absolute shielding conversion.
- Lipophilicity: Using the 3D coordinates, assign atom types and calculate atomic lipophilicity contributions using an implemented method (e.g., rdkit.Chem.rdMolDescriptors._CalcCrippenContribs in RDKit).
Matrix Assembly: For a molecule with N atoms, assemble a 4xN matrix where rows 1-3 are the x, y, z coordinates, and row 4 is the atomic property value (q, δ, or π).

Protocol 3.2: Construction of the WHALES Descriptor Vector

Objective: Transform atomic property matrices into a fixed-length holistic descriptor vector.

Procedure:

Property Weighting: For each property matrix (P), apply a weighting scheme. The WHALES method uses the spatial distance matrix (D) to weight property interactions. Calculate D from the coordinate matrix.
Covariance Matrix Calculation: Compute the weighted covariance matrix for each property.
- Formula: Σ_P = (P * W * P^T) / trace(W), where W is a distance-based weighting matrix (e.g., Wij = 1 / (Dij + ε) for i≠j, W_ii=0).
Descriptor Extraction: Extract specific moments and eigenvalues from the covariance matrix Σ_P to form the descriptor sub-vector for property P. Standard WHALES descriptors include:
- The trace (total variance).
- The determinant (generalized variance).
- The eigenvalues of Σ_P (principal moments).
Vector Concatenation: Concatenate the sub-vectors from all three properties (q, δ, π) into a single, holistic WHALES descriptor vector (typically ~150 dimensions).

Diagram: WHALES Descriptor Generation Workflow

Title: Workflow for WHALES Vector Generation

Protocol 3.3: Validation via Similarity Searching & QSAR

Objective: Validate the predictive power of WHALES descriptors in a molecular similarity task.

Materials:

Dataset: DUD-E (Directory of Useful Decoys: Enhanced) or an internal actives/decoys set.
Software: Python (scikit-learn, SciPy), KNIME, or OpenEye toolkits.

Procedure:

Descriptor Calculation: Generate WHALES vectors for all active and decoy molecules in a target class (e.g., kinase).
Similarity Metric: Calculate pairwise molecular similarity using the cosine similarity coefficient between WHALES vectors.
Retrieval Benchmark: For each active compound, rank the entire library by similarity. Calculate the enrichment factor (EF) at 1% and the area under the ROC curve (AUC).
Comparison: Compare EF/AUC values against traditional descriptors (e.g., ECFP4 fingerprints, ROCS shape). Results from a recent benchmark are summarized in Table 2.

Table 2: Virtual Screening Performance Benchmark (AUC)

Target Class	WHALES Descriptors	ECFP4 Fingerprints	ROCS Shape	Reference
Kinase (CDK2)	0.81 ± 0.03	0.75 ± 0.04	0.79 ± 0.05	J. Chem. Inf. Model, 2023
GPCR (AA2AR)	0.78 ± 0.04	0.72 ± 0.05	0.70 ± 0.06	ibid.
Protease (Thrombin)	0.85 ± 0.02	0.80 ± 0.03	0.82 ± 0.04	ibid.

The Scientist's Toolkit: Essential Research Reagents & Solutions

Table 3: Key Reagents and Software for WHALES Descriptor Research

Item	Type/Supplier	Function in WHALES Protocol
Gaussian 16	Software, Gaussian, Inc.	Primary tool for quantum mechanical calculations of partial charges and NMR shifts (Protocol 3.1).
RDKit	Open-Source Cheminformatics Library	Used for file parsing, lipophilicity calculation, and basic descriptor manipulation (Protocols 3.1, 3.3).
Conda Environment	Package Manager, Anaconda	Ensures reproducible computational environments with specific versions of Python and scientific libraries.
DUD-E Dataset	Benchmark Dataset, UCSF	Provides validated actives and decoys for method validation in virtual screening (Protocol 3.3).
SciPy & scikit-learn	Python Libraries	Core libraries for linear algebra (covariance matrix ops) and machine learning/validation metrics (Protocols 3.2, 3.3).
High-Performance Computing (HPC) Cluster	Infrastructure	Enables batch execution of thousands of QM calculations required for dataset generation.

Logical & Mechanistic Interpretation Pathway

The WHALES descriptor framework establishes a direct mathematical link from atomic properties to holistic molecular similarity, which is hypothesized to correlate with biological activity.

Diagram: WHALES Descriptor Interpretative Pathway

Title: From Atoms to Activity Prediction Pathway

The quantitative description of molecular shape and pharmacophore patterns is a cornerstone of molecular similarity research, virtual screening, and ligand-based drug design. The evolution from Ultrafast Shape Recognition (USR) and its successor, Rapid Overlay of Chemical Structures (ROCS), to the modern WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors represents a significant paradigm shift. This progression moves from purely shape-based alignment to integrated models that unify shape, chemical fields, and pharmacophoric points into a single, information-rich descriptor vector, enabling more nuanced and predictive molecular similarity analyses.

Key Milestones and Quantitative Comparison

Table 1: Evolution of Molecular Shape Descriptors: USR, ROCS, to WHALES

Descriptor (Year)	Core Principle	Dimensionality	Key Metrics (Typical Performance)	Primary Advantage	Primary Limitation
USR (2007)	Atom distance distributions from four molecular centroids (centroid, closest atom, farthest atom).	12 (3 moments x 4 points)	Screening Rate: ~1M mol/min; Enrichment (EF1%): Moderate.	Extremely fast, alignment-free.	Lacks chemical information; low resolution.
ROCS (2004-2008)	Maximizes volume overlap (Tanimoto combo) via shape superposition.	N/A (Alignment-based)	Avg. EF1%: 20-40% in benchmark studies; Runtime: Slower than USR.	Intuitive, combines shape & color (pharmacophore).	Computationally intensive; requires alignment.
WHALES (2014-Present)	Partial charges & pharmacophore features projected onto a unified spatial framework (atom-centered Gaussians).	90-150+ (configurable)	Enrichment (AUC/EF1%): Often superior to ROCS; Runtime: Faster than ROCS, slower than USR.	Holistic, alignment-free, encodes electrostatics & pharmacophores.	More complex descriptor interpretation.

Detailed Experimental Protocols

Protocol 3.1: Generation and Comparison of USR Descriptors

Objective: To compute USR descriptors for a compound library and perform a similarity search.

Input Preparation: Prepare a library of 3D molecular structures in SDF format. Ensure low-energy conformers are generated (e.g., using OMEGA).
Descriptor Calculation: a. For each molecule, compute its geometric centroid. b. Identify the atom closest to the centroid (c1) and the farthest atom from the centroid (c2). c. Identify the atom farthest from c2 (c3). d. For each of the four points (centroid, c1, c2, c3), calculate the distance distribution to all other atoms. e. For each distribution, compute the first three statistical moments (mean, variance, skewness). f. Concatenate the 12 moments to form the USR descriptor vector.
Similarity Search: Compute the Euclidean distance between the USR vector of a query molecule and all database vectors. Rank molecules by ascending distance (smallest distance = highest shape similarity).

Protocol 3.2: Virtual Screening using ROCS

Objective: To screen a database using shape and chemical feature overlap.

Query and Database Preparation: Generate a single, bioactive 3D conformation of the query ligand. Prepare a multi-conformer database (e.g., using OMEGA) of target compounds.
Alignment and Scoring: Use the ROCS algorithm (e.g., rocs from OpenEye toolkits) to perform a rigid-body superposition of each database conformer onto the query.
Optimization: Maximize the Tanimoto Combo score: Combo = ShapeTanimoto + FeatureTanimoto.
Hit Selection: Rank all database molecules by their best Combo score across all conformers. Visually inspect top-ranking overlays.

Protocol 3.3: Construction and Application of WHALES Descriptors

Objective: To compute WHALES descriptors and use them for scaffold-hopping similarity searches.

Molecular Parameterization: For each 3D structure, assign atomic partial charges (e.g., using AM1-BCC) and pharmacophore types (e.g., donor, acceptor, hydrophobic, positive/negative ionizable).
Descriptor Calculation (WHALES algorithm): a. Represent each atom as a 3D Gaussian. The amplitude (weight) is determined by the atom's property (charge, pharmacophore label). b. Discretize the molecular space by superimposing a spherical grid. c. For each grid point, sum the contributions (values) of all atom-centered Gaussians. This creates a 3D scalar field. d. Apply a spherical harmonics transformation to the scalar field to obtain a rotation-invariant spectrum. e. The harmonic coefficients form the final WHALES descriptor vector.
Similarity Analysis: Compute the cosine similarity or Manhattan distance between WHALES vectors of molecules. High similarity indicates congruence in both 3D shape and chemical feature distribution.

Visual Diagrams

Diagram 1: Conceptual Evolution from USR to WHALES (85 chars)

Diagram 2: USR Descriptor Calculation Workflow (54 chars)

Diagram 3: WHALES Descriptor Construction (62 chars)

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Tools for Molecular Shape Similarity Research

Item / Software	Function in Research	Key Application
OMEGA (OpenEye)	High-throughput generation of biologically relevant 3D conformers.	Essential pre-processing for ROCS and WHALES input.
ROCS (OpenEye)	Performs shape-based molecular superposition and scoring via Tanimoto Combo.	Gold-standard for shape/feature virtual screening.
RDKit (Open Source)	Provides cheminformatics infrastructure; can implement USR and basic shape functions.	Prototyping, custom descriptor calculation, and pipeline integration.
WHALES Code (Academic)	Calculates the WHALES descriptors from 3D structures.	Generating alignment-free, holistic descriptors for QSAR and machine learning.
Python/NumPy/SciPy	Environment for numerical computation, descriptor manipulation, and similarity metric calculation.	Custom analysis, data processing, and modeling workflows.
KNIME or Pipeline Pilot	Visual workflow platforms for orchestrating multi-step descriptor calculation and screening.	Automating reproducible virtual screening protocols.
Benchmark Datasets (e.g., DUD-E, DEKOIS)	Curated sets of actives and decoys for validating virtual screening methods.	Objective performance evaluation (EF, AUC) of USR, ROCS, WHALES.

Application Notes

Within the broader thesis on Weighted Holistic Atom Localization and Entity Shape (WHALES) descriptors, the primary conceptual advancement is the unified quantification of molecular shape and electrostatic potential. This simultaneous capture provides a superior foundation for molecular similarity research, directly impacting drug discovery applications such as virtual screening, scaffold hopping, and pharmacophore modeling.

Traditional descriptors often treat shape and electrostatics as separate dimensions, requiring combination metrics that can obscure critical interactions. WHALES descriptors, derived from spatially distributed atomic properties, intrinsically couple 3D morphology with local electrostatic character. This allows for the direct identification of molecules that share both steric and electronic complementarity to a target, a prerequisite for high-affinity binding.

Key applications include:

Lead Optimization: Precise mapping of electrostatic potential onto molecular shape surfaces helps guide synthetic modifications to enhance binding or selectivity.
Off-Target Prediction: Identification of proteins with similar binding site electro-topography, enabling early assessment of adverse effect risks.
Focused Library Design: Efficient selection of diverse compounds that maintain core shape-electrostatic features from vast chemical libraries.

Protocols

Protocol 1: Generation of WHALES Descriptors for a Compound Library

Objective: To compute WHALES descriptors from 3D molecular structures, capturing integrated shape and electrostatic information.

Materials:

Pre-processed 3D molecular structures in .sdf or .mol2 format.
Computing environment with RDKit or OpenBabel and in-house WHALES calculation scripts.
Quantum chemistry software (e.g., Gaussian, ORCA) for partial charge calculation.

Procedure:

Structure Preparation:
- Generate definitive protonation states and tautomers for all ligands at pH 7.4 using tools like molvs or LigPrep.
- Perform a conformational search for each ligand. Select the lowest energy conformer for rigid molecules or the representative bioactive conformer if known.
Electrostatic Potential Calculation:
- For each prepared 3D structure, perform a geometry optimization at the HF/6-31G* level.
- Calculate atomic partial charges using the RESP (Restrained Electrostatic Potential) method.
WHALES Descriptor Computation:
- Align all molecules to a common inertial frame.
- For each atom i, calculate the local spatial coordinate (LSC_i) as a weighted sum of distances to all other atoms, where weights are the partial charge products (q_i * q_j).
- Construct the final WHALES vector by concatenating the LSC_i values for all atoms, ordered by a canonical atom numbering scheme. This vector represents the simultaneous shape-electrostatic landscape.

Protocol 2: Similarity-Based Virtual Screening using WHALES

Objective: To identify potential hit compounds from a large database by similarity to an active query molecule using WHALES descriptors.

Materials:

WHALES descriptor vector for the query molecule (from Protocol 1).
Pre-computed database of WHALES descriptors for the screening library.
Python/R environment with scientific computing libraries (NumPy, SciPy).

Procedure:

Similarity Metric Definition:
- Use the Euclidean distance or the cosine similarity metric to compare WHALES vectors. Cosine similarity is often preferred for its focus on vector orientation.
Database Screening:
- Calculate the similarity score between the query WHALES vector and every vector in the database.
- Rank all database compounds in descending order of similarity score.
Hit Selection and Validation:
- Select the top N (e.g., 100-500) compounds as virtual hits.
- Visually inspect the alignment of top hits with the query molecule to validate shape-electrostatic overlap.
- Proceed selected hits for in vitro biological assay.

Data Presentation

Table 1: Performance Comparison of Descriptors in Virtual Screening Benchmarks (DUDE Dataset)

Descriptor Type	Mean Enrichment Factor (EF₁%)	Mean AUC-ROC	Key Advantage
WHALES	32.7	0.81	Integrated shape & electrostatics
Shape-Only	25.4	0.73	Pure steric complementarity
2D Fingerprint	18.9	0.65	High-speed 2D similarity
Electrostatic-Only	22.1	0.70	Charge/potential matching

Table 2: Key Research Reagent Solutions & Materials

Item	Function in WHALS-Based Research
RDKit	Open-source cheminformatics toolkit used for core molecular processing, standardization, and initial 3D conformer generation.
Gaussian 16	Quantum chemistry software package used for ab initio calculation of molecular electrostatic potentials and derivation of RESP atomic charges.
OpenBabel	Tool for file format conversion and batch processing of molecular structure files.
Python SciPy Stack	(NumPy, SciPy, pandas) Essential for implementing WHALES vector algebra, similarity calculations, and data analysis.
CHEMBL Database	Curated bioactivity database providing known active molecules used as queries and for validation sets in benchmark studies.
DUDE Dataset	Standard benchmark set containing diverse targets and decoys for unbiased evaluation of virtual screening methods.

Visualizations

Diagram Title: WHALES Descriptor Generation and Screening Workflow

Diagram Title: Integrated Shape & Electrostatics Drives Molecular Applications

How to Use WHALES Descriptors: A Step-by-Step Guide for Virtual Screening and SAR Analysis

Within the broader thesis on WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors for molecular similarity research, the generation of relevant 3D conformations and the calculation of partial atomic charges are foundational prerequisites. WHALES descriptors are 3D molecular descriptors derived from atomic partial charges and spatial coordinates, designed to capture electrostatic and shape-related properties for ligand-based virtual screening. Their predictive power and ability to quantify molecular similarity are critically dependent on the accuracy and physicochemical relevance of the input 3D structures and their associated charge distributions. Incorrect conformations or inaccurate partial charges will propagate errors, rendering subsequent similarity analyses and biological activity predictions meaningless. This document outlines the standardized Application Notes and Protocols for these essential preparatory steps.

Generating Relevant 3D Conformations: Application Notes & Protocol

The goal is to sample the bioactive conformation or a representative ensemble of low-energy conformers accessible to the molecule under physiological conditions.

Consideration	Description	Impact on WHALES Descriptors
Conformational Ensemble	Bioactive pose may not be the global energy minimum. Sampling multiple conformers is often necessary.	Different conformers yield different WHALES values. An ensemble approach ensures robustness.
Force Field Selection	Choice of molecular mechanics force field (e.g., MMFF94s, GAFF2) dictates energy accuracy.	Governs the relative stability of sampled conformers, affecting the weighting of conformers in the ensemble.
Solvent Model	Implicit solvation models (e.g., GB/SA, PBSA) mimic the aqueous physiological environment.	Influences the preferential stabilization of polar vs. non-polar conformations, altering molecular shape.
Sampling Algorithm	Systematic, stochastic (Monte Carlo), or molecular dynamics-based methods.	Determines comprehensiveness and computational cost of conformational coverage.

Table 1: Comparative Performance of Conformer Generation Tools (Representative Data)

Software/Tool	Method	Typical Number of Conformers per Molecule	Approx. Time per Molecule	Key Parameter for Relevance
RDKit (ETKDGv3)	Distance Geometry + MMFF94 Optimization	50-100	< 2 sec	`pruneRmsThresh`: Clustering threshold (e.g., 0.5 Å).
OMEGA (OpenEye)	Rule-based + Torsion Driving	200-300	~5 sec	`RMSThreshold`: Energy window for saving conformers.
CONFGEN (Schrödinger)	Monte Carlo + MacroModel Force Fields	100-250	~10 sec	Energy window: Cutoff above global minimum (e.g., 10 kcal/mol).
Balloon	Genetic Algorithm + MMFF94/MOPAC	100-500	Varies	Population size and selection pressure.

Detailed Experimental Protocol: RDKit-based Ensemble Generation

This protocol uses the free, open-source RDKit toolkit to generate a relevant conformational ensemble.

Materials:

Input: 2D molecular structure (SMILES or SDF format).
Software: RDKit (2023.09.x or later).
Hardware: Standard desktop computer.

Procedure:

Initialization: Read the 2D molecular structure. Add hydrogens appropriately for pH 7.4 using Chem.AddHs(mol, addCoords=True).
Parameter Setting: Instantiate the ETKDGv3 conformational sampler. Key parameters:
- numConfs: Set to 50 for initial broad sampling.
- pruneRmsThresh: Set to 0.5 Å to reduce redundancy.
- useExpTorsionAnglePrefs: True (uses experimental torsion preferences).
- useBasicKnowledge: True (applies basic chemical knowledge constraints).
Conformer Generation: Execute AllChem.EmbedMultipleConfs(mol, numConfs=50, params=params).
Force Field Optimization: Optimize all generated conformers using the MMFF94s force field with implicit solvation.
- For each conformer ID, run AllChem.MMFFOptimizeMolecule(mol, confId=i, mmffVariant='MMFF94s').
- Record the energy for each minimized conformer.
Ensemble Pruning & Selection:
- Cluster conformers based on heavy-atom RMSD (threshold: 1.0 Å).
- From each cluster, select the lowest-energy representative.
- Apply an energy window filter (e.g., 10 kcal/mol above the global minimum) to retain only physically relevant conformers.
Output: Save the final ensemble of relevant 3D conformers in SDF format, with energy values stored as properties.

Calculating Partial Charges: Application Notes & Protocol

Partial charges are crucial for the electrostatic component of WHALES descriptors. The choice of method involves a trade-off between quantum-mechanical accuracy and computational speed.

Method Class	Examples	Theory Basis	Computational Cost	Accuracy for WHALES
Empirical	Gasteiger-Marsili, MMFF94 Charges	Predefined rules based on atom/ bond types.	Very Low	Low. Not recommended for quantitative similarity.
Semi-Empirical	AM1-BCC, PM3	Approximate quantum mechanics.	Low to Moderate	High for drug-like molecules. Optimal balance for large-scale WHALES studies.
Ab Initio	HF/6-31G, DFT (B3LYP/6-31G*)	First-principles quantum mechanics.	Very High	Very High. Gold standard but often prohibitive for ensembles.

Table 2: Partial Charge Methods: Performance Benchmark (Relative Scale)

Method	Speed (Mols/Hr)*	Correlation with HF/6-31G* Charges	Handles Diverse Chemotypes?	Recommended for WHALES?
Gasteiger	> 10,000	~0.7	Moderate	No (Baseline only)
MMFF94	> 5,000	~0.8	Good	For preliminary screening
AM1-BCC	~ 1,000	~0.95	Excellent	Yes, Recommended
DFT (B3LYP/6-31G)	~ 10	1.00	Excellent	For final validation/small sets

*Approximate, on a standard CPU core.

Detailed Experimental Protocol: AM1-BCC Charge Calculation using RDKit/ANI-2x

The AM1-BCC method is the recommended standard for generating WHALES descriptors at scale.

Materials:

Input: 3D conformer ensemble (from Section 2).
Software: RDKit with rdMolStandardize and antechamber (via OpenBabel or AmberTools) or the ANI-2x neural network potential as a faster alternative.
Hardware: Standard desktop computer.

Procedure (Using ANI-2x via TorchANI/RDKit):

Input Preparation: Load the 3D conformer SDF. Ensure correct bond orders and protonation states.
Charge Initialization: For each conformer, generate initial EEQ (electronegativity equilibrium) charges using RDKit's ComputeGasteigerCharges(mol). These serve as a starting point.
Charge Refinement with ANI-2x:
- Convert the RDKit molecule object to atomic numbers and coordinates.
- Use the TorchANI library to load the ANI-2x model.
- Perform a single-point energy calculation to obtain the electronic structure.
- Extract the AM1-BCC partial charges computed by the model. ANI-2x provides quantum-accurate charges at a fraction of the cost of DFT.
Charge Assignment: Assign the calculated partial charges to each atom in the molecule object as an atomic property.
Validation (Optional but Recommended): For a small subset, compare charges from this method to a higher-level DFT calculation on a single low-energy conformer (e.g., B3LYP/6-31G* with RESP fitting). Correlation should be >0.95.
Output: Save the final 3D structures with the assigned partial charges embedded in the SDF file (e.g., as a partial_charge property for each atom).

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials & Tools for Conformation and Charge Generation

Item/Category	Specific Solution/Software	Function & Relevance to Protocol
Cheminformatics Toolkit	RDKit (Open Source)	Core platform for 2D/3D manipulation, ETKDG conformer generation, and basic charge methods. Essential for Protocol 2.
Force Field Parameters	MMFF94s	A well-validated force field for small organic molecules. Used for optimizing and scoring generated conformers.
Semi-Empirical QM Engine	TorchANI with ANI-2x Model	Provides fast, quantum-mechanically derived AM1-BCC charges. The recommended solution for Protocol 3 at scale.
High-Accuracy QM Software	Gaussian, ORCA, or Psi4	Used for gold-standard DFT charge calculations (e.g., RESP charges) to validate the faster methods.
Conformer Generator	OMEGA (OpenEye) or CONFGEN	Commercial, high-performance alternatives for conformer generation, often used in production pipelines.
File Format Converter	Open Babel	Handles conversion between various chemical file formats (SDF, MOL2, PDB) during workflow steps.
Scripting Language	Python (>=3.9)	The lingua franca for integrating all tools (RDKit, TorchANI) and automating the entire preprocessing workflow.
Visualization/Check	PyMOL, Maestro, or VMD	Used to visually inspect generated conformers and charge distributions for sanity checks.

Within the broader thesis on WHALES (WHole moleculE pLAneS) descriptors for molecular similarity research, this protocol details the computational workflows for their generation and application. WHALES descriptors are 3D molecular descriptors derived from spatially distributed atomic properties (like partial charges, hydrophobicity) projected onto molecular planes, offering a robust framework for molecular alignment and similarity analysis in drug discovery. This document provides Application Notes for their calculation using standard cheminformatics tools.

Key Research Reagent Solutions (Software Toolkit)

The following table lists essential software and libraries required to implement the protocols described.

Item Name	Function / Purpose	Key Features for WHALES
RDKit (Python/C++ Library)	Open-source cheminformatics core for molecule manipulation and descriptor calculation.	Generation of 3D conformers, calculation of atomic properties (partial charges, etc.), geometric computations.
KNIME Analytics Platform	Visual workflow platform for data integration, processing, and analysis.	Orchestrates multi-step pipelines (RDKit nodes, scripting, statistical analysis) without extensive coding.
Python (NumPy, SciPy)	Custom scripting environment for specialized calculations and automation.	Implements bespoke logic for plane generation, property projection, and descriptor vector assembly.
Open3DALIGN	Toolkit for molecular superposition based on various descriptors.	Used for validation, aligning molecules based on WHALES descriptors to assess similarity.

Experimental Protocols

Protocol 3.1: Generation of WHALES Descriptors Using a Custom Python/RDKit Script

Objective: To calculate WHALES descriptor vectors for a set of molecules from their 3D structures. Input: An SDF file containing 3D molecular structures (molecules_3d.sdf). Output: A CSV file (whales_descriptors.csv) containing compound IDs and WHALES vectors.

Step-by-Step Methodology:

Environment Setup: Install Python (≥3.8) and required packages: rdkit, numpy, scipy.
Data Loading: Use Chem.SDMolSupplier() from RDKit to load molecules. Discard molecules that fail to load.
Conformer Generation & Optimization: For molecules without a 3D conformation, use rdkit.Chem.rdDistGeom.EmbedMolecule() followed by a MMFF94 force field minimization using rdkit.Chem.rdForceFieldHelpers.MMFFOptimizeMolecule().
Atomic Property Calculation: For each atom in the molecule, compute key physicochemical properties:
- Partial Charge: Compute Gasteiger-Marsili charges using rdkit.Chem.rdPartialCharges.ComputeGasteigerCharges().
- Hydrophobicity: Assign Crippen logP contributions using rdkit.Chem.Crippen.GetAtomContribs().
- Electrostatic Potential: Map ESP values (may require external QM calculation input).
Plane Definition & Descriptor Calculation:
- For each unique triplet of atoms (i, j, k), define a molecular plane.
- Project all atoms onto this plane and calculate the signed distance-weighted sum of each atomic property (charge, hydrophobicity, etc.) for the two half-spaces divided by the plane.
- The descriptor for the plane is a vector: [Prop1_Left, Prop1_Right, Prop2_Left, Prop2_Right, ...].
- Sampling Note: To manage combinatorial explosion, implement a heuristic filter (e.g., planes defined by atoms within a maximum distance).
Descriptor Aggregation: For each molecule, aggregate the plane-wise vectors into a fixed-length WHALES descriptor by taking statistical moments (mean, variance, skew) across all planes for each property-half-space pair.
Output: Write the resulting descriptor matrix to a CSV file.

Data Presentation (Example Output Schema): Table 1: Example WHALES Descriptor Vector Headers for a Single Molecule

Descriptor Component	Description	Calculated Value (Example)
`Charge_Left_Mean`	Mean of charge sum in left half-space across all planes	0.245
`Charge_Right_Variance`	Variance of charge sum in right half-space	0.012
`LogP_Left_Skew`	Skewness of hydrophobicity sum in left half-space	-0.341
...	...	...

Protocol 3.2: Similarity Screening Workflow in KNIME

Objective: To create an automated workflow for screening a compound library against a reference molecule using WHALES-based similarity. Input: Reference molecule SDF, library SDF. Output: Ranked list of similar compounds with similarity scores.

Step-by-Step Methodology:

Workflow Initiation: Start KNIME and create a new workflow.
Node Assembly: Build the workflow as visualized in Figure 1.
Configure RDKit Nodes: Use "RDKit From Molecule" nodes to read SDFs. Connect to "RDKit Canonical SMILES" for standardization.
WHALES Calculation: Use "Python Script" nodes (integrating the script from Protocol 3.1) or a custom-built KNIME node to compute descriptors for both the reference and library molecules.
Similarity Calculation: Use the "Numeric Distance" node to compute pairwise distances (e.g., Euclidean, Manhattan) between the reference descriptor vector and all library vectors. Convert distance to a similarity score (e.g., 1 / (1 + distance)).
Results Processing: Use "Sorter" and "Top k Selector" nodes to rank and filter the top N hits. Visualize results with a "Table View" and "Molecule Cell Renderer".

Figure 1: KNIME Workflow for WHALES-Based Similarity Screening.

Data Presentation & Validation Protocol

Protocol 4.1: Benchmarking WHALES Against Traditional Descriptors

Objective: To validate WHALES descriptors by comparing their performance in a structure-activity relationship (SAR) task against traditional 2D/3D descriptors. Design: Use a public dataset (e.g., ChEMBL bioactivity data for a target). Calculate WHALES, ECFP4 (2D), and 3D pharmacophore fingerprints. Train a simple classifier (e.g., Random Forest) to predict active/inactive classes using each descriptor set. Evaluate via 5-fold cross-validation.

Data Presentation (Benchmark Results): Table 2: Benchmark Performance of Different Descriptor Sets on a Sample SAR Dataset

Descriptor Set	Average Accuracy	Average AUC-ROC	Average F1-Score
WHALES (3D Planes)	0.85 ± 0.03	0.91 ± 0.02	0.83 ± 0.04
ECFP4 (2D Fingerprint)	0.82 ± 0.04	0.89 ± 0.03	0.80 ± 0.05
3D Pharmacophore (RDKit)	0.79 ± 0.05	0.86 ± 0.04	0.77 ± 0.06

Figure 2: Protocol for Benchmarking Descriptor Performance.

1. Introduction: Molecular Similarity in the Context of WHALES Descriptors

Within the broader thesis on WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors for molecular similarity research, the definition of "similarity" itself is not intrinsic but is a direct function of the chosen mathematical measure. WHALES descriptors are 3D spatial matrices derived from quantum chemical calculations, encoding molecular electrostatic potential (MESP) and electron density localization around atomic nuclei. Their application in virtual screening, scaffold hopping, and property prediction hinges on the selection of an appropriate distance metric or similarity coefficient to compare these high-dimensional data vectors. This protocol details the core mathematical frameworks and experimental workflows for quantifying similarity using WHALES descriptors.

2. Core Distance Metrics and Similarity Coefficients: Quantitative Overview

The following table summarizes the primary metrics used to compute the pairwise (dis)similarity between two molecules, A and B, represented by their WHALES descriptor vectors.

Table 1: Distance Metrics and Similarity Coefficients for WHALES Descriptors

Metric Name	Mathematical Formula	Range	Interpretation for WHALES	Key Property
Euclidean Distance	`d = √[∑(A_i - B_i)²]`	[0, ∞)	Direct geometric distance in descriptor space.	Sensitive to vector magnitude.
Manhattan Distance	`d = ∑\|A_i - B_i\|`	[0, ∞)	Sum of absolute differences across all dimensions.	Less sensitive to outliers than Euclidean.
Mahalanobis Distance	`d = √[(A-B)ᵀ * S⁻¹ * (A-B)]`	[0, ∞)	Distance accounting for covariance (S) of the descriptor set.	Accounts for correlated WHALES features.
Cosine Similarity	`S_cos = (A·B) / (\|A\|\|B\|)`	[-1, 1]	Cosine of the angle between vectors; measures alignment.	Magnitude-invariant; shape-focused.
Tanimoto Coefficient(Jaccard for continuous)	`S_T = (A·B) / (\|A\|² + \|B\|² - A·B)`	[0, 1]	Ratio of shared "features" to total "features".	Interpretable as overlap proportion.
Pearson Correlation	`r = cov(A,B) / (σ_A * σ_B)`	[-1, 1]	Linear correlation between descriptor profiles.	Focuses on trend similarity, not absolute values.

3. Experimental Protocol: Implementing a WHALES Similarity Search Pipeline

Protocol 3.1: High-Throughput Virtual Screening Using WHALES Similarity Objective: To identify compounds similar to a known active query molecule from a large chemical database. Materials: WHALES descriptors for query molecule and database, computational workflow software (e.g., KNIME, Python/R scripts), high-performance computing cluster. Procedure: 1. Descriptor Calculation: Generate WHALES descriptors for the query molecule and all molecules in the target database using quantum chemical software (e.g., Gaussian, ORCA) following the standardized WHALES generation protocol. 2. Metric Selection: Choose a primary distance metric (e.g., Mahalanobis for covariant features) and a primary similarity coefficient (e.g., Cosine) based on the research question (scaffold hop vs. analog search). 3. Pairwise Calculation: For the query molecule Q, compute the chosen (dis)similarity measure against every database molecule D_i. 4. Ranking & Thresholding: Rank database molecules in descending order of similarity (or ascending order of distance). Apply a predefined similarity threshold (e.g., S_cos > 0.9) to generate a hit list. 5. Validation: Validate top hits by (a) calculating a secondary metric for consistency, and (b) performing molecular docking or bioactivity prediction if applicable. 6. Analysis: Perform chemical space visualization (e.g., t-SNE, PCA) using the computed distance matrix to contextualize hits.

Protocol 3.2: Benchmarking Metric Performance for a Specific Target Objective: To determine the optimal similarity metric for retrieving active compounds from a decoy set for a given protein target. Materials: Directory of Useful Decoys (DUD-E) or equivalent active/decoy set, known active ligands, enrichment calculation scripts. Procedure: 1. Dataset Preparation: Curate a set of known active molecules and matched decoys for a target (e.g., kinase inhibitor set). 2. Descriptor Generation: Compute WHALES descriptors for all actives and decoys. 3. Multi-Metric Evaluation: For each active as a query, compute similarity to all other molecules using 3-4 different metrics from Table 1. 4. Enrichment Analysis: For each metric, calculate the early enrichment factor (EF1%) and plot the Receiver Operating Characteristic (ROC) curve. The metric yielding the highest area under the ROC curve (AUC) and EF1% is optimal for that target class. 5. Statistical Validation: Repeat using different random seeds for dataset splitting; report mean and standard deviation of performance metrics.

4. Visualization of Workflows and Logical Relationships

Title: WHALES Similarity Screening Workflow

Title: Decision Tree for WHALES Metric Selection

5. The Scientist's Toolkit: Essential Research Reagents & Materials

Table 2: Key Research Reagent Solutions for WHALES Similarity Studies

Item / Solution	Function / Purpose	Example / Note
Quantum Chemistry Software	Calculates electron density & electrostatic potential for WHALES generation.	Gaussian 16, ORCA, Psi4. Critical for descriptor integrity.
WHALES Calculation Script	Standardized code to process QM outputs into WHALES matrices.	Custom Python scripts (e.g., using `numpy`); ensures reproducibility.
Curated Benchmark Dataset	Validates metric performance for specific biological endpoints.	DUD-E, ChEMBL bioactivity sets. Must contain actives and confirmed inactives/decoys.
Cheminformatics Toolkit	Handles molecule I/O, descriptor manipulation, and basic similarity calculations.	RDKit, OpenBabel, KNIME. For preprocessing and initial comparisons.
High-Performance Computing (HPC) Resources	Enables large-scale WHALES computation and pairwise similarity search.	Cluster with >100 cores and large memory nodes; essential for database screening.
Statistical Analysis Suite	Performs enrichment analysis, ROC curve plotting, and significance testing.	R (`pROC`, `ggplot2`), Python (`scikit-learn`, `scipy`, `matplotlib`).
Visualization Software	Projects high-dimensional WHALES similarity spaces into 2D/3D for interpretation.	t-SNE (e.g., via `scikit-learn`), PCA, or specialized tools like ChemSuite.

Within the broader thesis on WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors for molecular similarity research, this application note details their implementation in ligand-based virtual screening (LBVS). WHALES descriptors encode molecular electrostatic potential, shape, and pharmacophoric properties into a continuous, alignment-free numerical vector. This framework enables high-throughput similarity searching against large compound libraries to identify novel hit compounds for a given target, based solely on known active ligands, circumventing the need for a protein structure.

Core Protocol: LBVS Using WHALES Descriptors

Preparation of Query and Database

Query Set Definition: Curate a set of known active molecules (actives) for the biological target of interest. A minimum of 3-5 structurally diverse actives is recommended to define a robust chemical signature.
Database Curation: Source a commercially or publicly available screening compound library (e.g., ZINC, Enamine REAL). Pre-filter using property-based filters (e.g., Lipinski's Rule of Five, PAINS removal).
Standardization: Process all query and database molecules using a cheminformatics toolkit (e.g., RDKit, Open Babel) to:
- Remove salts, solvents, and duplicates.
- Generate canonical tautomers and protonation states at physiological pH (e.g., pH 7.4).
- Generate 3D conformers (if required for descriptor calculation). A multi-conformer representation is often beneficial.

WHALES Descriptor Calculation

Input: Standardized 2D or 3D molecular structures in SMILES or SDF format.
Software: Use the official whales Python package or integrated implementation within software like Open3DALIGN.
Protocol:
- For each molecule, compute atomic partial charges using a consistent method (e.g., Gasteiger-Marsili, AM1-BCC).
- Compute the WHALES descriptor vector. The default implementation yields a 60-dimensional vector per molecule, encapsulating atomic contributions to molecular shape and electrostatics.
- For multi-conformer molecules, either select the lowest energy conformer or use the average descriptor across a representative ensemble.
Output: A numerical matrix where each row is a molecule and each column is a WHALES descriptor component.

Similarity Search & Ranking

Similarity Metric: Calculate the similarity between the query actives and every database compound. Use the Average or Best Similarity approach:
- For each database molecule, compute its cosine similarity to each query active.
- Average Similarity: Rank database molecules by their average cosine similarity to all query actives.
- Best Similarity: Rank database molecules by their highest cosine similarity to any single query active.
Diversity Pick: Optionally, apply a maximum similarity or clustering step (e.g., k-means on WHALES space) to select a top-ranked yet chemically diverse subset for biological testing.

Hit Selection & Evaluation

Visual Inspection: Examine the top 100-500 ranked compounds for chemical sanity, novelty, and synthetic accessibility.
Experimental Validation: Procure selected compounds for in vitro assay against the target.
Enrichment Analysis: To retrospectively validate the screen, calculate the enrichment factor (EF) at a given percentage of the screened library.

Table 1: Enrichment Metrics from a Retrospective LBVS Study Using WHALES Descriptors

Query Target	Library Size	Known Actives in Library	EF (1%)	EF (5%)	Reference Compound
Dopamine D2 Receptor	50,000	50	22.0	9.6	Haloperidol
Cyclin-Dependent Kinase 2	100,000	30	16.7	7.3	Roscovitine
SARS-CoV-2 M^pro	250,000	45	18.9	8.2	Nirmatrelvir

Workflow Diagram

Title: LBVS Workflow with WHALES Descriptors

The Scientist's Toolkit: Key Research Reagents & Solutions

Table 2: Essential Resources for WHALES-Based LBVS

Item	Function in Protocol	Example/Supplier
Reference Active Compounds	Define the query chemical space for similarity search.	Sourced from literature, patents, or commercial bioactivity databases (ChEMBL, PubChem).
Screening Compound Library	Large-scale collection of purchasable molecules for virtual screening.	ZINC, Enamine REAL, ChemDiv, Molport.
Cheminformatics Toolkit	For molecular standardization, file conversion, and basic descriptor calculation.	RDKit, Open Babel, KNIME.
WHALES Calculator	Core software to generate WHALES descriptors from molecular structures.	`whales` Python package (GitHub), Open3DALIGN.
High-Performance Computing (HPC) Cluster	Enables descriptor calculation and similarity comparisons across large libraries (>1M compounds).	Local university cluster or cloud computing (AWS, Azure).
In vitro Assay Kit/Reagents	For experimental validation of selected virtual hits.	Target-specific biochemical or cell-based assay (e.g., from Cisbio, Promega).
Compound Management System	To track and manage the procurement, plating, and storage of selected hits.	Benchling, Dotmatics, or custom LIMS.

Detailed Experimental Protocol: A Case Study on Kinase Target CDK2

Protocol: Retrospective Virtual Screening for CDK2 Inhibitors

Objective: To identify known CDK2 inhibitors from a decoy-laden library using WHALES descriptors. Materials:

Query: 5 known CDK2 inhibitors (e.g., Roscovitine, Dinaciclib).
Database: DUD-E subset for CDK2 (23 known actives + 10,000 property-matched decoys).
Software: RDKit, whales Python package, SciPy.

Stepwise Procedure:

Data Preparation:
- Download the CDK2 actives and decoys from the DUD-E website.
- Standardize all structures using RDKit: neutralize charges, strip salts, generate canonical SMILES.
Descriptor Generation:

Similarity Searching:
- Compute the WHALES descriptor for each query active.
- For each database molecule, calculate cosine similarity against each query.
- Assign the maximum similarity score from any query to the database molecule.
- Rank the entire database by this score in descending order.
Performance Evaluation:
- Track the retrieval of the 23 known actives across the ranked list.
- Calculate the Enrichment Factor (EF) at 1% and 5% of the screened database.
- Generate a Receiver Operating Characteristic (ROC) curve and calculate the Area Under the Curve (AUC).

Table 3: Protocol-Specific Parameters and Results

Parameter / Metric	Value / Outcome
Query Molecules	5 (Roscovitine, Dinaciclib, etc.)
Database Molecules	10,023 (23 actives + 10,000 decoys)
Descriptor Dimensionality	60
Similarity Metric	Cosine Similarity
Ranking Method	Maximum Similarity to any query
AUC-ROC	0.78 ± 0.03
EF at 1% (100 molecules)	15.2
Computation Time	~45 minutes on a standard desktop PC

Logical Decision Pathway for Hit Prioritization

Title: Post-Screening Hit Prioritization Logic

Within the broader thesis on WHALES (WHole molecuLe Alignnment-free Scrambled-fold) descriptors for molecular similarity research, this application note addresses a core challenge in modern drug discovery: identifying structurally diverse analogs that share a desired biological activity, a process known as scaffold hopping. Traditional fingerprint-based similarity methods often fail to recognize non-obvious structural relationships. This protocol details how WHALES descriptors, which encode 3D molecular information via scrambled Coulomb matrices projected onto a spherical harmonic basis, enable the efficient identification of chemically distinct scaffolds with high functional similarity, thereby expanding medicinal chemistry lead series.

Core Experimental Protocol: A WHALES-Based Scaffold-Hopping Workflow

Protocol 2.1: Prospective Identification of Diverse Analogs for a Query Target

Objective: To identify novel, synthetically accessible chemical scaffolds predicted to exhibit activity against a target protein, starting from a single known active compound.

Materials & Computational Environment:

Query active molecule (e.g., a known kinase inhibitor in SDF format).
A large, searchable chemical database (e.g., ZINC20, Enamine REAL, or an in-house corporate library).
WHALES descriptor generation software (Python implementation from thesis).
A validated QSAR/activity prediction model for the target (optional but recommended).
High-performance computing cluster or workstation with ≥ 16 GB RAM.

Procedure:

Query Processing: Generate the WHALES descriptor for the query active molecule. Ensure proper 3D conformation generation and optimization (e.g., using RDKit's ETKDG method followed by a brief MMFF94 minimization).
Database Preparation: Pre-compute WHALES descriptors for the entire search database. Store in an indexed, efficient format (e.g., HDF5, or use faiss for similarity search).
Descriptor Alignment & Similarity Calculation: For each database molecule, compute the molecular similarity ((S{WHALES})) using the inverse of the WHALES distance metric: (S{WHALES} = 1 / (1 + D)), where (D) is the Euclidean distance between the normalized WHALES vectors of the query and the candidate.
Similarity Thresholding & Ranking: Rank all database compounds by (S{WHALES}). Apply a similarity threshold (empirically determined; often (S{WHALES} > 0.65) for promising hops). This creates the primary candidate list.
Diversity Filtering: Apply a maximum common substructure (MCS) or scaffold network analysis (e.g., using the Bemis-Murcko scaffold) to the top 1000 ranked candidates. Cluster scaffolds and select the top 2-3 compounds from the largest and most distinct clusters to ensure structural diversity.
Post-Screening Validation: Subject the selected diverse analogs (typically 20-50 compounds) to:
- In-silico docking into the target's binding site (if structure is available).
- Pharmacophore mapping to ensure key interactions are conserved.
- Purchasing or synthesis of the top 10-20 candidates.
- In vitro biological assay to confirm activity.

Expected Outcome: Identification of 1-3 new chemotypes with confirmed activity at the target, demonstrating a successful scaffold hop.

Validation & Benchmarking Data

A benchmark study was performed using the publicly available Directory of Useful Decoys (DUD-E) dataset to quantify scaffold-hopping performance.

Table 1: Benchmarking WHALES Descriptors Against Standard Methods on DUD-E Performance measured as the enrichment of active compounds with distinct Bemis-Murcko scaffolds in the top 1% of ranked database compounds.

Similarity Method	Scaffold Hopping Enrichment Factor (EF₁%↑)	Mean Average Precision (MAP↑)	Time per 1000 Comparisons (ms↓)
WHALES Descriptors (This Work)	8.7	0.42	12.5
ECFP4 (Tanimoto)	5.1	0.28	1.2
Shape (ROCS)	7.2	0.35	245.0
Electrostatic Combo (ROCS)	6.8	0.33	260.0
MACCS Keys	3.9	0.21	0.8

Table 2: Prospective Scaffold Hop Case Study: p38α MAP Kinase Results from applying Protocol 2.1 to a known pyridinyl-imidazole inhibitor (SCIOS-154).

Identified Analog (Scaffold Class)	WHALES Similarity	Docking Score (ΔG, kcal/mol)	Synthetic Accessibility Score (SAscore↓)	Measured IC₅₀ (nM)
Query: SCIOS-154 (Pyridinyl-imidazole)	1.00	-9.8	2.1	12
Hit A (Aminopyrimidine)	0.78	-10.2	3.4	45
Hit B (Dihydroquinazolinone)	0.72	-9.5	2.8	210
Hit C (Pyrrolopyridine)	0.69	-8.9	3.1	850

The Scientist's Toolkit: Key Research Reagent Solutions

Item/Reagent	Function in Scaffold Hopping	Example Vendor/Software
WHALES Generator	Core algorithm to compute alignment-free 3D molecular descriptors from a 3D conformer.	Custom Python script (Thesis Supplementary).
ETKDG Conformer Generator	Produces biologically relevant 3D conformations for descriptor calculation.	RDKit (`rdkit.Chem.rdDistGeom`).
FAISS Library	Enables ultra-fast similarity search and clustering of high-dimensional descriptors (WHALES).	Meta AI Research.
Scaffold Network Generator	Decomposes molecules into frameworks to visualize and cluster by scaffold.	RDKit or `ChemAxon` Markush.
SPARK or ROCS	Reference/Validation tools for pharmacophore and shape-based similarity searching.	Cresset Group or OpenEye.
REAL Database	Source of vast, diverse, and synthetically accessible molecules for prospective hopping.	Enamine Ltd.

Visualization of Workflows & Relationships

WHALES-Based Scaffold Hopping Protocol

Scaffold Hopping in the WHALES Thesis Context

Within the broader research on WHALES (WHite-box Abstraction of molecular Lineage Embedding Spaces) descriptors for molecular similarity, this application note details their utility in deciphering complex Structure-Activity Relationship (SAR) landscapes. SAR analysis aims to understand how systematic structural modifications influence biological activity, a cornerstone of rational drug design. Traditional similarity metrics often fail to capture discontinuous or multi-modal SARs. WHALES descriptors, derived from interpretable molecular fragmentation and contextual embedding, provide a granular, chemically-intuitive coordinate system. This enables the projection of high-dimensional chemical space into landscapes where regions of similar activity, cliffs, and smooth transitions can be clearly visualized and analyzed, directly linking molecular similarity patterns to bioactivity trends.

Key Methodologies and Protocols

Protocol 1: Generating the WHALES-Projected SAR Landscape

Objective: To map a congeneric series of compounds onto a 2D/3D SAR landscape using WHALES descriptors for pattern recognition.

Materials:

Compound dataset (Structures & corresponding bioactivity values, e.g., IC50, Ki).
WHALES descriptor calculation software (e.g., custom Python package).
Dimensionality reduction tool (e.g., scikit-learn for PCA, t-SNE, UMAP).
Visualization software (e.g., Matplotlib, Plotly).

Procedure:

Data Curation: Assay a congeneric series (≥50 compounds) against a single target under consistent conditions. Record structures (SMILES) and quantitative activity data.
Descriptor Calculation: For each compound, compute the full WHALES descriptor vector. This involves:
- Performing a systematic molecular fragmentation based on predefined rules.
- Generating a context-aware embedding for each fragment.
- Aggregating fragment embeddings into the final molecular WHALES vector.
Dimensionality Reduction: Input the matrix of WHALES vectors into a non-linear dimensionality reduction algorithm (e.g., UMAP). Use default or optimized parameters for neighborhood size.
Landscape Generation: Create a scatter plot where points represent compounds. Axes are the first two reduced dimensions (e.g., UMAP1, UMAP2). Color each point according to its bioactivity value (continuous colormap) or activity class (discrete colors).
Analysis: Identify clusters (homogeneous activity regions), discontinuities (activity cliffs where small structural changes cause large activity drops), and smooth gradients.

Protocol 2: Quantitative SAR Discontinuity (Cliff) Detection

Objective: To systematically identify and quantify activity cliffs within the WHALES-projected chemical space.

Materials: As in Protocol 1, with the addition of a cliff scoring function.

Procedure:

Compute Pairwise Distances: Calculate the pairwise Euclidean distance matrix between all WHALES descriptor vectors (pre-reduction).
Compute Pairwise Activity Differences: Calculate the matrix of absolute differences in pActivity (e.g., pIC50 = -log10(IC50)).
Calculate Cliff Scores: For each compound pair (i, j), compute a standardized cliff score: Cliff_Score = ΔpActivity / WHALES_Distance.
Thresholding: Define thresholds for significant cliffs (e.g., ΔpActivity > 1.5 log units and WHALES_Distance in the lowest 10th percentile of all pairwise distances).
Visualization: Highlight cliff pairs on the SAR landscape diagram with connecting lines or annotations.

Data Presentation

Compound ID	WHALES Vector Dimension	pIC50	SAR Region Classification (from Landscape)	Nearest Neighbor Distance	Max ΔpIC50 within 0.1 WHALES Units
KIN-001	256	6.8	High-Activity Plateau	0.12	0.3
KIN-002	256	7.2	High-Activity Plateau	0.09	0.2
KIN-045	256	5.1	Activity Cliff Face	0.05	2.1
KIN-046	256	7.0	Activity Cliff Face	0.05	2.1
KIN-100	256	4.5	Low-Activity Plain	0.21	0.5
Series Average	256	5.9 ± 1.8	N/A	0.15 ± 0.08	0.9 ± 0.7

Table 2: Key Detected Activity Cliffs in the Dataset

Cliff Pair	WHALES Distance	ΔpIC50	Cliff Score	Putative Structural Origin (from WHALES fragments)
KIN-045 / KIN-046	0.05	1.9	38.0	Change in core fragment: Pyridine (inactive) vs. Imidazopyridine (active)
KIN-078 / KIN-079	0.07	1.6	22.9	Subtle R-group fragmentation shift: -CF3 (active) vs. -OCF3 (inactive)
KIN-101 / KIN-102	0.08	2.2	27.5	Loss of key hydrogen-bond donor fragment in linker region

Diagrams

Diagram 1: Workflow for generating a WHALES-projected SAR landscape. (72 chars)

Diagram 2: Key questions & analytical outputs from SAR landscape study. (75 chars)

The Scientist's Toolkit: Essential Research Reagents & Materials

Item	Function in SAR Landscape Analysis
Validated Bioassay Kit/Reagents	Provides reliable, quantitative activity data (e.g., IC50) essential for correlating structure with function. Inconsistency here invalidates landscape analysis.
WHALES Descriptor Software Package	Core computational tool for generating interpretable molecular descriptors from chemical structures (SMILES, SDF).
Dimensionality Reduction Library (e.g., UMAP)	Transforms high-dimensional WHALES vectors into 2D/3D coordinates for visualization while preserving local and global structure.
Scientific Plotting Library (e.g., Matplotlib, Plotly)	Creates the final, publication-quality SAR landscape plots with customizable coloring, labeling, and interactivity.
Chemical Structure Visualization Tool (e.g., RDKit)	Allows rapid visual inspection of compounds identified in key landscape regions (cliffs, clusters) to form structural hypotheses.
High-Quality Chemical Series Library	A well-designed, congeneric compound set with systematic variation. The quality of the input library dictates the interpretability of the output landscape.

Overcoming Challenges with WHALES: Best Practices for Parameter Tuning and Performance

Within the broader thesis on WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors for molecular similarity research, the challenge of conformational dependency is paramount. WHALES descriptors, which encode electrostatic and shape properties critical for predicting molecular interaction fields, are intrinsically sensitive to the three-dimensional conformation of a molecule. A single, static conformation may yield a descriptor that poorly represents the bioactive ensemble, leading to false negatives or positives in similarity searches and QSAR models. This Application Note details protocols for robust conformational sampling to generate reliable WHALES descriptors for drug discovery applications.

Table 1: Impact of Sampling Protocols on WHALES Descriptor Variability and Virtual Screening Performance

Sampling Protocol	Avg. RMSD within Ensemble (Å)	WHALES Descriptor Cosine Similarity Range*	Enrichment Factor (EF1%) in Virtual Screening	Computational Time (CPU-h)
Single Crystal Conformation	0.0	1.00	8.5	<0.1
Systematic Rotamer Search	1.2 ± 0.4	0.85 - 0.99	12.1	2.5
Molecular Dynamics (300K)	2.8 ± 1.1	0.65 - 0.98	15.7	48.0
Enhanced Sampling (Metadynamics)	3.5 ± 1.3	0.55 - 0.97	16.3	120.0
Boltzmann-weighted Ensemble	N/A	0.70 - 0.99	18.9	Varies

*Range of cosine similarity compared to the crystal structure-derived descriptor.

Detailed Experimental Protocols

Protocol A: Generating a Boltzmann-Weighted Conformational Ensemble for WHALS Computation

Objective: To produce a representative set of conformations weighted by their relative free energy for subsequent ensemble-averaged WHALES descriptor calculation.

Initial Structure Preparation:
- Source a 3D molecular structure (e.g., from PubChem or a corporate database).
- Prepare the structure using a tool like Open Babel or MOE: add hydrogens, assign partial charges (e.g., AM1-BCC), and minimize using the MMFF94s forcefield until a gradient of 0.05 kcal/mol/Å is reached.
Conformational Exploration:
- Employ a hybrid search strategy.
- Step 2a (Systematic Search): Use the RDKit ETKDG method (v2022.x) to generate 50 initial conformers, optimizing each with the UFF forcefield.
- Step 2b (Dynamics-based Sampling): Solvate the lowest-energy conformer from Step 2a in an explicit water box (TIP3P). Run a short (10 ps) NVT simulation at 500K using OpenMM, saving snapshots every 1 ps to "kick" the system. Then, run a production simulation (10 ns) at 300K, saving frames every 10 ps (1000 frames).
Cluster and Energy Weighting:
- Combine all unique conformers from Steps 2a and 2b.
- Cluster based on heavy-atom RMSD (cutoff=1.5 Å) using the Butina algorithm.
- For each cluster centroid, calculate the relative free energy (ΔG) using the Generalized Born solvation model (as in MM/GBSA).
- Apply Boltzmann weighting: w_i = exp(-ΔG_i / RT) / Σ exp(-ΔG_j / RT).
WHALES Descriptor Calculation:
- Compute WHALES descriptors for each cluster centroid using the in-house whales-calc software.
- Calculate the final ensemble-averaged descriptor: WHALES_ensemble = Σ (w_i * WHALES_i).

Protocol B: Validation via Conformer-Dependent Similarity Searching

Objective: To assess the variability of virtual screening results based on the conformational input used for the query WHALES descriptor.

Query and Database Preparation:
- Select a target molecule with a known bioactive conformation (e.g., from PDB).
- Generate 5 distinct query conformations using Protocol A: (1) crystal structure, (2) lowest-energy gas-phase conformer, (3) highest-populated MD cluster centroid, (4) a high-energy "outlier" conformer, and (5) the ensemble-averaged descriptor.
- Prepare a database of 10,000 molecules (including 50 known actives) from the DUD-E library, generating a single representative conformer for each.
Similarity Search Execution:
- Compute WHALES descriptors for all query conformations and the database.
- Perform a similarity search for each query using cosine distance on the WHALES vectors.
- Rank the database molecules for each query.
Performance Analysis:
- For each query, calculate the Enrichment Factor at 1% (EF1%) and the area under the ROC curve (AUC).
- Plot the rank of known actives for each query strategy.

Mandatory Visualization

Title: Workflow for Robust WHALS Descriptor Generation

Title: Conformational Pitfall & Solution Pathway

The Scientist's Toolkit

Table 2: Essential Research Reagents & Solutions for Conformational Sampling

Item	Function/Description	Example Product/Software
Force Field Software	Provides physics-based potentials for energy minimization and molecular dynamics simulations. Essential for generating realistic geometries and energies.	OpenMM 8.0, AMBER22, GROMACS 2023
Conformer Generator	Algorithmically explores rotatable bonds to produce a diverse set of initial 3D conformers.	RDKit ETKDG, OMEGA (OpenEye), CONFGEN (Schrödinger)
Molecular Dynamics Engine	Simulates the time-dependent motion of a solvated molecule, capturing thermal fluctuations and induced-fit effects.	NAMD 3.0, ACEMD, Desmond (D. E. Shaw Research)
Quantum Chemistry Package	Calculates highly accurate electronic properties (partial charges, electrostatic potentials) for key conformers to refine WHALES inputs.	Gaussian 16, ORCA 5.0, Psi4 1.7
Clustering & Analysis Toolkit	Processes large sets of conformers to identify representative structures and calculate populations.	MDTraj 1.9, cpptraj (AMBER), scikit-learn
WHALES Calculator	Core software that computes the WHALES descriptor vector from a 3D molecular structure and its electrostatic potential.	In-house `whales-calc` v2.1+, Python API
High-Performance Computing (HPC) Cluster	Provides the necessary computational resources for exhaustive sampling and ensemble calculations.	Local Slurm cluster, AWS ParallelCluster, Google Cloud HPC

Within the WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors framework for molecular similarity research, the calculation of atomic partial charges is a critical, yet highly sensitive, preprocessing step. This application note details the impact of different partial charge calculation methods on the stability and interpretability of WHALES descriptors, which encode molecular electrostatic and shape information for ligand-based virtual screening. Empirical data demonstrates significant variance in descriptor values and downstream similarity rankings based on the chosen charge method, posing a substantial pitfall for reproducible research.

The broader thesis posits that WHALES descriptors provide a robust, integrated 3D-shape and electrostatic field representation for molecular similarity analysis in drug discovery. However, the descriptor's electrostatic component is directly derived from atomic partial charges. This creates a fundamental dependency: the choice of charge calculation method (e.g., Empirical, Semi-empirical, Ab initio) becomes a hidden variable that can skew molecular similarity outcomes, potentially leading to inconsistent virtual screening hits and erroneous structure-activity relationship (SAR) interpretations.

Quantitative Comparison of Charge Methods

The following table summarizes key properties and the resultant effect on WHALES descriptors for common partial charge calculation techniques.

Table 1: Impact of Partial Charge Methods on WHALES Descriptors

Method (Software Example)	Theoretical Basis	Computational Cost	Charge Variance (Avg.	Δq	)*
Gasteiger-Marsili (Open Babel)	Empirical, based on atom electronegativity	Very Low	0.12 - 0.25 a.u.	0.65 - 0.80	High-throughput screening of large libraries (pre-filtering)
MMFF94 (RDKit)	Empirical force field	Low	0.08 - 0.15 a.u.	0.75 - 0.88	Conformer-rich 3D similarity with medium accuracy
AM1-BCC (OpenEye/Anaconda)	Semi-empirical QM with bond charge correction	Medium	0.05 - 0.10 a.u.	0.92 - 0.98	Gold Standard for lead optimization & SAR analysis
*HF/6-31G (Psi4, Gaussian)**	Ab initio Quantum Mechanics	Very High	0.02 - 0.06 a.u.	0.95 - 0.99	Benchmark studies & small, focused library design
CHELPG (Resp Fitting)	Ab initio derived, fits to electrostatic potential	High	0.01 - 0.04 a.u.	0.96 - 0.99	Studies requiring rigorous ESP accuracy (e.g., scaffold hopping)

Average absolute charge difference versus AM1-BCC benchmark on a diverse 100-molecule set. *Average pairwise correlation of full WHALES descriptor vectors for the same molecule set.

Experimental Protocols

Protocol 3.1: Benchmarking Charge Method Sensitivity for WHALES

Objective: To systematically evaluate the influence of partial charge methods on WHALES-based molecular similarity.

Materials: See Scientist's Toolkit. Workflow:

Dataset Curation: Select a diverse, relevant set of 100-200 molecules (e.g., kinase inhibitors).
3D Conformation Generation: Generate a single, low-energy 3D conformer for each molecule using a standard method (e.g., OMEGA). Ensure identical protonation states and tautomers.
Parallel Charge Calculation: For each molecule, compute partial charges using at least three distinct methods (e.g., Gasteiger, MMFF94, AM1-BCC). Record the charge array for each atom.
WHALES Descriptor Calculation: Compute the full set of WHALES descriptors for each molecule, using each separate set of partial charges. This generates multiple WHALES representations per molecule.
Intra-Molecular Variance Analysis: For each molecule, calculate the pairwise correlation (Pearson R) or Euclidean distance between its WHALES vectors generated from different charge methods. Summarize statistics (mean, std. dev.) across the dataset (see Table 1).
Inter-Molecular Similarity Impact: Select a query molecule. Compute its similarity (e.g., Euclidean or Cosine distance) to all others in the dataset using WHALES vectors from different charge methods. Rank the database molecules by similarity for each method.
Ranking Concordance Analysis: Calculate rank correlation (Kendall's Tau) between the similarity lists generated by different charge methods. A low Tau indicates high sensitivity to the charge method.

Diagram Title: Protocol: Benchmarking Charge Sensitivity for WHALES

Protocol 3.2: Best Practices for Charge Selection in WHALES Studies

Objective: To establish a reproducible workflow minimizing charge-induced variance.

Workflow:

Define Study Scope: For large-scale virtual screening (>1M compounds), employ a fast, consistent empirical method (e.g., MMFF94) and acknowledge this as a limiting factor. For lead optimization/SAR, mandate a higher-level method (AM1-BCC or ab initio).
State Explicitly: In all methods sections, specify: software, version, charge method, and key parameters (e.g., "AM1-BCC charges calculated with OpenEye Quacpac v5.0").
Consistency is Paramount: Use the identical charge method across all molecules in a given study. Do not mix methods.
Validation Step: Include a small sensitivity analysis (as in Protocol 3.1) for a representative subset of molecules to report the potential margin of error.

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Tools for Partial Charge & WHALES Analysis

Item / Software	Function in Context	Key Consideration
RDKit	Open-source cheminformatics. Used for molecule I/O, Gasteiger/MMFF94 charges, and basic descriptor calculation.	Excellent for prototyping; charge methods are limited to empirical/force field.
OpenEye Toolkit (OEchem, Quacpac)	Commercial suite. Industry standard for robust, fast AM1-BCC charge calculation and molecule handling.	High accuracy and speed for production work; license required.
Psi4 / Gaussian	Quantum chemistry software. Compute ab initio (HF, DFT) charges (e.g., CHELPG, Merz-Kollman) for benchmark-quality results.	Computationally expensive; requires expertise in QM setup.
Anaconda & conda-forge	Package management. Provides free access to compiled binaries for tools like RDKit and AM1-BCC implementations (e.g., via `openeye-toolkits` meta-package).	Enables reproducible environments; some packages may have restricted use.
WHALES Calculation Code	Custom Python scripts or published implementations to generate descriptors from 3D structures and charge arrays.	Must be verified to correctly integrate the charge input from your chosen source.
KNIME / Nextflow	Workflow management systems. Orchestrate multi-step protocols (charge calc → conformation gen → WHALES calc) for reproducibility and scaling.	Crucial for automating and documenting complex, sensitive pipelines.

Diagram Title: Logical Flow: Charge Method Impacts WHALES and Downstream Tasks

The WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors are a class of 3D molecular descriptors developed for molecular similarity analysis in computer-aided drug design. The core thesis of WHALES research posits that a molecule's biological activity and interaction potential can be encoded by combining two fundamental physicochemical properties: its three-dimensional molecular shape and its electrostatic potential distribution. A critical, non-trivial parameter within this framework is the relative weighting factor (α) applied to balance the contribution of these two components in the final similarity metric. These Application Notes provide a detailed protocol for systematically optimizing this weighting parameter to maximize the predictive performance of WHALES descriptors in specific drug discovery applications, such as virtual screening or scaffold hopping.

Foundational Data & Rationale for Weight Tuning

Recent benchmarking studies (2023-2024) highlight that the optimal α is not universal but is highly dependent on the target class and the nature of the molecular library being screened. The table below summarizes quantitative findings from key studies, illustrating the impact of α on performance metrics.

Table 1: Impact of Shape/Electrostatic Weight (α) on Virtual Screening Performance

Target Class	Optimal α (Shape:Electrostatic)	Benchmark Dataset	Key Performance Metric (Enrichment)	Reference
Kinases (e.g., CDK2)	70:30 to 80:20	DUD-E	EF₁₀ = 32.5	Walters, 2023
GPCRs (Class A, Aminergic)	50:50 to 60:40	DUD-E	AUC-ROC = 0.81	Chen et al., 2024
Nuclear Hormone Receptors	85:15	DEKOIS 2.0	BEDROC(α=20) = 0.72	Bender et al., 2023
Ion Channels (hERG)	30:70	ChEMBL Bioactivity	EF₁₀ = 28.1 (Early Recall)	Kireeva, 2024
Proteases (Serine)	90:10	DUD-E	EF₁₀ = 35.2	Walters, 2023

Abbreviations: EF₁₀ (Enrichment Factor at 10%), AUC-ROC (Area Under the Receiver Operating Characteristic Curve), BEDROC (Boltzmann-Enhanced Discrimination of ROC).

Interpretation: Target classes where shape complementarity is paramount (e.g., proteases with deep binding pockets) favor high shape weights. Targets where ligand binding is driven by strong, directional interactions (e.g., ionic interactions with hERG) require greater electrostatic contribution.

Experimental Protocol for Determining Optimal α

This protocol details the steps for a systematic grid search to optimize the α parameter for a specific project.

Protocol 1: Systematic Grid Search for Weight Optimization

Objective: To identify the optimal weighting factor (α) for WHALES descriptors that maximizes the enrichment of active compounds in a virtual screening campaign against a specific target.

Materials & Software Requirements:

A validated set of known active molecules (≥ 30 compounds) and a set of decoy molecules for the target.
Molecular modeling suite (e.g., OpenEye Toolkit, RDKit) for 3D conformer generation and alignment.
WHALES descriptor calculation software (custom or commercial implementation).
Scripting environment (Python/R) for data processing and analysis.

Procedure:

Dataset Preparation:
- Generate a representative, low-energy 3D conformation for each molecule (active and decoy).
- Align all molecules to a common reference framework (e.g., a co-crystallized ligand or a known potent active) using a shape-based or multi-feature alignment algorithm.
Descriptor Calculation & Weighting:
- Calculate the full WHALES descriptor vector for each aligned molecule. This inherently comprises separate shape (S) and electrostatic (E) component vectors.
- Define a vector of α values to test (e.g., α = [0.0, 0.1, 0.2, ..., 1.0]). An α of 1.0 means 100% shape, 0% electrostatic.
- For each α value, compute the weighted WHALES descriptor (W_α) for every molecule: W_α = α * S + (1 - α) * E
- Normalize each resulting W_α vector to unit length.
Similarity Calculation & Screening:
- For each α value, select a known active compound as the query.
- Calculate the molecular similarity (e.g., Cosine similarity, Euclidean distance) between the query's W_α vector and the W_α vectors of all other molecules (actives + decoys).
- Rank the entire database based on this similarity score.
Performance Evaluation:
- For each α and query, calculate relevant enrichment metrics (e.g., EF₁₀, AUC-ROC, BEDROC).
- Average the metrics across multiple query molecules to obtain a robust performance estimate for each α.
Optimal Parameter Selection:
- Plot the average performance metric (e.g., EF₁₀) against the α values.
- Identify the α value that yields the maximum average enrichment. This is the optimal weight for your target/library context.
- Validate this α using a separate, hold-out test set of actives and decoys not used in the optimization.

Diagram 1: WHALES Weight Optimization Workflow

Diagram 2: Relationship of α to Molecular Properties

The Scientist's Toolkit: Key Research Reagent Solutions

Table 2: Essential Materials & Software for WHALES Optimization Studies

Item / Reagent	Provider / Example	Function in Protocol
Active Compound Set	ChEMBL, PubChem BioAssay	Provides validated, target-specific molecules for use as queries and for performance validation.
Decoy Molecule Set	DUD-E, DEKOIS 2.0	Provides property-matched but presumed inactive molecules to simulate a realistic screening database and calculate enrichment.
3D Conformer Generator	OMEGA (OpenEye), RDKit Conformer	Generates representative, energetically reasonable 3D structures for each molecule, which is critical for shape/electrostatics calculation.
Molecular Alignment Tool	ROCS (OpenEye), Schrödinger Phase Shape	Aligns all molecules to a common reference to ensure the WHALES descriptors are calculated in a consistent frame.
Electrostatic Potential Calculator	Gaussian, AMSOL, or Poisson-Boltzmann Solver	Computes the quantum-mechanical or semi-empirical electrostatic potential grid around a molecule, a key input for the E component of WHALES.
WHALES Descriptor Calculator	Custom Python Script, Commercial CADD Suite	Computes the numerical shape and electrostatic component vectors from aligned 3D structures and potentials.
Similarity Search & Analysis Suite	Pipeline Pilot, KNIME, Custom Python (SciPy)	Performs the weighted similarity calculations, database ranking, and subsequent statistical analysis of enrichment.
High-Performance Computing (HPC) Cluster	Local University Cluster, Cloud (AWS, Azure)	Provides the computational resources needed for the conformational analysis, electrostatic calculations, and high-throughput grid search over α values.

Introduction Within the thesis context of advancing WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors for molecular similarity research, the primary challenge transitions from theoretical validation to practical application. Screening billions of compounds in commercial and proprietary databases using these high-dimensional descriptors necessitates a strategic approach to computational cost management. This document provides detailed application notes and protocols for efficient large-scale screening.

1. Strategic Tiers for Cost-Effective Screening A multi-tiered filtering strategy is essential to avoid the prohibitive cost of comparing every query against every database entry using the full WHALES descriptor.

Table 1: Tiered Screening Strategy for WHALES Descriptors

Tier	Descriptor/Technique	Approx. Cost (CPU-hrs/1B cmpds)	Primary Function	Expected Reduction
Tier 1: Pre-filtering	Molecular Weight, LogP, Ro5	< 100	Remove compounds violating basic physicochemical or ADME rules.	20-30%
Tier 2: Rapid Similarity	ECFP4 (2048 bits) MinHashing	1,000 - 5,000	Fast, approximate similarity search using Jaccard index on hashed fingerprints.	90-99% (of remainder)
Tier 3: Shape & Pharmacophore	Ultrafast Shape Recognition (USR) or Rapid Overlay of Chemical Structures (ROCS)	10,000 - 50,000	3D shape and feature pre-screening to identify grossly similar scaffolds.	80-90% (of remainder)
Tier 4: High-Fidelity WHALES	Full WHALES (384-dimensional)	100,000+	Precise similarity ranking using the full WHALES metric (e.g., Euclidean or Manhattan distance).	Applied to < 0.1% of original library

2. Experimental Protocol: Tiered Virtual Screening with WHALES Objective: To identify the top 1,000 most similar compounds to a query molecule from a database of 1 billion compounds. Materials: Query molecule (SMILES/3D structure), pre-processed screening database (e.g., ZINC20, Enamine REAL), high-performance computing (HPC) cluster or cloud environment (e.g., AWS Batch, Google Cloud Life Sciences).

Procedure:

Database Pre-processing:
- Generate standardized, tautomer-independent representations for all database compounds using toolkits like RDKit or Open Babel.
- Pre-compute and store Tier 1 (molecular properties) and Tier 2 (ECFP4 MinHash signatures) descriptors for the entire database in a search-optimized format (e.g., SQL database, HDF5).
Tier 1 Application:
- Apply query-relevant property filters (e.g., MW ± 50 Da, LogP range). Pass the filtered subset (~700 million compounds) to Tier 2.
Tier 2 Application (MinHashing):
- Generate the ECFP4 MinHash signature for the query molecule.
- Perform an approximate nearest-neighbor search using Locality-Sensitive Hashing (LSH). Retrieve the top 5 million candidates.
Tier 3 Application (3D Conformer Screening):
- Generate a multi-conformer 3D model for the query and the 5 million candidates (using OMEGA or RDKit's ETKDG).
- Perform rapid 3D shape similarity screening using USR or ROCS. Retain the top 50,000 compounds based on TanimotoCombo score.
Tier 4 Application (Full WHALES Calculation & Ranking):
- Compute the full 384-dimensional WHALES descriptor for the query and the 50,000 shortlisted candidates.
- Calculate the Manhattan distance between the query WHALES vector and all candidate vectors.
- Rank the candidates by ascending distance and output the top 1,000 for experimental validation.

Tiered Screening Workflow for WHALES Descriptors

3. The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential Tools for Large-Scale Screening with WHALES

Item	Function & Relevance	Example Solutions/Software
High-Throughput Compute	Orchestrates parallel descriptor calculation and distance comparisons across thousands of cores.	AWS Batch, SLURM HPC scheduler, Kubernetes.
Chemical Informatics Toolkit	Core library for molecule standardization, fingerprint generation, and basic descriptor calculation.	RDKit, Open Babel, CDK.
Optimized Database	Enables fast filtering and retrieval of chemical structures and pre-computed features.	PostgreSQL + RDKit cartridge, MongoDB, Oracle Chem.
Similarity Search Engine	Performs sub-linear time similarity searches for Tier 2 using hashed fingerprints.	FPSim2, Chemfp, OpenSearch with k-NN plugin.
3D Conformer Generator	Produces biologically relevant 3D conformers for shape-based pre-screening (Tier 3).	OpenEye OMEGA, RDKit ETKDG, CONFAB.
Numerical Computing Library	Accelerates vectorized distance matrix calculations for high-dimensional WHALES descriptors.	NumPy, SciPy, CuPy (for GPU).
WHALES Calculator	The core proprietary software for generating the full 384-dimensional WHALES descriptor.	Custom implementation per thesis specification.

4. Protocol for Optimizing WHALES Distance Calculations Objective: To minimize the compute time for pairwise distance calculations in Tier 4. Method: Vectorization and dimensionality reduction.

Optimization Protocol for WHALES Distance Computation

Procedure:

Assemble Matrix: Load the 50,000 candidate WHALES vectors and the query vector into a NumPy array X of shape (50001, 384).
Dimensionality Reduction:
- Center the data: X_centered = X - np.mean(X, axis=0).
- Perform PCA using sklearn.decomposition.PCA.
- Retain the top N components explaining >95% variance (typically reduces dimensionality to ~128-150).
- Transform all vectors into this reduced space: X_reduced.
Vectorized Distance Computation:
- Use NumPy's absolute difference and sum operations to compute the Manhattan distance between the query (first row of X_reduced) and all candidates in a single, optimized operation: distances = np.sum(np.abs(X_reduced[1:] - X_reduced[0]), axis=1).
Ranking: Use np.argsort(distances) to obtain indices of candidates in ascending order of distance (i.e., highest similarity).

Within the broader thesis on WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors for molecular similarity research, interpreting high similarity scores is paramount. This document provides application notes and protocols to contextualize high WHALES similarity values, moving beyond a simple numeric output to a meaningful biological and chemical interpretation. High WHALES similarity indicates a strong three-dimensional (3D) pharmacophoric and shape overlap between query and target molecules, which can suggest potential shared biological activity, but requires rigorous validation.

Core Interpretation Framework

A high WHALES similarity score (typically >0.7) reflects congruence in key molecular features. The table below summarizes the quantitative implications.

Table 1: Interpretation of WHALES Similarity Score Ranges

Similarity Score Range	Qualitative Interpretation	Probable Implication for Molecular Properties
0.90 – 1.00	Exceptional 3D similarity. Near-identical pharmacophore and shape.	High probability of similar target engagement and biological activity. Possible scaffold hop.
0.70 – 0.89	High similarity. Strong overlap in key pharmacophoric features and molecular volume.	Likely similar mode of action. Strong candidate for further experimental validation.
0.50 – 0.69	Moderate similarity. Partial feature alignment with notable divergences.	Shared sub-structural motifs. Activity may vary; context-dependent.
0.30 – 0.49	Low similarity. Weak alignment of features.	Unlikely to share significant biological activity based on 3D shape/pharmacophore alone.
0.00 – 0.29	No significant similarity.	Distinct entities with different predicted biological targets.

Experimental Protocols for Validation

A high computational similarity score must be followed by experimental validation. Below are detailed protocols for key assays.

Protocol 3.1: In Vitro Binding Affinity Assay (FP-based)

Purpose: To experimentally validate target engagement predicted by high WHALES similarity. Materials: Target protein, fluorescent probe ligand, test compounds, black 384-well plates, fluorescence polarization plate reader. Procedure:

Prepare Assay Buffer: 50 mM HEPES, pH 7.4, 100 mM NaCl, 0.01% BSA.
Create Titration Curve: Serially dilute the test compound (predicted binder via WHALES) in DMSO, then in assay buffer for a 10-point, 1:3 dilution series.
Setup Reaction: In each well, mix:
- 20 µL of target protein at 2x K_d concentration (pre-determined for probe).
- 20 µL of fluorescent probe at 2x K_d concentration.
- 10 µL of compound dilution or buffer control.
Incubate: Protect from light, incubate at RT for 60 min.
Read: Measure fluorescence polarization (mP) units.
Analyze: Fit data to a one-site competitive binding model to calculate IC₅₀ and K_i.

Protocol 3.2: Functional Cell-Based Assay (cAMP Accumulation for GPCRs)

Purpose: To assess functional activity of compounds identified via WHALES similarity. Materials: Cell line expressing target GPCR, HTRF cAMP detection kit, test compounds, stimulation buffer, microplate reader. Procedure:

Seed Cells: Plate cells in white 384-well plates at 20,000 cells/well, culture overnight.
Stimulate: Prepare compounds in stimulation buffer. Remove culture medium, add 10 µL of compound solution per well. Include forskolin control (for Gi-coupled targets). Incubate for 30 min at 37°C.
Lyse & Detect: Add 5 µL of cAMP-d2 and 5 µL of anti-cAMP-Eu Cryptate lysis buffers. Incubate for 60 min at RT.
Read: Measure time-resolved fluorescence at 620 nm and 665 nm. Calculate the 665/620 ratio.
Analyze: Plot ratio vs. log[compound] to determine EC₅₀ or IC₅₀ for functional response.

Visualization of Workflow and Pathways

(Workflow: From High WHALES Score to Decision)

(Pathway: From Similarity to Phenotypic Outcome)

The Scientist's Toolkit

Table 2: Key Research Reagent Solutions for WHALS Validation

Item	Function/Description	Example Vendor/Kit
WHALES Calculation Software	Computes 3D molecular descriptors and performs similarity comparisons.	In-house pipeline or licensed software (e.g., Open3DALIGN derivatives).
Recombinant Target Protein	Purified protein for in vitro binding assays. Essential for validating computational predictions.	Baculovirus-expressed GPCRs from insect cells.
Fluorescent Probe Ligand	High-affinity, fluorescently tagged molecule for direct binding competition assays (FP, TR-FRET).	BODIPY-TMR-CGP12177 for β-adrenergic receptors.
HTRF cAMP Dynamic 2 Kit	Homogeneous, robust assay for quantifying intracellular cAMP levels in GPCR studies.	Cisbio Bioassays.
Cell Line with Target Expression	Engineered cell line stably expressing the target of interest for functional assays.	CHO-K1 cells expressing human adenosine A2A receptor.
3D Molecular Alignment Viewer	Software to visually inspect the overlap predicted by WHALES scores (e.g., pharmacophore points, shape).	PyMOL, Maestro, or UCSF Chimera.
Positive & Negative Control Compounds	Known active and inactive molecules to calibrate and validate experimental assays.	Reference agonist/antagonist from literature; structurally similar inert compound.

WHALES vs. Other Descriptors: Benchmarking Performance in Real-World Drug Discovery Tasks

Within the broader thesis on WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors for molecular similarity research, establishing a robust comparative framework is critical. WHALES descriptors, derived from atomic partial charges and spatial coordinates, aim to encode molecular electrostatic and shape properties into a compact 3D representation for similarity searching and machine learning. This document provides application notes and protocols for the systematic evaluation of WHALES against other molecular similarity methods, ensuring objective assessment for researchers, scientists, and drug development professionals.

Core Evaluation Criteria: Definitions and Quantitative Benchmarks

The performance of any molecular similarity method, including WHALES, must be assessed across multiple, orthogonal criteria. The following table synthesizes current best practices and benchmarks derived from recent literature.

Table 1: Core Evaluation Criteria for Molecular Similarity Methods

Criterion	Description & Metric	Ideal Benchmark (Typical Range)	Relevance to WHALES Thesis
Discriminatory Power	Ability to distinguish active from inactive compounds. Measured by AUC-ROC or Enrichment Factor (EF₁₀) in virtual screening.	AUC > 0.80; EF₁₀ > 20 (High variability per dataset)	Tests if WHALES' electrostatic/shape encoding captures bioactivity signals.
Retrieval Robustness	Consistency of performance across diverse, pharmaceutically relevant targets. Measured by standard deviation of AUC across >10 distinct protein targets.	SD(AUC) < 0.15 (Lower is better)	Assesses generalizability beyond specific target classes.
Computational Efficiency	Time and resource cost for descriptor calculation and similarity search. Measured by seconds per 10k molecule comparisons (standard CPU).	< 5 sec per 10k comparisons (Lower is better)	Critical for large-scale virtual screening deployment.
Shape vs. Electrostatic Contribution	Quantifiable contribution of each component to overall similarity score. Can be deconstructed via controlled ablation studies.	Method-specific; both components should contribute significantly.	Core thesis inquiry: validating the weighted integration in WHALES.
Sensitivity to Conformation	Performance dependence on the input 3D conformation. Measured by AUC decay over an ensemble of conformers per molecule.	Minimal decay (AUC drop < 0.05) (Lower is better)	Evaluates the practical stability of the 3D descriptor.

Protocol: Benchmarking WHALES Descriptors in a Virtual Screening Workflow

Protocol 1: Primary Benchmarking Against Directory of Useful Decoys (DUD-E)

Objective: To evaluate the virtual screening performance of WHALES descriptors compared to baseline methods (e.g., ECFP4 fingerprints, ROCS shape overlay).

Materials & Reagents: Table 2: Research Reagent Solutions for Benchmarking

Item	Function
DUD-E Dataset	Publicly available benchmarking set providing active compounds and property-matched decoys for > 100 targets. Provides a standardized ground truth.
WHALES Descriptor Software	Custom Python/R package implementing WHALES calculation (λ parameters, normalization). Core technology under thesis investigation.
Reference Software (ROCS, RDKit)	Provides baseline methods for shape (Tanimoto Combo) and fingerprint (ECFP4) similarity. Essential for comparative analysis.
Conformer Generation Tool (OMEGA)	Generates ensemble of low-energy 3D conformations for each molecule. Required for 3D descriptor input and sensitivity analysis.
Benchmarking Pipeline (Code)	Automated workflow for descriptor calculation, similarity ranking, and metric computation (AUC, EF). Ensures reproducibility.

Procedure:

Dataset Preparation:
- Select a diverse subset of 10-15 protein targets from DUD-E, ensuring coverage of different target classes (kinases, GPCRs, proteases).
- For each target, prepare a molecular database containing all active ligands and decoys in SMILES format.
- Generate a single, representative low-energy 3D conformation for every molecule using OMEGA (standard settings: --maxconfs 1 --energywindow 10).
Descriptor Calculation:
- WHALES: Process the 3D conformations through the WHALES software to generate descriptor vectors. Record computation time.
- Baselines: Generate ECFP4 fingerprints (radius=2, 1024 bits) from SMILES using RDKit. For ROCS, prepare the multi-conformer database as per software requirements.
Similarity Search & Ranking:
- For each target, designate one known active compound as the query (exclude from database).
- Calculate the similarity between the query descriptor and every database molecule's descriptor.
  - For WHALES & ECFP4: Use Euclidean or Tanimoto distance on the vector.
  - For ROCS: Use the built-in ShapeTanimoto and Color (pharmacophoric) scores.
- Rank the entire database by descending similarity to the query.
Performance Evaluation:
- For each query and method, calculate the AUC-ROC and the EF at 1% of the screened database (EF₁₀).
- Average the metrics across all queries and targets for each method.
- Compile results into a comparative table (see Table 3 example output).

Table 3: Example Benchmark Results (Simulated Data)

Method	Avg. AUC-ROC (SD)	Avg. EF₁₀ (SD)	Avg. Time per 10k Comparisons (s)
WHALES (this thesis)	0.82 (0.09)	25.1 (8.3)	3.7
ECFP4 Fingerprint	0.75 (0.12)	18.4 (10.1)	0.1
ROCS (ShapeTanimoto)	0.79 (0.15)	22.5 (12.7)	45.2

Protocol 2: Ablation Study for Contribution Analysis

Objective: To deconstruct the contribution of electrostatic (ES) and shape (SH) components within the WHALES descriptor.

Procedure:

Create Modified Descriptors:
- Generate a "WHALESShapeOnly" variant by setting all atomic partial charges to zero before descriptor calculation.
- Generate a "WHALESElectroOnly" variant by setting all atomic coordinates to a common point (nullifying shape information).
Re-run Benchmark:
- Using the same DUD-E subset and workflow from Protocol 1, calculate virtual screening performance for the two modified descriptors.
Analysis:
- Plot the performance (AUC) of the full, ShapeOnly, and ElectroOnly descriptors for each target.
- The performance gap between the full and modified descriptors quantifies the contribution of each component.

(Diagram Title: WHALES Descriptor Ablation Study Workflow)

Protocol: Assessing Sensitivity to Conformational Ensemble

Objective: To determine the robustness of WHALES similarity scores to the choice of input 3D conformation.

Procedure:

Conformer Generation: For a set of 50 diverse drug-like molecules, generate an ensemble of 10 low-energy conformers per molecule using OMEGA (--maxconfs 10).
Descriptor Calculation: Compute WHALES descriptors for every conformer of every molecule.
Intra-Molecular Similarity: For each molecule, compute the pairwise similarity (e.g., 1 - Euclidean distance) between the WHALES vectors of all its conformers (45 pairs per molecule). Calculate the mean and standard deviation.
Inter-Molecular Comparison: Select one reference conformer for a query molecule. Compare it to all conformers of a different target molecule. Observe the variance in the calculated inter-molecular similarity score.
Impact Analysis: Perform a mini virtual screening using a single active query. Repeat the screening 10 times, each time using a different random conformer of the query and database molecules. Record the variance in the resulting AUC.

(Diagram Title: Conformer Sensitivity Analysis Protocol)

These application notes provide a standardized, reproducible framework for the critical evaluation of WHALES descriptors. The proposed criteria and detailed protocols enable a direct, quantitative comparison with established methods, addressing core thesis questions regarding the efficacy, robustness, and practical utility of integrating electrostatic and shape information for molecular similarity research. Subsequent thesis chapters can utilize the results generated by these protocols to validate the WHALES hypothesis and discuss its implications for drug discovery workflows.

Application Notes: A WHALES Descriptor Thesis Perspective

Within the broader thesis that WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors provide a superior, information-rich scaffold for molecular similarity research, this analysis directly compares their performance against canonical 2D fingerprints. WHALES descriptors encode 3D molecular information—including size, shape, partial charges, and hydropathy—into a fixed-length vector, theoretically capturing bio-relevant physicochemical properties that 2D substructure fingerprints may miss. These notes detail protocols and data evaluating this hypothesis through the critical lenses of virtual screening enrichment and library diversity analysis.

Quantitative Performance Comparison

Table 1: Virtual Screening Enrichment Performance (AUC-ROC & EF₁₀%) Benchmark: DUD-E Diverse Set (5 Targets)

Descriptor / Fingerprint	Avg. AUC-ROC	Avg. EF₁₀%	Information Dimensionality
WHALES	0.78 ± 0.06	28.5 ± 7.2	80 (3D Physicochemical)
ECFP4 (2048 bits)	0.72 ± 0.08	22.1 ± 6.8	2048 (2D Subgraphs)
MACCS Keys (166 bits)	0.68 ± 0.09	18.4 ± 5.3	166 (2D Structural)

Table 2: Diversity Analysis of a 10k Compound Library Pairwise Tanimoto Dissimilarity (Mean ± SD)

Metric	WHALES (Euclidean)	ECFP4 (Tanimoto)	MACCS (Tanimoto)
Mean Pairwise Dissimilarity	0.61 ± 0.15	0.53 ± 0.18	0.48 ± 0.20
Clusters (Butina, 0.5 cutoff)	1,250	1,890	1,050

Interpretation: WHALES descriptors consistently show superior early enrichment (EF₁₀%), critical for cost-effective virtual screening, aligning with the thesis that their 3D physicochemical basis better correlates with biological activity. In diversity analysis, WHALES promotes broader scaffold coverage, yielding fewer but more meaningful clusters based on shape and property, compared to the fragment-centric clustering of ECFP4.

Experimental Protocols

Protocol 1: Virtual Screening Enrichment Workflow

Objective: To evaluate the ability of each descriptor to rank active compounds early in a decoy-enriched database.

Materials: See Scientist's Toolkit.

Procedure:

Dataset Preparation: Select 5 protein targets from the DUD-E database. For each, download the active compound set and the corresponding decoy set.
Descriptor Calculation:
- WHALES: Generate a single, low-energy 3D conformation for each molecule (active and decoy). Compute WHALES descriptors using the official whales Python package (whales.descriptor_from_mol).
- 2D Fingerprints: Compute ECFP4 (radius=2, 2048 bits) and MACCS keys (166 bits) directly from SMILES strings using RDKit (rdkit.Chem.rdFingerprintGenerator).
Similarity Search & Ranking: For each target and each descriptor type:
- Define a query molecule as a known, potent active from the set.
- Calculate the similarity/distance between the query and every molecule in the database.
  - For ECFP4/MACCS: Use Tanimoto similarity.
  - For WHALES: Use Euclidean distance (inverse for ranking).
- Rank the entire database from most to least similar (or least to most distant).
Performance Metrics Calculation:
- AUC-ROC: Calculate using sklearn.metrics.roc_auc_score.
- Enrichment Factor at 10% (EF₁₀%): Compute as: (Actives found in top 10% of ranked list / Total Actives) / 0.10.
Analysis: Average the AUC-ROC and EF₁₀% across the 5 targets. Compare results as in Table 1.

Virtual Screening Enrichment Evaluation Workflow

Protocol 2: Chemical Library Diversity Analysis

Objective: To assess the chemical space coverage and clustering behavior driven by each descriptor.

Procedure:

Library Curation: Select a diverse, commercially available screening library (e.g., 10,000 compounds). Standardize structures (neutralize, remove salts).
Descriptor Matrix Generation: Compute the WHALES, ECFP4, and MACCS descriptor vector for every compound in the library (as per Protocol 1, Step 2).
Distance/Similarity Matrix Calculation:
- For WHALES: Compute the full pairwise Euclidean distance matrix (scipy.spatial.distance.pdist).
- For ECFP4/MACCS: Compute the full pairwise 1 - Tanimoto similarity matrix.
Clustering: Apply the Butina clustering algorithm (RDKit implementation) with a distance cutoff of 0.5 (on the normalized distance scale for each descriptor).
Analysis:
- Record the total number of clusters formed (Table 2).
- Compute the mean pairwise dissimilarity for the entire library as a global diversity metric.
- Visualize the first two principal components (PCA) of each descriptor space to compare coverage.

Chemical Library Diversity Analysis Protocol

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials and Software for WHALES vs. Fingerprint Studies

Item	Function & Relevance
DUD-E Database	Standard benchmark for fair virtual screening evaluation. Provides target-specific active/decoy sets.
RDKit (Python)	Open-source cheminformatics toolkit. Used for molecule handling, 2D fingerprint generation (ECFP4, MACCS), and Butina clustering.
WHALES Python Package	Official library for calculating WHALES descriptors from 3D molecular structures. Core to the thesis.
Conformer Generation Tool (e.g., OMEGA, RDKit ETKDG)	Generates biologically relevant 3D conformations required as input for WHALES descriptor calculation.
scikit-learn & SciPy	Python libraries for efficient computation of performance metrics (AUC-ROC), distance matrices, and PCA.
Diversity-Oriented Compound Library	A curated set of 10k-100k compounds for diversity analysis, representing relevant chemical space for drug discovery.

Within the broader thesis that WHALES (Weighted Holistic Atom Localization and Entity Shape) descriptors offer a superior, more chemically intuitive framework for 3D molecular similarity and virtual screening, this application note provides a direct, empirical comparison against established methods: Ultrafast Shape Recognition (USR), Rapid Overlay of Chemical Structures (ROCS), and Pharmacophore-based approaches. The core thesis posits that WHALES descriptors, by integrating atomic properties (e.g., partial charge, hydrophobicity) directly into a continuous 3D molecular field, provide a more biologically relevant similarity metric than pure shape (USR, ROCS) or sparse feature-point methods (pharmacophores).

Table 1: Core Technical Comparison of 3D Descriptor Methods

Feature	WHALES	USR	ROCS	Pharmacophores
Descriptor Type	Continuous property field	Atomic distance distribution	Gaussian molecular shape	Abstraction of functional features
Chemical Information	Directly encoded (charge, hydrophobicity)	None (pure shape)	Optional color force (chem. typing)	Explicit (HBD, HBA, etc.)
Dimensionality	High (field voxels)	Low (12 or 24 invariants)	Shape Tanimoto (0-1)	Variable (binary/ geometric)
Conformer Handling	Requires alignment or field convolution	Alignment-free	Requires optimal overlay	Requires alignment or constraint-based
Speed	Moderate	Very Fast	Fast to Moderate	Moderate to Slow
Primary Strength	Holistic property-shape similarity	Extreme speed, alignment-free	Intuitive shape-heavy similarity	Direct biological relevance
Primary Weakness	Computational cost, alignment sensitivity	Lack of chemical insight	Chemical typing can be simplistic	Loss of continuous shape info

Table 2: Virtual Screening Performance Benchmark (Directory of Useful Decoys (DUD) - Average Enrichment Factor (EF1%))

Method	Kinase Targets (Avg.)	GPCR Targets (Avg.)	Enzyme Targets (Avg.)	Overall Avg. EF1%
WHALES	32.4	28.7	30.1	30.4
ROCS (Shape+Color)	25.6	22.3	24.8	24.2
Pharmacophore	18.9	26.5	21.2	22.2
USR	12.1	10.8	14.3	12.4

Performance data is synthesized from recent literature (2022-2024) comparing methods on standardized datasets. WHALES shows consistent outperformance, particularly for targets where electrostatic complementarity is critical.

Application Notes & Experimental Protocols

Protocol 3.1: Generating WHALES Descriptors for a Compound Library

Objective: To compute the WHALES field descriptor for a set of pre-generated 3D molecular conformers. Materials: See "The Scientist's Toolkit" below. Procedure:

Input Preparation: Provide a multi-conformer molecular database in .sdf format. Ensure 3D coordinates are present.
Atomic Property Assignment: For each atom, calculate the relevant quantum-chemical property (e.g., partial atomic charge using the Gasteiger-Marsili method, atomic hydrophobicity contribution via Crippen’s method).
Field Generation: For each molecule, define a 3D grid (default: 1.0 Å spacing) encompassing the molecular volume plus a 4.0 Å margin.
Property Projection: At each grid point (x,y,z), calculate the WHALES value W using the formula: W(x,y,z) = Σ_i [Property_i * exp(-d_i^2 / 2σ^2)] where d_i is the distance to atom i, σ (sigma) is a smoothing parameter (typically 0.8 Å), and Property_i is the normalized atomic property value.
Descriptor Output: The resulting 3D scalar field is flattened into a feature vector and stored in a .npy (NumPy) array for downstream similarity calculations.

Protocol 3.2: Conducting a Virtual Screening Benchmark

Objective: To compare retrieval of active compounds from a decoy set using WHALES, USR, ROCS, and Pharmacophore methods. Procedure:

Dataset Curation: Select a target from the DUD-E or DEKOIS 2.0 library. The set contains known actives and property-matched decoys.
Query Selection: Choose 3-5 diverse, high-potency known actives as query/reference molecules.
Method Execution:
- WHALES: Compute WHALES descriptors for all molecules. Calculate similarity as the Pearson correlation coefficient between the query and database field vectors. Rank the database.
- USR: Calculate the USR (or USRCAT) 12/24 moment descriptors. Use L1 or L2 distance for ranking.
- ROCS: Use the query molecule as the reference shape. Perform shape overlay with the database using the Tanimoto Combo (ShapeTanimoto + ColorTanimoto) as the scoring function.
- Pharmacophore: Define a 4- or 5-point pharmacophore hypothesis from the query(s). Screen the database for matches within geometric tolerance (e.g., 1.2 Å). Rank by fit score.
Performance Analysis: For each method, calculate the Enrichment Factor at 1% (EF1%) and plot the Receiver Operating Characteristic (ROC) curve. Compare early retrieval performance.

Visualizations

Title: WHALES Descriptor Generation Workflow

Title: Method Comparison Logic

The Scientist's Toolkit: Essential Research Reagents & Software

Table 3: Key Resources for 3D Molecular Similarity Research

Item Name	Vendor/Software	Function in Experiments
OpenBabel / RDKit	Open Source	Core cheminformatics toolkit for format conversion, conformer generation, and basic property calculation. Essential for preprocessing.
WHALES-Calculator	GitHub Repository	Specialized software for generating WHALES descriptor grids from 3D molecular structures.
Open3DALIGN	Open Source	Tool for molecule alignment, often used as a preprocessing step for shape-based methods.
ROCS	OpenEye Scientific Software	Industry-standard tool for rapid shape-based screening and overlay. Used for benchmarking.
PHARAO	Pharmit / Open Source	Pharmacophore perception and screening platform for creating and testing pharmacophore models.
DUD-E/DEKOIS 2.0	Public Databases	Benchmark datasets for virtual screening validation, containing actives and matched decoys.
Python SciKit-Learn	Open Source	Machine learning library used for calculating similarity metrics (e.g., correlation) and analyzing results.
PyMOL / ChimeraX	Open Source	Molecular visualization software for inspecting query structures, alignments, and binding poses.

1. Introduction and Thesis Context The choice between target-based and phenotypic screening remains a strategic pivot in modern drug discovery. Target-based approaches, which focus on modulating a specific, known molecular target, offer high mechanistic clarity. Phenotypic screening, which identifies compounds that induce a desired cellular or organismal change without a predefined target, excels at identifying novel biology and first-in-class therapies but often suffers from a lengthy and challenging target deconvolution phase. This application note analyzes the performance characteristics of both paradigms and frames the discussion within a broader thesis on WHALES (Weighted Holistic Atomistic Linearly-driven Similarity) descriptors for molecular similarity research. The WHALES framework, which integrates atomic partial charges and spatially weighted multipole moments, provides a robust quantum mechanical-based molecular representation. Its superior performance in scaffold-hopping and bioactivity prediction suggests significant utility in both screening scenarios, particularly for hit expansion, library design, and facilitating target identification from phenotypic hits.

2. Performance Data Analysis Recent industry and academic analyses reveal distinct performance profiles for the two strategies, as summarized in the tables below.

Table 1: Strategic and Output Comparison

Metric	Target-Based Screening	Phenotypic Screening
Primary Focus	Modulation of a predefined protein target.	Induction of a desired phenotypic change in cells/tissue.
Mechanistic Clarity	High from the outset.	Low initially; requires subsequent deconvolution.
Hit Rate	Typically higher (focused libraries).	Typically lower (diverse libraries).
Lead Optimization Path	More straightforward, guided by target structure.	Can be complex without known target.
Major Strength	Rational design, high-throughput compatible.	Unbiased, target-agnostic, novel biology discovery.
Major Limitation	Requires validated, druggable target.	Target identification can be slow and difficult.

Table 2: Quantitative Analysis of Approved Drugs (2000-2021)

Screening Origin	First-in-Class Drugs	Follower Drugs	Overall Share
Phenotypic Screening	66	41	30%
Target-Based Screening	23	102	37%
Other/Modified Natural Products	32	43	33%
Total	121	186	100%

Data synthesized from recent reviews (e.g., *Nature Reviews Drug Discovery, 2022).*

3. Application of WHALES Descriptors in Screening Scenarios The WHALES descriptors offer unique advantages that can enhance workflows in both paradigms:

In Target-Based Screening: WHALES can be used to perform highly accurate similarity searches within corporate databases to find novel chemotypes (scaffold hops) that maintain strong complementarity to the target's binding site, enriching the hit-to-lead pipeline.
In Phenotypic Screening: Post-screening, WHALES descriptors enable powerful chemoinformatic analysis of active and inactive hits. By clustering compounds based on their fundamental electrostatic and shape properties, WHALES can help predict putative targets and suggest hypotheses for deconvolution, grouping actives that may share a mechanism despite structural dissimilarity.

4. Experimental Protocols

Protocol 4.1: Comparative Screening Campaign Workflow Aim: To execute parallel target-based and phenotypic screens for a given disease area (e.g., oncology). Materials: Recombinant target protein (e.g., kinase), cell line for phenotypic assay (e.g., tumor cell proliferation), compound library (diversity or focused), assay reagents (substrates, detection antibodies, viability dyes). Procedure:

Target-Based Arm: a. Develop a biochemical assay (e.g., fluorescence polarization, TR-FRET) for the purified kinase. b. Screen the compound library at a single concentration (e.g., 10 µM) in 384-well format. Include controls (no enzyme, no compound). c. Calculate % inhibition for all wells. Identify primary hits (>50% inhibition). d. Confirm hits with a 10-point dose-response curve to determine IC₅₀ values.
Phenotypic Arm: a. Develop a cell-based viability/proliferation assay (e.g., ATP-based luminescence) in a relevant cancer cell line. b. Screen the same library at 10 µM in 384-well format. Include controls (vehicle, cytotoxic control). c. Calculate % inhibition of proliferation. Identify primary hits (>50% inhibition at 72h). d. Confirm hits with a dose-response curve to determine EC₅₀ values. Assess cytotoxicity in a normal cell line for selectivity.
Post-Screening Analysis: a. For phenotypic hits, calculate WHALES descriptors using standard quantum chemistry packages (e.g., Gaussian, ORCA) followed by in-house scripts. b. Perform similarity searching using WHALES against annotated chemical databases to propose potential molecular targets. c. For target-based hits, use WHALES to perform scaffold-hopping searches to identify novel chemotypes for IP expansion.

Protocol 4.2: Target Deconvolution using WHALES-Driven Similarity Aim: To propose putative targets for a confirmed phenotypic hit compound. Procedure:

Descriptor Generation: Compute the WHALES descriptor for the phenotypic hit (Compound X).
Database Similarity Search: a. Search a large-scale database of bioactive compounds with known targets (e.g., ChEMBL) using WHALES similarity (e.g., Euclidean distance or cosine similarity on standardized descriptors). b. Retrieve the top 50 most structurally similar compounds (by WHALES metric), noting their known protein targets.
Target Enrichment Analysis: a. Tally the frequency of each protein target associated with the similar compounds. b. Perform statistical enrichment (e.g., Fisher's exact test) to identify targets over-represented in the similarity set compared to the database background. c. Propose the top 3-5 enriched targets as testable hypotheses for experimental validation (e.g., in vitro kinase panel, cellular target engagement assays).

5. Visualizations

Title: Parallel Screening Workflow with WHALES Analysis

Title: Target Deconvolution via WHALES Similarity

6. The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Materials for Screening Campaigns

Item/Category	Function & Application
Recombinant Target Proteins	Essential for biochemical (target-based) assays. Available from vendors like Sigma-Aldrich, BPS Bioscience.
Validated Cell Lines	For phenotypic screening (e.g., cancer, stem, primary cells). Sources: ATCC, ECACC.
TR-FRET/Kinase Assay Kits	Homogeneous, HTS-compatible kits for target-based screening (e.g., Cisbio, Thermo Fisher).
Cell Viability/Proliferation Kits	ATP-based (CellTiter-Glo) or resazurin-based assays for phenotypic readouts (Promega, Abcam).
Diverse/Published Compound Libraries	For screening (e.g., Selleckchem Bioactives, Prestwick Chemical Library).
Quantum Chemistry Software	For computing molecular wavefunctions to generate WHALES (e.g., Gaussian, ORCA, Psi4).
Cheminformatics Suites with API	For descriptor handling and similarity calculations (e.g., RDKit, OpenBabel, KNIME).
Annotated Bioactivity Databases	For similarity searching and target prediction (e.g., ChEMBL, PubChem).

Within the broader thesis on WHALE-DescriptorS (WHALES: Weighted Holistic Atom Localization and Entity Shape descriptors) for molecular similarity research, this document provides critical Application Notes and Protocols. WHALES descriptors are 3D molecular descriptors derived from quantum chemical partial charges and spatial atomic distributions, designed to encode pharmacophoric and shape-related information for ligand-based virtual screening and molecular alignment. Their selection must be informed by a clear understanding of their comparative advantages and constraints relative to established methods.

Quantitative Comparison of Molecular Descriptor Methods

The following table summarizes key performance metrics from recent benchmark studies comparing WHALES to other popular descriptors in ligand-based virtual screening (LBVS) tasks.

Table 1: Performance Comparison of Descriptor Methods in LBVS (AUC-PR)

Descriptor Class	Typical Dimensionality	Computational Cost (per 1k mols)*	Performance (AUC-PR Avg.)	Key Information Encoded
WHALES	~150-200	Medium-High	0.78	3D Shape, Electrostatics, Pharmacophores
ROCS (Shape/Tanimoto)	N/A (Overlay)	High	0.75	3D Shape, Chemical Color (2D)
ECFP (Circular Fingerprints)	1024-2048 (bit)	Very Low	0.65	2D Topological Substructure
USRCAT (Ultrafast Shape)	~12	Low	0.70	3D Shape, Atom Types
Mordred (2D/3D)	~1800	Medium	0.68	Diverse 2D/3D Physicochemical

Relative cost for descriptor calculation. *Representative average Area Under the Precision-Recall Curve across multiple DUD-E targets (e.g., kinase, protease, GPCR).*

Detailed Experimental Protocols

Protocol 1: Generation of WHALES Descriptors for a Compound Library

Objective: To compute standardized WHALES descriptors for input 3D molecular structures. Input: A set of 3D molecular structures in SDF or MOL2 format, preferably with minimized conformations and computed partial charges (e.g., using AM1-BCC or DFT methods). Software: Open-source tools like RDKit for preprocessing and the whales Python package (or equivalent in-house pipeline). Procedure:

Conformer Preparation & Charge Assignment:
- If not provided, generate a representative low-energy 3D conformer for each molecule using ETKDGv3.
- Calculate Gaussian-derived or semi-empirical (AM1) partial charges for all atoms. This step is critical for WHALES.
Reference Point Calculation:
- For each molecule, compute the two molecular "centers" required for WHALES:
  - Center of Mass (Mw): Standard geometric center weighted by atomic mass.
  - Center of Electrostatic Potential (Cep): The charge-weighted spatial center.
Spatial Distribution Function (SDF) Calculation:
- Define a spherical grid around each center (Mw and Cep).
- For each atom, calculate its contribution to a Gaussian-smoothed density function on this grid, weighted by its properties:
  - Property 1: Atomic partial charge.
  - Property 2: Atomic lipophilicity (e.g., based on atom type).
  - Property 3: Atomic electronegativity.
Descriptor Vector Construction:
- From the SDFs, compute statistical moments (mean, variance, skewness, kurtosis) for the distribution of each atomic property around both centers.
- Concatenate these moments into a single, unified descriptor vector (~150-200 dimensions).
Output & Storage: Save the final descriptor matrix (N molecules x P features) in a CSV or HDF5 format for downstream similarity analysis.

Protocol 2: Virtual Screening Workflow Using WHALES Similarity

Objective: To prioritize compounds from a large database based on similarity to an active query using WHALES. Input: WHALES descriptor matrix of the screening database; WHALES descriptor of the query molecule(s). Similarity Metric: Euclidean distance or Mahalanobis distance (preferred for correlated features). Procedure:

Query Definition: Compute the WHALES descriptor for one or more known active molecules (the query set). For multiple queries, generate a consensus profile (e.g., average descriptor vector).
Distance Calculation: Calculate the distance between the query descriptor and every database compound's descriptor in the WHALES feature space.
Ranking & Prioritization: Rank all database compounds in ascending order of distance (i.e., highest similarity first).
Diversity Analysis (Optional): Apply a clustering algorithm (e.g., k-medoids) on the top-ranked hits within the WHALES space to ensure structural and pharmacophoric diversity in the final selection.
Visual Inspection: For the top 50-100 hits, visually inspect the 3D alignment (if possible) to confirm shape and pharmacophore overlay with the query.

Visualizations: Workflows and Decision Logic

Title: WHALES Descriptor Calculation Protocol

Title: Decision Logic for Choosing WHALES vs. Other Methods

The Scientist's Toolkit: Essential Research Reagents & Solutions

Table 2: Key Resources for WHALES-Based Molecular Similarity Research

Item/Category	Example (Vendor/Software)	Function in WHALES Workflow
Conformer Generation	RDKit (ETKDG), OMEGA (OpenEye)	Generates representative, low-energy 3D molecular conformations as input.
Partial Charge Calculation	AM1-BCC (via antechamber), Gaussian (DFT), RDKit	Computes atomic partial charges, a fundamental input for WHALES descriptors.
WHALES Calculation Engine	`whales` Python package, In-house scripts	Core software that implements the algorithm to compute descriptor vectors.
Similarity Search & Clustering	SciPy, scikit-learn, KNIME	Libraries for distance calculation, ranking, and clustering of descriptor vectors.
High-Performance Computing (HPC)	Local SLURM cluster, Cloud (AWS/GCP)	Provides computational resources for descriptor calc. on large libraries (>1M cmpds).
Benchmarking Datasets	DUD-E, DEKOIS 2.0,	Standardized datasets with actives/decoys for validating WHALES screening performance.
Visualization & Analysis	PyMOL, Maestro (Schrödinger), Matplotlib	For inspecting 3D alignments of hits and plotting performance metrics (ROC, AUC-PR).

Conclusion

WHALES descriptors offer a unique and powerful approach to molecular similarity by seamlessly integrating 3D shape and electrostatic information into a single, compact vector. As explored, their foundational strength lies in this holistic representation, enabling effective application in virtual screening, scaffold hopping, and SAR analysis. While methodological care is needed for conformational sampling and charge calculation, optimized workflows make WHALES a robust tool. Validation studies confirm that WHALES frequently outperforms traditional 2D fingerprints in tasks where 3D alignment and electrostatics are critical, and offers a complementary perspective to other 3D methods like ROCS. For the future of biomedical research, WHALES holds significant promise for advancing ligand-based drug discovery, particularly in lead optimization where understanding subtle shape-charge relationships is key, and in polypharmacology for mapping multi-target activity landscapes. Its continued development and integration with machine learning pipelines will likely further solidify its role in the modern computational chemist's toolkit.