predictors

The submodule that contains the predictors, i.e., drug-target affinity (DTA) prediction models, implemented in DebiasedDTA study. The implemented predictors are BPEDTA, DeepDTA, and LMDTA. Abstract classes are also available to quickly train a custom DTA prediction model with DebiasedDTA.

`Predictor`

Bases: ABC

An abstract class that implements the interface of a predictor in pydebiaseddta. The predictors are characterized by an n_epochs attribute and a train function, whose signatures are implemented by this class. Any instance of Predictor class can be trained in the DebiasedDTA training framework, and therefore, Predictor can be inherited to debias custom DTA prediction models.

Source code in pydebiaseddta/predictors/abstract_predictors.py

class Predictor(ABC):
    """An abstract class that implements the interface of a predictor in `pydebiaseddta`.
    The predictors are characterized by an `n_epochs` attribute and a `train` function, 
    whose signatures are implemented by this class. 
    Any instance of `Predictor` class can be trained in the `DebiasedDTA` training framework,
    and therefore, `Predictor` can be inherited to debias custom DTA prediction models.
    """

    @abstractmethod
    def __init__(self, n_epochs: int, *args, **kwargs) -> None:
        """An abstract constructor for `Predictor` to display that `n_epochs` is a necessary attribute for children classes.

        Parameters
        ----------
        n_epochs : int
            Number of epochs to train the model.
        """
        self.n_epochs = n_epochs

    @abstractmethod
    def train(
        self,
        train_ligands: List[Any],
        train_proteins: List[Any],
        train_labels: List[float],
        val_ligands: List[Any] = None,
        val_proteins: List[Any] = None,
        val_labels: List[float] = None,
        sample_weights_by_epoch: List[np.array] = None,
    ) -> Any:
        """An abstract method to train DTA prediction models.
        The inputs can be of any biomolecule representation type.
        However, the training procedure must support sample weighting in every epoch.

        Parameters
        ----------
        train_ligands : List[Any]
            The training ligands as a List.
        train_proteins : List[Any]
            The training proteins as a List.
        train_labels : List[float]
            Affinity scores of the training protein-compound pairs
        val_ligands : List[Any], optional
            Validation ligands as a List, in case validation scores are measured during training, by default `None`
        val_proteins : List[Any], optional
            Validation proteins as a List, in case validation scores are measured during training, by default `None`
        val_labels : List[float], optional
            Affinity scores of validation protein-compound pairs as a List, in case validation scores are measured during training, by default `None`

        Returns
        -------
        Any
            The function is free to return any value after its training, including `None`.
        """
        pass

`init(n_epochs, *args, **kwargs)` `abstractmethod`

An abstract constructor for Predictor to display that n_epochs is a necessary attribute for children classes.

Parameters:

Name	Type	Description	Default
`n_epochs`	`int`	Number of epochs to train the model.	required

Source code in pydebiaseddta/predictors/abstract_predictors.py

@abstractmethod
def __init__(self, n_epochs: int, *args, **kwargs) -> None:
    """An abstract constructor for `Predictor` to display that `n_epochs` is a necessary attribute for children classes.

    Parameters
    ----------
    n_epochs : int
        Number of epochs to train the model.
    """
    self.n_epochs = n_epochs

`train(train_ligands, train_proteins, train_labels, val_ligands=None, val_proteins=None, val_labels=None, sample_weights_by_epoch=None)` `abstractmethod`

An abstract method to train DTA prediction models. The inputs can be of any biomolecule representation type. However, the training procedure must support sample weighting in every epoch.

Parameters:

Name	Type	Description	Default
`train_ligands`	`List[Any]`	The training ligands as a List.	required
`train_proteins`	`List[Any]`	The training proteins as a List.	required
`train_labels`	`List[float]`	Affinity scores of the training protein-compound pairs	required
`val_ligands`	`List[Any], optional`	Validation ligands as a List, in case validation scores are measured during training, by default `None`	`None`
`val_proteins`	`List[Any], optional`	Validation proteins as a List, in case validation scores are measured during training, by default `None`	`None`
`val_labels`	`List[float], optional`	Affinity scores of validation protein-compound pairs as a List, in case validation scores are measured during training, by default `None`	`None`

Returns:

Type	Description
`Any`	The function is free to return any value after its training, including `None`.

Source code in pydebiaseddta/predictors/abstract_predictors.py

@abstractmethod
def train(
    self,
    train_ligands: List[Any],
    train_proteins: List[Any],
    train_labels: List[float],
    val_ligands: List[Any] = None,
    val_proteins: List[Any] = None,
    val_labels: List[float] = None,
    sample_weights_by_epoch: List[np.array] = None,
) -> Any:
    """An abstract method to train DTA prediction models.
    The inputs can be of any biomolecule representation type.
    However, the training procedure must support sample weighting in every epoch.

    Parameters
    ----------
    train_ligands : List[Any]
        The training ligands as a List.
    train_proteins : List[Any]
        The training proteins as a List.
    train_labels : List[float]
        Affinity scores of the training protein-compound pairs
    val_ligands : List[Any], optional
        Validation ligands as a List, in case validation scores are measured during training, by default `None`
    val_proteins : List[Any], optional
        Validation proteins as a List, in case validation scores are measured during training, by default `None`
    val_labels : List[float], optional
        Affinity scores of validation protein-compound pairs as a List, in case validation scores are measured during training, by default `None`

    Returns
    -------
    Any
        The function is free to return any value after its training, including `None`.
    """
    pass

`TFPredictor`

Bases: Predictor

The models in DebiasedDTA study (BPE-DTA, LM-DTA, DeepDTA) are implemented in Tensorflow. TFPredictor class provides an abstraction to these models to minimize code duplication. The children classes only implement model building, biomolecule vectorization, and __init__ functions.
Model training, prediction, and save/load functions are inherited from this class.

Source code in pydebiaseddta/predictors/abstract_predictors.py

class TFPredictor(Predictor):
    """The models in DebiasedDTA study (BPE-DTA, LM-DTA, DeepDTA) are implemented in Tensorflow.
    `TFPredictor` class provides an abstraction to these models to minimize code duplication.
    The children classes only implement model building, biomolecule vectorization, and `__init__` functions.  
    Model training, prediction, and save/load functions are inherited from this class.
    """

    @abstractmethod
    def __init__(self, n_epochs: int, learning_rate: float, batch_size: int, **kwargs):
        """An abstract constructor for BPE-DTA, LM-DTA, and DeepDTA.
        The constructor sets the common attributes and call the `build` function.  

        Parameters
        ----------
        n_epochs : int
            Number of epochs to train the model.
        learning_rate : float
            The learning rate of the optimization algorithm.
        batch_size : _type_
            Batch size for training.
        """
        self.n_epochs = n_epochs
        self.learning_rate = learning_rate
        self.batch_size = batch_size
        self.history = dict()
        self.model = self.build()

    @abstractmethod
    def build(self):
        """An abstract function to create the model architecture.
        Every child has to implement this function.
        """
        pass

    @abstractmethod
    def vectorize_ligands(self, ligands):
        """An abstract function to vectorize ligands.
        Every child has to implement this function.
        """
        pass

    @abstractmethod
    def vectorize_proteins(self, proteins):
        """An abstract function to vectorize proteins.
        Every child has to implement this function.
        """
        pass

    @classmethod
    def from_file(cls, path: str):
        """A utility function to load a `TFPredictor` instance from disk.
        All attributes, including the model weights, are loaded.

        Parameters
        ----------
        path : str
            Path to load the prediction model from.

        Returns
        -------
        TFPredictor
            The previously saved model.
        """
        with open(f"{path}/params.json") as f:
            dct = json.load(f)

        instance = cls(**dct)

        instance.model = tf.keras.models.load_model(f"{path}/model")

        with open(f"{path}/history.json") as f:
            instance.history = json.load(f)
        return instance

    def train(
        self,
        train_ligands: List[str],
        train_proteins: List[str],
        train_labels: List[float],
        val_ligands: List[str] = None,
        val_proteins: List[str] = None,
        val_labels: List[float] = None,
        sample_weights_by_epoch: List[np.array] = None,
    ) -> Dict:
        """The common model training procedure for BPE-DTA, LM-DTA, and DeepDTA.
        The models adopt different biomolecule representation methods and model architectures,
        so, the training results are different.
        The training procedure supports validation for tracking, and sample weighting for debiasing.

        Parameters
        ----------
        train_ligands : List[str]
            SMILES strings of the training ligands.
        train_proteins : List[str]
            Amino-acid sequences of the training proteins.
        train_labels : List[float]
            Affinity scores of the training protein-ligand pairs.
        val_ligands : List[str], optional
            SMILES strings of the validation ligands, by default None and no validation is used.
        val_proteins : List[str], optional
            Amino-acid sequences of the validation proteins, by default None and no validation is used.
        val_labels : List[float], optional
            Affinity scores of the validation pairs, by default None and no validation is used.
        sample_weights_by_epoch : List[np.array], optional
            Weight of each training protein-ligand pair during training across epochs.
            This variable must be a List of size $E$ (number of training epochs),
            in which each element is a `np.array` of $N\times 1$, where $N$ is the training set size and 
            each element corresponds to the weight of a training sample.
            By default `None` and no weighting is used.

        Returns
        -------
        Dict
            Training history.
        """
        if sample_weights_by_epoch is None:
            sample_weights_by_epoch = create_uniform_weights(
                len(train_ligands), self.n_epochs
            )

        train_ligand_vectors = self.vectorize_ligands(train_ligands)
        train_protein_vectors = self.vectorize_proteins(train_proteins)
        train_labels = np.array(train_labels)

        val_tuple = None
        if (
            val_ligands is not None
            and val_proteins is not None
            and val_labels is not None
        ):
            val_ligand_vectors = self.vectorize_ligands(val_ligands)
            val_protein_vectors = self.vectorize_proteins(val_proteins)
            val_tuple = (
                [val_ligand_vectors, val_protein_vectors],
                np.array(val_labels),
            )

        train_stats_over_epochs = {"mse": [], "rmse": [], "r2": []}
        val_stats_over_epochs = train_stats_over_epochs.copy()
        for e in range(self.n_epochs):
            self.model.fit(
                x=[train_ligand_vectors, train_protein_vectors],
                y=train_labels,
                sample_weight=sample_weights_by_epoch[e],
                validation_data=val_tuple,
                batch_size=self.batch_size,
                epochs=1,
            )

            train_stats = evaluate_predictions(
                gold_truths=train_labels,
                predictions=self.predict(train_ligands, train_proteins),
                metrics=list(train_stats_over_epochs.keys()),
            )
            for metric, stat in train_stats.items():
                train_stats_over_epochs[metric].append(stat)

            if val_tuple is not None:
                val_stats = evaluate_predictions(
                    y_true=val_labels,
                    y_preds=self.predict(val_ligands, val_proteins),
                    metrics=list(val_stats_over_epochs.keys()),
                )
                for metric, stat in val_stats.items():
                    val_stats_over_epochs[metric].append(stat)

        self.history["train"] = train_stats_over_epochs
        if val_stats_over_epochs is not None:
            self.history["val"] = val_stats_over_epochs

        return self.history

    def predict(self, ligands: List[str], proteins: List[str]) -> List[float]:
        """Predicts the affinities of a `List` of protein-ligand pairs via the trained DTA prediction model,
        *i.e.*, BPE-DTA, LM-DTA, and BPE-DTA. 

        Parameters
        ----------
        ligands : List[str]
            SMILES strings of the ligands.
        proteins : List[str]
            Amino-acid sequences of the proteins.

        Returns
        -------
        List[float]
            Predicted affinity scores by DTA prediction model.
        """        
        ligand_vectors = self.vectorize_ligands(ligands)
        protein_vectors = self.vectorize_proteins(proteins)
        return self.model.predict([ligand_vectors, protein_vectors]).tolist()

    def save(self, path: str) -> None:
        """A utility function to save a `TFPredictor` instance to the disk.
        All attributes, including the model weights, are saved.

        Parameters
        ----------
        path : str
            Path to save the predictor.
        """        
        self.model.save(f"{path}/model")

        with open(f"{path}/history.json", "w") as f:
            json.dump(self.history, f, indent=4)

        donot_copy = {"model", "history"}
        dct = {k: v for k, v in self.__dict__.items() if k not in donot_copy}
        with open(f"{path}/params.json", "w") as f:
            json.dump(dct, f, indent=4)

`init(n_epochs, learning_rate, batch_size, **kwargs)` `abstractmethod`

An abstract constructor for BPE-DTA, LM-DTA, and DeepDTA. The constructor sets the common attributes and call the build function.

Parameters:

Name	Type	Description	Default
`n_epochs`	`int`	Number of epochs to train the model.	required
`learning_rate`	`float`	The learning rate of the optimization algorithm.	required
`batch_size`	`_type_`	Batch size for training.	required

Source code in pydebiaseddta/predictors/abstract_predictors.py

@abstractmethod
def __init__(self, n_epochs: int, learning_rate: float, batch_size: int, **kwargs):
    """An abstract constructor for BPE-DTA, LM-DTA, and DeepDTA.
    The constructor sets the common attributes and call the `build` function.  

    Parameters
    ----------
    n_epochs : int
        Number of epochs to train the model.
    learning_rate : float
        The learning rate of the optimization algorithm.
    batch_size : _type_
        Batch size for training.
    """
    self.n_epochs = n_epochs
    self.learning_rate = learning_rate
    self.batch_size = batch_size
    self.history = dict()
    self.model = self.build()

`build()` `abstractmethod`

An abstract function to create the model architecture. Every child has to implement this function.

Source code in pydebiaseddta/predictors/abstract_predictors.py

@abstractmethod
def build(self):
    """An abstract function to create the model architecture.
    Every child has to implement this function.
    """
    pass

`from_file(path)` `classmethod`

A utility function to load a TFPredictor instance from disk. All attributes, including the model weights, are loaded.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to load the prediction model from.	required

Returns:

Type	Description
`TFPredictor`	The previously saved model.

Source code in pydebiaseddta/predictors/abstract_predictors.py

@classmethod
def from_file(cls, path: str):
    """A utility function to load a `TFPredictor` instance from disk.
    All attributes, including the model weights, are loaded.

    Parameters
    ----------
    path : str
        Path to load the prediction model from.

    Returns
    -------
    TFPredictor
        The previously saved model.
    """
    with open(f"{path}/params.json") as f:
        dct = json.load(f)

    instance = cls(**dct)

    instance.model = tf.keras.models.load_model(f"{path}/model")

    with open(f"{path}/history.json") as f:
        instance.history = json.load(f)
    return instance

`predict(ligands, proteins)`

Predicts the affinities of a List of protein-ligand pairs via the trained DTA prediction model, i.e., BPE-DTA, LM-DTA, and BPE-DTA.

Parameters:

Name	Type	Description	Default
`ligands`	`List[str]`	SMILES strings of the ligands.	required
`proteins`	`List[str]`	Amino-acid sequences of the proteins.	required

Returns:

Type	Description
`List[float]`	Predicted affinity scores by DTA prediction model.

Source code in pydebiaseddta/predictors/abstract_predictors.py

def predict(self, ligands: List[str], proteins: List[str]) -> List[float]:
    """Predicts the affinities of a `List` of protein-ligand pairs via the trained DTA prediction model,
    *i.e.*, BPE-DTA, LM-DTA, and BPE-DTA. 

    Parameters
    ----------
    ligands : List[str]
        SMILES strings of the ligands.
    proteins : List[str]
        Amino-acid sequences of the proteins.

    Returns
    -------
    List[float]
        Predicted affinity scores by DTA prediction model.
    """        
    ligand_vectors = self.vectorize_ligands(ligands)
    protein_vectors = self.vectorize_proteins(proteins)
    return self.model.predict([ligand_vectors, protein_vectors]).tolist()

`save(path)`

A utility function to save a TFPredictor instance to the disk. All attributes, including the model weights, are saved.

Parameters:

Name	Type	Description	Default
`path`	`str`	Path to save the predictor.	required

Source code in pydebiaseddta/predictors/abstract_predictors.py

def save(self, path: str) -> None:
    """A utility function to save a `TFPredictor` instance to the disk.
    All attributes, including the model weights, are saved.

    Parameters
    ----------
    path : str
        Path to save the predictor.
    """        
    self.model.save(f"{path}/model")

    with open(f"{path}/history.json", "w") as f:
        json.dump(self.history, f, indent=4)

    donot_copy = {"model", "history"}
    dct = {k: v for k, v in self.__dict__.items() if k not in donot_copy}
    with open(f"{path}/params.json", "w") as f:
        json.dump(dct, f, indent=4)

`train(train_ligands, train_proteins, train_labels, val_ligands=None, val_proteins=None, val_labels=None, sample_weights_by_epoch=None)`

The common model training procedure for BPE-DTA, LM-DTA, and DeepDTA. The models adopt different biomolecule representation methods and model architectures, so, the training results are different. The training procedure supports validation for tracking, and sample weighting for debiasing.

Parameters:

Name	Type	Description	Default
`train_ligands`	`List[str]`	SMILES strings of the training ligands.	required
`train_proteins`	`List[str]`	Amino-acid sequences of the training proteins.	required
`train_labels`	`List[float]`	Affinity scores of the training protein-ligand pairs.	required
`val_ligands`	`List[str], optional`	SMILES strings of the validation ligands, by default None and no validation is used.	`None`
`val_proteins`	`List[str], optional`	Amino-acid sequences of the validation proteins, by default None and no validation is used.	`None`
`val_labels`	`List[float], optional`	Affinity scores of the validation pairs, by default None and no validation is used.	`None`
`sample_weights_by_epoch`	`List[np.array], optional`	Weight of each training protein-ligand pair during training across epochs. This variable must be a List of size \(E\) (number of training epochs), in which each element is a `np.array` of \(N imes 1\), where \(N\) is the training set size and each element corresponds to the weight of a training sample. By default `None` and no weighting is used.	`None`

Returns:

Type	Description
`Dict`	Training history.

Source code in pydebiaseddta/predictors/abstract_predictors.py

def train(
    self,
    train_ligands: List[str],
    train_proteins: List[str],
    train_labels: List[float],
    val_ligands: List[str] = None,
    val_proteins: List[str] = None,
    val_labels: List[float] = None,
    sample_weights_by_epoch: List[np.array] = None,
) -> Dict:
    """The common model training procedure for BPE-DTA, LM-DTA, and DeepDTA.
    The models adopt different biomolecule representation methods and model architectures,
    so, the training results are different.
    The training procedure supports validation for tracking, and sample weighting for debiasing.

    Parameters
    ----------
    train_ligands : List[str]
        SMILES strings of the training ligands.
    train_proteins : List[str]
        Amino-acid sequences of the training proteins.
    train_labels : List[float]
        Affinity scores of the training protein-ligand pairs.
    val_ligands : List[str], optional
        SMILES strings of the validation ligands, by default None and no validation is used.
    val_proteins : List[str], optional
        Amino-acid sequences of the validation proteins, by default None and no validation is used.
    val_labels : List[float], optional
        Affinity scores of the validation pairs, by default None and no validation is used.
    sample_weights_by_epoch : List[np.array], optional
        Weight of each training protein-ligand pair during training across epochs.
        This variable must be a List of size $E$ (number of training epochs),
        in which each element is a `np.array` of $N\times 1$, where $N$ is the training set size and 
        each element corresponds to the weight of a training sample.
        By default `None` and no weighting is used.

    Returns
    -------
    Dict
        Training history.
    """
    if sample_weights_by_epoch is None:
        sample_weights_by_epoch = create_uniform_weights(
            len(train_ligands), self.n_epochs
        )

    train_ligand_vectors = self.vectorize_ligands(train_ligands)
    train_protein_vectors = self.vectorize_proteins(train_proteins)
    train_labels = np.array(train_labels)

    val_tuple = None
    if (
        val_ligands is not None
        and val_proteins is not None
        and val_labels is not None
    ):
        val_ligand_vectors = self.vectorize_ligands(val_ligands)
        val_protein_vectors = self.vectorize_proteins(val_proteins)
        val_tuple = (
            [val_ligand_vectors, val_protein_vectors],
            np.array(val_labels),
        )

    train_stats_over_epochs = {"mse": [], "rmse": [], "r2": []}
    val_stats_over_epochs = train_stats_over_epochs.copy()
    for e in range(self.n_epochs):
        self.model.fit(
            x=[train_ligand_vectors, train_protein_vectors],
            y=train_labels,
            sample_weight=sample_weights_by_epoch[e],
            validation_data=val_tuple,
            batch_size=self.batch_size,
            epochs=1,
        )

        train_stats = evaluate_predictions(
            gold_truths=train_labels,
            predictions=self.predict(train_ligands, train_proteins),
            metrics=list(train_stats_over_epochs.keys()),
        )
        for metric, stat in train_stats.items():
            train_stats_over_epochs[metric].append(stat)

        if val_tuple is not None:
            val_stats = evaluate_predictions(
                y_true=val_labels,
                y_preds=self.predict(val_ligands, val_proteins),
                metrics=list(val_stats_over_epochs.keys()),
            )
            for metric, stat in val_stats.items():
                val_stats_over_epochs[metric].append(stat)

    self.history["train"] = train_stats_over_epochs
    if val_stats_over_epochs is not None:
        self.history["val"] = val_stats_over_epochs

    return self.history

`vectorize_ligands(ligands)` `abstractmethod`

An abstract function to vectorize ligands. Every child has to implement this function.

Source code in pydebiaseddta/predictors/abstract_predictors.py

@abstractmethod
def vectorize_ligands(self, ligands):
    """An abstract function to vectorize ligands.
    Every child has to implement this function.
    """
    pass

`vectorize_proteins(proteins)` `abstractmethod`

An abstract function to vectorize proteins. Every child has to implement this function.

Source code in pydebiaseddta/predictors/abstract_predictors.py

@abstractmethod
def vectorize_proteins(self, proteins):
    """An abstract function to vectorize proteins.
    Every child has to implement this function.
    """
    pass

`create_uniform_weights(n_samples, n_epochs)`

Create a lists of weights such that every training instance has the equal weight across all epoch, i.e., no sample weighting is used.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	Number of training instances.	required
`n_epochs`	`int`	Number of epochs to train the model.	required

Returns:

Type	Description
`List[np.array]`	Sample weights across epochs. Each instance has a weight of 1 for all epochs.

Source code in pydebiaseddta/predictors/abstract_predictors.py

def create_uniform_weights(n_samples: int, n_epochs: int) -> List[np.array]:
    """Create a lists of weights such that every training instance has the equal weight across all epoch,
    *i.e.*, no sample weighting is used.

    Parameters
    ----------
    n_samples : int
        Number of training instances.
    n_epochs : int
        Number of epochs to train the model.

    Returns
    -------
    List[np.array]
        Sample weights across epochs. Each instance has a weight of 1 for all epochs.
    """
    return [np.array([1] * n_samples) for _ in range(n_epochs)]

`DeepDTA`

Bases: TFPredictor

Source code in pydebiaseddta/predictors/deepdta.py

class DeepDTA(TFPredictor):
    def __init__(
        self,
        max_smi_len: int = 100,
        max_prot_len: int = 1000,
        embedding_dim: int = 128,
        learning_rate: float = 0.001,
        batch_size: int = 256,
        n_epochs: int = 200,
        num_filters: int = 32,
        smi_filter_len: int = 4,
        prot_filter_len: int = 6,
    ):
        """Constructor to create a DeepDTA instance.
        DeepDTA segments SMILES strings of ligands and amino-acid sequences of proteins into characters,
        and applies three layers of convolutions to learn latent representations. 
        A fully-connected neural network with three layers is used afterwards to predict affinities.

        Parameters
        ----------
        max_smi_len : int, optional
            Maximum number of characters in a SMILES string, by default 100. 
            Longer SMILES strings are truncated.
        max_prot_len : int, optional
            Maximum number of amino-acids a protein sequence, by default 1000. 
            Longer sequences are truncated.
        embedding_dim : int, optional
            The dimension of the biomolecule characters, by default 128.
        learning_rate : float, optional
            Learning rate during optimization, by default 0.001.
        batch_size : int, optional
            Batch size during training, by default 256.
        n_epochs : int, optional
            Number of epochs to train the model, by default 200.
        num_filters : int, optional
            Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.
        smi_filter_len : int, optional
            Length of filters in the convolution blocks for ligands, by default 4.
        prot_filter_len : int, optional
            Length of filters in the convolution blocks for proteins, by default 6.
        """    
        self.max_smi_len = max_smi_len
        self.max_prot_len = max_prot_len
        self.embedding_dim = embedding_dim
        self.num_filters = num_filters
        self.smi_filter_len = smi_filter_len
        self.prot_filter_len = prot_filter_len

        self.chem_vocab_size = 94
        self.prot_vocab_size = 26
        TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

    def build(self):
        """Builds a `DeepDTA` predictor in `keras` with the parameters specified during construction.

        Returns
        -------
        tensorflow.keras.models.Model
            The built model.
        """    
        # Inputs
        ligands = Input(shape=(self.max_smi_len,), dtype="int32")

        # chemical representation
        ligand_representation = Embedding(
            input_dim=self.chem_vocab_size + 1,
            output_dim=self.embedding_dim,
            input_length=self.max_smi_len,
            mask_zero=True,
        )(ligands)
        ligand_representation = Conv1D(
            filters=self.num_filters,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = Conv1D(
            filters=self.num_filters * 2,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = Conv1D(
            filters=self.num_filters * 3,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = GlobalMaxPooling1D()(ligand_representation)

        # Protein representation
        proteins = Input(shape=(self.max_prot_len,), dtype="int32")
        protein_representation = Embedding(
            input_dim=self.prot_vocab_size + 1,
            output_dim=self.embedding_dim,
            input_length=self.max_prot_len,
            mask_zero=True,
        )(proteins)
        protein_representation = Conv1D(
            filters=self.num_filters,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = Conv1D(
            filters=self.num_filters * 2,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = Conv1D(
            filters=self.num_filters * 3,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = GlobalMaxPooling1D()(protein_representation)

        interaction_representation = Concatenate(axis=-1)(
            [ligand_representation, protein_representation]
        )

        # Fully connected layers
        FC1 = Dense(1024, activation="relu")(interaction_representation)
        FC1 = Dropout(0.1)(FC1)
        FC2 = Dense(1024, activation="relu")(FC1)
        FC2 = Dropout(0.1)(FC2)
        FC3 = Dense(512, activation="relu")(FC2)
        predictions = Dense(1, kernel_initializer="normal")(FC3)

        opt = Adam(self.learning_rate)
        deepdta = Model(inputs=[ligands, proteins], outputs=[predictions])
        deepdta.compile(
            optimizer=opt,
            loss="mean_squared_error",
            metrics=["mean_squared_error"],
        )
        return deepdta

    def vectorize_ligands(self, ligands: List[str]) -> np.array:
        """Segments SMILES strings of ligands into characters and applies label encoding.
        Truncation and padding are also applied to prepare ligands for training and/or prediction.

        Parameters
        ----------
        ligands : List[str]
            The SMILES strings of ligands.

        Returns
        -------
        np.array
            An $N \\times max\\_smi\\_len$ ($N$ is the number of the input ligands) matrix that contains label encoded sequences of SMILES tokens.
        """     
        smi_to_unichar_encoding = load_smiles_to_unichar_encoding()
        unichars = smiles_to_unichar_batch(ligands, smi_to_unichar_encoding)
        word_identifier = load_chemical_word_identifier(vocab_size=94)

        return np.array(
            word_identifier.encode_sequences(unichars, self.max_smi_len)
        )

    def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
        """Segments amino-acid sequences of proteins into characters and applies label encoding.
        Truncation and padding are also applied to prepare proteins for training and/or prediction.

        Parameters
        ----------
        aa_sequences : List[str]
            The amino-acid sequences of proteins.

        Returns
        -------
        np.array
            An $N \\times max\\_prot\\_len$ ($N$ is the number of the input proteins) matrix that contains label encoded sequences of amino-acids.
        """
        word_identifier = load_protein_word_identifier(vocab_size=26)
        return np.array(
            word_identifier.encode_sequences(aa_sequences, self.max_prot_len)
        )

`init(max_smi_len=100, max_prot_len=1000, embedding_dim=128, learning_rate=0.001, batch_size=256, n_epochs=200, num_filters=32, smi_filter_len=4, prot_filter_len=6)`

Constructor to create a DeepDTA instance. DeepDTA segments SMILES strings of ligands and amino-acid sequences of proteins into characters, and applies three layers of convolutions to learn latent representations. A fully-connected neural network with three layers is used afterwards to predict affinities.

Parameters:

Name	Type	Description	Default
`max_smi_len`	`int, optional`	Maximum number of characters in a SMILES string, by default 100. Longer SMILES strings are truncated.	`100`
`max_prot_len`	`int, optional`	Maximum number of amino-acids a protein sequence, by default 1000. Longer sequences are truncated.	`1000`
`embedding_dim`	`int, optional`	The dimension of the biomolecule characters, by default 128.	`128`
`learning_rate`	`float, optional`	Learning rate during optimization, by default 0.001.	`0.001`
`batch_size`	`int, optional`	Batch size during training, by default 256.	`256`
`n_epochs`	`int, optional`	Number of epochs to train the model, by default 200.	`200`
`num_filters`	`int, optional`	Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.	`32`
`smi_filter_len`	`int, optional`	Length of filters in the convolution blocks for ligands, by default 4.	`4`
`prot_filter_len`	`int, optional`	Length of filters in the convolution blocks for proteins, by default 6.	`6`

Source code in pydebiaseddta/predictors/deepdta.py

def __init__(
    self,
    max_smi_len: int = 100,
    max_prot_len: int = 1000,
    embedding_dim: int = 128,
    learning_rate: float = 0.001,
    batch_size: int = 256,
    n_epochs: int = 200,
    num_filters: int = 32,
    smi_filter_len: int = 4,
    prot_filter_len: int = 6,
):
    """Constructor to create a DeepDTA instance.
    DeepDTA segments SMILES strings of ligands and amino-acid sequences of proteins into characters,
    and applies three layers of convolutions to learn latent representations. 
    A fully-connected neural network with three layers is used afterwards to predict affinities.

    Parameters
    ----------
    max_smi_len : int, optional
        Maximum number of characters in a SMILES string, by default 100. 
        Longer SMILES strings are truncated.
    max_prot_len : int, optional
        Maximum number of amino-acids a protein sequence, by default 1000. 
        Longer sequences are truncated.
    embedding_dim : int, optional
        The dimension of the biomolecule characters, by default 128.
    learning_rate : float, optional
        Learning rate during optimization, by default 0.001.
    batch_size : int, optional
        Batch size during training, by default 256.
    n_epochs : int, optional
        Number of epochs to train the model, by default 200.
    num_filters : int, optional
        Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.
    smi_filter_len : int, optional
        Length of filters in the convolution blocks for ligands, by default 4.
    prot_filter_len : int, optional
        Length of filters in the convolution blocks for proteins, by default 6.
    """    
    self.max_smi_len = max_smi_len
    self.max_prot_len = max_prot_len
    self.embedding_dim = embedding_dim
    self.num_filters = num_filters
    self.smi_filter_len = smi_filter_len
    self.prot_filter_len = prot_filter_len

    self.chem_vocab_size = 94
    self.prot_vocab_size = 26
    TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

`build()`

Builds a DeepDTA predictor in keras with the parameters specified during construction.

Returns:

Type	Description
`tensorflow.keras.models.Model`	The built model.

Source code in pydebiaseddta/predictors/deepdta.py

def build(self):
    """Builds a `DeepDTA` predictor in `keras` with the parameters specified during construction.

    Returns
    -------
    tensorflow.keras.models.Model
        The built model.
    """    
    # Inputs
    ligands = Input(shape=(self.max_smi_len,), dtype="int32")

    # chemical representation
    ligand_representation = Embedding(
        input_dim=self.chem_vocab_size + 1,
        output_dim=self.embedding_dim,
        input_length=self.max_smi_len,
        mask_zero=True,
    )(ligands)
    ligand_representation = Conv1D(
        filters=self.num_filters,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = Conv1D(
        filters=self.num_filters * 2,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = Conv1D(
        filters=self.num_filters * 3,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = GlobalMaxPooling1D()(ligand_representation)

    # Protein representation
    proteins = Input(shape=(self.max_prot_len,), dtype="int32")
    protein_representation = Embedding(
        input_dim=self.prot_vocab_size + 1,
        output_dim=self.embedding_dim,
        input_length=self.max_prot_len,
        mask_zero=True,
    )(proteins)
    protein_representation = Conv1D(
        filters=self.num_filters,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = Conv1D(
        filters=self.num_filters * 2,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = Conv1D(
        filters=self.num_filters * 3,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = GlobalMaxPooling1D()(protein_representation)

    interaction_representation = Concatenate(axis=-1)(
        [ligand_representation, protein_representation]
    )

    # Fully connected layers
    FC1 = Dense(1024, activation="relu")(interaction_representation)
    FC1 = Dropout(0.1)(FC1)
    FC2 = Dense(1024, activation="relu")(FC1)
    FC2 = Dropout(0.1)(FC2)
    FC3 = Dense(512, activation="relu")(FC2)
    predictions = Dense(1, kernel_initializer="normal")(FC3)

    opt = Adam(self.learning_rate)
    deepdta = Model(inputs=[ligands, proteins], outputs=[predictions])
    deepdta.compile(
        optimizer=opt,
        loss="mean_squared_error",
        metrics=["mean_squared_error"],
    )
    return deepdta

`vectorize_ligands(ligands)`

Segments SMILES strings of ligands into characters and applies label encoding. Truncation and padding are also applied to prepare ligands for training and/or prediction.

Parameters:

Name	Type	Description	Default
`ligands`	`List[str]`	The SMILES strings of ligands.	required

Returns:

Type	Description
`np.array`	An \(N \times max\_smi\_len\) (\(N\) is the number of the input ligands) matrix that contains label encoded sequences of SMILES tokens.

Source code in pydebiaseddta/predictors/deepdta.py

def vectorize_ligands(self, ligands: List[str]) -> np.array:
    """Segments SMILES strings of ligands into characters and applies label encoding.
    Truncation and padding are also applied to prepare ligands for training and/or prediction.

    Parameters
    ----------
    ligands : List[str]
        The SMILES strings of ligands.

    Returns
    -------
    np.array
        An $N \\times max\\_smi\\_len$ ($N$ is the number of the input ligands) matrix that contains label encoded sequences of SMILES tokens.
    """     
    smi_to_unichar_encoding = load_smiles_to_unichar_encoding()
    unichars = smiles_to_unichar_batch(ligands, smi_to_unichar_encoding)
    word_identifier = load_chemical_word_identifier(vocab_size=94)

    return np.array(
        word_identifier.encode_sequences(unichars, self.max_smi_len)
    )

`vectorize_proteins(aa_sequences)`

Segments amino-acid sequences of proteins into characters and applies label encoding. Truncation and padding are also applied to prepare proteins for training and/or prediction.

Parameters:

Name	Type	Description	Default
`aa_sequences`	`List[str]`	The amino-acid sequences of proteins.	required

Returns:

Type	Description
`np.array`	An \(N \times max\_prot\_len\) (\(N\) is the number of the input proteins) matrix that contains label encoded sequences of amino-acids.

Source code in pydebiaseddta/predictors/deepdta.py

def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
    """Segments amino-acid sequences of proteins into characters and applies label encoding.
    Truncation and padding are also applied to prepare proteins for training and/or prediction.

    Parameters
    ----------
    aa_sequences : List[str]
        The amino-acid sequences of proteins.

    Returns
    -------
    np.array
        An $N \\times max\\_prot\\_len$ ($N$ is the number of the input proteins) matrix that contains label encoded sequences of amino-acids.
    """
    word_identifier = load_protein_word_identifier(vocab_size=26)
    return np.array(
        word_identifier.encode_sequences(aa_sequences, self.max_prot_len)
    )

`BPEDTA`

Bases: TFPredictor

Source code in pydebiaseddta/predictors/bpedta.py

class BPEDTA(TFPredictor):
    def __init__(
        self,
        max_smi_len: int = 100,
        max_prot_len: int = 1000,
        embedding_dim: int = 128,
        learning_rate: float = 0.001,
        batch_size: int = 256,
        n_epochs: int = 200,
        num_filters: int = 32,
        smi_filter_len: int = 4,
        prot_filter_len: int = 6,
    ):
        """Constructor to create a BPE-DTA instance.
        BPE-DTA segments SMILES strings of ligands and amino-acid sequences of proteins into biomolecule words,
        and applies three layers of convolutions to learn latent representations. 
        A fully-connected neural network with three layers is used afterwards to predict affinities.

        Parameters
        ----------
        max_smi_len : int, optional
            Maximum number of chemical words in a SMILES string, by default 100. 
            SMILES strings that contain more chemical words are truncated.
        max_prot_len : int, optional
            Maximum number of protein words in an amino-acid sequence, by default 1000. 
            Amino-acid sequences that contain more proteins words are truncated.
        embedding_dim : int, optional
            The dimension of the biomolecule words, by default 128.
        learning_rate : float, optional
            Learning rate during optimization, by default 0.001.
        batch_size : int, optional
            Batch size during training, by default 256.
        n_epochs : int, optional
            Number of epochs to train the model, by default 200.
        num_filters : int, optional
            Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.
        smi_filter_len : int, optional
            Length of filters in the convolution blocks for ligands, by default 4.
        prot_filter_len : int, optional
            Length of filters in the convolution blocks for proteins, by default 6.
        """    
        self.max_smi_len = max_smi_len
        self.max_prot_len = max_prot_len
        self.embedding_dim = embedding_dim
        self.num_filters = num_filters
        self.smi_filter_len = smi_filter_len
        self.prot_filter_len = prot_filter_len

        self.chem_vocab_size = 8000
        self.prot_vocab_size = 32000
        TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

    def build(self) -> Model:
        """Builds a `BPEDTA` predictor in `keras` with the parameters specified during construction.

        Returns
        -------
        tensorflow.keras.models.Model
            The built model.
        """        
        # Inputs
        ligands = Input(shape=(self.max_smi_len,), dtype="int32")
        # chemical representation
        ligand_representation = Embedding(
            input_dim=self.chem_vocab_size + 1,
            output_dim=self.embedding_dim,
            input_length=self.max_smi_len,
            mask_zero=True,
        )(ligands)
        ligand_representation = Conv1D(
            filters=self.num_filters,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = Conv1D(
            filters=self.num_filters * 2,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = Conv1D(
            filters=self.num_filters * 3,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = GlobalMaxPooling1D()(ligand_representation)

        # Protein representation
        proteins = Input(shape=(self.max_prot_len,), dtype="int32")
        protein_representation = Embedding(
            input_dim=self.prot_vocab_size + 1,
            output_dim=self.embedding_dim,
            input_length=self.max_prot_len,
            mask_zero=True,
        )(proteins)
        protein_representation = Conv1D(
            filters=self.num_filters,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = Conv1D(
            filters=self.num_filters * 2,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = Conv1D(
            filters=self.num_filters * 3,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = GlobalMaxPooling1D()(protein_representation)

        interaction_representation = Concatenate(axis=-1)(
            [ligand_representation, protein_representation]
        )

        # Fully connected layers
        FC1 = Dense(1024, activation="relu")(interaction_representation)
        FC1 = Dropout(0.1)(FC1)
        FC2 = Dense(1024, activation="relu")(FC1)
        FC2 = Dropout(0.1)(FC2)
        FC3 = Dense(512, activation="relu")(FC2)
        predictions = Dense(1, kernel_initializer="normal")(FC3)

        opt = Adam(self.learning_rate)
        bpedta = Model(inputs=[ligands, proteins], outputs=[predictions])
        bpedta.compile(
            optimizer=opt, loss="mean_squared_error", metrics=["mean_squared_error"],
        )
        return bpedta

    def vectorize_ligands(self, ligands: List[str]) -> np.array:
        """Segments SMILES strings of ligands into chemical words and applies label encoding.
        Truncation and padding are also applied to prepare ligands for training and/or prediction.

        Parameters
        ----------
        ligands : List[str]
            The SMILES strings of ligands.

        Returns
        -------
        np.array
            An $N \\times max\\_smi\\_len$ ($N$ is the number of the input ligands) matrix that contains label encoded sequences of chemical words.
        """        
        smi_to_unichar_encoding = load_smiles_to_unichar_encoding()
        unichars = smiles_to_unichar_batch(ligands, smi_to_unichar_encoding)
        word_identifier = load_chemical_word_identifier(vocab_size=8000)

        return np.array(word_identifier.encode_sequences(unichars, self.max_smi_len))

    def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
        """Segments amino-acid sequences of proteins into protein words and applies label encoding.
        Truncation and padding are also applied to prepare proteins for training and/or prediction.

        Parameters
        ----------
        aa_sequences : List[str]
            The amino-acid sequences of proteins.

        Returns
        -------
        np.array
            An $N \\times max\\_prot\\_len$ ($N$ is the number of the input proteins) matrix that contains label encoded sequences of protein words.
        """      
        word_identifier = load_protein_word_identifier(vocab_size=32000)
        return np.array(
            word_identifier.encode_sequences(aa_sequences, self.max_prot_len)
        )

`init(max_smi_len=100, max_prot_len=1000, embedding_dim=128, learning_rate=0.001, batch_size=256, n_epochs=200, num_filters=32, smi_filter_len=4, prot_filter_len=6)`

Constructor to create a BPE-DTA instance. BPE-DTA segments SMILES strings of ligands and amino-acid sequences of proteins into biomolecule words, and applies three layers of convolutions to learn latent representations. A fully-connected neural network with three layers is used afterwards to predict affinities.

Parameters:

Name	Type	Description	Default
`max_smi_len`	`int, optional`	Maximum number of chemical words in a SMILES string, by default 100. SMILES strings that contain more chemical words are truncated.	`100`
`max_prot_len`	`int, optional`	Maximum number of protein words in an amino-acid sequence, by default 1000. Amino-acid sequences that contain more proteins words are truncated.	`1000`
`embedding_dim`	`int, optional`	The dimension of the biomolecule words, by default 128.	`128`
`learning_rate`	`float, optional`	Learning rate during optimization, by default 0.001.	`0.001`
`batch_size`	`int, optional`	Batch size during training, by default 256.	`256`
`n_epochs`	`int, optional`	Number of epochs to train the model, by default 200.	`200`
`num_filters`	`int, optional`	Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.	`32`
`smi_filter_len`	`int, optional`	Length of filters in the convolution blocks for ligands, by default 4.	`4`
`prot_filter_len`	`int, optional`	Length of filters in the convolution blocks for proteins, by default 6.	`6`

Source code in pydebiaseddta/predictors/bpedta.py

def __init__(
    self,
    max_smi_len: int = 100,
    max_prot_len: int = 1000,
    embedding_dim: int = 128,
    learning_rate: float = 0.001,
    batch_size: int = 256,
    n_epochs: int = 200,
    num_filters: int = 32,
    smi_filter_len: int = 4,
    prot_filter_len: int = 6,
):
    """Constructor to create a BPE-DTA instance.
    BPE-DTA segments SMILES strings of ligands and amino-acid sequences of proteins into biomolecule words,
    and applies three layers of convolutions to learn latent representations. 
    A fully-connected neural network with three layers is used afterwards to predict affinities.

    Parameters
    ----------
    max_smi_len : int, optional
        Maximum number of chemical words in a SMILES string, by default 100. 
        SMILES strings that contain more chemical words are truncated.
    max_prot_len : int, optional
        Maximum number of protein words in an amino-acid sequence, by default 1000. 
        Amino-acid sequences that contain more proteins words are truncated.
    embedding_dim : int, optional
        The dimension of the biomolecule words, by default 128.
    learning_rate : float, optional
        Learning rate during optimization, by default 0.001.
    batch_size : int, optional
        Batch size during training, by default 256.
    n_epochs : int, optional
        Number of epochs to train the model, by default 200.
    num_filters : int, optional
        Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.
    smi_filter_len : int, optional
        Length of filters in the convolution blocks for ligands, by default 4.
    prot_filter_len : int, optional
        Length of filters in the convolution blocks for proteins, by default 6.
    """    
    self.max_smi_len = max_smi_len
    self.max_prot_len = max_prot_len
    self.embedding_dim = embedding_dim
    self.num_filters = num_filters
    self.smi_filter_len = smi_filter_len
    self.prot_filter_len = prot_filter_len

    self.chem_vocab_size = 8000
    self.prot_vocab_size = 32000
    TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

`build()`

Builds a BPEDTA predictor in keras with the parameters specified during construction.

Returns:

Type	Description
`tensorflow.keras.models.Model`	The built model.

Source code in pydebiaseddta/predictors/bpedta.py

def build(self) -> Model:
    """Builds a `BPEDTA` predictor in `keras` with the parameters specified during construction.

    Returns
    -------
    tensorflow.keras.models.Model
        The built model.
    """        
    # Inputs
    ligands = Input(shape=(self.max_smi_len,), dtype="int32")
    # chemical representation
    ligand_representation = Embedding(
        input_dim=self.chem_vocab_size + 1,
        output_dim=self.embedding_dim,
        input_length=self.max_smi_len,
        mask_zero=True,
    )(ligands)
    ligand_representation = Conv1D(
        filters=self.num_filters,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = Conv1D(
        filters=self.num_filters * 2,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = Conv1D(
        filters=self.num_filters * 3,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = GlobalMaxPooling1D()(ligand_representation)

    # Protein representation
    proteins = Input(shape=(self.max_prot_len,), dtype="int32")
    protein_representation = Embedding(
        input_dim=self.prot_vocab_size + 1,
        output_dim=self.embedding_dim,
        input_length=self.max_prot_len,
        mask_zero=True,
    )(proteins)
    protein_representation = Conv1D(
        filters=self.num_filters,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = Conv1D(
        filters=self.num_filters * 2,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = Conv1D(
        filters=self.num_filters * 3,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = GlobalMaxPooling1D()(protein_representation)

    interaction_representation = Concatenate(axis=-1)(
        [ligand_representation, protein_representation]
    )

    # Fully connected layers
    FC1 = Dense(1024, activation="relu")(interaction_representation)
    FC1 = Dropout(0.1)(FC1)
    FC2 = Dense(1024, activation="relu")(FC1)
    FC2 = Dropout(0.1)(FC2)
    FC3 = Dense(512, activation="relu")(FC2)
    predictions = Dense(1, kernel_initializer="normal")(FC3)

    opt = Adam(self.learning_rate)
    bpedta = Model(inputs=[ligands, proteins], outputs=[predictions])
    bpedta.compile(
        optimizer=opt, loss="mean_squared_error", metrics=["mean_squared_error"],
    )
    return bpedta

`vectorize_ligands(ligands)`

Segments SMILES strings of ligands into chemical words and applies label encoding. Truncation and padding are also applied to prepare ligands for training and/or prediction.

Parameters:

Name	Type	Description	Default
`ligands`	`List[str]`	The SMILES strings of ligands.	required

Returns:

Type	Description
`np.array`	An \(N \times max\_smi\_len\) (\(N\) is the number of the input ligands) matrix that contains label encoded sequences of chemical words.

Source code in pydebiaseddta/predictors/bpedta.py

def vectorize_ligands(self, ligands: List[str]) -> np.array:
    """Segments SMILES strings of ligands into chemical words and applies label encoding.
    Truncation and padding are also applied to prepare ligands for training and/or prediction.

    Parameters
    ----------
    ligands : List[str]
        The SMILES strings of ligands.

    Returns
    -------
    np.array
        An $N \\times max\\_smi\\_len$ ($N$ is the number of the input ligands) matrix that contains label encoded sequences of chemical words.
    """        
    smi_to_unichar_encoding = load_smiles_to_unichar_encoding()
    unichars = smiles_to_unichar_batch(ligands, smi_to_unichar_encoding)
    word_identifier = load_chemical_word_identifier(vocab_size=8000)

    return np.array(word_identifier.encode_sequences(unichars, self.max_smi_len))

`vectorize_proteins(aa_sequences)`

Segments amino-acid sequences of proteins into protein words and applies label encoding. Truncation and padding are also applied to prepare proteins for training and/or prediction.

Parameters:

Name	Type	Description	Default
`aa_sequences`	`List[str]`	The amino-acid sequences of proteins.	required

Returns:

Type	Description
`np.array`	An \(N \times max\_prot\_len\) (\(N\) is the number of the input proteins) matrix that contains label encoded sequences of protein words.

Source code in pydebiaseddta/predictors/bpedta.py

def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
    """Segments amino-acid sequences of proteins into protein words and applies label encoding.
    Truncation and padding are also applied to prepare proteins for training and/or prediction.

    Parameters
    ----------
    aa_sequences : List[str]
        The amino-acid sequences of proteins.

    Returns
    -------
    np.array
        An $N \\times max\\_prot\\_len$ ($N$ is the number of the input proteins) matrix that contains label encoded sequences of protein words.
    """      
    word_identifier = load_protein_word_identifier(vocab_size=32000)
    return np.array(
        word_identifier.encode_sequences(aa_sequences, self.max_prot_len)
    )

`LMDTA`

Bases: TFPredictor

Source code in pydebiaseddta/predictors/lmdta.py

class LMDTA(TFPredictor):
    def __init__(
        self, n_epochs: int = 200, learning_rate: float = 0.001, batch_size: int = 256
    ):
        """Constructor to create a LMDTA instance.
        LMDTA represents ligands and proteins with pre-trained language model embeddings
        obtained via [`ChemBERTa`](https://arxiv.org/abs/2010.09885) and  [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) models, respectively. 
        A fully-connected neural network with two layers is used afterwards to predict affinities.

        Parameters
        ----------
        n_epochs : int, optional
            Number of epochs to train the model, by default 200.
        learning_rate : float, optional
            Learning rate during optimization, by default 0.001.
        batch_size : int, optional
             Batch size during training, by default 256.
        """
        transformers.logging.set_verbosity(transformers.logging.CRITICAL)
        self.chemical_tokenizer = AutoTokenizer.from_pretrained(
            "seyonec/PubChem10M_SMILES_BPE_450k"
        )
        self.chemberta = AutoModel.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")

        self.protein_tokenizer = AutoTokenizer.from_pretrained(
            "Rostlab/prot_bert", do_lower_case=False
        )
        self.protbert = AutoModel.from_pretrained("Rostlab/prot_bert")
        TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

    def build(self):
        """Builds a `LMDTA` predictor in `keras` with the parameters specified during construction.

        Returns
        -------
        tensorflow.keras.models.Model
            The built model.
        """
        chemicals = Input(shape=(768,), dtype="float32")
        proteins = Input(shape=(1024,), dtype="float32")

        interaction_representation = Concatenate(axis=-1)([chemicals, proteins])

        FC1 = Dense(1024, activation="relu")(interaction_representation)
        FC1 = Dropout(0.1)(FC1)
        FC2 = Dense(512, activation="relu")(FC1)
        predictions = Dense(1, kernel_initializer="normal")(FC2)

        opt = Adam(self.learning_rate)
        lmdta = Model(inputs=[chemicals, proteins], outputs=[predictions])
        lmdta.compile(
            optimizer=opt, loss="mean_squared_error", metrics=["mean_squared_error"]
        )
        return lmdta

    @lru_cache(maxsize=2048)
    def get_chemberta_embedding(self, smiles: str) -> np.array:
        """Computes the [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vector for a ligand. 
        Since the creating the vector is computation-heavy, an `lru_cache` of size 2048 is used to store produced vectors.

        Parameters
        ----------
        smiles : str
            SMILES string of the ligand.

        Returns
        -------
        np.array
            [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vector (768-dimensional) of the ligand.
        """        
        tokens = self.chemical_tokenizer(smiles, return_tensors="pt")
        output = self.chemberta(**tokens)
        return output.last_hidden_state.detach().numpy().mean(axis=1)

    def vectorize_ligands(self, ligands: List[str]) -> np.array:
        """Vectorizes the ligands with [`ChemBERTa`](https://arxiv.org/abs/2010.09885) embeddings.

        Parameters
        ----------
        ligands : List[str]
            The SMILES strings of ligands.

        Returns
        -------
        np.array
            An $N \\times 768$ ($N$ is the number of the input ligands) matrix that contains [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vectors of the ligands.
        """        
        return np.vstack(
            [self.get_chemberta_embedding(chemical) for chemical in ligands]
        )

    @lru_cache(maxsize=1024)
    def get_protbert_embedding(self, aa_sequence: str) -> np.array:
        """Computes the [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vector for a protein. 
        Since the creating the vector is computation-heavy, an `lru_cache` of size 2048 is used to store produced vectors.

        Parameters
        ----------
        aa_sequence : str
            Amino-acid sequence of the protein.

        Returns
        -------
        np.array
            [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vector (1024-dimensional) of the protein.
        """    
        pp_sequence = " ".join(aa_sequence)
        cleaned_sequence = re.sub(r"[UZOB]", "X", pp_sequence)
        tokens = self.protein_tokenizer(cleaned_sequence, return_tensors="pt")
        output = self.protbert(**tokens)
        return output.last_hidden_state.detach().numpy().mean(axis=1)

    def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
        """Vectorizes the proteins with [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) embeddings.

        Parameters
        ----------
        aa_sequences : List[str]
            The amino-acid sequences of the proteins.

        Returns
        -------
        np.array
            An $N \\times 1024$ ($N$ is the number of the input proteins) matrix that contains [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vectors of the ligands.
        """   
        return np.vstack(
            [self.get_protbert_embedding(aa_sequence) for aa_sequence in aa_sequences]
        )

`init(n_epochs=200, learning_rate=0.001, batch_size=256)`

Constructor to create a LMDTA instance. LMDTA represents ligands and proteins with pre-trained language model embeddings obtained via ChemBERTa and ProtBert models, respectively. A fully-connected neural network with two layers is used afterwards to predict affinities.

Parameters:

Name	Type	Description	Default
`n_epochs`	`int, optional`	Number of epochs to train the model, by default 200.	`200`
`learning_rate`	`float, optional`	Learning rate during optimization, by default 0.001.	`0.001`
`batch_size`	`int, optional`	Batch size during training, by default 256.	`256`

Source code in pydebiaseddta/predictors/lmdta.py

def __init__(
    self, n_epochs: int = 200, learning_rate: float = 0.001, batch_size: int = 256
):
    """Constructor to create a LMDTA instance.
    LMDTA represents ligands and proteins with pre-trained language model embeddings
    obtained via [`ChemBERTa`](https://arxiv.org/abs/2010.09885) and  [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) models, respectively. 
    A fully-connected neural network with two layers is used afterwards to predict affinities.

    Parameters
    ----------
    n_epochs : int, optional
        Number of epochs to train the model, by default 200.
    learning_rate : float, optional
        Learning rate during optimization, by default 0.001.
    batch_size : int, optional
         Batch size during training, by default 256.
    """
    transformers.logging.set_verbosity(transformers.logging.CRITICAL)
    self.chemical_tokenizer = AutoTokenizer.from_pretrained(
        "seyonec/PubChem10M_SMILES_BPE_450k"
    )
    self.chemberta = AutoModel.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")

    self.protein_tokenizer = AutoTokenizer.from_pretrained(
        "Rostlab/prot_bert", do_lower_case=False
    )
    self.protbert = AutoModel.from_pretrained("Rostlab/prot_bert")
    TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

`build()`

Builds a LMDTA predictor in keras with the parameters specified during construction.

Returns:

Type	Description
`tensorflow.keras.models.Model`	The built model.

Source code in pydebiaseddta/predictors/lmdta.py

def build(self):
    """Builds a `LMDTA` predictor in `keras` with the parameters specified during construction.

    Returns
    -------
    tensorflow.keras.models.Model
        The built model.
    """
    chemicals = Input(shape=(768,), dtype="float32")
    proteins = Input(shape=(1024,), dtype="float32")

    interaction_representation = Concatenate(axis=-1)([chemicals, proteins])

    FC1 = Dense(1024, activation="relu")(interaction_representation)
    FC1 = Dropout(0.1)(FC1)
    FC2 = Dense(512, activation="relu")(FC1)
    predictions = Dense(1, kernel_initializer="normal")(FC2)

    opt = Adam(self.learning_rate)
    lmdta = Model(inputs=[chemicals, proteins], outputs=[predictions])
    lmdta.compile(
        optimizer=opt, loss="mean_squared_error", metrics=["mean_squared_error"]
    )
    return lmdta

`get_chemberta_embedding(smiles)` `cached`

Computes the ChemBERTa vector for a ligand. Since the creating the vector is computation-heavy, an lru_cache of size 2048 is used to store produced vectors.

Parameters:

Name	Type	Description	Default
`smiles`	`str`	SMILES string of the ligand.	required

Returns:

Type	Description
`np.array`	`ChemBERTa` vector (768-dimensional) of the ligand.

Source code in pydebiaseddta/predictors/lmdta.py

@lru_cache(maxsize=2048)
def get_chemberta_embedding(self, smiles: str) -> np.array:
    """Computes the [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vector for a ligand. 
    Since the creating the vector is computation-heavy, an `lru_cache` of size 2048 is used to store produced vectors.

    Parameters
    ----------
    smiles : str
        SMILES string of the ligand.

    Returns
    -------
    np.array
        [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vector (768-dimensional) of the ligand.
    """        
    tokens = self.chemical_tokenizer(smiles, return_tensors="pt")
    output = self.chemberta(**tokens)
    return output.last_hidden_state.detach().numpy().mean(axis=1)

`get_protbert_embedding(aa_sequence)` `cached`

Computes the ProtBert vector for a protein. Since the creating the vector is computation-heavy, an lru_cache of size 2048 is used to store produced vectors.

Parameters:

Name	Type	Description	Default
`aa_sequence`	`str`	Amino-acid sequence of the protein.	required

Returns:

Type	Description
`np.array`	`ProtBert` vector (1024-dimensional) of the protein.

Source code in pydebiaseddta/predictors/lmdta.py

@lru_cache(maxsize=1024)
def get_protbert_embedding(self, aa_sequence: str) -> np.array:
    """Computes the [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vector for a protein. 
    Since the creating the vector is computation-heavy, an `lru_cache` of size 2048 is used to store produced vectors.

    Parameters
    ----------
    aa_sequence : str
        Amino-acid sequence of the protein.

    Returns
    -------
    np.array
        [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vector (1024-dimensional) of the protein.
    """    
    pp_sequence = " ".join(aa_sequence)
    cleaned_sequence = re.sub(r"[UZOB]", "X", pp_sequence)
    tokens = self.protein_tokenizer(cleaned_sequence, return_tensors="pt")
    output = self.protbert(**tokens)
    return output.last_hidden_state.detach().numpy().mean(axis=1)

`vectorize_ligands(ligands)`

Vectorizes the ligands with ChemBERTa embeddings.

Parameters:

Name	Type	Description	Default
`ligands`	`List[str]`	The SMILES strings of ligands.	required

Returns:

Type	Description
`np.array`	An \(N \times 768\) (\(N\) is the number of the input ligands) matrix that contains `ChemBERTa` vectors of the ligands.

Source code in pydebiaseddta/predictors/lmdta.py

def vectorize_ligands(self, ligands: List[str]) -> np.array:
    """Vectorizes the ligands with [`ChemBERTa`](https://arxiv.org/abs/2010.09885) embeddings.

    Parameters
    ----------
    ligands : List[str]
        The SMILES strings of ligands.

    Returns
    -------
    np.array
        An $N \\times 768$ ($N$ is the number of the input ligands) matrix that contains [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vectors of the ligands.
    """        
    return np.vstack(
        [self.get_chemberta_embedding(chemical) for chemical in ligands]
    )

`vectorize_proteins(aa_sequences)`

Vectorizes the proteins with ProtBert embeddings.

Parameters:

Name	Type	Description	Default
`aa_sequences`	`List[str]`	The amino-acid sequences of the proteins.	required

Returns:

Type	Description
`np.array`	An \(N \times 1024\) (\(N\) is the number of the input proteins) matrix that contains `ProtBert` vectors of the ligands.

Source code in pydebiaseddta/predictors/lmdta.py

def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
    """Vectorizes the proteins with [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) embeddings.

    Parameters
    ----------
    aa_sequences : List[str]
        The amino-acid sequences of the proteins.

    Returns
    -------
    np.array
        An $N \\times 1024$ ($N$ is the number of the input proteins) matrix that contains [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vectors of the ligands.
    """   
    return np.vstack(
        [self.get_protbert_embedding(aa_sequence) for aa_sequence in aa_sequences]
    )

predictors

Predictor

__init__(n_epochs, *args, **kwargs) abstractmethod

train(train_ligands, train_proteins, train_labels, val_ligands=None, val_proteins=None, val_labels=None, sample_weights_by_epoch=None) abstractmethod

TFPredictor

__init__(n_epochs, learning_rate, batch_size, **kwargs) abstractmethod

build() abstractmethod

from_file(path) classmethod

predict(ligands, proteins)

save(path)

train(train_ligands, train_proteins, train_labels, val_ligands=None, val_proteins=None, val_labels=None, sample_weights_by_epoch=None)

vectorize_ligands(ligands) abstractmethod

vectorize_proteins(proteins) abstractmethod

create_uniform_weights(n_samples, n_epochs)

DeepDTA

__init__(max_smi_len=100, max_prot_len=1000, embedding_dim=128, learning_rate=0.001, batch_size=256, n_epochs=200, num_filters=32, smi_filter_len=4, prot_filter_len=6)

build()

vectorize_ligands(ligands)

vectorize_proteins(aa_sequences)

BPEDTA

__init__(max_smi_len=100, max_prot_len=1000, embedding_dim=128, learning_rate=0.001, batch_size=256, n_epochs=200, num_filters=32, smi_filter_len=4, prot_filter_len=6)

build()

vectorize_ligands(ligands)

vectorize_proteins(aa_sequences)

LMDTA

__init__(n_epochs=200, learning_rate=0.001, batch_size=256)

build()

get_chemberta_embedding(smiles) cached

get_protbert_embedding(aa_sequence) cached

vectorize_ligands(ligands)

vectorize_proteins(aa_sequences)

`Predictor`

`init(n_epochs, *args, **kwargs)` `abstractmethod`

`train(train_ligands, train_proteins, train_labels, val_ligands=None, val_proteins=None, val_labels=None, sample_weights_by_epoch=None)` `abstractmethod`

`TFPredictor`

`init(n_epochs, learning_rate, batch_size, **kwargs)` `abstractmethod`

`build()` `abstractmethod`

`from_file(path)` `classmethod`

`predict(ligands, proteins)`

`save(path)`

`train(train_ligands, train_proteins, train_labels, val_ligands=None, val_proteins=None, val_labels=None, sample_weights_by_epoch=None)`

`vectorize_ligands(ligands)` `abstractmethod`

`vectorize_proteins(proteins)` `abstractmethod`

`create_uniform_weights(n_samples, n_epochs)`

`DeepDTA`

`init(max_smi_len=100, max_prot_len=1000, embedding_dim=128, learning_rate=0.001, batch_size=256, n_epochs=200, num_filters=32, smi_filter_len=4, prot_filter_len=6)`

`build()`

`vectorize_ligands(ligands)`

`vectorize_proteins(aa_sequences)`

`BPEDTA`

`init(max_smi_len=100, max_prot_len=1000, embedding_dim=128, learning_rate=0.001, batch_size=256, n_epochs=200, num_filters=32, smi_filter_len=4, prot_filter_len=6)`

`build()`

`vectorize_ligands(ligands)`

`vectorize_proteins(aa_sequences)`

`LMDTA`

`init(n_epochs=200, learning_rate=0.001, batch_size=256)`

`build()`

`get_chemberta_embedding(smiles)` `cached`

`get_protbert_embedding(aa_sequence)` `cached`

`vectorize_ligands(ligands)`

`vectorize_proteins(aa_sequences)`