Skip to content

predictors

The submodule that contains the predictors, i.e., drug-target affinity (DTA) prediction models, implemented in DebiasedDTA study. The implemented predictors are BPEDTA, DeepDTA, and LMDTA. Abstract classes are also available to quickly train a custom DTA prediction model with DebiasedDTA.

Predictor

Bases: ABC

An abstract class that implements the interface of a predictor in pydebiaseddta. The predictors are characterized by an n_epochs attribute and a train function, whose signatures are implemented by this class. Any instance of Predictor class can be trained in the DebiasedDTA training framework, and therefore, Predictor can be inherited to debias custom DTA prediction models.

Source code in pydebiaseddta/predictors/abstract_predictors.py
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
class Predictor(ABC):
    """An abstract class that implements the interface of a predictor in `pydebiaseddta`.
    The predictors are characterized by an `n_epochs` attribute and a `train` function, 
    whose signatures are implemented by this class. 
    Any instance of `Predictor` class can be trained in the `DebiasedDTA` training framework,
    and therefore, `Predictor` can be inherited to debias custom DTA prediction models.
    """

    @abstractmethod
    def __init__(self, n_epochs: int, *args, **kwargs) -> None:
        """An abstract constructor for `Predictor` to display that `n_epochs` is a necessary attribute for children classes.

        Parameters
        ----------
        n_epochs : int
            Number of epochs to train the model.
        """
        self.n_epochs = n_epochs

    @abstractmethod
    def train(
        self,
        train_ligands: List[Any],
        train_proteins: List[Any],
        train_labels: List[float],
        val_ligands: List[Any] = None,
        val_proteins: List[Any] = None,
        val_labels: List[float] = None,
        sample_weights_by_epoch: List[np.array] = None,
    ) -> Any:
        """An abstract method to train DTA prediction models.
        The inputs can be of any biomolecule representation type.
        However, the training procedure must support sample weighting in every epoch.

        Parameters
        ----------
        train_ligands : List[Any]
            The training ligands as a List.
        train_proteins : List[Any]
            The training proteins as a List.
        train_labels : List[float]
            Affinity scores of the training protein-compound pairs
        val_ligands : List[Any], optional
            Validation ligands as a List, in case validation scores are measured during training, by default `None`
        val_proteins : List[Any], optional
            Validation proteins as a List, in case validation scores are measured during training, by default `None`
        val_labels : List[float], optional
            Affinity scores of validation protein-compound pairs as a List, in case validation scores are measured during training, by default `None`

        Returns
        -------
        Any
            The function is free to return any value after its training, including `None`.
        """
        pass

__init__(n_epochs, *args, **kwargs) abstractmethod

An abstract constructor for Predictor to display that n_epochs is a necessary attribute for children classes.

Parameters:

Name Type Description Default
n_epochs int

Number of epochs to train the model.

required
Source code in pydebiaseddta/predictors/abstract_predictors.py
41
42
43
44
45
46
47
48
49
50
@abstractmethod
def __init__(self, n_epochs: int, *args, **kwargs) -> None:
    """An abstract constructor for `Predictor` to display that `n_epochs` is a necessary attribute for children classes.

    Parameters
    ----------
    n_epochs : int
        Number of epochs to train the model.
    """
    self.n_epochs = n_epochs

train(train_ligands, train_proteins, train_labels, val_ligands=None, val_proteins=None, val_labels=None, sample_weights_by_epoch=None) abstractmethod

An abstract method to train DTA prediction models. The inputs can be of any biomolecule representation type. However, the training procedure must support sample weighting in every epoch.

Parameters:

Name Type Description Default
train_ligands List[Any]

The training ligands as a List.

required
train_proteins List[Any]

The training proteins as a List.

required
train_labels List[float]

Affinity scores of the training protein-compound pairs

required
val_ligands List[Any], optional

Validation ligands as a List, in case validation scores are measured during training, by default None

None
val_proteins List[Any], optional

Validation proteins as a List, in case validation scores are measured during training, by default None

None
val_labels List[float], optional

Affinity scores of validation protein-compound pairs as a List, in case validation scores are measured during training, by default None

None

Returns:

Type Description
Any

The function is free to return any value after its training, including None.

Source code in pydebiaseddta/predictors/abstract_predictors.py
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
@abstractmethod
def train(
    self,
    train_ligands: List[Any],
    train_proteins: List[Any],
    train_labels: List[float],
    val_ligands: List[Any] = None,
    val_proteins: List[Any] = None,
    val_labels: List[float] = None,
    sample_weights_by_epoch: List[np.array] = None,
) -> Any:
    """An abstract method to train DTA prediction models.
    The inputs can be of any biomolecule representation type.
    However, the training procedure must support sample weighting in every epoch.

    Parameters
    ----------
    train_ligands : List[Any]
        The training ligands as a List.
    train_proteins : List[Any]
        The training proteins as a List.
    train_labels : List[float]
        Affinity scores of the training protein-compound pairs
    val_ligands : List[Any], optional
        Validation ligands as a List, in case validation scores are measured during training, by default `None`
    val_proteins : List[Any], optional
        Validation proteins as a List, in case validation scores are measured during training, by default `None`
    val_labels : List[float], optional
        Affinity scores of validation protein-compound pairs as a List, in case validation scores are measured during training, by default `None`

    Returns
    -------
    Any
        The function is free to return any value after its training, including `None`.
    """
    pass

TFPredictor

Bases: Predictor

The models in DebiasedDTA study (BPE-DTA, LM-DTA, DeepDTA) are implemented in Tensorflow. TFPredictor class provides an abstraction to these models to minimize code duplication. The children classes only implement model building, biomolecule vectorization, and __init__ functions.
Model training, prediction, and save/load functions are inherited from this class.

Source code in pydebiaseddta/predictors/abstract_predictors.py
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
class TFPredictor(Predictor):
    """The models in DebiasedDTA study (BPE-DTA, LM-DTA, DeepDTA) are implemented in Tensorflow.
    `TFPredictor` class provides an abstraction to these models to minimize code duplication.
    The children classes only implement model building, biomolecule vectorization, and `__init__` functions.  
    Model training, prediction, and save/load functions are inherited from this class.
    """

    @abstractmethod
    def __init__(self, n_epochs: int, learning_rate: float, batch_size: int, **kwargs):
        """An abstract constructor for BPE-DTA, LM-DTA, and DeepDTA.
        The constructor sets the common attributes and call the `build` function.  

        Parameters
        ----------
        n_epochs : int
            Number of epochs to train the model.
        learning_rate : float
            The learning rate of the optimization algorithm.
        batch_size : _type_
            Batch size for training.
        """
        self.n_epochs = n_epochs
        self.learning_rate = learning_rate
        self.batch_size = batch_size
        self.history = dict()
        self.model = self.build()

    @abstractmethod
    def build(self):
        """An abstract function to create the model architecture.
        Every child has to implement this function.
        """
        pass

    @abstractmethod
    def vectorize_ligands(self, ligands):
        """An abstract function to vectorize ligands.
        Every child has to implement this function.
        """
        pass

    @abstractmethod
    def vectorize_proteins(self, proteins):
        """An abstract function to vectorize proteins.
        Every child has to implement this function.
        """
        pass

    @classmethod
    def from_file(cls, path: str):
        """A utility function to load a `TFPredictor` instance from disk.
        All attributes, including the model weights, are loaded.

        Parameters
        ----------
        path : str
            Path to load the prediction model from.

        Returns
        -------
        TFPredictor
            The previously saved model.
        """
        with open(f"{path}/params.json") as f:
            dct = json.load(f)

        instance = cls(**dct)

        instance.model = tf.keras.models.load_model(f"{path}/model")

        with open(f"{path}/history.json") as f:
            instance.history = json.load(f)
        return instance

    def train(
        self,
        train_ligands: List[str],
        train_proteins: List[str],
        train_labels: List[float],
        val_ligands: List[str] = None,
        val_proteins: List[str] = None,
        val_labels: List[float] = None,
        sample_weights_by_epoch: List[np.array] = None,
    ) -> Dict:
        """The common model training procedure for BPE-DTA, LM-DTA, and DeepDTA.
        The models adopt different biomolecule representation methods and model architectures,
        so, the training results are different.
        The training procedure supports validation for tracking, and sample weighting for debiasing.

        Parameters
        ----------
        train_ligands : List[str]
            SMILES strings of the training ligands.
        train_proteins : List[str]
            Amino-acid sequences of the training proteins.
        train_labels : List[float]
            Affinity scores of the training protein-ligand pairs.
        val_ligands : List[str], optional
            SMILES strings of the validation ligands, by default None and no validation is used.
        val_proteins : List[str], optional
            Amino-acid sequences of the validation proteins, by default None and no validation is used.
        val_labels : List[float], optional
            Affinity scores of the validation pairs, by default None and no validation is used.
        sample_weights_by_epoch : List[np.array], optional
            Weight of each training protein-ligand pair during training across epochs.
            This variable must be a List of size $E$ (number of training epochs),
            in which each element is a `np.array` of $N\times 1$, where $N$ is the training set size and 
            each element corresponds to the weight of a training sample.
            By default `None` and no weighting is used.

        Returns
        -------
        Dict
            Training history.
        """
        if sample_weights_by_epoch is None:
            sample_weights_by_epoch = create_uniform_weights(
                len(train_ligands), self.n_epochs
            )

        train_ligand_vectors = self.vectorize_ligands(train_ligands)
        train_protein_vectors = self.vectorize_proteins(train_proteins)
        train_labels = np.array(train_labels)

        val_tuple = None
        if (
            val_ligands is not None
            and val_proteins is not None
            and val_labels is not None
        ):
            val_ligand_vectors = self.vectorize_ligands(val_ligands)
            val_protein_vectors = self.vectorize_proteins(val_proteins)
            val_tuple = (
                [val_ligand_vectors, val_protein_vectors],
                np.array(val_labels),
            )

        train_stats_over_epochs = {"mse": [], "rmse": [], "r2": []}
        val_stats_over_epochs = train_stats_over_epochs.copy()
        for e in range(self.n_epochs):
            self.model.fit(
                x=[train_ligand_vectors, train_protein_vectors],
                y=train_labels,
                sample_weight=sample_weights_by_epoch[e],
                validation_data=val_tuple,
                batch_size=self.batch_size,
                epochs=1,
            )

            train_stats = evaluate_predictions(
                gold_truths=train_labels,
                predictions=self.predict(train_ligands, train_proteins),
                metrics=list(train_stats_over_epochs.keys()),
            )
            for metric, stat in train_stats.items():
                train_stats_over_epochs[metric].append(stat)

            if val_tuple is not None:
                val_stats = evaluate_predictions(
                    y_true=val_labels,
                    y_preds=self.predict(val_ligands, val_proteins),
                    metrics=list(val_stats_over_epochs.keys()),
                )
                for metric, stat in val_stats.items():
                    val_stats_over_epochs[metric].append(stat)

        self.history["train"] = train_stats_over_epochs
        if val_stats_over_epochs is not None:
            self.history["val"] = val_stats_over_epochs

        return self.history

    def predict(self, ligands: List[str], proteins: List[str]) -> List[float]:
        """Predicts the affinities of a `List` of protein-ligand pairs via the trained DTA prediction model,
        *i.e.*, BPE-DTA, LM-DTA, and BPE-DTA. 

        Parameters
        ----------
        ligands : List[str]
            SMILES strings of the ligands.
        proteins : List[str]
            Amino-acid sequences of the proteins.

        Returns
        -------
        List[float]
            Predicted affinity scores by DTA prediction model.
        """        
        ligand_vectors = self.vectorize_ligands(ligands)
        protein_vectors = self.vectorize_proteins(proteins)
        return self.model.predict([ligand_vectors, protein_vectors]).tolist()

    def save(self, path: str) -> None:
        """A utility function to save a `TFPredictor` instance to the disk.
        All attributes, including the model weights, are saved.

        Parameters
        ----------
        path : str
            Path to save the predictor.
        """        
        self.model.save(f"{path}/model")

        with open(f"{path}/history.json", "w") as f:
            json.dump(self.history, f, indent=4)

        donot_copy = {"model", "history"}
        dct = {k: v for k, v in self.__dict__.items() if k not in donot_copy}
        with open(f"{path}/params.json", "w") as f:
            json.dump(dct, f, indent=4)

__init__(n_epochs, learning_rate, batch_size, **kwargs) abstractmethod

An abstract constructor for BPE-DTA, LM-DTA, and DeepDTA. The constructor sets the common attributes and call the build function.

Parameters:

Name Type Description Default
n_epochs int

Number of epochs to train the model.

required
learning_rate float

The learning rate of the optimization algorithm.

required
batch_size _type_

Batch size for training.

required
Source code in pydebiaseddta/predictors/abstract_predictors.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
@abstractmethod
def __init__(self, n_epochs: int, learning_rate: float, batch_size: int, **kwargs):
    """An abstract constructor for BPE-DTA, LM-DTA, and DeepDTA.
    The constructor sets the common attributes and call the `build` function.  

    Parameters
    ----------
    n_epochs : int
        Number of epochs to train the model.
    learning_rate : float
        The learning rate of the optimization algorithm.
    batch_size : _type_
        Batch size for training.
    """
    self.n_epochs = n_epochs
    self.learning_rate = learning_rate
    self.batch_size = batch_size
    self.history = dict()
    self.model = self.build()

build() abstractmethod

An abstract function to create the model architecture. Every child has to implement this function.

Source code in pydebiaseddta/predictors/abstract_predictors.py
117
118
119
120
121
122
@abstractmethod
def build(self):
    """An abstract function to create the model architecture.
    Every child has to implement this function.
    """
    pass

from_file(path) classmethod

A utility function to load a TFPredictor instance from disk. All attributes, including the model weights, are loaded.

Parameters:

Name Type Description Default
path str

Path to load the prediction model from.

required

Returns:

Type Description
TFPredictor

The previously saved model.

Source code in pydebiaseddta/predictors/abstract_predictors.py
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
@classmethod
def from_file(cls, path: str):
    """A utility function to load a `TFPredictor` instance from disk.
    All attributes, including the model weights, are loaded.

    Parameters
    ----------
    path : str
        Path to load the prediction model from.

    Returns
    -------
    TFPredictor
        The previously saved model.
    """
    with open(f"{path}/params.json") as f:
        dct = json.load(f)

    instance = cls(**dct)

    instance.model = tf.keras.models.load_model(f"{path}/model")

    with open(f"{path}/history.json") as f:
        instance.history = json.load(f)
    return instance

predict(ligands, proteins)

Predicts the affinities of a List of protein-ligand pairs via the trained DTA prediction model, i.e., BPE-DTA, LM-DTA, and BPE-DTA.

Parameters:

Name Type Description Default
ligands List[str]

SMILES strings of the ligands.

required
proteins List[str]

Amino-acid sequences of the proteins.

required

Returns:

Type Description
List[float]

Predicted affinity scores by DTA prediction model.

Source code in pydebiaseddta/predictors/abstract_predictors.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
def predict(self, ligands: List[str], proteins: List[str]) -> List[float]:
    """Predicts the affinities of a `List` of protein-ligand pairs via the trained DTA prediction model,
    *i.e.*, BPE-DTA, LM-DTA, and BPE-DTA. 

    Parameters
    ----------
    ligands : List[str]
        SMILES strings of the ligands.
    proteins : List[str]
        Amino-acid sequences of the proteins.

    Returns
    -------
    List[float]
        Predicted affinity scores by DTA prediction model.
    """        
    ligand_vectors = self.vectorize_ligands(ligands)
    protein_vectors = self.vectorize_proteins(proteins)
    return self.model.predict([ligand_vectors, protein_vectors]).tolist()

save(path)

A utility function to save a TFPredictor instance to the disk. All attributes, including the model weights, are saved.

Parameters:

Name Type Description Default
path str

Path to save the predictor.

required
Source code in pydebiaseddta/predictors/abstract_predictors.py
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
def save(self, path: str) -> None:
    """A utility function to save a `TFPredictor` instance to the disk.
    All attributes, including the model weights, are saved.

    Parameters
    ----------
    path : str
        Path to save the predictor.
    """        
    self.model.save(f"{path}/model")

    with open(f"{path}/history.json", "w") as f:
        json.dump(self.history, f, indent=4)

    donot_copy = {"model", "history"}
    dct = {k: v for k, v in self.__dict__.items() if k not in donot_copy}
    with open(f"{path}/params.json", "w") as f:
        json.dump(dct, f, indent=4)

train(train_ligands, train_proteins, train_labels, val_ligands=None, val_proteins=None, val_labels=None, sample_weights_by_epoch=None)

The common model training procedure for BPE-DTA, LM-DTA, and DeepDTA. The models adopt different biomolecule representation methods and model architectures, so, the training results are different. The training procedure supports validation for tracking, and sample weighting for debiasing.

Parameters:

Name Type Description Default
train_ligands List[str]

SMILES strings of the training ligands.

required
train_proteins List[str]

Amino-acid sequences of the training proteins.

required
train_labels List[float]

Affinity scores of the training protein-ligand pairs.

required
val_ligands List[str], optional

SMILES strings of the validation ligands, by default None and no validation is used.

None
val_proteins List[str], optional

Amino-acid sequences of the validation proteins, by default None and no validation is used.

None
val_labels List[float], optional

Affinity scores of the validation pairs, by default None and no validation is used.

None
sample_weights_by_epoch List[np.array], optional

Weight of each training protein-ligand pair during training across epochs. This variable must be a List of size \(E\) (number of training epochs), in which each element is a np.array of \(N imes 1\), where \(N\) is the training set size and each element corresponds to the weight of a training sample. By default None and no weighting is used.

None

Returns:

Type Description
Dict

Training history.

Source code in pydebiaseddta/predictors/abstract_predictors.py
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
def train(
    self,
    train_ligands: List[str],
    train_proteins: List[str],
    train_labels: List[float],
    val_ligands: List[str] = None,
    val_proteins: List[str] = None,
    val_labels: List[float] = None,
    sample_weights_by_epoch: List[np.array] = None,
) -> Dict:
    """The common model training procedure for BPE-DTA, LM-DTA, and DeepDTA.
    The models adopt different biomolecule representation methods and model architectures,
    so, the training results are different.
    The training procedure supports validation for tracking, and sample weighting for debiasing.

    Parameters
    ----------
    train_ligands : List[str]
        SMILES strings of the training ligands.
    train_proteins : List[str]
        Amino-acid sequences of the training proteins.
    train_labels : List[float]
        Affinity scores of the training protein-ligand pairs.
    val_ligands : List[str], optional
        SMILES strings of the validation ligands, by default None and no validation is used.
    val_proteins : List[str], optional
        Amino-acid sequences of the validation proteins, by default None and no validation is used.
    val_labels : List[float], optional
        Affinity scores of the validation pairs, by default None and no validation is used.
    sample_weights_by_epoch : List[np.array], optional
        Weight of each training protein-ligand pair during training across epochs.
        This variable must be a List of size $E$ (number of training epochs),
        in which each element is a `np.array` of $N\times 1$, where $N$ is the training set size and 
        each element corresponds to the weight of a training sample.
        By default `None` and no weighting is used.

    Returns
    -------
    Dict
        Training history.
    """
    if sample_weights_by_epoch is None:
        sample_weights_by_epoch = create_uniform_weights(
            len(train_ligands), self.n_epochs
        )

    train_ligand_vectors = self.vectorize_ligands(train_ligands)
    train_protein_vectors = self.vectorize_proteins(train_proteins)
    train_labels = np.array(train_labels)

    val_tuple = None
    if (
        val_ligands is not None
        and val_proteins is not None
        and val_labels is not None
    ):
        val_ligand_vectors = self.vectorize_ligands(val_ligands)
        val_protein_vectors = self.vectorize_proteins(val_proteins)
        val_tuple = (
            [val_ligand_vectors, val_protein_vectors],
            np.array(val_labels),
        )

    train_stats_over_epochs = {"mse": [], "rmse": [], "r2": []}
    val_stats_over_epochs = train_stats_over_epochs.copy()
    for e in range(self.n_epochs):
        self.model.fit(
            x=[train_ligand_vectors, train_protein_vectors],
            y=train_labels,
            sample_weight=sample_weights_by_epoch[e],
            validation_data=val_tuple,
            batch_size=self.batch_size,
            epochs=1,
        )

        train_stats = evaluate_predictions(
            gold_truths=train_labels,
            predictions=self.predict(train_ligands, train_proteins),
            metrics=list(train_stats_over_epochs.keys()),
        )
        for metric, stat in train_stats.items():
            train_stats_over_epochs[metric].append(stat)

        if val_tuple is not None:
            val_stats = evaluate_predictions(
                y_true=val_labels,
                y_preds=self.predict(val_ligands, val_proteins),
                metrics=list(val_stats_over_epochs.keys()),
            )
            for metric, stat in val_stats.items():
                val_stats_over_epochs[metric].append(stat)

    self.history["train"] = train_stats_over_epochs
    if val_stats_over_epochs is not None:
        self.history["val"] = val_stats_over_epochs

    return self.history

vectorize_ligands(ligands) abstractmethod

An abstract function to vectorize ligands. Every child has to implement this function.

Source code in pydebiaseddta/predictors/abstract_predictors.py
124
125
126
127
128
129
@abstractmethod
def vectorize_ligands(self, ligands):
    """An abstract function to vectorize ligands.
    Every child has to implement this function.
    """
    pass

vectorize_proteins(proteins) abstractmethod

An abstract function to vectorize proteins. Every child has to implement this function.

Source code in pydebiaseddta/predictors/abstract_predictors.py
131
132
133
134
135
136
@abstractmethod
def vectorize_proteins(self, proteins):
    """An abstract function to vectorize proteins.
    Every child has to implement this function.
    """
    pass

create_uniform_weights(n_samples, n_epochs)

Create a lists of weights such that every training instance has the equal weight across all epoch, i.e., no sample weighting is used.

Parameters:

Name Type Description Default
n_samples int

Number of training instances.

required
n_epochs int

Number of epochs to train the model.

required

Returns:

Type Description
List[np.array]

Sample weights across epochs. Each instance has a weight of 1 for all epochs.

Source code in pydebiaseddta/predictors/abstract_predictors.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def create_uniform_weights(n_samples: int, n_epochs: int) -> List[np.array]:
    """Create a lists of weights such that every training instance has the equal weight across all epoch,
    *i.e.*, no sample weighting is used.

    Parameters
    ----------
    n_samples : int
        Number of training instances.
    n_epochs : int
        Number of epochs to train the model.

    Returns
    -------
    List[np.array]
        Sample weights across epochs. Each instance has a weight of 1 for all epochs.
    """
    return [np.array([1] * n_samples) for _ in range(n_epochs)]

DeepDTA

Bases: TFPredictor

Source code in pydebiaseddta/predictors/deepdta.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
class DeepDTA(TFPredictor):
    def __init__(
        self,
        max_smi_len: int = 100,
        max_prot_len: int = 1000,
        embedding_dim: int = 128,
        learning_rate: float = 0.001,
        batch_size: int = 256,
        n_epochs: int = 200,
        num_filters: int = 32,
        smi_filter_len: int = 4,
        prot_filter_len: int = 6,
    ):
        """Constructor to create a DeepDTA instance.
        DeepDTA segments SMILES strings of ligands and amino-acid sequences of proteins into characters,
        and applies three layers of convolutions to learn latent representations. 
        A fully-connected neural network with three layers is used afterwards to predict affinities.

        Parameters
        ----------
        max_smi_len : int, optional
            Maximum number of characters in a SMILES string, by default 100. 
            Longer SMILES strings are truncated.
        max_prot_len : int, optional
            Maximum number of amino-acids a protein sequence, by default 1000. 
            Longer sequences are truncated.
        embedding_dim : int, optional
            The dimension of the biomolecule characters, by default 128.
        learning_rate : float, optional
            Learning rate during optimization, by default 0.001.
        batch_size : int, optional
            Batch size during training, by default 256.
        n_epochs : int, optional
            Number of epochs to train the model, by default 200.
        num_filters : int, optional
            Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.
        smi_filter_len : int, optional
            Length of filters in the convolution blocks for ligands, by default 4.
        prot_filter_len : int, optional
            Length of filters in the convolution blocks for proteins, by default 6.
        """    
        self.max_smi_len = max_smi_len
        self.max_prot_len = max_prot_len
        self.embedding_dim = embedding_dim
        self.num_filters = num_filters
        self.smi_filter_len = smi_filter_len
        self.prot_filter_len = prot_filter_len

        self.chem_vocab_size = 94
        self.prot_vocab_size = 26
        TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

    def build(self):
        """Builds a `DeepDTA` predictor in `keras` with the parameters specified during construction.

        Returns
        -------
        tensorflow.keras.models.Model
            The built model.
        """    
        # Inputs
        ligands = Input(shape=(self.max_smi_len,), dtype="int32")

        # chemical representation
        ligand_representation = Embedding(
            input_dim=self.chem_vocab_size + 1,
            output_dim=self.embedding_dim,
            input_length=self.max_smi_len,
            mask_zero=True,
        )(ligands)
        ligand_representation = Conv1D(
            filters=self.num_filters,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = Conv1D(
            filters=self.num_filters * 2,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = Conv1D(
            filters=self.num_filters * 3,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = GlobalMaxPooling1D()(ligand_representation)

        # Protein representation
        proteins = Input(shape=(self.max_prot_len,), dtype="int32")
        protein_representation = Embedding(
            input_dim=self.prot_vocab_size + 1,
            output_dim=self.embedding_dim,
            input_length=self.max_prot_len,
            mask_zero=True,
        )(proteins)
        protein_representation = Conv1D(
            filters=self.num_filters,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = Conv1D(
            filters=self.num_filters * 2,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = Conv1D(
            filters=self.num_filters * 3,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = GlobalMaxPooling1D()(protein_representation)

        interaction_representation = Concatenate(axis=-1)(
            [ligand_representation, protein_representation]
        )

        # Fully connected layers
        FC1 = Dense(1024, activation="relu")(interaction_representation)
        FC1 = Dropout(0.1)(FC1)
        FC2 = Dense(1024, activation="relu")(FC1)
        FC2 = Dropout(0.1)(FC2)
        FC3 = Dense(512, activation="relu")(FC2)
        predictions = Dense(1, kernel_initializer="normal")(FC3)

        opt = Adam(self.learning_rate)
        deepdta = Model(inputs=[ligands, proteins], outputs=[predictions])
        deepdta.compile(
            optimizer=opt,
            loss="mean_squared_error",
            metrics=["mean_squared_error"],
        )
        return deepdta

    def vectorize_ligands(self, ligands: List[str]) -> np.array:
        """Segments SMILES strings of ligands into characters and applies label encoding.
        Truncation and padding are also applied to prepare ligands for training and/or prediction.

        Parameters
        ----------
        ligands : List[str]
            The SMILES strings of ligands.

        Returns
        -------
        np.array
            An $N \\times max\\_smi\\_len$ ($N$ is the number of the input ligands) matrix that contains label encoded sequences of SMILES tokens.
        """     
        smi_to_unichar_encoding = load_smiles_to_unichar_encoding()
        unichars = smiles_to_unichar_batch(ligands, smi_to_unichar_encoding)
        word_identifier = load_chemical_word_identifier(vocab_size=94)

        return np.array(
            word_identifier.encode_sequences(unichars, self.max_smi_len)
        )

    def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
        """Segments amino-acid sequences of proteins into characters and applies label encoding.
        Truncation and padding are also applied to prepare proteins for training and/or prediction.

        Parameters
        ----------
        aa_sequences : List[str]
            The amino-acid sequences of proteins.

        Returns
        -------
        np.array
            An $N \\times max\\_prot\\_len$ ($N$ is the number of the input proteins) matrix that contains label encoded sequences of amino-acids.
        """
        word_identifier = load_protein_word_identifier(vocab_size=26)
        return np.array(
            word_identifier.encode_sequences(aa_sequences, self.max_prot_len)
        )

__init__(max_smi_len=100, max_prot_len=1000, embedding_dim=128, learning_rate=0.001, batch_size=256, n_epochs=200, num_filters=32, smi_filter_len=4, prot_filter_len=6)

Constructor to create a DeepDTA instance. DeepDTA segments SMILES strings of ligands and amino-acid sequences of proteins into characters, and applies three layers of convolutions to learn latent representations. A fully-connected neural network with three layers is used afterwards to predict affinities.

Parameters:

Name Type Description Default
max_smi_len int, optional

Maximum number of characters in a SMILES string, by default 100. Longer SMILES strings are truncated.

100
max_prot_len int, optional

Maximum number of amino-acids a protein sequence, by default 1000. Longer sequences are truncated.

1000
embedding_dim int, optional

The dimension of the biomolecule characters, by default 128.

128
learning_rate float, optional

Learning rate during optimization, by default 0.001.

0.001
batch_size int, optional

Batch size during training, by default 256.

256
n_epochs int, optional

Number of epochs to train the model, by default 200.

200
num_filters int, optional

Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.

32
smi_filter_len int, optional

Length of filters in the convolution blocks for ligands, by default 4.

4
prot_filter_len int, optional

Length of filters in the convolution blocks for proteins, by default 6.

6
Source code in pydebiaseddta/predictors/deepdta.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def __init__(
    self,
    max_smi_len: int = 100,
    max_prot_len: int = 1000,
    embedding_dim: int = 128,
    learning_rate: float = 0.001,
    batch_size: int = 256,
    n_epochs: int = 200,
    num_filters: int = 32,
    smi_filter_len: int = 4,
    prot_filter_len: int = 6,
):
    """Constructor to create a DeepDTA instance.
    DeepDTA segments SMILES strings of ligands and amino-acid sequences of proteins into characters,
    and applies three layers of convolutions to learn latent representations. 
    A fully-connected neural network with three layers is used afterwards to predict affinities.

    Parameters
    ----------
    max_smi_len : int, optional
        Maximum number of characters in a SMILES string, by default 100. 
        Longer SMILES strings are truncated.
    max_prot_len : int, optional
        Maximum number of amino-acids a protein sequence, by default 1000. 
        Longer sequences are truncated.
    embedding_dim : int, optional
        The dimension of the biomolecule characters, by default 128.
    learning_rate : float, optional
        Learning rate during optimization, by default 0.001.
    batch_size : int, optional
        Batch size during training, by default 256.
    n_epochs : int, optional
        Number of epochs to train the model, by default 200.
    num_filters : int, optional
        Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.
    smi_filter_len : int, optional
        Length of filters in the convolution blocks for ligands, by default 4.
    prot_filter_len : int, optional
        Length of filters in the convolution blocks for proteins, by default 6.
    """    
    self.max_smi_len = max_smi_len
    self.max_prot_len = max_prot_len
    self.embedding_dim = embedding_dim
    self.num_filters = num_filters
    self.smi_filter_len = smi_filter_len
    self.prot_filter_len = prot_filter_len

    self.chem_vocab_size = 94
    self.prot_vocab_size = 26
    TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

build()

Builds a DeepDTA predictor in keras with the parameters specified during construction.

Returns:

Type Description
tensorflow.keras.models.Model

The built model.

Source code in pydebiaseddta/predictors/deepdta.py
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
def build(self):
    """Builds a `DeepDTA` predictor in `keras` with the parameters specified during construction.

    Returns
    -------
    tensorflow.keras.models.Model
        The built model.
    """    
    # Inputs
    ligands = Input(shape=(self.max_smi_len,), dtype="int32")

    # chemical representation
    ligand_representation = Embedding(
        input_dim=self.chem_vocab_size + 1,
        output_dim=self.embedding_dim,
        input_length=self.max_smi_len,
        mask_zero=True,
    )(ligands)
    ligand_representation = Conv1D(
        filters=self.num_filters,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = Conv1D(
        filters=self.num_filters * 2,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = Conv1D(
        filters=self.num_filters * 3,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = GlobalMaxPooling1D()(ligand_representation)

    # Protein representation
    proteins = Input(shape=(self.max_prot_len,), dtype="int32")
    protein_representation = Embedding(
        input_dim=self.prot_vocab_size + 1,
        output_dim=self.embedding_dim,
        input_length=self.max_prot_len,
        mask_zero=True,
    )(proteins)
    protein_representation = Conv1D(
        filters=self.num_filters,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = Conv1D(
        filters=self.num_filters * 2,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = Conv1D(
        filters=self.num_filters * 3,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = GlobalMaxPooling1D()(protein_representation)

    interaction_representation = Concatenate(axis=-1)(
        [ligand_representation, protein_representation]
    )

    # Fully connected layers
    FC1 = Dense(1024, activation="relu")(interaction_representation)
    FC1 = Dropout(0.1)(FC1)
    FC2 = Dense(1024, activation="relu")(FC1)
    FC2 = Dropout(0.1)(FC2)
    FC3 = Dense(512, activation="relu")(FC2)
    predictions = Dense(1, kernel_initializer="normal")(FC3)

    opt = Adam(self.learning_rate)
    deepdta = Model(inputs=[ligands, proteins], outputs=[predictions])
    deepdta.compile(
        optimizer=opt,
        loss="mean_squared_error",
        metrics=["mean_squared_error"],
    )
    return deepdta

vectorize_ligands(ligands)

Segments SMILES strings of ligands into characters and applies label encoding. Truncation and padding are also applied to prepare ligands for training and/or prediction.

Parameters:

Name Type Description Default
ligands List[str]

The SMILES strings of ligands.

required

Returns:

Type Description
np.array

An \(N \times max\_smi\_len\) (\(N\) is the number of the input ligands) matrix that contains label encoded sequences of SMILES tokens.

Source code in pydebiaseddta/predictors/deepdta.py
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
def vectorize_ligands(self, ligands: List[str]) -> np.array:
    """Segments SMILES strings of ligands into characters and applies label encoding.
    Truncation and padding are also applied to prepare ligands for training and/or prediction.

    Parameters
    ----------
    ligands : List[str]
        The SMILES strings of ligands.

    Returns
    -------
    np.array
        An $N \\times max\\_smi\\_len$ ($N$ is the number of the input ligands) matrix that contains label encoded sequences of SMILES tokens.
    """     
    smi_to_unichar_encoding = load_smiles_to_unichar_encoding()
    unichars = smiles_to_unichar_batch(ligands, smi_to_unichar_encoding)
    word_identifier = load_chemical_word_identifier(vocab_size=94)

    return np.array(
        word_identifier.encode_sequences(unichars, self.max_smi_len)
    )

vectorize_proteins(aa_sequences)

Segments amino-acid sequences of proteins into characters and applies label encoding. Truncation and padding are also applied to prepare proteins for training and/or prediction.

Parameters:

Name Type Description Default
aa_sequences List[str]

The amino-acid sequences of proteins.

required

Returns:

Type Description
np.array

An \(N \times max\_prot\_len\) (\(N\) is the number of the input proteins) matrix that contains label encoded sequences of amino-acids.

Source code in pydebiaseddta/predictors/deepdta.py
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
    """Segments amino-acid sequences of proteins into characters and applies label encoding.
    Truncation and padding are also applied to prepare proteins for training and/or prediction.

    Parameters
    ----------
    aa_sequences : List[str]
        The amino-acid sequences of proteins.

    Returns
    -------
    np.array
        An $N \\times max\\_prot\\_len$ ($N$ is the number of the input proteins) matrix that contains label encoded sequences of amino-acids.
    """
    word_identifier = load_protein_word_identifier(vocab_size=26)
    return np.array(
        word_identifier.encode_sequences(aa_sequences, self.max_prot_len)
    )

BPEDTA

Bases: TFPredictor

Source code in pydebiaseddta/predictors/bpedta.py
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
class BPEDTA(TFPredictor):
    def __init__(
        self,
        max_smi_len: int = 100,
        max_prot_len: int = 1000,
        embedding_dim: int = 128,
        learning_rate: float = 0.001,
        batch_size: int = 256,
        n_epochs: int = 200,
        num_filters: int = 32,
        smi_filter_len: int = 4,
        prot_filter_len: int = 6,
    ):
        """Constructor to create a BPE-DTA instance.
        BPE-DTA segments SMILES strings of ligands and amino-acid sequences of proteins into biomolecule words,
        and applies three layers of convolutions to learn latent representations. 
        A fully-connected neural network with three layers is used afterwards to predict affinities.

        Parameters
        ----------
        max_smi_len : int, optional
            Maximum number of chemical words in a SMILES string, by default 100. 
            SMILES strings that contain more chemical words are truncated.
        max_prot_len : int, optional
            Maximum number of protein words in an amino-acid sequence, by default 1000. 
            Amino-acid sequences that contain more proteins words are truncated.
        embedding_dim : int, optional
            The dimension of the biomolecule words, by default 128.
        learning_rate : float, optional
            Learning rate during optimization, by default 0.001.
        batch_size : int, optional
            Batch size during training, by default 256.
        n_epochs : int, optional
            Number of epochs to train the model, by default 200.
        num_filters : int, optional
            Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.
        smi_filter_len : int, optional
            Length of filters in the convolution blocks for ligands, by default 4.
        prot_filter_len : int, optional
            Length of filters in the convolution blocks for proteins, by default 6.
        """    
        self.max_smi_len = max_smi_len
        self.max_prot_len = max_prot_len
        self.embedding_dim = embedding_dim
        self.num_filters = num_filters
        self.smi_filter_len = smi_filter_len
        self.prot_filter_len = prot_filter_len

        self.chem_vocab_size = 8000
        self.prot_vocab_size = 32000
        TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

    def build(self) -> Model:
        """Builds a `BPEDTA` predictor in `keras` with the parameters specified during construction.

        Returns
        -------
        tensorflow.keras.models.Model
            The built model.
        """        
        # Inputs
        ligands = Input(shape=(self.max_smi_len,), dtype="int32")
        # chemical representation
        ligand_representation = Embedding(
            input_dim=self.chem_vocab_size + 1,
            output_dim=self.embedding_dim,
            input_length=self.max_smi_len,
            mask_zero=True,
        )(ligands)
        ligand_representation = Conv1D(
            filters=self.num_filters,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = Conv1D(
            filters=self.num_filters * 2,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = Conv1D(
            filters=self.num_filters * 3,
            kernel_size=self.smi_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(ligand_representation)
        ligand_representation = GlobalMaxPooling1D()(ligand_representation)

        # Protein representation
        proteins = Input(shape=(self.max_prot_len,), dtype="int32")
        protein_representation = Embedding(
            input_dim=self.prot_vocab_size + 1,
            output_dim=self.embedding_dim,
            input_length=self.max_prot_len,
            mask_zero=True,
        )(proteins)
        protein_representation = Conv1D(
            filters=self.num_filters,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = Conv1D(
            filters=self.num_filters * 2,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = Conv1D(
            filters=self.num_filters * 3,
            kernel_size=self.prot_filter_len,
            activation="relu",
            padding="valid",
            strides=1,
        )(protein_representation)
        protein_representation = GlobalMaxPooling1D()(protein_representation)

        interaction_representation = Concatenate(axis=-1)(
            [ligand_representation, protein_representation]
        )

        # Fully connected layers
        FC1 = Dense(1024, activation="relu")(interaction_representation)
        FC1 = Dropout(0.1)(FC1)
        FC2 = Dense(1024, activation="relu")(FC1)
        FC2 = Dropout(0.1)(FC2)
        FC3 = Dense(512, activation="relu")(FC2)
        predictions = Dense(1, kernel_initializer="normal")(FC3)

        opt = Adam(self.learning_rate)
        bpedta = Model(inputs=[ligands, proteins], outputs=[predictions])
        bpedta.compile(
            optimizer=opt, loss="mean_squared_error", metrics=["mean_squared_error"],
        )
        return bpedta

    def vectorize_ligands(self, ligands: List[str]) -> np.array:
        """Segments SMILES strings of ligands into chemical words and applies label encoding.
        Truncation and padding are also applied to prepare ligands for training and/or prediction.

        Parameters
        ----------
        ligands : List[str]
            The SMILES strings of ligands.

        Returns
        -------
        np.array
            An $N \\times max\\_smi\\_len$ ($N$ is the number of the input ligands) matrix that contains label encoded sequences of chemical words.
        """        
        smi_to_unichar_encoding = load_smiles_to_unichar_encoding()
        unichars = smiles_to_unichar_batch(ligands, smi_to_unichar_encoding)
        word_identifier = load_chemical_word_identifier(vocab_size=8000)

        return np.array(word_identifier.encode_sequences(unichars, self.max_smi_len))

    def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
        """Segments amino-acid sequences of proteins into protein words and applies label encoding.
        Truncation and padding are also applied to prepare proteins for training and/or prediction.

        Parameters
        ----------
        aa_sequences : List[str]
            The amino-acid sequences of proteins.

        Returns
        -------
        np.array
            An $N \\times max\\_prot\\_len$ ($N$ is the number of the input proteins) matrix that contains label encoded sequences of protein words.
        """      
        word_identifier = load_protein_word_identifier(vocab_size=32000)
        return np.array(
            word_identifier.encode_sequences(aa_sequences, self.max_prot_len)
        )

__init__(max_smi_len=100, max_prot_len=1000, embedding_dim=128, learning_rate=0.001, batch_size=256, n_epochs=200, num_filters=32, smi_filter_len=4, prot_filter_len=6)

Constructor to create a BPE-DTA instance. BPE-DTA segments SMILES strings of ligands and amino-acid sequences of proteins into biomolecule words, and applies three layers of convolutions to learn latent representations. A fully-connected neural network with three layers is used afterwards to predict affinities.

Parameters:

Name Type Description Default
max_smi_len int, optional

Maximum number of chemical words in a SMILES string, by default 100. SMILES strings that contain more chemical words are truncated.

100
max_prot_len int, optional

Maximum number of protein words in an amino-acid sequence, by default 1000. Amino-acid sequences that contain more proteins words are truncated.

1000
embedding_dim int, optional

The dimension of the biomolecule words, by default 128.

128
learning_rate float, optional

Learning rate during optimization, by default 0.001.

0.001
batch_size int, optional

Batch size during training, by default 256.

256
n_epochs int, optional

Number of epochs to train the model, by default 200.

200
num_filters int, optional

Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.

32
smi_filter_len int, optional

Length of filters in the convolution blocks for ligands, by default 4.

4
prot_filter_len int, optional

Length of filters in the convolution blocks for proteins, by default 6.

6
Source code in pydebiaseddta/predictors/bpedta.py
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def __init__(
    self,
    max_smi_len: int = 100,
    max_prot_len: int = 1000,
    embedding_dim: int = 128,
    learning_rate: float = 0.001,
    batch_size: int = 256,
    n_epochs: int = 200,
    num_filters: int = 32,
    smi_filter_len: int = 4,
    prot_filter_len: int = 6,
):
    """Constructor to create a BPE-DTA instance.
    BPE-DTA segments SMILES strings of ligands and amino-acid sequences of proteins into biomolecule words,
    and applies three layers of convolutions to learn latent representations. 
    A fully-connected neural network with three layers is used afterwards to predict affinities.

    Parameters
    ----------
    max_smi_len : int, optional
        Maximum number of chemical words in a SMILES string, by default 100. 
        SMILES strings that contain more chemical words are truncated.
    max_prot_len : int, optional
        Maximum number of protein words in an amino-acid sequence, by default 1000. 
        Amino-acid sequences that contain more proteins words are truncated.
    embedding_dim : int, optional
        The dimension of the biomolecule words, by default 128.
    learning_rate : float, optional
        Learning rate during optimization, by default 0.001.
    batch_size : int, optional
        Batch size during training, by default 256.
    n_epochs : int, optional
        Number of epochs to train the model, by default 200.
    num_filters : int, optional
        Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32.
    smi_filter_len : int, optional
        Length of filters in the convolution blocks for ligands, by default 4.
    prot_filter_len : int, optional
        Length of filters in the convolution blocks for proteins, by default 6.
    """    
    self.max_smi_len = max_smi_len
    self.max_prot_len = max_prot_len
    self.embedding_dim = embedding_dim
    self.num_filters = num_filters
    self.smi_filter_len = smi_filter_len
    self.prot_filter_len = prot_filter_len

    self.chem_vocab_size = 8000
    self.prot_vocab_size = 32000
    TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

build()

Builds a BPEDTA predictor in keras with the parameters specified during construction.

Returns:

Type Description
tensorflow.keras.models.Model

The built model.

Source code in pydebiaseddta/predictors/bpedta.py
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
def build(self) -> Model:
    """Builds a `BPEDTA` predictor in `keras` with the parameters specified during construction.

    Returns
    -------
    tensorflow.keras.models.Model
        The built model.
    """        
    # Inputs
    ligands = Input(shape=(self.max_smi_len,), dtype="int32")
    # chemical representation
    ligand_representation = Embedding(
        input_dim=self.chem_vocab_size + 1,
        output_dim=self.embedding_dim,
        input_length=self.max_smi_len,
        mask_zero=True,
    )(ligands)
    ligand_representation = Conv1D(
        filters=self.num_filters,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = Conv1D(
        filters=self.num_filters * 2,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = Conv1D(
        filters=self.num_filters * 3,
        kernel_size=self.smi_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(ligand_representation)
    ligand_representation = GlobalMaxPooling1D()(ligand_representation)

    # Protein representation
    proteins = Input(shape=(self.max_prot_len,), dtype="int32")
    protein_representation = Embedding(
        input_dim=self.prot_vocab_size + 1,
        output_dim=self.embedding_dim,
        input_length=self.max_prot_len,
        mask_zero=True,
    )(proteins)
    protein_representation = Conv1D(
        filters=self.num_filters,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = Conv1D(
        filters=self.num_filters * 2,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = Conv1D(
        filters=self.num_filters * 3,
        kernel_size=self.prot_filter_len,
        activation="relu",
        padding="valid",
        strides=1,
    )(protein_representation)
    protein_representation = GlobalMaxPooling1D()(protein_representation)

    interaction_representation = Concatenate(axis=-1)(
        [ligand_representation, protein_representation]
    )

    # Fully connected layers
    FC1 = Dense(1024, activation="relu")(interaction_representation)
    FC1 = Dropout(0.1)(FC1)
    FC2 = Dense(1024, activation="relu")(FC1)
    FC2 = Dropout(0.1)(FC2)
    FC3 = Dense(512, activation="relu")(FC2)
    predictions = Dense(1, kernel_initializer="normal")(FC3)

    opt = Adam(self.learning_rate)
    bpedta = Model(inputs=[ligands, proteins], outputs=[predictions])
    bpedta.compile(
        optimizer=opt, loss="mean_squared_error", metrics=["mean_squared_error"],
    )
    return bpedta

vectorize_ligands(ligands)

Segments SMILES strings of ligands into chemical words and applies label encoding. Truncation and padding are also applied to prepare ligands for training and/or prediction.

Parameters:

Name Type Description Default
ligands List[str]

The SMILES strings of ligands.

required

Returns:

Type Description
np.array

An \(N \times max\_smi\_len\) (\(N\) is the number of the input ligands) matrix that contains label encoded sequences of chemical words.

Source code in pydebiaseddta/predictors/bpedta.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
def vectorize_ligands(self, ligands: List[str]) -> np.array:
    """Segments SMILES strings of ligands into chemical words and applies label encoding.
    Truncation and padding are also applied to prepare ligands for training and/or prediction.

    Parameters
    ----------
    ligands : List[str]
        The SMILES strings of ligands.

    Returns
    -------
    np.array
        An $N \\times max\\_smi\\_len$ ($N$ is the number of the input ligands) matrix that contains label encoded sequences of chemical words.
    """        
    smi_to_unichar_encoding = load_smiles_to_unichar_encoding()
    unichars = smiles_to_unichar_batch(ligands, smi_to_unichar_encoding)
    word_identifier = load_chemical_word_identifier(vocab_size=8000)

    return np.array(word_identifier.encode_sequences(unichars, self.max_smi_len))

vectorize_proteins(aa_sequences)

Segments amino-acid sequences of proteins into protein words and applies label encoding. Truncation and padding are also applied to prepare proteins for training and/or prediction.

Parameters:

Name Type Description Default
aa_sequences List[str]

The amino-acid sequences of proteins.

required

Returns:

Type Description
np.array

An \(N \times max\_prot\_len\) (\(N\) is the number of the input proteins) matrix that contains label encoded sequences of protein words.

Source code in pydebiaseddta/predictors/bpedta.py
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
    """Segments amino-acid sequences of proteins into protein words and applies label encoding.
    Truncation and padding are also applied to prepare proteins for training and/or prediction.

    Parameters
    ----------
    aa_sequences : List[str]
        The amino-acid sequences of proteins.

    Returns
    -------
    np.array
        An $N \\times max\\_prot\\_len$ ($N$ is the number of the input proteins) matrix that contains label encoded sequences of protein words.
    """      
    word_identifier = load_protein_word_identifier(vocab_size=32000)
    return np.array(
        word_identifier.encode_sequences(aa_sequences, self.max_prot_len)
    )

LMDTA

Bases: TFPredictor

Source code in pydebiaseddta/predictors/lmdta.py
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
class LMDTA(TFPredictor):
    def __init__(
        self, n_epochs: int = 200, learning_rate: float = 0.001, batch_size: int = 256
    ):
        """Constructor to create a LMDTA instance.
        LMDTA represents ligands and proteins with pre-trained language model embeddings
        obtained via [`ChemBERTa`](https://arxiv.org/abs/2010.09885) and  [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) models, respectively. 
        A fully-connected neural network with two layers is used afterwards to predict affinities.

        Parameters
        ----------
        n_epochs : int, optional
            Number of epochs to train the model, by default 200.
        learning_rate : float, optional
            Learning rate during optimization, by default 0.001.
        batch_size : int, optional
             Batch size during training, by default 256.
        """
        transformers.logging.set_verbosity(transformers.logging.CRITICAL)
        self.chemical_tokenizer = AutoTokenizer.from_pretrained(
            "seyonec/PubChem10M_SMILES_BPE_450k"
        )
        self.chemberta = AutoModel.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")

        self.protein_tokenizer = AutoTokenizer.from_pretrained(
            "Rostlab/prot_bert", do_lower_case=False
        )
        self.protbert = AutoModel.from_pretrained("Rostlab/prot_bert")
        TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

    def build(self):
        """Builds a `LMDTA` predictor in `keras` with the parameters specified during construction.

        Returns
        -------
        tensorflow.keras.models.Model
            The built model.
        """
        chemicals = Input(shape=(768,), dtype="float32")
        proteins = Input(shape=(1024,), dtype="float32")

        interaction_representation = Concatenate(axis=-1)([chemicals, proteins])

        FC1 = Dense(1024, activation="relu")(interaction_representation)
        FC1 = Dropout(0.1)(FC1)
        FC2 = Dense(512, activation="relu")(FC1)
        predictions = Dense(1, kernel_initializer="normal")(FC2)

        opt = Adam(self.learning_rate)
        lmdta = Model(inputs=[chemicals, proteins], outputs=[predictions])
        lmdta.compile(
            optimizer=opt, loss="mean_squared_error", metrics=["mean_squared_error"]
        )
        return lmdta

    @lru_cache(maxsize=2048)
    def get_chemberta_embedding(self, smiles: str) -> np.array:
        """Computes the [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vector for a ligand. 
        Since the creating the vector is computation-heavy, an `lru_cache` of size 2048 is used to store produced vectors.

        Parameters
        ----------
        smiles : str
            SMILES string of the ligand.

        Returns
        -------
        np.array
            [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vector (768-dimensional) of the ligand.
        """        
        tokens = self.chemical_tokenizer(smiles, return_tensors="pt")
        output = self.chemberta(**tokens)
        return output.last_hidden_state.detach().numpy().mean(axis=1)

    def vectorize_ligands(self, ligands: List[str]) -> np.array:
        """Vectorizes the ligands with [`ChemBERTa`](https://arxiv.org/abs/2010.09885) embeddings.

        Parameters
        ----------
        ligands : List[str]
            The SMILES strings of ligands.

        Returns
        -------
        np.array
            An $N \\times 768$ ($N$ is the number of the input ligands) matrix that contains [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vectors of the ligands.
        """        
        return np.vstack(
            [self.get_chemberta_embedding(chemical) for chemical in ligands]
        )

    @lru_cache(maxsize=1024)
    def get_protbert_embedding(self, aa_sequence: str) -> np.array:
        """Computes the [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vector for a protein. 
        Since the creating the vector is computation-heavy, an `lru_cache` of size 2048 is used to store produced vectors.

        Parameters
        ----------
        aa_sequence : str
            Amino-acid sequence of the protein.

        Returns
        -------
        np.array
            [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vector (1024-dimensional) of the protein.
        """    
        pp_sequence = " ".join(aa_sequence)
        cleaned_sequence = re.sub(r"[UZOB]", "X", pp_sequence)
        tokens = self.protein_tokenizer(cleaned_sequence, return_tensors="pt")
        output = self.protbert(**tokens)
        return output.last_hidden_state.detach().numpy().mean(axis=1)

    def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
        """Vectorizes the proteins with [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) embeddings.

        Parameters
        ----------
        aa_sequences : List[str]
            The amino-acid sequences of the proteins.

        Returns
        -------
        np.array
            An $N \\times 1024$ ($N$ is the number of the input proteins) matrix that contains [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vectors of the ligands.
        """   
        return np.vstack(
            [self.get_protbert_embedding(aa_sequence) for aa_sequence in aa_sequences]
        )

__init__(n_epochs=200, learning_rate=0.001, batch_size=256)

Constructor to create a LMDTA instance. LMDTA represents ligands and proteins with pre-trained language model embeddings obtained via ChemBERTa and ProtBert models, respectively. A fully-connected neural network with two layers is used afterwards to predict affinities.

Parameters:

Name Type Description Default
n_epochs int, optional

Number of epochs to train the model, by default 200.

200
learning_rate float, optional

Learning rate during optimization, by default 0.001.

0.001
batch_size int, optional

Batch size during training, by default 256.

256
Source code in pydebiaseddta/predictors/lmdta.py
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
def __init__(
    self, n_epochs: int = 200, learning_rate: float = 0.001, batch_size: int = 256
):
    """Constructor to create a LMDTA instance.
    LMDTA represents ligands and proteins with pre-trained language model embeddings
    obtained via [`ChemBERTa`](https://arxiv.org/abs/2010.09885) and  [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) models, respectively. 
    A fully-connected neural network with two layers is used afterwards to predict affinities.

    Parameters
    ----------
    n_epochs : int, optional
        Number of epochs to train the model, by default 200.
    learning_rate : float, optional
        Learning rate during optimization, by default 0.001.
    batch_size : int, optional
         Batch size during training, by default 256.
    """
    transformers.logging.set_verbosity(transformers.logging.CRITICAL)
    self.chemical_tokenizer = AutoTokenizer.from_pretrained(
        "seyonec/PubChem10M_SMILES_BPE_450k"
    )
    self.chemberta = AutoModel.from_pretrained("seyonec/PubChem10M_SMILES_BPE_450k")

    self.protein_tokenizer = AutoTokenizer.from_pretrained(
        "Rostlab/prot_bert", do_lower_case=False
    )
    self.protbert = AutoModel.from_pretrained("Rostlab/prot_bert")
    TFPredictor.__init__(self, n_epochs, learning_rate, batch_size)

build()

Builds a LMDTA predictor in keras with the parameters specified during construction.

Returns:

Type Description
tensorflow.keras.models.Model

The built model.

Source code in pydebiaseddta/predictors/lmdta.py
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def build(self):
    """Builds a `LMDTA` predictor in `keras` with the parameters specified during construction.

    Returns
    -------
    tensorflow.keras.models.Model
        The built model.
    """
    chemicals = Input(shape=(768,), dtype="float32")
    proteins = Input(shape=(1024,), dtype="float32")

    interaction_representation = Concatenate(axis=-1)([chemicals, proteins])

    FC1 = Dense(1024, activation="relu")(interaction_representation)
    FC1 = Dropout(0.1)(FC1)
    FC2 = Dense(512, activation="relu")(FC1)
    predictions = Dense(1, kernel_initializer="normal")(FC2)

    opt = Adam(self.learning_rate)
    lmdta = Model(inputs=[chemicals, proteins], outputs=[predictions])
    lmdta.compile(
        optimizer=opt, loss="mean_squared_error", metrics=["mean_squared_error"]
    )
    return lmdta

get_chemberta_embedding(smiles) cached

Computes the ChemBERTa vector for a ligand. Since the creating the vector is computation-heavy, an lru_cache of size 2048 is used to store produced vectors.

Parameters:

Name Type Description Default
smiles str

SMILES string of the ligand.

required

Returns:

Type Description
np.array

ChemBERTa vector (768-dimensional) of the ligand.

Source code in pydebiaseddta/predictors/lmdta.py
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
@lru_cache(maxsize=2048)
def get_chemberta_embedding(self, smiles: str) -> np.array:
    """Computes the [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vector for a ligand. 
    Since the creating the vector is computation-heavy, an `lru_cache` of size 2048 is used to store produced vectors.

    Parameters
    ----------
    smiles : str
        SMILES string of the ligand.

    Returns
    -------
    np.array
        [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vector (768-dimensional) of the ligand.
    """        
    tokens = self.chemical_tokenizer(smiles, return_tensors="pt")
    output = self.chemberta(**tokens)
    return output.last_hidden_state.detach().numpy().mean(axis=1)

get_protbert_embedding(aa_sequence) cached

Computes the ProtBert vector for a protein. Since the creating the vector is computation-heavy, an lru_cache of size 2048 is used to store produced vectors.

Parameters:

Name Type Description Default
aa_sequence str

Amino-acid sequence of the protein.

required

Returns:

Type Description
np.array

ProtBert vector (1024-dimensional) of the protein.

Source code in pydebiaseddta/predictors/lmdta.py
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
@lru_cache(maxsize=1024)
def get_protbert_embedding(self, aa_sequence: str) -> np.array:
    """Computes the [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vector for a protein. 
    Since the creating the vector is computation-heavy, an `lru_cache` of size 2048 is used to store produced vectors.

    Parameters
    ----------
    aa_sequence : str
        Amino-acid sequence of the protein.

    Returns
    -------
    np.array
        [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vector (1024-dimensional) of the protein.
    """    
    pp_sequence = " ".join(aa_sequence)
    cleaned_sequence = re.sub(r"[UZOB]", "X", pp_sequence)
    tokens = self.protein_tokenizer(cleaned_sequence, return_tensors="pt")
    output = self.protbert(**tokens)
    return output.last_hidden_state.detach().numpy().mean(axis=1)

vectorize_ligands(ligands)

Vectorizes the ligands with ChemBERTa embeddings.

Parameters:

Name Type Description Default
ligands List[str]

The SMILES strings of ligands.

required

Returns:

Type Description
np.array

An \(N \times 768\) (\(N\) is the number of the input ligands) matrix that contains ChemBERTa vectors of the ligands.

Source code in pydebiaseddta/predictors/lmdta.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
def vectorize_ligands(self, ligands: List[str]) -> np.array:
    """Vectorizes the ligands with [`ChemBERTa`](https://arxiv.org/abs/2010.09885) embeddings.

    Parameters
    ----------
    ligands : List[str]
        The SMILES strings of ligands.

    Returns
    -------
    np.array
        An $N \\times 768$ ($N$ is the number of the input ligands) matrix that contains [`ChemBERTa`](https://arxiv.org/abs/2010.09885) vectors of the ligands.
    """        
    return np.vstack(
        [self.get_chemberta_embedding(chemical) for chemical in ligands]
    )

vectorize_proteins(aa_sequences)

Vectorizes the proteins with ProtBert embeddings.

Parameters:

Name Type Description Default
aa_sequences List[str]

The amino-acid sequences of the proteins.

required

Returns:

Type Description
np.array

An \(N \times 1024\) (\(N\) is the number of the input proteins) matrix that contains ProtBert vectors of the ligands.

Source code in pydebiaseddta/predictors/lmdta.py
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
def vectorize_proteins(self, aa_sequences: List[str]) -> np.array:
    """Vectorizes the proteins with [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) embeddings.

    Parameters
    ----------
    aa_sequences : List[str]
        The amino-acid sequences of the proteins.

    Returns
    -------
    np.array
        An $N \\times 1024$ ($N$ is the number of the input proteins) matrix that contains [`ProtBert`](https://www.biorxiv.org/content/biorxiv/early/2020/07/21/2020.07.12.199554.full.pdf) vectors of the ligands.
    """   
    return np.vstack(
        [self.get_protbert_embedding(aa_sequence) for aa_sequence in aa_sequences]
    )