predictors
The submodule that contains the predictors, i.e., drug-target affinity (DTA) prediction models, implemented in DebiasedDTA study. The implemented predictors are BPEDTA, DeepDTA, and LMDTA. Abstract classes are also available to quickly train a custom DTA prediction model with DebiasedDTA.
Predictor
Bases: ABC
An abstract class that implements the interface of a predictor in pydebiaseddta
.
The predictors are characterized by an n_epochs
attribute and a train
function,
whose signatures are implemented by this class.
Any instance of Predictor
class can be trained in the DebiasedDTA
training framework,
and therefore, Predictor
can be inherited to debias custom DTA prediction models.
Source code in pydebiaseddta/predictors/abstract_predictors.py
33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|
__init__(n_epochs, *args, **kwargs)
abstractmethod
An abstract constructor for Predictor
to display that n_epochs
is a necessary attribute for children classes.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_epochs |
int
|
Number of epochs to train the model. |
required |
Source code in pydebiaseddta/predictors/abstract_predictors.py
41 42 43 44 45 46 47 48 49 50 |
|
train(train_ligands, train_proteins, train_labels, val_ligands=None, val_proteins=None, val_labels=None, sample_weights_by_epoch=None)
abstractmethod
An abstract method to train DTA prediction models. The inputs can be of any biomolecule representation type. However, the training procedure must support sample weighting in every epoch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_ligands |
List[Any]
|
The training ligands as a List. |
required |
train_proteins |
List[Any]
|
The training proteins as a List. |
required |
train_labels |
List[float]
|
Affinity scores of the training protein-compound pairs |
required |
val_ligands |
List[Any], optional
|
Validation ligands as a List, in case validation scores are measured during training, by default |
None
|
val_proteins |
List[Any], optional
|
Validation proteins as a List, in case validation scores are measured during training, by default |
None
|
val_labels |
List[float], optional
|
Affinity scores of validation protein-compound pairs as a List, in case validation scores are measured during training, by default |
None
|
Returns:
Type | Description |
---|---|
Any
|
The function is free to return any value after its training, including |
Source code in pydebiaseddta/predictors/abstract_predictors.py
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|
TFPredictor
Bases: Predictor
The models in DebiasedDTA study (BPE-DTA, LM-DTA, DeepDTA) are implemented in Tensorflow.
TFPredictor
class provides an abstraction to these models to minimize code duplication.
The children classes only implement model building, biomolecule vectorization, and __init__
functions.
Model training, prediction, and save/load functions are inherited from this class.
Source code in pydebiaseddta/predictors/abstract_predictors.py
90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 |
|
__init__(n_epochs, learning_rate, batch_size, **kwargs)
abstractmethod
An abstract constructor for BPE-DTA, LM-DTA, and DeepDTA.
The constructor sets the common attributes and call the build
function.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_epochs |
int
|
Number of epochs to train the model. |
required |
learning_rate |
float
|
The learning rate of the optimization algorithm. |
required |
batch_size |
_type_
|
Batch size for training. |
required |
Source code in pydebiaseddta/predictors/abstract_predictors.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 |
|
build()
abstractmethod
An abstract function to create the model architecture. Every child has to implement this function.
Source code in pydebiaseddta/predictors/abstract_predictors.py
117 118 119 120 121 122 |
|
from_file(path)
classmethod
A utility function to load a TFPredictor
instance from disk.
All attributes, including the model weights, are loaded.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
Path to load the prediction model from. |
required |
Returns:
Type | Description |
---|---|
TFPredictor
|
The previously saved model. |
Source code in pydebiaseddta/predictors/abstract_predictors.py
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
|
predict(ligands, proteins)
Predicts the affinities of a List
of protein-ligand pairs via the trained DTA prediction model,
i.e., BPE-DTA, LM-DTA, and BPE-DTA.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ligands |
List[str]
|
SMILES strings of the ligands. |
required |
proteins |
List[str]
|
Amino-acid sequences of the proteins. |
required |
Returns:
Type | Description |
---|---|
List[float]
|
Predicted affinity scores by DTA prediction model. |
Source code in pydebiaseddta/predictors/abstract_predictors.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 |
|
save(path)
A utility function to save a TFPredictor
instance to the disk.
All attributes, including the model weights, are saved.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
Path to save the predictor. |
required |
Source code in pydebiaseddta/predictors/abstract_predictors.py
282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 |
|
train(train_ligands, train_proteins, train_labels, val_ligands=None, val_proteins=None, val_labels=None, sample_weights_by_epoch=None)
The common model training procedure for BPE-DTA, LM-DTA, and DeepDTA. The models adopt different biomolecule representation methods and model architectures, so, the training results are different. The training procedure supports validation for tracking, and sample weighting for debiasing.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_ligands |
List[str]
|
SMILES strings of the training ligands. |
required |
train_proteins |
List[str]
|
Amino-acid sequences of the training proteins. |
required |
train_labels |
List[float]
|
Affinity scores of the training protein-ligand pairs. |
required |
val_ligands |
List[str], optional
|
SMILES strings of the validation ligands, by default None and no validation is used. |
None
|
val_proteins |
List[str], optional
|
Amino-acid sequences of the validation proteins, by default None and no validation is used. |
None
|
val_labels |
List[float], optional
|
Affinity scores of the validation pairs, by default None and no validation is used. |
None
|
sample_weights_by_epoch |
List[np.array], optional
|
Weight of each training protein-ligand pair during training across epochs.
This variable must be a List of size \(E\) (number of training epochs),
in which each element is a |
None
|
Returns:
Type | Description |
---|---|
Dict
|
Training history. |
Source code in pydebiaseddta/predictors/abstract_predictors.py
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 |
|
vectorize_ligands(ligands)
abstractmethod
An abstract function to vectorize ligands. Every child has to implement this function.
Source code in pydebiaseddta/predictors/abstract_predictors.py
124 125 126 127 128 129 |
|
vectorize_proteins(proteins)
abstractmethod
An abstract function to vectorize proteins. Every child has to implement this function.
Source code in pydebiaseddta/predictors/abstract_predictors.py
131 132 133 134 135 136 |
|
create_uniform_weights(n_samples, n_epochs)
Create a lists of weights such that every training instance has the equal weight across all epoch, i.e., no sample weighting is used.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_samples |
int
|
Number of training instances. |
required |
n_epochs |
int
|
Number of epochs to train the model. |
required |
Returns:
Type | Description |
---|---|
List[np.array]
|
Sample weights across epochs. Each instance has a weight of 1 for all epochs. |
Source code in pydebiaseddta/predictors/abstract_predictors.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
|
DeepDTA
Bases: TFPredictor
Source code in pydebiaseddta/predictors/deepdta.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
|
__init__(max_smi_len=100, max_prot_len=1000, embedding_dim=128, learning_rate=0.001, batch_size=256, n_epochs=200, num_filters=32, smi_filter_len=4, prot_filter_len=6)
Constructor to create a DeepDTA instance. DeepDTA segments SMILES strings of ligands and amino-acid sequences of proteins into characters, and applies three layers of convolutions to learn latent representations. A fully-connected neural network with three layers is used afterwards to predict affinities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_smi_len |
int, optional
|
Maximum number of characters in a SMILES string, by default 100. Longer SMILES strings are truncated. |
100
|
max_prot_len |
int, optional
|
Maximum number of amino-acids a protein sequence, by default 1000. Longer sequences are truncated. |
1000
|
embedding_dim |
int, optional
|
The dimension of the biomolecule characters, by default 128. |
128
|
learning_rate |
float, optional
|
Learning rate during optimization, by default 0.001. |
0.001
|
batch_size |
int, optional
|
Batch size during training, by default 256. |
256
|
n_epochs |
int, optional
|
Number of epochs to train the model, by default 200. |
200
|
num_filters |
int, optional
|
Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32. |
32
|
smi_filter_len |
int, optional
|
Length of filters in the convolution blocks for ligands, by default 4. |
4
|
prot_filter_len |
int, optional
|
Length of filters in the convolution blocks for proteins, by default 6. |
6
|
Source code in pydebiaseddta/predictors/deepdta.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
|
build()
Builds a DeepDTA
predictor in keras
with the parameters specified during construction.
Returns:
Type | Description |
---|---|
tensorflow.keras.models.Model
|
The built model. |
Source code in pydebiaseddta/predictors/deepdta.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 |
|
vectorize_ligands(ligands)
Segments SMILES strings of ligands into characters and applies label encoding. Truncation and padding are also applied to prepare ligands for training and/or prediction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ligands |
List[str]
|
The SMILES strings of ligands. |
required |
Returns:
Type | Description |
---|---|
np.array
|
An \(N \times max\_smi\_len\) (\(N\) is the number of the input ligands) matrix that contains label encoded sequences of SMILES tokens. |
Source code in pydebiaseddta/predictors/deepdta.py
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 |
|
vectorize_proteins(aa_sequences)
Segments amino-acid sequences of proteins into characters and applies label encoding. Truncation and padding are also applied to prepare proteins for training and/or prediction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
aa_sequences |
List[str]
|
The amino-acid sequences of proteins. |
required |
Returns:
Type | Description |
---|---|
np.array
|
An \(N \times max\_prot\_len\) (\(N\) is the number of the input proteins) matrix that contains label encoded sequences of amino-acids. |
Source code in pydebiaseddta/predictors/deepdta.py
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 |
|
BPEDTA
Bases: TFPredictor
Source code in pydebiaseddta/predictors/bpedta.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
|
__init__(max_smi_len=100, max_prot_len=1000, embedding_dim=128, learning_rate=0.001, batch_size=256, n_epochs=200, num_filters=32, smi_filter_len=4, prot_filter_len=6)
Constructor to create a BPE-DTA instance. BPE-DTA segments SMILES strings of ligands and amino-acid sequences of proteins into biomolecule words, and applies three layers of convolutions to learn latent representations. A fully-connected neural network with three layers is used afterwards to predict affinities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
max_smi_len |
int, optional
|
Maximum number of chemical words in a SMILES string, by default 100. SMILES strings that contain more chemical words are truncated. |
100
|
max_prot_len |
int, optional
|
Maximum number of protein words in an amino-acid sequence, by default 1000. Amino-acid sequences that contain more proteins words are truncated. |
1000
|
embedding_dim |
int, optional
|
The dimension of the biomolecule words, by default 128. |
128
|
learning_rate |
float, optional
|
Learning rate during optimization, by default 0.001. |
0.001
|
batch_size |
int, optional
|
Batch size during training, by default 256. |
256
|
n_epochs |
int, optional
|
Number of epochs to train the model, by default 200. |
200
|
num_filters |
int, optional
|
Number of filters in the first convolution block. The next blocks use two and three times of this number, respectively. y default 32. |
32
|
smi_filter_len |
int, optional
|
Length of filters in the convolution blocks for ligands, by default 4. |
4
|
prot_filter_len |
int, optional
|
Length of filters in the convolution blocks for proteins, by default 6. |
6
|
Source code in pydebiaseddta/predictors/bpedta.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
|
build()
Builds a BPEDTA
predictor in keras
with the parameters specified during construction.
Returns:
Type | Description |
---|---|
tensorflow.keras.models.Model
|
The built model. |
Source code in pydebiaseddta/predictors/bpedta.py
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
|
vectorize_ligands(ligands)
Segments SMILES strings of ligands into chemical words and applies label encoding. Truncation and padding are also applied to prepare ligands for training and/or prediction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ligands |
List[str]
|
The SMILES strings of ligands. |
required |
Returns:
Type | Description |
---|---|
np.array
|
An \(N \times max\_smi\_len\) (\(N\) is the number of the input ligands) matrix that contains label encoded sequences of chemical words. |
Source code in pydebiaseddta/predictors/bpedta.py
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 |
|
vectorize_proteins(aa_sequences)
Segments amino-acid sequences of proteins into protein words and applies label encoding. Truncation and padding are also applied to prepare proteins for training and/or prediction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
aa_sequences |
List[str]
|
The amino-acid sequences of proteins. |
required |
Returns:
Type | Description |
---|---|
np.array
|
An \(N \times max\_prot\_len\) (\(N\) is the number of the input proteins) matrix that contains label encoded sequences of protein words. |
Source code in pydebiaseddta/predictors/bpedta.py
182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 |
|
LMDTA
Bases: TFPredictor
Source code in pydebiaseddta/predictors/lmdta.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|
__init__(n_epochs=200, learning_rate=0.001, batch_size=256)
Constructor to create a LMDTA instance.
LMDTA represents ligands and proteins with pre-trained language model embeddings
obtained via ChemBERTa
and ProtBert
models, respectively.
A fully-connected neural network with two layers is used afterwards to predict affinities.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_epochs |
int, optional
|
Number of epochs to train the model, by default 200. |
200
|
learning_rate |
float, optional
|
Learning rate during optimization, by default 0.001. |
0.001
|
batch_size |
int, optional
|
Batch size during training, by default 256. |
256
|
Source code in pydebiaseddta/predictors/lmdta.py
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
|
build()
Builds a LMDTA
predictor in keras
with the parameters specified during construction.
Returns:
Type | Description |
---|---|
tensorflow.keras.models.Model
|
The built model. |
Source code in pydebiaseddta/predictors/lmdta.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
|
get_chemberta_embedding(smiles)
cached
Computes the ChemBERTa
vector for a ligand.
Since the creating the vector is computation-heavy, an lru_cache
of size 2048 is used to store produced vectors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles |
str
|
SMILES string of the ligand. |
required |
Returns:
Type | Description |
---|---|
np.array
|
|
Source code in pydebiaseddta/predictors/lmdta.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 |
|
get_protbert_embedding(aa_sequence)
cached
Computes the ProtBert
vector for a protein.
Since the creating the vector is computation-heavy, an lru_cache
of size 2048 is used to store produced vectors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
aa_sequence |
str
|
Amino-acid sequence of the protein. |
required |
Returns:
Type | Description |
---|---|
np.array
|
|
Source code in pydebiaseddta/predictors/lmdta.py
106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
|
vectorize_ligands(ligands)
Vectorizes the ligands with ChemBERTa
embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ligands |
List[str]
|
The SMILES strings of ligands. |
required |
Returns:
Type | Description |
---|---|
np.array
|
An \(N \times 768\) (\(N\) is the number of the input ligands) matrix that contains |
Source code in pydebiaseddta/predictors/lmdta.py
89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 |
|
vectorize_proteins(aa_sequences)
Vectorizes the proteins with ProtBert
embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
aa_sequences |
List[str]
|
The amino-acid sequences of the proteins. |
required |
Returns:
Type | Description |
---|---|
np.array
|
An \(N \times 1024\) (\(N\) is the number of the input proteins) matrix that contains |
Source code in pydebiaseddta/predictors/lmdta.py
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
|