sequence
The submodule for processing SMILES strings.
smiles_processing.py
consists of utility function to segment SMILES strings,
whereas word_identification.py
consists of a class to learn biomolecule words and segment biomolecule sequences into biomolecule words.
segment_smiles(smiles, segment_sq_brackets=True)
Segments a SMILES string into its tokens.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles |
str
|
Input SMILES string. |
required |
segment_sq_brackets |
bool, optional
|
Whether to segment expressions within square brackets (e.g. [C@@H], [Rb]), too.
Set to |
True
|
Returns:
Type | Description |
---|---|
List[str]
|
Each element of the SMILES string as a list. |
Source code in pydebiaseddta/sequence/smiles_processing.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
|
segment_smiles_batch(smiles_batch, segment_sq_brackets=True)
Segments multiple SMILES strings with a single call by wrapping sequence.smiles_processing.segment_smiles
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles_batch |
List[str]
|
List of input SMILES strings. |
required |
segment_sq_brackets |
bool, optional
|
Whether to segment expressions within square brackets.
See |
True
|
Returns:
Type | Description |
---|---|
List[List[str]]
|
A 2D list of strings where element \([i][j]\) corresponds to the \(j^{th}\) token of the \(i^{th}\) input. |
Source code in pydebiaseddta/sequence/smiles_processing.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
|
WordIdentifier
A versatile class to identify biomolecule words in biomolecule strings.
WordIdentifier
leverages the Byte Pair Encoding algorithm implemented in the tokenizers
library
to learn biomolecule vocabularies and segment biomolecule strings into their words.
Source code in pydebiaseddta/sequence/word_identification.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
|
__init__(vocab_size)
Creates a WordIdentifier
instance.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vocab_size |
int
|
Size of the biomolecule vocabulary. |
required |
Source code in pydebiaseddta/sequence/word_identification.py
19 20 21 22 23 24 25 26 27 28 29 |
|
encode_sequences(sequences, padding_len=None)
Segments a List of biomolecule strings into biomolecule words via the learned vocabulary and returns the id of the biomolecule word, which is convenient to apply label encoding in the subsequent steps. Padding support is also available to ease training deep learning possible.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sequences |
List[str]
|
The List of biomolecule strings. |
required |
padding_len |
int, optional
|
The desired length of sequences, by default |
None
|
Returns:
Type | Description |
---|---|
List[List[int]]
|
List of the id of the biomolecule words of each input string. |
Source code in pydebiaseddta/sequence/word_identification.py
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
|
from_file(loadpath)
classmethod
Loads a WordIdentifier
from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
loadpath |
str
|
Path to the |
required |
Returns:
Type | Description |
---|---|
WordIdentifier
|
Previously saved |
Source code in pydebiaseddta/sequence/word_identification.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
|
save(savepath)
Saves a WordIdentifier
instance to disk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
savepath |
str
|
The path to dump the instance. File extension is added automatically. |
required |
Source code in pydebiaseddta/sequence/word_identification.py
114 115 116 117 118 119 120 121 122 123 124 |
|
tokenize_sequences(sequences)
Segments a List of biomolecule strings into biomolecule words via the learned vocabulary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sequences |
List[str]
|
The List of biomolecule strings. |
required |
Returns:
Type | Description |
---|---|
List[List[str]]
|
List of biomolecule words of each input string. |
Source code in pydebiaseddta/sequence/word_identification.py
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
|
train(corpus_path)
Learns a biomolecule vocabulary from a file of biomolecule strings using Byte Pair Encoding Algorithm.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
corpus_path |
str
|
Path to the corpus of biomolecule strings. The corpus file must contain a biomolecule string per line. |
required |
Source code in pydebiaseddta/sequence/word_identification.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
|
load_chemical_word_identifier(vocab_size)
A convenience function to load word vocabularies learned for SMILES strings in the study. The possible vocabularies to load are for DeepDTA and BPE-DTA.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vocab_size |
int
|
Size of the learned SMILES word vocabulary. The allowed values are 94 and 8000, for DeepDTA and BPE-DTA, respectively. |
required |
Returns:
Type | Description |
---|---|
type[WordIdentifier]
|
The |
Raises:
Type | Description |
---|---|
ValueError
|
If vocabulary size besides 94 and 8000 is passed, a |
Source code in pydebiaseddta/sequence/word_identification.py
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
|
load_protein_word_identifier(vocab_size)
A convenience function to load word vocabularies learned for amino-acid sequences in the study. The possible vocabularies to load are for DeepDTA and BPE-DTA.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
vocab_size |
int
|
Size of the learned SMILES word vocabulary. The allowed values are 26 and 32000, for DeepDTA and BPE-DTA, respectively. |
required |
Returns:
Type | Description |
---|---|
type[WordIdentifier]
|
The |
Raises:
Type | Description |
---|---|
ValueError
|
If vocabulary size besides 26 and 32000 is passed, a |
Source code in pydebiaseddta/sequence/word_identification.py
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 |
|