I am currently working with SELFIES (self-referencing embedded strings, github : https://github.com/aspuru-guzik-group/selfies) which is basically a string representation of a molecule. Basically it is a sequence of tokens that are defined by brackets , e.g. propane would be written as "[C][C][C]". I would like to find the most efficient way to get a list of tokens like so:
selfies= "[C][C][C]"
tokens= some_function(selfies)
tokens
["[C]","[C]","[C]"]
i already found 3 ways to do it :
- with the "native" function from the github (https://github.com/aspuru-guzik-group/selfies/blob/master/selfies/utils/selfies_utils.py):
def split_selfies(selfies: str) -> Iterator[str]:
"""Tokenizes a SELFIES string into its individual symbols.
:param selfies: a SELFIES string.
:return: the symbols of the SELFIES string one-by-one with order preserved.
:Example:
>>> import selfies as sf
>>> list(sf.split_selfies("[C][=C][F].[C]"))
['[C]', '[=C]', '[F]', '.', '[C]']
"""
left_idx = selfies.find("[")
while 0 <= left_idx < len(selfies):
right_idx = selfies.find("]", left_idx 1)
if right_idx == -1:
raise ValueError("malformed SELFIES string, hanging '[' bracket")
next_symbol = selfies[left_idx: right_idx 1]
yield next_symbol
left_idx = right_idx 1
if selfies[left_idx: left_idx 1] == ".":
yield "."
left_idx = 1
%%timeit
tokens= list(sf.split_selfies(selfies))
3.41 µs ± 22.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Edit: "." is never present in my case and it is not considered in solution 2 and 3 for speed's sake
This is kinda slow probably because of the conversion to a list
- One from the creator of the library (https://github.com/aspuru-guzik-group/stoned-selfies/blob/main/GA_rediscover.py) :
def get_selfie_chars(selfies):
'''Obtain a list of all selfie characters in string selfie
Parameters:
selfie (string) : A selfie string - representing a molecule
Example:
>>> get_selfie_chars('[C][=C][C][=C][C][=C][Ring1][Branch1_1]')
['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[Branch1_1]']
Returns:
chars_selfie: list of selfie characters present in molecule selfie
'''
chars_selfie = [] # A list of all SELFIE sybols from string selfie
while selfie != '':
chars_selfie.append(selfie[selfie.find('['): selfie.find(']') 1])
selfie = selfie[selfie.find(']') 1:]
return chars_selfie
%%timeit
tokens= get_selfie_chars(selfies)
3.44 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Which surprisingly take the same amount of time roughly that the native function
- My implementation with a combinaison of list comprehension,slicing and .split()
def selfies_split(selfies):
return [block "]" for block in selfies.split("]")][:-1]
%%timeit
tokens=selfies_split(selfies)
1.05 µs ± 53.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
My implementation is roughly 3 fold faster but I recon that the most efficient way to tokenize is probably to use regex with the package re but i have never used it and i am not particularly confortable with regex. So I fail to see how to implement it in way that yield the best results.
Edit:
- Suggested from answers:
def stackoverflow_1_split(selfies):
atoms = selfies[1:-1].replace('][', "$").split("$")
return list(map('[{}]'.format, atoms))
%%timeit
tokens=stackoverflow_1_split(selfies)
1.75 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Without the list conversion , it is actually faster then my implementation ( 575 ns /- 10 ns) but the list is a requirement
- Second suggestion from answers:
import re
def stackoverflow_2_split(selfies):
return re.findall(r".*?]", selfies)
%%timeit
tokens=stackoverflow_2_split(selfies)
1.81 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Surprisingly re does not seem to outperform other solutions
- third suggestion from answers :
def stackoverflow_3_split(selfies):
return selfies.replace(']', '] ').split()
%%timeit
tokens=stackoverflow_3_split(selfies)
485 ns ± 4.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
This is the fastest solution so far , which is roughly 2 time faster then my implementation, Well done Kelly!
CodePudding user response:
Another:
selfies.replace(']', '] ').split()
Benchmark with 50 tokens (since you said that's your mean):
7.29 us original
3.91 us Kelly <= mine
8.06 us keepAlive
8.87 us trincot
With your "[C][C][C]"
instead:
0.87 us original
0.44 us Kelly
0.88 us keepAlive
1.45 us trincot
Code (Try it online!):
from timeit import repeat
import re
def original(selfies):
return [block "]" for block in selfies.split("]")][:-1]
def Kelly(selfies):
return selfies.replace(']', '] ').split()
def keepAlive(selfies):
atoms = selfies[1:-1].split('][')
return [f'[{a}]' for a in atoms]
def trincot(selfie):
return re.findall(r".*?]", selfie)
fs = original, Kelly, keepAlive, trincot
selfies = ''.join(f'[{i}]' for i in range(50))
expect = original(selfies)
for f in fs:
print(f(selfies) == expect, f.__name__)
for _ in range(3):
print()
for f in fs:
number = 1000
t = min(repeat(lambda: f(selfies), number=number)) / number
print('%.2f us ' % (t * 1e6), f.__name__)
CodePudding user response:
With regex you can do it as follows:
import re
def get_selfie_chars(selfie):
return re.findall(r".*?]", selfie)
If a point should be a separate match then:
return re.findall(r"\.|.*?]", selfie)
CodePudding user response:
What about doing
>>> atoms = selfies[1:-1].split('][')
>>> atoms
["C","C","C"]
Assuming you do not need the square brackets anymore. Otherwise, you could ultimately do
>>> [f'[{a}]' for a in atoms]
["[C]","[C]","[C]"]