With python what is the most efficient way to tokenize a string (SELFIES) to a list?-CodePudding

I am currently working with SELFIES (self-referencing embedded strings, github : https://github.com/aspuru-guzik-group/selfies) which is basically a string representation of a molecule. Basically it is a sequence of tokens that are defined by brackets , e.g. propane would be written as "[C][C][C]". I would like to find the most efficient way to get a list of tokens like so:

selfies= "[C][C][C]"
tokens= some_function(selfies)
tokens 
["[C]","[C]","[C]"]

i already found 3 ways to do it :

with the "native" function from the github (https://github.com/aspuru-guzik-group/selfies/blob/master/selfies/utils/selfies_utils.py):

def split_selfies(selfies: str) -> Iterator[str]:
    """Tokenizes a SELFIES string into its individual symbols.
    :param selfies: a SELFIES string.
    :return: the symbols of the SELFIES string one-by-one with order preserved.
    :Example:
    >>> import selfies as sf
    >>> list(sf.split_selfies("[C][=C][F].[C]"))
    ['[C]', '[=C]', '[F]', '.', '[C]']
    """

    left_idx = selfies.find("[")

    while 0 <= left_idx < len(selfies):
        right_idx = selfies.find("]", left_idx   1)
        if right_idx == -1:
            raise ValueError("malformed SELFIES string, hanging '[' bracket")

        next_symbol = selfies[left_idx: right_idx   1]
        yield next_symbol

        left_idx = right_idx   1
        if selfies[left_idx: left_idx   1] == ".":
            yield "."
            left_idx  = 1


%%timeit
tokens= list(sf.split_selfies(selfies))
3.41 µs ± 22.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Edit: "." is never present in my case and it is not considered in solution 2 and 3 for speed's sake

This is kinda slow probably because of the conversion to a list

One from the creator of the library (https://github.com/aspuru-guzik-group/stoned-selfies/blob/main/GA_rediscover.py) :

def get_selfie_chars(selfies):
    '''Obtain a list of all selfie characters in string selfie
    
    Parameters: 
    selfie (string) : A selfie string - representing a molecule 
    
    Example: 
    >>> get_selfie_chars('[C][=C][C][=C][C][=C][Ring1][Branch1_1]')
    ['[C]', '[=C]', '[C]', '[=C]', '[C]', '[=C]', '[Ring1]', '[Branch1_1]']
    
    Returns:
    chars_selfie: list of selfie characters present in molecule selfie
    '''
    chars_selfie = [] # A list of all SELFIE sybols from string selfie
    while selfie != '':
        chars_selfie.append(selfie[selfie.find('['): selfie.find(']') 1])
        selfie = selfie[selfie.find(']') 1:]
    return chars_selfie

%%timeit
tokens= get_selfie_chars(selfies)
3.44 µs ± 43.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Which surprisingly take the same amount of time roughly that the native function

My implementation with a combinaison of list comprehension,slicing and .split()

def selfies_split(selfies):
    return [block "]" for block in selfies.split("]")][:-1]


%%timeit
tokens=selfies_split(selfies)
1.05 µs ± 53.2 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

My implementation is roughly 3 fold faster but I recon that the most efficient way to tokenize is probably to use regex with the package re but i have never used it and i am not particularly confortable with regex. So I fail to see how to implement it in way that yield the best results.

Edit:

Suggested from answers:

def stackoverflow_1_split(selfies):
    atoms = selfies[1:-1].replace('][', "$").split("$")
    return list(map('[{}]'.format, atoms))
%%timeit
tokens=stackoverflow_1_split(selfies)
1.75 µs ± 101 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Without the list conversion , it is actually faster then my implementation ( 575 ns /- 10 ns) but the list is a requirement

Second suggestion from answers:

import re
def stackoverflow_2_split(selfies):
    return re.findall(r".*?]", selfies)

%%timeit
tokens=stackoverflow_2_split(selfies)
1.81 µs ± 110 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Surprisingly re does not seem to outperform other solutions

third suggestion from answers :

def stackoverflow_3_split(selfies):
    return selfies.replace(']', '] ').split()
%%timeit
tokens=stackoverflow_3_split(selfies)
485 ns ± 4.04 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

This is the fastest solution so far , which is roughly 2 time faster then my implementation, Well done Kelly!

CodePudding user response：

Another:

selfies.replace(']', '] ').split()

Benchmark with 50 tokens (since you said that's your mean):

7.29 us  original
3.91 us  Kelly       <= mine
8.06 us  keepAlive
8.87 us  trincot

With your "[C][C][C]" instead:

0.87 us  original
0.44 us  Kelly
0.88 us  keepAlive
1.45 us  trincot

Code (Try it online!):

from timeit import repeat
import re

def original(selfies):
    return [block "]" for block in selfies.split("]")][:-1]

def Kelly(selfies):
    return selfies.replace(']', '] ').split()

def keepAlive(selfies):
    atoms = selfies[1:-1].split('][')
    return [f'[{a}]' for a in atoms]

def trincot(selfie):
    return re.findall(r".*?]", selfie)

fs = original, Kelly, keepAlive, trincot

selfies = ''.join(f'[{i}]' for i in range(50))

expect = original(selfies)
for f in fs:
    print(f(selfies) == expect, f.__name__)

for _ in range(3):
    print()
    for f in fs:
        number = 1000
        t = min(repeat(lambda: f(selfies), number=number)) / number
        print('%.2f us ' % (t * 1e6), f.__name__)

CodePudding user response：

With regex you can do it as follows:

import re

def get_selfie_chars(selfie):
    return re.findall(r".*?]", selfie)

If a point should be a separate match then:

    return re.findall(r"\.|.*?]", selfie)

CodePudding user response：

What about doing

>>> atoms = selfies[1:-1].split('][')
>>> atoms
["C","C","C"]

Assuming you do not need the square brackets anymore. Otherwise, you could ultimately do

>>> [f'[{a}]' for a in atoms]
["[C]","[C]","[C]"]