In Python, how do I write a loop to remove from character # n to a specific character (:) in parts o-CodePudding

I have a list like this:

test = ["Similar to Stxbp2: Syntaxin-binding protein 2 (Mus musculus)", "Protein of unknown function", "Similar to rab18b: Ras-related protein Rab-18-B (Danio rerio)", "Protein of unknown function", "Protein of unknown function"]

This object is, in actuality, a lot longer than this, but just for a simplified example: My goal is to loop through test and edit it to where any value starting with "Similar to" will return the gene name proceeding directly after (e.g., for this example I'd like to replace the items in the list matching this beginning with "Stxb2" and "rab18b", respectively), which I presume would require specifying to start at character 12 and end when it reaches a colon. When a value includes "Protein of unknown function", I want it to return "Unknown". Thus, the output would be:

["Stxbp2", "Unknown", "rab18b", "Unknown", "Unknown"]

I know this probably requires a for loop with if statements to match each condition, but am pretty lost in how to proceed from there to achieve the result I'm looking for.

CodePudding user response：

Variation without regex if you don't like those:

def parse(x):
    if x.startswith("Similar to"):
        return x.split(":")[0].split()[-1]
    if x.startswith("Protein of unknown function"):
        return "Unknown"
    raise ValueError(f"Unknown value: {x}")

print([parse(i) for i in test ])

outputs:

['Stxbp2', 'Unknown', 'rab18b', 'Unknown', 'Unknown']

CodePudding user response：

You can try using list comprehension by matching your condition using str.startswith and then use str.split to split on the :

[x[11:].split(':', 1)[0] if x.startswith('Similar to') else 'Unknown' for x in test ]
# -> ['Stxbp2', 'Unknown', 'rab18b', 'Unknown', 'Unknown']

CodePudding user response：

We can use a list comprehension along with a regex replacement:

test = ["Similar to Stxbp2: Syntaxin-binding protein 2 (Mus musculus)", "Protein of unknown function", "Similar to rab18b: Ras-related protein Rab-18-B (Danio rerio)", "Protein of unknown function", "Protein of unknown function"]
d = {'Similar to ': '', 'Protein of unknown function': 'unknown'}
regex = r'\b(?:'   r'|'.join(d.keys())   r')\b'
output = [re.sub(regex, lambda m: d[m.group()], x).split(':')[0] for x in test]
print(output)  # ['Stxbp2', 'unknown', 'rab18b', 'unknown', 'unknown']

The strategy here is the dictionary contains, as keys, the search terms, with the values as replacements. We build a regex alternation of the keys, and then use re.sub() in callback mode. For each matching key, we lookup the replacement.