Regex to extract substring from string in Python-CodePudding

I have following values in two columns. I would like to get only selected values from 'name' column.

gene 
BCR-ABL (translocation) [HSA:25] [KO:K06619]            MLL-AF4 (translocation) [HSA:4297   4299] [KO:K09186 K15184]            E2A-PBX1 (translocation) [HSA:6929 5087] [KO:K09063 K09355]            TEL-AML1 (translocation) [HSA:861] [KO:K08367]            c-MYC (rearrangement) [HSA:4609] [KO:K04377]            CRLF2 (rearrangement) [HSA:64109] [KO:K05078]            PAX5 (rearrangement) [HSA:5079] [KO:K09383]
(GALAC1) GALT [HSA:2592] [KO:K00965]            (GALAC2) GALK1 [HSA:2584] [KO:K00849]            (GALAC3) GALE [HSA:2582] [KO:K01784]            (GALAC4) GALM [HSA:130589] [KO:K01785]

I am using following regex in python to extract that and getting the below output The dict['GENE'] has these values.

pattern1= re.compile('^(.*) \(.* \[HSA')
for gene in re.findall(pattern1, dict['GENE']):
    re.sub("\(.*?\)|\[.*?\]\s ", ' | ', gene)

1 BCR-ABL | ||MLL-AF4 | ||E2A-PBX1 | ||TEL-AML1 | ||c-MYC | ||CRLF2 | ||PAX5
2 | GALT ||| GALK1 ||| GALE ||

The desired output is:

1 BCR-ABL | MLL-AF4 | E2A-PBX1 | TEL-AML1 | c-MYC | CRLF2 | PAX5
2 GALT | GALK1 | GALE | GALM

CodePudding user response：

Clunky method but it returns your desired output

import re

s = '''BCR-ABL (translocation) [HSA:25] [KO:K06619]            MLL-AF4 (translocation) [HSA:4297   4299] [KO:K09186 K15184]            E2A-PBX1 (translocation) [HSA:6929 5087] [KO:K09063 K09355]            TEL-AML1 (translocation) [HSA:861] [KO:K08367]            c-MYC (rearrangement) [HSA:4609] [KO:K04377]            CRLF2 (rearrangement) [HSA:64109] [KO:K05078]            PAX5 (rearrangement) [HSA:5079] [KO:K09383]
(GALAC1) GALT [HSA:2592] [KO:K00965]            (GALAC2) GALK1 [HSA:2584] [KO:K00849]            (GALAC3) GALE [HSA:2582] [KO:K01784]            (GALAC4) GALM [HSA:130589] [KO:K01785]'''
s = s.split('\n')

for line in s:
    line = re.sub(r'\([^\)] \)', '', line)
    line = re.sub(r'\[[^\]] \]', '', line)
    r = re.sub(r'\s{2,}', ' | ', line)
    print(r.strip().strip('|'))

CodePudding user response：

It looks like you mainly want to get rid of the text between brackets:

>>> nobrackets = re.sub('(\[|\().*?(\]|\))', '', txt)
>>> print(nobrackets)
gene 
BCR-ABL               MLL-AF4               E2A-PBX1               TEL-AML1               c-MYC               CRLF2               PAX5   
 GALT               GALK1               GALE               GALM

The regex is quite simple:

(
  \[    # a literal [
  |     # or
  \(    # a literal (
)
.*?     # anything (ungreedy¹)
(
  \]    # a literal ]
  |     # or
  \)    # a literal )
)

1: https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy

Then, it's just a matter of cleaning up and formatting:

>>> lines = [ ' | '.join(filter(lambda x: x, re.split('\s ', line))) for line in nobrackets.split('\n') ]
>>> for i, line in enumerate(lines):
...   print(f'{i} {line}')
... 
0 gene
1 BCR-ABL | MLL-AF4 | E2A-PBX1 | TEL-AML1 | c-MYC | CRLF2 | PAX5
2 GALT | GALK1 | GALE | GALM