I have following values in two columns. I would like to get only selected values from 'name' column.
gene
BCR-ABL (translocation) [HSA:25] [KO:K06619] MLL-AF4 (translocation) [HSA:4297 4299] [KO:K09186 K15184] E2A-PBX1 (translocation) [HSA:6929 5087] [KO:K09063 K09355] TEL-AML1 (translocation) [HSA:861] [KO:K08367] c-MYC (rearrangement) [HSA:4609] [KO:K04377] CRLF2 (rearrangement) [HSA:64109] [KO:K05078] PAX5 (rearrangement) [HSA:5079] [KO:K09383]
(GALAC1) GALT [HSA:2592] [KO:K00965] (GALAC2) GALK1 [HSA:2584] [KO:K00849] (GALAC3) GALE [HSA:2582] [KO:K01784] (GALAC4) GALM [HSA:130589] [KO:K01785]
I am using following regex in python to extract that and getting the below output The dict['GENE'] has these values.
pattern1= re.compile('^(.*) \(.* \[HSA')
for gene in re.findall(pattern1, dict['GENE']):
re.sub("\(.*?\)|\[.*?\]\s ", ' | ', gene)
1 BCR-ABL | ||MLL-AF4 | ||E2A-PBX1 | ||TEL-AML1 | ||c-MYC | ||CRLF2 | ||PAX5
2 | GALT ||| GALK1 ||| GALE ||
The desired output is:
1 BCR-ABL | MLL-AF4 | E2A-PBX1 | TEL-AML1 | c-MYC | CRLF2 | PAX5
2 GALT | GALK1 | GALE | GALM
CodePudding user response:
Clunky method but it returns your desired output
import re
s = '''BCR-ABL (translocation) [HSA:25] [KO:K06619] MLL-AF4 (translocation) [HSA:4297 4299] [KO:K09186 K15184] E2A-PBX1 (translocation) [HSA:6929 5087] [KO:K09063 K09355] TEL-AML1 (translocation) [HSA:861] [KO:K08367] c-MYC (rearrangement) [HSA:4609] [KO:K04377] CRLF2 (rearrangement) [HSA:64109] [KO:K05078] PAX5 (rearrangement) [HSA:5079] [KO:K09383]
(GALAC1) GALT [HSA:2592] [KO:K00965] (GALAC2) GALK1 [HSA:2584] [KO:K00849] (GALAC3) GALE [HSA:2582] [KO:K01784] (GALAC4) GALM [HSA:130589] [KO:K01785]'''
s = s.split('\n')
for line in s:
line = re.sub(r'\([^\)] \)', '', line)
line = re.sub(r'\[[^\]] \]', '', line)
r = re.sub(r'\s{2,}', ' | ', line)
print(r.strip().strip('|'))
CodePudding user response:
It looks like you mainly want to get rid of the text between brackets:
>>> nobrackets = re.sub('(\[|\().*?(\]|\))', '', txt)
>>> print(nobrackets)
gene
BCR-ABL MLL-AF4 E2A-PBX1 TEL-AML1 c-MYC CRLF2 PAX5
GALT GALK1 GALE GALM
The regex is quite simple:
(
\[ # a literal [
| # or
\( # a literal (
)
.*? # anything (ungreedy¹)
(
\] # a literal ]
| # or
\) # a literal )
)
1: https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy
Then, it's just a matter of cleaning up and formatting:
>>> lines = [ ' | '.join(filter(lambda x: x, re.split('\s ', line))) for line in nobrackets.split('\n') ]
>>> for i, line in enumerate(lines):
... print(f'{i} {line}')
...
0 gene
1 BCR-ABL | MLL-AF4 | E2A-PBX1 | TEL-AML1 | c-MYC | CRLF2 | PAX5
2 GALT | GALK1 | GALE | GALM