I have following data in a text file.
T79534 TARGETID T79534
T79534 FORMERID TTDI01219
T79534 TARGNAME P450-dependent ergosterol synthesis (PDE synth)
T79534 TARGTYPE Discontinued target
T79534 DRUGINFO D0T5NI Saperconazole Discontinued in Phase 2
T78590 TARGETID T78590
T78590 FORMERID TTDI01580
T78590 TARGNAME Polymorphonuclear neutrophil adhesion (PMNA)
T78590 TARGTYPE Discontinued target
T78590 DRUGINFO D0OB7J NPC-15669 Discontinued in Phase 1
I would like to extract the value of TARGETID
and TARGTYPE
field. I am using following lines of python code to get this data (target_file is a variable with this data: <class 'pandas.core.frame.DataFrame'>
).
for index, row in target_file.iterrows():
if re.match("r^[A-Z] .*", str(row['field'])):
if row['field'] == 'TARGETID':
target_id = row['value']
elif row['field'] == 'TARGTYPE':
target_type = row['value']
else:
continue
elif re.match("r^\s |$\Z", str(row['field'])): #if matches space or end of file, get the data
print (target_id, target_type, uniprot_id)
#empty the variable for new loop
target_id = ''
target_type = ''
else:
'do nothing'
I assume that the regex in elif condition (re.match("r^\s |$\Z", str(row['field']))
) is not working.
The expected output is:
T79534 Discontinued target
T78590 Discontinued target
Any help here is highly appreciated
CodePudding user response:
If there aren't more conditions I don't know of and every chunk always has an TARGETID
and a TARGTYPE
, then you can do it like this in the df
:
out = df.loc[df['field'].isin(['TARGETID','TARGTYPE']),'value']
print(out)
Output:
0 T79534
3 Discontinued target
5 T78590
8 Discontinued target
To match it exactly the output you want to have if necessary you could do something like this with out
:
for i,k in zip(out[0::2], out[1::2]):
print(i,k)
Output:
T79534 Discontinued target
T78590 Discontinued target
Does that help?
UPDATE: if you want to do it directly from the textfile without creating a DataFrame you can do it like this:
with open('000_SO_input3.txt', 'r') as f:
for line in f:
if len(line.strip())!=0:
ID, fields, value = [x.strip() for x in line.split(maxsplit=2)]
if fields=='TARGETID':
targetid = value
elif fields=='TARGTYPE':
targtype = value
else:
print(targetid, targtype)
print(targetid, targtype)
if it is not an empty row, split the row and check for hits for the searched words, else (you got an empty row) print the current value of targetid
and targtype
. Notice at the very end you print the same statement again because at the end there is no empty row and your result wouldn't print the last current values of the two variables.
EDIT on your code:
I just had a look into the code your tried. First of all, there is a typo in both regex. for the r-string the r
needs to be outside of "..."
, so it has to be r"^[A-Z] .*"
. You acutally just search for a word so this one will do it aswell r"\w "
. Second you need to be aware of what you are doing, you don't check for one big multiline string, you check every cell of the column field
, there is no need to wrap row('field')
with str()
. Same for the 2nd regex pattern. You check cell by cell, so you won't catch when the file
is ended because the for loop just stops after last cell. And the empty line gets cut out (at least it did when i loaded the data from textfile to df) automatically, so you won't hit that either. In general r"^\s "
would match an empty string or string with only whitespace but there just is no one so your elif
never gets executed. If you just insert in all the if elif and else some print statements to see how the code executes, you will see.
CodePudding user response:
It seems simpler with re.findall.
(I've omitted the file-reading part because it seems that you have no problem with it and it's easier to code an example in one file.)
import re
myfile=(
'T79534 TARGETID T79534 ',
'T79534 FORMERID TTDI01219 ',
'T79534 TARGNAME P450-dependent ergosterol synthesis (PDE synth) ',
'T79534 TARGTYPE Discontinued target ',
'T79534 DRUGINFO D0T5NI Saperconazole Discontinued in Phase 2',
'T78590 TARGETID T78590 ',
'T78590 FORMERID TTDI01580 ',
'T78590 TARGNAME Polymorphonuclear neutrophil adhesion (PMNA) ',
'T78590 TARGTYPE Discontinued target ',
'T78590 DRUGINFO D0OB7J NPC-15669 Discontinued in Phase 1',)
target_id = ''
target_type = ''
for f in myfile:
g = re.findall('\w \s (\w )\s (\w )',f)
if g[0][0] == 'TARGETID':
target_id = g[0][1]
if g[0][0] == 'TARGTYPE':
target_type = g[0][1]
if target_id:
print(target_id)
if target_type:
print(target_type)
target_id = ''
target_type=''
output
T79534
Discontinued
T78590
Discontinued