Regex to match space in the beginning or end of file in python-CodePudding

I have following data in a text file.

T79534  TARGETID    T79534      
T79534  FORMERID    TTDI01219       
T79534  TARGNAME    P450-dependent ergosterol synthesis (PDE synth)     
T79534  TARGTYPE    Discontinued target     
T79534  DRUGINFO    D0T5NI  Saperconazole   Discontinued in Phase 2
                
T78590  TARGETID    T78590      
T78590  FORMERID    TTDI01580       
T78590  TARGNAME    Polymorphonuclear neutrophil adhesion (PMNA)        
T78590  TARGTYPE    Discontinued target     
T78590  DRUGINFO    D0OB7J  NPC-15669   Discontinued in Phase 1

I would like to extract the value of TARGETID and TARGTYPE field. I am using following lines of python code to get this data (target_file is a variable with this data: <class 'pandas.core.frame.DataFrame'>).

for index, row in target_file.iterrows():
    if re.match("r^[A-Z] .*", str(row['field'])):
        if row['field'] == 'TARGETID':
            target_id = row['value']
        elif row['field'] == 'TARGTYPE':
            target_type = row['value']
        else:
            continue
    elif re.match("r^\s |$\Z", str(row['field'])): #if matches space or end of file, get the data
        print (target_id, target_type, uniprot_id)   

        #empty the variable for new loop
        target_id = ''
        target_type = ''
    else:
        'do nothing'

I assume that the regex in elif condition (re.match("r^\s |$\Z", str(row['field']))) is not working.

The expected output is:

T79534 Discontinued target
T78590 Discontinued target

Any help here is highly appreciated

CodePudding user response：

If there aren't more conditions I don't know of and every chunk always has an TARGETID and a TARGTYPE, then you can do it like this in the df:

out = df.loc[df['field'].isin(['TARGETID','TARGTYPE']),'value']
print(out)

Output:
0                 T79534
3    Discontinued target
5                 T78590
8    Discontinued target

To match it exactly the output you want to have if necessary you could do something like this with out:

for i,k in zip(out[0::2], out[1::2]):
    print(i,k)

Output:
T79534 Discontinued target
T78590 Discontinued target

Does that help?

UPDATE: if you want to do it directly from the textfile without creating a DataFrame you can do it like this:

with open('000_SO_input3.txt', 'r') as f:
    for line in f:
        if len(line.strip())!=0:
            ID, fields, value = [x.strip() for x in line.split(maxsplit=2)]
            if fields=='TARGETID':
                targetid = value
            elif fields=='TARGTYPE':
                targtype = value
        else:
            print(targetid, targtype)
    print(targetid, targtype)

if it is not an empty row, split the row and check for hits for the searched words, else (you got an empty row) print the current value of targetid and targtype. Notice at the very end you print the same statement again because at the end there is no empty row and your result wouldn't print the last current values of the two variables.

EDIT on your code: I just had a look into the code your tried. First of all, there is a typo in both regex. for the r-string the r needs to be outside of "...", so it has to be r"^[A-Z] .*". You acutally just search for a word so this one will do it aswell r"\w ". Second you need to be aware of what you are doing, you don't check for one big multiline string, you check every cell of the column field, there is no need to wrap row('field') with str(). Same for the 2nd regex pattern. You check cell by cell, so you won't catch when the file is ended because the for loop just stops after last cell. And the empty line gets cut out (at least it did when i loaded the data from textfile to df) automatically, so you won't hit that either. In general r"^\s " would match an empty string or string with only whitespace but there just is no one so your elif never gets executed. If you just insert in all the if elif and else some print statements to see how the code executes, you will see.

CodePudding user response：

It seems simpler with re.findall.
(I've omitted the file-reading part because it seems that you have no problem with it and it's easier to code an example in one file.)

import re 
myfile=(
'T79534  TARGETID    T79534      ',
'T79534  FORMERID    TTDI01219       ',
'T79534  TARGNAME    P450-dependent ergosterol synthesis (PDE synth)     ',
'T79534  TARGTYPE    Discontinued target     ',
'T79534  DRUGINFO    D0T5NI  Saperconazole   Discontinued in Phase 2',
'T78590  TARGETID    T78590      ',
'T78590  FORMERID    TTDI01580       ',
'T78590  TARGNAME    Polymorphonuclear neutrophil adhesion (PMNA)        ',
'T78590  TARGTYPE    Discontinued target     ',
'T78590  DRUGINFO    D0OB7J  NPC-15669   Discontinued in Phase 1',)
target_id = ''
target_type = ''
for f in myfile:
  g = re.findall('\w \s (\w )\s (\w )',f)
  if g[0][0] == 'TARGETID':
    target_id = g[0][1] 
  if g[0][0] == 'TARGTYPE':
    target_type = g[0][1] 
  if target_id:
    print(target_id)
  if target_type:
    print(target_type)
  target_id = ''
  target_type=''

output

T79534
Discontinued
T78590
Discontinued