Python : Group tagging of words that have concecutive positions-CodePudding

I have a dataframe with 3 columns: 'text', 'in', 'tar' of type(str, list, list) respectively.

                   text                                              in     tar
0  This is an example text that I use in order to  get an answer     [2]    [6]
1  Discussion: We are examining the possibility of this solution.    [3]    [6, 7, 8]

in and tar represent specific entities that I want to tag into the text, and they return the position of each found entity term in the text.

For example, at the 2nd row of the dataframe where in = [3], I take the 3rd word from text column (i.e.: "examining") and label it as <IN>examining</IN>.

Similarly, for the same row, since tar = [6,7, 8], I have <TAR>of</TAR>, <TAR>this</TAR>, <TAR>solution</TAR>.

But what I want is when there are consecutive positions (i.e. [1,2,3] or [6,7, 8]) to label them together in one tag, such as <TAR>of this solution</TAR>.

I only want to do this when the positions are consecutive (i.e.: [1,2,3]), not when they aren't (i.e. [1,3,5]).

This is what I have so far:

data = {'text': ['This is an example text that I use in order to get an answer',
                 'Discussion: We are examining the possibility of this solution'],
        'in': [[2], [3]],
        'tar': [[6], [6, 7, 8]]}
df = pd.DataFrame(data)
cols = list(df.columns)[1:]
new_text = []
for idx, row in df.iterrows():
    temp = list(row['text'].split())
    for pos, word in enumerate(temp):
        for col in cols:
            if pos in row[col]:
                temp[pos] = f'<{col.upper()}>{word}</{col.upper()}>'
    new_text.append(' '.join(temp))
df['text'] = new_text
print(df.text.to_list())

output:

['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to get an answer', 
 'Discussion: We are <IN>examining</IN> the possibility <TAR>of</TAR> <TAR>this</TAR> <TAR>solution</TAR>']

Desired output:

 ['Discussion: We are <IN>examining</IN> the possibility <TAR>of this solution</TAR>']

Can someone help?

CodePudding user response：

As a quick fix to your existing code you could just replace </TAR> <TAR> with a space

CodePudding user response：

One of the approaches:

import pandas as pd
data = {'text': ['This is an example text that I use in order to get an answer',
                 'Discussion: We are examining the possibility of this solution'],
        'in': [[2], [3]],
        'tar': [[6], [2, 5, 6, 7, 8]]}
df = pd.DataFrame(data)
cols = list(df.columns)[1:]
for idx, row in df.iterrows():
    # Split the text on spaces
    temp = list(row['text'].split())
    for col in cols:
        # Initialise data with <IN> or <TAR> based on column
        data = f'<{col.upper()}>'
        string = []
        for i, value in enumerate(row[col]):
            # append the word at index 'i' in row[col] in 'temp' to 'data'
            # Eg: <IN>an
            data  = temp[row[col][i]]
            # If the next value is a consecutive number, replace the ith word in text
            # with data obtained so far and continue
            if (len(row[col])>i 1 and row[col][i 1]-value == 1):
                # example 'possibility' will be replaced by '<TAR>possibility'
                temp[row[col][i]] = data
                data = ''
                continue
            else:
                # If next index is not consecutive, append </IN> or </TAR> to 'data' based on column
                # Eg: <IN>an</IN>
                data  = f'</{col.upper()}>'
                # Replace for eg. 'an' with '<IN>an</IN>''
                temp[row[col][i]] = data
                data = f'<{col.upper()}>'

    row['text'] = ' '.join(temp)
print (df.text.to_list())

Output:

['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to get an answer', 'Discussion: We <TAR>are</TAR> <IN>examining</IN> the <TAR>possibility of this solution</TAR>']