I have a dataframe with 3 columns: 'text', 'in', 'tar'
of type(str, list, list)
respectively.
text in tar
0 This is an example text that I use in order to get an answer [2] [6]
1 Discussion: We are examining the possibility of this solution. [3] [6, 7, 8]
in
and tar
represent specific entities that I want to tag into the text, and they return the position of each found entity term in the text.
For example, at the 2nd row of the dataframe where in = [3]
, I take the 3rd word from text
column (i.e.: "examining") and label it as <IN>examining</IN>
.
Similarly, for the same row, since tar = [6,7, 8]
, I have <TAR>of</TAR>
, <TAR>this</TAR>
, <TAR>solution</TAR>
.
But what I want is when there are consecutive positions (i.e. [1,2,3] or [6,7, 8]) to label them together in one tag, such as <TAR>of this solution</TAR>
.
I only want to do this when the positions are consecutive (i.e.: [1,2,3]), not when they aren't (i.e. [1,3,5]).
This is what I have so far:
data = {'text': ['This is an example text that I use in order to get an answer',
'Discussion: We are examining the possibility of this solution'],
'in': [[2], [3]],
'tar': [[6], [6, 7, 8]]}
df = pd.DataFrame(data)
cols = list(df.columns)[1:]
new_text = []
for idx, row in df.iterrows():
temp = list(row['text'].split())
for pos, word in enumerate(temp):
for col in cols:
if pos in row[col]:
temp[pos] = f'<{col.upper()}>{word}</{col.upper()}>'
new_text.append(' '.join(temp))
df['text'] = new_text
print(df.text.to_list())
output:
['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to get an answer',
'Discussion: We are <IN>examining</IN> the possibility <TAR>of</TAR> <TAR>this</TAR> <TAR>solution</TAR>']
Desired output:
['Discussion: We are <IN>examining</IN> the possibility <TAR>of this solution</TAR>']
Can someone help?
CodePudding user response:
As a quick fix to your existing code you could just replace </TAR> <TAR>
with a space
CodePudding user response:
One of the approaches:
import pandas as pd
data = {'text': ['This is an example text that I use in order to get an answer',
'Discussion: We are examining the possibility of this solution'],
'in': [[2], [3]],
'tar': [[6], [2, 5, 6, 7, 8]]}
df = pd.DataFrame(data)
cols = list(df.columns)[1:]
for idx, row in df.iterrows():
# Split the text on spaces
temp = list(row['text'].split())
for col in cols:
# Initialise data with <IN> or <TAR> based on column
data = f'<{col.upper()}>'
string = []
for i, value in enumerate(row[col]):
# append the word at index 'i' in row[col] in 'temp' to 'data'
# Eg: <IN>an
data = temp[row[col][i]]
# If the next value is a consecutive number, replace the ith word in text
# with data obtained so far and continue
if (len(row[col])>i 1 and row[col][i 1]-value == 1):
# example 'possibility' will be replaced by '<TAR>possibility'
temp[row[col][i]] = data
data = ''
continue
else:
# If next index is not consecutive, append </IN> or </TAR> to 'data' based on column
# Eg: <IN>an</IN>
data = f'</{col.upper()}>'
# Replace for eg. 'an' with '<IN>an</IN>''
temp[row[col][i]] = data
data = f'<{col.upper()}>'
row['text'] = ' '.join(temp)
print (df.text.to_list())
Output:
['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to get an answer', 'Discussion: We <TAR>are</TAR> <IN>examining</IN> the <TAR>possibility of this solution</TAR>']