I have a dataframe with 3 columns: 'text', 'in', 'tar'
of type(str, list, list)
respectively.
text in tar
0 This is an example text that I use in order to ... [2] [6]
1 Discussion: We are examining the possibility of ... [3] [6, 7]
in
and tar
represent specific entities that I want to tag into the text, and they return the position of each found entity term in the text.
For example, at the 2nd row of the dataframe where in = [3]
, I want to take the 3rd word from text
column (i.e.: "are") and label it as <IN>are</IN>
.
Similarly, for the same row, since tar = [6,7]
, I also want to take the 6th and 7th word from text
column (i.e. "possibility", "of") and label them as <TAR>possibility</TAR>
, <TAR>of</TAR>
.
Can someone help me how to do this?
CodePudding user response:
This is not the most optimal implementation but is worth getting inspiration.
data = {'text': ['This is an example text that I use in order to',
'Discussion: We are examining the possibility of the'],
'in': [[2], [3]],
'tar': [[6], [6, 7]]}
df = pd.DataFrame(data)
cols = list(df.columns)[1:]
new_text = []
for idx, row in df.iterrows():
temp = list(row['text'].split())
for pos, word in enumerate(temp):
for col in cols:
if pos in row[col]:
temp[pos] = f'<{col.upper()}>{word}</{col.upper()}>'
new_text.append(' '.join(temp))
df['text'] = new_text
print(df.text.to_list())
output:
['This is <IN>an</IN> example text that <TAR>I</TAR> use in order to',
'Discussion: We are <IN>examining</IN> the possibility <TAR>of</TAR> <TAR>the</TAR>']