I have pandas dataframe, containing information in the following format:
sentence_num | sent_word | tag | word_char | word_index |
---|---|---|---|---|
0 | foo | B-foo | f | 1 |
0 | foo | B-foo | o | 1 |
0 | foo | B-foo | o | 1 |
0 | [ ] | B-ws | [ ] | 2 |
0 | bar | B-bar | b | 3 |
0 | bar | B-bar | a | 3 |
0 | bar | B-bar | r | 3 |
1 | john | B-name | j | 1 |
1 | john | B-name | o | 1 |
1 | john | B-name | h | 1 |
1 | john | B-name | n | 1 |
1 | [ ] | B-ws | [ ] | 2 |
1 | doe | B-sur | d | 3 |
1 | doe | B-sur | o | 3 |
1 | doe | B-sur | e | 3 |
I want to rename tags if the char is not the first in the word:
sentence_num | sent_word | tag | word_char | word_index |
---|---|---|---|---|
0 | foo | B-foo | f | 1 |
0 | foo | I-foo | o | 1 |
0 | foo | I-foo | o | 1 |
0 | [ ] | B-ws | [ ] | 2 |
0 | bar | B-bar | b | 3 |
0 | bar | I-bar | a | 3 |
0 | bar | I-bar | r | 3 |
1 | john | B-name | j | 1 |
1 | john | I-name | o | 1 |
1 | john | I-name | h | 1 |
1 | john | I-name | n | 1 |
1 | [ ] | B-ws | [ ] | 2 |
1 | doe | B-sur | d | 3 |
1 | doe | I-sur | o | 3 |
1 | doe | I-sur | e | 3 |
Since the word index is repeating and the sentence num does not help me a lot, I am not sure how to group the data so that I get to the elements I want to edit.
CodePudding user response:
Use boolean indexing:
# is word_char not the first letter?
# and sent_word is not "[ ]"
m = ( df['sent_word'].str[0].ne(df['word_char'])
& df['sent_word'].ne('[ ]')
)
# for those rows, change the B into I
df.loc[m, 'tag'] = 'I' df.loc[m, 'tag'].str[1:]
output:
sentence_num sent_word tag word_char word_index
0 0 foo B-foo f 1
1 0 foo I-foo o 1
2 0 foo I-foo o 1
3 0 [ ] B-ws [ ] 2
4 0 bar B-bar b 3
5 0 bar I-bar a 3
6 0 bar I-bar r 3
7 1 john B-name j 1
8 1 john I-name o 1
9 1 john I-name h 1
10 1 john I-name n 1
11 1 [ ] B-ws [ ] 2
12 1 doe B-sur d 3
13 1 doe I-sur o 3
14 1 doe I-sur e 3