df.drop_duplicates, is not working what am i doing wrong?-CodePudding

i am trying to search a text for chapters and then extract the text chapter by chapter. my search array returns the chapter name and the start and end positions in the text. it looks like this.

    SearchTerm  Start   End
0   ITEM 1. 7219    47441.0
1   ITEM 2. 47441   57712.0
2   ITEM 3. 57712   76730.0
3   ITEM 4. 76730   106927.0
4   ITEM 5. 106927  111973.0
5   ITEM 6. 111973  120362.0
6   ITEM 7. 120362  237727.0
7   ITEM 8. 237727  830655.0
8   ITEM 9. 830655  833033.0
9   ITEM 10.    833033  833709.0
10  ITEM 11.    833709  834662.0
11  ITEM 12.    834662  846594.0
12  ITEM 13.    846594  847172.0
13  ITEM 14.    847172  849550.0
14  ITEM 15.    849550  877408.0
15  Item 15.    877408  913873.0
16  ITEM 1. 913873  914661.0
17  ITEM 2. 914661  914735.0
18  ITEM 3. 914735  914816.0
19  ITEM 4. 914816  915164.0
20  ITEM 6. 915164  915290.0
21  ITEM 7. 915290  915640.0
22  ITEM 8. 915640  917398.0
23  ITEM 9. 917398  917637.0
24  ITEM 10.    917637  917752.0
25  ITEM 11.    917752  917878.0
26  ITEM 12.    917878  918005.0
27  ITEM 13.    918005  918116.0
28  ITEM 14.    918116  918316.0
29  ITEM 15.    918316  919863.0

it contains duplicates because my search finds the table of contents and the chapters. so i want to drop duplicates and keep the last entries.

i have tried:

df2= matches_array.drop_duplicates(subset=["SearchTerm"],keep='last',inplace=True)

df2= matches_array.drop_duplicates(subset=["SearchTerm"],keep='last',inplace=False)

matches_array.drop_duplicates(subset=["SearchTerm"],keep="last",inplace=False)

matches_array.drop_duplicates(subset=['SearchTerm'],keep='last',inplace=True)

and several other variations with ignore index, but i cannot get it to work. what am i doing wrong?

edit:

{'SearchTerm': ['ITEM\xa01.', 'ITEM\xa02.', 'ITEM\xa03.',
'ITEM\xa04.', 'ITEM\xa05.', 'ITEM\xa06.', 'ITEM\xa07.',
'ITEM\xa08.', 'ITEM\xa09.', 'ITEM\xa010.', 'ITEM\xa011.',
'ITEM\xa012.', 'ITEM\xa013.', 'ITEM\xa014.', 'ITEM\xa015.',
'Item\xa015.', 'ITEM 1.', 'ITEM 2.', 'ITEM 3.', 'ITEM 4.',
'ITEM 6.', 'ITEM 7.', 'ITEM 8.', 'ITEM 9.', 'ITEM 10.',
'ITEM 11.', 'ITEM 12.', 'ITEM 13.', 'ITEM 14.', 'ITEM 15.'], 'Start': [7219, 47441, 57712, 76730, 106927, 111973,
120362, 237727, 830655, 833033, 833709, 834662, 846594,
847172, 849550, 877408, 913873, 914661, 914735, 914816,
915164, 915290, 915640, 917398, 917637, 917752, 917878,
918005, 918116, 918316], 'End': [47441.0, 57712.0, 76730.0, 106927.0, 111973.0, 120362.0, 237727.0, 830655.0, 833033.0, 833709.0, 834662.0, 846594.0, 847172.0, 849550.0, 877408.0, 913873.0, 914661.0, 914735.0, 914816.0, 915164.0, 915290.0, 915640.0, 917398.0, 917637.0, 917752.0, 917878.0, 918005.0, 918116.0, 918316.0, 919863.0]}

CodePudding user response：

IIUC, you can replace your \xa0 codes (non-breaking spaces) by normal spaces, and make the strings upper case:

df['SearchTerm'] = df['SearchTerm'].str.replace('\xa0', ' ').str.upper()

df2 = df.drop_duplicates(subset=["SearchTerm"], keep='last')

output:

   SearchTerm   Start       End
4     ITEM 5.  106927  111973.0
16    ITEM 1.  913873  914661.0
17    ITEM 2.  914661  914735.0
18    ITEM 3.  914735  914816.0
19    ITEM 4.  914816  915164.0
20    ITEM 6.  915164  915290.0
21    ITEM 7.  915290  915640.0
22    ITEM 8.  915640  917398.0
23    ITEM 9.  917398  917637.0
24   ITEM 10.  917637  917752.0
25   ITEM 11.  917752  917878.0
26   ITEM 12.  917878  918005.0
27   ITEM 13.  918005  918116.0
28   ITEM 14.  918116  918316.0
29   ITEM 15.  918316  919863.0

CodePudding user response：

How about using group by to pick the last one? (You can change the max of End if that's what you want)

df = pd.read_csv("/tmp/Book2.csv")
    
df.sort_values(by=['Search Term', 'Start']).groupby('Search Term').max('Start')



      

    
Search Term   Start     End             
ITEM 1       913873  914661
ITEM 10      917637  917752
ITEM 11      917752  917878
ITEM 12      917878  918005
ITEM 13      918005  918116
ITEM 14      918116  918316
ITEM 15      918316  919863
ITEM 2       914661  914735
ITEM 3       914735  914816
ITEM 4       914816  915164
ITEM 5       106927  111973
ITEM 6       915164  915290
ITEM 7       915290  915640
ITEM 8       915640  917398
ITEM 9       917398  917637