i am trying to search a text for chapters and then extract the text chapter by chapter. my search array returns the chapter name and the start and end positions in the text. it looks like this.
SearchTerm Start End
0 ITEM 1. 7219 47441.0
1 ITEM 2. 47441 57712.0
2 ITEM 3. 57712 76730.0
3 ITEM 4. 76730 106927.0
4 ITEM 5. 106927 111973.0
5 ITEM 6. 111973 120362.0
6 ITEM 7. 120362 237727.0
7 ITEM 8. 237727 830655.0
8 ITEM 9. 830655 833033.0
9 ITEM 10. 833033 833709.0
10 ITEM 11. 833709 834662.0
11 ITEM 12. 834662 846594.0
12 ITEM 13. 846594 847172.0
13 ITEM 14. 847172 849550.0
14 ITEM 15. 849550 877408.0
15 Item 15. 877408 913873.0
16 ITEM 1. 913873 914661.0
17 ITEM 2. 914661 914735.0
18 ITEM 3. 914735 914816.0
19 ITEM 4. 914816 915164.0
20 ITEM 6. 915164 915290.0
21 ITEM 7. 915290 915640.0
22 ITEM 8. 915640 917398.0
23 ITEM 9. 917398 917637.0
24 ITEM 10. 917637 917752.0
25 ITEM 11. 917752 917878.0
26 ITEM 12. 917878 918005.0
27 ITEM 13. 918005 918116.0
28 ITEM 14. 918116 918316.0
29 ITEM 15. 918316 919863.0
it contains duplicates because my search finds the table of contents and the chapters. so i want to drop duplicates and keep the last entries.
i have tried:
df2= matches_array.drop_duplicates(subset=["SearchTerm"],keep='last',inplace=True)
df2= matches_array.drop_duplicates(subset=["SearchTerm"],keep='last',inplace=False)
matches_array.drop_duplicates(subset=["SearchTerm"],keep="last",inplace=False)
matches_array.drop_duplicates(subset=['SearchTerm'],keep='last',inplace=True)
and several other variations with ignore index, but i cannot get it to work. what am i doing wrong?
edit:
{'SearchTerm': ['ITEM\xa01.', 'ITEM\xa02.', 'ITEM\xa03.',
'ITEM\xa04.', 'ITEM\xa05.', 'ITEM\xa06.', 'ITEM\xa07.',
'ITEM\xa08.', 'ITEM\xa09.', 'ITEM\xa010.', 'ITEM\xa011.',
'ITEM\xa012.', 'ITEM\xa013.', 'ITEM\xa014.', 'ITEM\xa015.',
'Item\xa015.', 'ITEM 1.', 'ITEM 2.', 'ITEM 3.', 'ITEM 4.',
'ITEM 6.', 'ITEM 7.', 'ITEM 8.', 'ITEM 9.', 'ITEM 10.',
'ITEM 11.', 'ITEM 12.', 'ITEM 13.', 'ITEM 14.', 'ITEM 15.'], 'Start': [7219, 47441, 57712, 76730, 106927, 111973,
120362, 237727, 830655, 833033, 833709, 834662, 846594,
847172, 849550, 877408, 913873, 914661, 914735, 914816,
915164, 915290, 915640, 917398, 917637, 917752, 917878,
918005, 918116, 918316], 'End': [47441.0, 57712.0, 76730.0, 106927.0, 111973.0, 120362.0, 237727.0, 830655.0, 833033.0, 833709.0, 834662.0, 846594.0, 847172.0, 849550.0, 877408.0, 913873.0, 914661.0, 914735.0, 914816.0, 915164.0, 915290.0, 915640.0, 917398.0, 917637.0, 917752.0, 917878.0, 918005.0, 918116.0, 918316.0, 919863.0]}
CodePudding user response:
IIUC, you can replace your \xa0
codes (non-breaking spaces) by normal spaces, and make the strings upper case:
df['SearchTerm'] = df['SearchTerm'].str.replace('\xa0', ' ').str.upper()
df2 = df.drop_duplicates(subset=["SearchTerm"], keep='last')
output:
SearchTerm Start End
4 ITEM 5. 106927 111973.0
16 ITEM 1. 913873 914661.0
17 ITEM 2. 914661 914735.0
18 ITEM 3. 914735 914816.0
19 ITEM 4. 914816 915164.0
20 ITEM 6. 915164 915290.0
21 ITEM 7. 915290 915640.0
22 ITEM 8. 915640 917398.0
23 ITEM 9. 917398 917637.0
24 ITEM 10. 917637 917752.0
25 ITEM 11. 917752 917878.0
26 ITEM 12. 917878 918005.0
27 ITEM 13. 918005 918116.0
28 ITEM 14. 918116 918316.0
29 ITEM 15. 918316 919863.0
CodePudding user response:
How about using group by to pick the last one? (You can change the max of End if that's what you want)
df = pd.read_csv("/tmp/Book2.csv")
df.sort_values(by=['Search Term', 'Start']).groupby('Search Term').max('Start')
Search Term Start End
ITEM 1 913873 914661
ITEM 10 917637 917752
ITEM 11 917752 917878
ITEM 12 917878 918005
ITEM 13 918005 918116
ITEM 14 918116 918316
ITEM 15 918316 919863
ITEM 2 914661 914735
ITEM 3 914735 914816
ITEM 4 914816 915164
ITEM 5 106927 111973
ITEM 6 915164 915290
ITEM 7 915290 915640
ITEM 8 915640 917398
ITEM 9 917398 917637