I am trying to understand how to select only those rows in my dataframe that are between two specific rows. These rows contain two specific strings in one of the columns. I will explain further with this example.
I have the following dataframe:
String Value
-------------------------
0 Blue 45
1 Red 35
2 Green 75
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
9 Yellow 22
10 Red 14
There is only one instance of "Start" and only one instance of "End" in the "String" column. I only want the rows of this dataframe that are between the rows that contain "Start" and "Stop" in the "String" column, and so I want to produce this output dataframe:
String Value
-------------------------
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
Also, I want to preserve the order of those rows I am preserving, and so preserving the order of "Start", "Orange", "Purple", "Teal", "Indigo", "End".
I know I can index these specific columns by doing:
index_start = df.index[df['String'] == 'Start']
index_end = df.index[df['String'] == 'End']
But I am not sure how to actually filter out all rows that are not between these two strings. How can I accomplish this in python?
CodePudding user response:
If both values are present you temporarily set "String" as index:
df.set_index('String').loc['Start':'End'].reset_index()
output:
String Value
0 Start 65
1 Orange 33
2 Purple 65
3 Teal 34
4 Indigo 44
5 End 32
Alternatively, using isin
(then the order of Start/End doesn't matter):
m = df['String'].isin(['Start', 'End']).cumsum().eq(1)
df[m|m.shift()]
output:
String Value
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
CodePudding user response:
This should be enough, iloc[] is useful when you try to locate rows by index, and it works the same as slices in lists.
index_start = df.index[df['String'] == 'Start']
index_end = df.index[df['String'] == 'End']
df.iloc[index_start[0]:index_end[0] 1]
More information: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html
CodePudding user response:
You can build a boolean mask using eq
cummax
and filter:
out = df[df['String'].eq('Start').cummax() & df.loc[::-1, 'String'].eq('End').cummax()]
Output:
String Value
3 Start 65
4 Orange 33
5 Purple 65
6 Teal 34
7 Indigo 44
8 End 32
CodePudding user response:
As you return the index values through your work:
df.iloc[index_start.item(): index_end.item()]