Home > Enterprise >  How to filter dataframe columns between two rows that contain specific string in column?
How to filter dataframe columns between two rows that contain specific string in column?

Time:04-22

I am trying to understand how to select only those rows in my dataframe that are between two specific rows. These rows contain two specific strings in one of the columns. I will explain further with this example.

I have the following dataframe:

       String      Value
-------------------------
 0       Blue         45     
 1        Red         35   
 2      Green         75    
 3      Start         65   
 4     Orange         33   
 5     Purple         65   
 6       Teal         34
 7     Indigo         44
 8        End         32
 9     Yellow         22 
10        Red         14

There is only one instance of "Start" and only one instance of "End" in the "String" column. I only want the rows of this dataframe that are between the rows that contain "Start" and "Stop" in the "String" column, and so I want to produce this output dataframe:

       String      Value
-------------------------  
 3      Start         65   
 4     Orange         33   
 5     Purple         65   
 6       Teal         34
 7     Indigo         44
 8        End         32

Also, I want to preserve the order of those rows I am preserving, and so preserving the order of "Start", "Orange", "Purple", "Teal", "Indigo", "End".

I know I can index these specific columns by doing:

index_start = df.index[df['String'] == 'Start']
index_end = df.index[df['String'] == 'End']    

But I am not sure how to actually filter out all rows that are not between these two strings. How can I accomplish this in python?

CodePudding user response:

If both values are present you temporarily set "String" as index:

df.set_index('String').loc['Start':'End'].reset_index()

output:

   String  Value
0   Start     65
1  Orange     33
2  Purple     65
3    Teal     34
4  Indigo     44
5     End     32

Alternatively, using isin (then the order of Start/End doesn't matter):

m = df['String'].isin(['Start', 'End']).cumsum().eq(1)
df[m|m.shift()]

output:

   String  Value
3   Start     65
4  Orange     33
5  Purple     65
6    Teal     34
7  Indigo     44
8     End     32

CodePudding user response:

This should be enough, iloc[] is useful when you try to locate rows by index, and it works the same as slices in lists.

index_start = df.index[df['String'] == 'Start']
index_end = df.index[df['String'] == 'End']  
df.iloc[index_start[0]:index_end[0] 1]

More information: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html

CodePudding user response:

You can build a boolean mask using eq cummax and filter:

out = df[df['String'].eq('Start').cummax() & df.loc[::-1, 'String'].eq('End').cummax()]

Output:

   String  Value
3   Start     65
4  Orange     33
5  Purple     65
6    Teal     34
7  Indigo     44
8     End     32

CodePudding user response:

As you return the index values through your work:

df.iloc[index_start.item(): index_end.item()]
  • Related