Home > Software engineering >  Select all rows containing NaN until a column with value appears
Select all rows containing NaN until a column with value appears

Time:04-04

So I have a Dataframe that looks like this:

ValueA Value B
a 1
NaN 2
NaN 3
b 4
NaN 5
NaN 6
c 7
NaN 8
NaN 9

I want to create a DataFrame, in which all values between b and c are stored - including b, excluding c.

The expected output is as follows:

ValueA Value B
b 4
NaN 5
NaN 6

CodePudding user response:

I think this will do what your question asks:

import pandas as pd
import numpy as np
import math
df = pd.DataFrame({
    'ValueA' : ['a', np.nan, np.nan, 'b', np.nan, np.nan, 'c', np.nan, np.nan],
    'Value B' : list(range(1, 10))})
print(df)

start = df[df['ValueA'] == 'b'].index.tolist()[0]
end = start   1
while end < df.shape[0] and type(df['ValueA'].loc[end]) is not str and math.isnan(df['ValueA'].loc[end]):
    end  = 1
df2 = df.loc[start:end - 1].reset_index(drop=True, inplace=False)
print(f"\n{df2}")

Output:

  ValueA  Value B
0      a        1
1    NaN        2
2    NaN        3
3      b        4
4    NaN        5
5    NaN        6
6      c        7
7    NaN        8
8    NaN        9

  ValueA  Value B
0      b        4
1    NaN        5
2    NaN        6

CodePudding user response:

Use two masks:

# rows after the first b
m1 = df['ValueA'].eq('b').cummax()
# rows after the first c
m2 = df['ValueA'].eq('c').cummax()

# rows between b and c excluded
df[m1&~m2]

Another approach, which however will give the rows between the first b/c (included) and b/c (excluded). This will also work between two b, two c, or if c comes before b!

df[df['ValueA'].isin(('b', 'c')).cumsum().eq(1)]

output:

  ValueA  ValueB
3      b       4
4    NaN       5
5    NaN       6

CodePudding user response:

If you have the "default" index (composed of consecutive integers, starting from 0), then one of possible solutions is to find indices of the "wanted" rows:

ind1, ind2 = df.index[df['ValueA'].isin(['b', 'c'])].tolist()

Then just retrieve rows from this range (excluding the right border):

result = df[ind1:ind2]

The result is:

  ValueA  Value B
3      b        4
4    NaN        5
5    NaN        6

This code will work also if both "border" values are not in lexicographic order. Values of Value1 contained possibly between both border rows, which are "< 'b'" or "> 'c'", will not do any harm, which is not true in the case of one of other answers.

Note that df.loc[ind1:ind2] will get also the row with ValueA == 'c'.

  • Related