I have a dataframe:
a b c d
0 M11 18 1 1.00
1 M11 17 1 1.00
2 M12 16 2 0.90
3 M12 15 2 0.90
4 S11 15 2 0.90
5 S2 13 3 0.85
6 M21 14 3 0.80
7 S21 14 3 0.80
8 M2 13 3 0.70
9 M22 12 3 0.70
10 S22 12 3 0.70
11 M31 11 4 0.65
12 S31 11 4 0.65
13 M3 10 4 0.60
14 A31 10 4 0.60
15 M32 9 4 0.55
16 S32 9 4 0.55
17 M41 8 4 0.50
18 S41 8 4 0.50
19 M42 7 4 0.45
20 S42 7 4 0.45
21 M43 6 4 0.40
22 S43 6 4 0.40
23 A51 5 4 0.35
24 A52 5 4 0.35
25 T51 5 4 0.35
26 T52 5 4 0.35
27 A53 4 4 0.30
28 T53 4 4 0.30
I wanto to print 'd' values between spesific values on 'a' column
I tried
df.loc[df['a'].between('T52', 'T53'), 'd']
to print 'd' between row 'a' T52 and T53
I expected
26 0.35
27 0.30
28 0.30
but the actually resulted is:
26 0.35
28 0.30
why the 27th row didnt included? I've tried the other range, some of them also not resulted as expected. is this a problem with indexing?
CodePudding user response:
df['a'].between('T52', 'T53')
Will check if each individual value in a
is lexicographically between T51
and T53
, independently of the relative order of the rows
You might rather want:
m = df['a'].isin(['T52', 'T53'])
df[m.cummax() & m[::-1].cummax()]
Output:
a b c d
26 T52 5 4 0.35
27 A53 4 4 0.30
28 T53 4 4 0.30
Intermediates:
a b c d between m cummax reversed_cummax &
0 M11 18 1 1.00 False False False True False
1 M11 17 1 1.00 False False False True False
2 M12 16 2 0.90 False False False True False
3 M12 15 2 0.90 False False False True False
4 S11 15 2 0.90 False False False True False
5 S2 13 3 0.85 False False False True False
6 M21 14 3 0.80 False False False True False
7 S21 14 3 0.80 False False False True False
8 M2 13 3 0.70 False False False True False
9 M22 12 3 0.70 False False False True False
10 S22 12 3 0.70 False False False True False
11 M31 11 4 0.65 False False False True False
12 S31 11 4 0.65 False False False True False
13 M3 10 4 0.60 False False False True False
14 A31 10 4 0.60 False False False True False
15 M32 9 4 0.55 False False False True False
16 S32 9 4 0.55 False False False True False
17 M41 8 4 0.50 False False False True False
18 S41 8 4 0.50 False False False True False
19 M42 7 4 0.45 False False False True False
20 S42 7 4 0.45 False False False True False
21 M43 6 4 0.40 False False False True False
22 S43 6 4 0.40 False False False True False
23 A51 5 4 0.35 False False False True False
24 A52 5 4 0.35 False False False True False
25 T51 5 4 0.35 False False False True False
26 T52 5 4 0.35 True True True True True
27 A53 4 4 0.30 False False True True True
28 T53 4 4 0.30 True True True True True
Alternatively, if you have several occurrences of T52/T53 and want all the rows in between for each, use:
m = df['a'].map({'T52': True, 'T53': False}).ffill().fillna(False)
out = df.loc[m|m.shift()]
Example input:
a b c d
0 M11 18 1 1.00
1 M11 17 1 1.00
2 M12 16 2 0.90
3 M12 15 2 0.90
4 T52 15 2 0.90 #
5 S2 13 3 0.85 # Group 1
6 M21 14 3 0.80 #
7 T53 14 3 0.80 #
8 M2 13 3 0.70
9 S42 7 4 0.45
10 M43 6 4 0.40
11 S43 6 4 0.40
12 A51 5 4 0.35
13 A52 5 4 0.35
14 T51 5 4 0.35
15 T52 5 4 0.35 #
16 A53 4 4 0.30 # Group 2
17 T53 4 4 0.30 #
Output:
a b c d
4 T52 15 2 0.90
5 S2 13 3 0.85
6 M21 14 3 0.80
7 T53 14 3 0.80
15 T52 5 4 0.35
16 A53 4 4 0.30
17 T53 4 4 0.30
CodePudding user response:
If always match start
and end
values in data is possible filter by DataFrame.loc
with compare start and end value with first match value by Series.idxmax
:
start = 'T52'
end = 'T53'
df = df.loc[df['a'].eq(start).idxmax() : df['a'].eq(end).idxmax(), 'd']
print (df)
26 0.35
27 0.30
28 0.30
Name: d, dtype: float64
CodePudding user response:
between doesn't give you the rows between those values, it gives you the rows alphabetically between those values. Just run
df['a'].between('T52','T53')
and see which rows return True
docs: https://pandas.pydata.org/docs/reference/api/pandas.Series.between.html
see the bottom example in the docs for a relevant example.