Problem solved
Using loc
istead of iloc
solves the problem but I'm not sure why.
Medium size of dataframe (80766, 19), composed of ints, floats and dates. While my work, I noticed my results were strange. I started transforming and simplifying expressions to see where the problem was, and came into a contradiction.
Using these two lines I got the same result (as expected):
import pandas
...
data_table[data_table[col_name] == 69][col_name]
data_table.iloc[data_table.index[data_table[col_name] == 69]][col_name]
Result:
23270 69
23271 69
..
25059 69
Name: BBCH, Length: 1790, dtype: int64
But when I changed the searched value for higher, the second line gives a completely incorrect result.
data_table[data_table[col_name] == 71][col_name]
Gives good result:
39556 71
39557 71
..
41353 71
Name: BBCH, Length: 1798, dtype: int64
And for
data_table.iloc[data_table.index[data_table[col_name] == 71]][col_name]
the result is:
7336 30
7337 30
..
9133 30
Name: BBCH, Length: 1798, dtype: int64
My question is why is it that? Is it a problem with size of data?
CodePudding user response:
As long as your index is a RangeIndex
, i.e., has no gaps, you can use loc
and iloc
interchangeably, e.g.,
>>> s = pd.Series('foo', index=range(10))
>>> s
0 foo
1 foo
2 foo
3 foo
4 foo
5 foo
6 foo
7 foo
8 foo
9 foo
dtype: object
>>> s.loc[[1, 2, 7]]
1 foo
2 foo
7 foo
dtype: object
>>> s.iloc[[1, 2, 7]]
1 foo
2 foo
7 foo
dtype: object
But s.loc[[1, 2, 7]]
selects the rows that are labelled 1
, 2
, and 7
, no matter their position, while iloc
extracts the rows that are at the positional indices 1
, 2
, and 7
. If you changed the order of the rows in s
, loc
would still give the same rows, but iloc
would give whatever ends up at the second, third, and 8th row.
If you modify your data so that it's no longer a RangeIndex
(i.e., there are rows missing if you will), loc
and iloc
will give different result once they select something that follows a "missing row". Hence in the example below, with the modified s
, the rows at index 1
and 2
are still labelled as 1
and 2
, so they are selected by both loc
and iloc
, but the 8th row is no longer labelled 7
, but 9
(as we removed two rows in the middle).
>>> s = s.drop([3, 4])
>>> s
0 foo # position = 0
1 foo # 1
2 foo # 2
5 foo # 3 but label == 5!!
6 foo # 4 but label == 6
7 foo # etc.
8 foo
9 foo
dtype: object
>>> s.loc[[1, 2, 7]]
1 foo
2 foo
7 foo
dtype: object
>>> s.iloc[[1, 2, 7]]
1 foo
2 foo
9 foo # != 7 !!
dtype: object
That explains why in the first case, your result was correct, but in the second case, something caused the labels of the index to be "out of sync" with the positional values (probably some dropped rows). As you selected by subsetting the labels of .index
, you need loc
, not iloc
. (If you did a reset_index
before subsetting, iloc
would work again, because then the index would again be identical to the positions of the rows.)