Pandas indexing, searching in dataframes-CodePudding

Problem solved

Using loc istead of iloc solves the problem but I'm not sure why.

Medium size of dataframe (80766, 19), composed of ints, floats and dates. While my work, I noticed my results were strange. I started transforming and simplifying expressions to see where the problem was, and came into a contradiction.

Using these two lines I got the same result (as expected):

import pandas
...
data_table[data_table[col_name] == 69][col_name]
data_table.iloc[data_table.index[data_table[col_name] == 69]][col_name]

Result:

23270    69
23271    69
         ..
25059    69
Name: BBCH, Length: 1790, dtype: int64

But when I changed the searched value for higher, the second line gives a completely incorrect result.

data_table[data_table[col_name] == 71][col_name]

Gives good result:

39556    71
39557    71
         ..
41353    71
Name: BBCH, Length: 1798, dtype: int64

And for

data_table.iloc[data_table.index[data_table[col_name] == 71]][col_name]

the result is:

7336    30
7337    30
        ..
9133    30
Name: BBCH, Length: 1798, dtype: int64

My question is why is it that? Is it a problem with size of data?

CodePudding user response：

As long as your index is a RangeIndex, i.e., has no gaps, you can use loc and iloc interchangeably, e.g.,

>>> s = pd.Series('foo', index=range(10))
>>> s
0    foo
1    foo
2    foo
3    foo
4    foo
5    foo
6    foo
7    foo
8    foo
9    foo
dtype: object
>>> s.loc[[1, 2, 7]]
1    foo
2    foo
7    foo
dtype: object
>>> s.iloc[[1, 2, 7]]
1    foo
2    foo
7    foo
dtype: object

But s.loc[[1, 2, 7]] selects the rows that are labelled 1, 2, and 7, no matter their position, while iloc extracts the rows that are at the positional indices 1, 2, and 7. If you changed the order of the rows in s, loc would still give the same rows, but iloc would give whatever ends up at the second, third, and 8th row.

If you modify your data so that it's no longer a RangeIndex (i.e., there are rows missing if you will), loc and iloc will give different result once they select something that follows a "missing row". Hence in the example below, with the modified s, the rows at index 1 and 2 are still labelled as 1 and 2, so they are selected by both loc and iloc, but the 8th row is no longer labelled 7, but 9 (as we removed two rows in the middle).

>>> s = s.drop([3, 4])
>>> s
0    foo # position = 0
1    foo # 1
2    foo # 2
5    foo # 3 but label == 5!!
6    foo # 4 but label == 6
7    foo # etc.
8    foo
9    foo
dtype: object
>>> s.loc[[1, 2, 7]]
1    foo
2    foo
7    foo
dtype: object
>>> s.iloc[[1, 2, 7]]
1    foo
2    foo
9    foo # != 7 !!
dtype: object

That explains why in the first case, your result was correct, but in the second case, something caused the labels of the index to be "out of sync" with the positional values (probably some dropped rows). As you selected by subsetting the labels of .index, you need loc, not iloc. (If you did a reset_index before subsetting, iloc would work again, because then the index would again be identical to the positions of the rows.)