Home > Software engineering >  Pandas KeyError in get_loc when calling entries from dataframe in for loop
Pandas KeyError in get_loc when calling entries from dataframe in for loop

Time:12-01

I am using a pandas data-frame and for some reason when trying to access one entry after another in a for loop it does gives me an error.

Here is my (simplified) code snippet:


df_original = pd.read_csv(csv_dataframe_filename, sep='\t', header=[0, 1], encoding_errors="replace")
df_original.columns = ['A', 'B',
              'Count_Number', 'D',
              'E', 'F',
              'use_first', 'H', 'I']

df_use = df_original
df_use = df_use.drop(df_use[((df_use['use_first']=='no'))].index)
df_use.columns = ['A', 'B',
              'Count_Number', 'D',
              'E', 'F',
              'use_first', 'H', 'I']


c_mag = np.zeros((len(df_use), 1))
x = 0
for i in range(len(df_use)):
    print(df_use['Count_Number'][x]) #THIS IS THE LINE THAT IS THE ISSUE
    x  = 1
print(c_mag)
print(df_use['Count_Number'][x])

The line that is the issue is marked by a comment. If I enter a specific number instead of the variable x, it works (both outside and inside the loop, but inside the loop it of course then prints always the same value each time which is not what I want). It also works with df_original instead of df_use (but for my purpose I really need df_use). The printing in the very last line also works (even with variable x that at that point has a certain value). I also entered the column naming for df_use in the middle later on, so I got the issue with and without it in the same way. I tried whether all other parts of the code work and they do, so both dataframes can be printed correctly etc. Using x instead of i as a variable is also a result of playing around and trying to find a solution, so using i was giving the same result.

The column contains floats, if that matters.

But for the code as it is I get the following error message ("folder of file" is of course just a replacement for the actual file path):


Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 2131, in pandas._libs.hashtable.Int64HashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 2140, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "[folder of file]", line 74, in <module>
    print(df_use['Count_Number'][x])
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 942, in __getitem__
    return self._get_value(key)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 1051, in _get_value
    loc = self.index.get_loc(label)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
    raise KeyError(key) from err
KeyError: 0

Process finished with exit code 1

I searched for answers and tried out different things, such as checking the spelling etc. But I can not find a solution and do not understand what I am doing wrong. Does anyone have an idea on how to solve this issue?

Thank you in advance for any helpful comment!

UPDATE: Found a solution after all. using .iloc[x] instead of just [x] solves the issue. Now I am still curious though why that happens - for other variables it worked even without the .iloc, so why not in this case? I feel like an answer would help me to better understand how things are working in python, so thanks for any hints even if I got the code working already.

What I already tried: The line that is the issue is marked by a comment. If I enter a specific number instead of the variable x, it works. It also works with df_original instead of df_use (but for my purpose I really need df_use). The printing in the very last line also works (even with variable x that at that point has a certain value). I also entered the column naming for df_use in the middle later on, so I got the issue with and without it in the same way. I tried whether all other parts of the code work and they do, so both data-frames can be printed correctly etc. Using x instead of i as a variable is also a result of playing around and trying to find a solution, so using i was giving the same result. I also played around with different ways of how to run the loop, but that did not help either. I searched for answers and tried out different things, such as checking the spelling etc.

What I am expecting: The entries of the data-frame columns can be called and used successfully (in this simplified case: can be printed) in the for loop one entry after another. If the printing itself can be done differently, that does not help me (of course I can just print the whole column, that is working), because my actual purpose is to do further calculations with each value. print() is just for now to simplify the issue and try to find a solution.

CodePudding user response:

The issue is that you are manually incrementing i in the for loop, but this is something the for loop already does for you. This causes i to increment by 2 every loop.

Try:

...
c_mag = np.zeros((len(df_use), 1))

for i in range(len(df_use)):
    print(df_use['Count_Number'][x]) #THIS IS THE LINE THAT IS THE ISSUE

print(c_mag)
...

CodePudding user response:

This is the answer focusing on the UPDATE section you have provided.

The first thing you need to understand between normal indexing of DataFrame and using iloc. iloc basically use position indexing (just like in lists we have positions of elements 0, 1, ... len(list)-1), but the normal indexing, in your case [x] matches the column name (in your case, it is row) with what you have entered rather than checking the position.

The traceback tells us that there is no row name 0, that's why it is producing KeyError. In the case of iloc, it uses position indexing, so it will return the very first value of the column Count_Number (for x=0).

In your case, if you want to use the for loop to print the values of the column in sequence, using iloc is recommended. As for the last line of your code, it will print the very last value of your column Count_Number, as the very last value of x in for loop is the length of the DataFrame - 1.

I was unable to understand completely the rest of your issue, so if you still have them, please do ask but in short and specific manner.

  • Related