Pandas Loc Key Error from Second to Nth Column-CodePudding

I have this very simple code that for some reason returns a key error when I select the second column to the last one. But for some reason the first one works as intended.

import csv
import pandas as pd
import sys

#terminal parameter:
csv_data = sys.argv[1]

df = pd.read_csv(csv_data)

print(df)
print("\n")

col_heading = str(input("Enter Column Heading: "))
value = str(input("Enter Lookup Value: "))


print(df.loc[df[col_heading] == value])

The CSV looks like this:

Last name, First name, SSN,        Test1, Test2, Test3, Test4, Final, Grade
Alfalfa,   Aloysius,   123-45-6789, 40.0,    90.0,   100.0,    83.0,    49.0,   D-
Alfred,    University, 123-12-1234, 41.0,    97.0,    96.0,    97.0,    48.0,   D 
Gerty,     Gramma,     567-89-0123, 41.0,    80.0,    60.0,    40.0,    44.0,   C
Android,   Electric,   087-65-4321, 42.0,    23.0,    36.0,    45.0,    47.0,   B-
Bumpkin,   Fred,       456-78-9012, 43.0,    78.0,    88.0,    77.0,    45.0,   A-
Rubble,    Betty,      234-56-7890, 44.0,    90.0,    80.0,    90.0,    46.0,   C-
Noshow,    Cecil,      345-67-8901, 45.0,    11.0,    -1.0,     4.0,    43.0,   F
Buff,      Bif,        632-79-9939, 46.0,    20.0,    30.0,    40.0,    50.0,   B 
Airpump,   Andrew,     223-45-6789, 49.0      1.0,    90.0,   100.0,    83.0,   A
Backus,    Jim,        143-12-1234, 48.0,     1.0,    97.0,    96.0,    97.0,   A 
Carnivore, Art,        565-89-0123, 44.0,     1.0,    80.0,    60.0,    40.0,   D 
Dandy,     Jim,        087-75-4321, 47.0,     1.0,    23.0,    36.0,    45.0,   C 
Elephant,  Ima,        456-71-9012, 45.0,     1.0,    78.0,    88.0,    77.0,   B-
Franklin,  Benny,      234-56-2890, 50.0,     1.0,    90.0,    80.0,    90.0,   B-
George,    Boy,        345-67-3901, 40.0,     1.0,    11.0,    -1.0,     4.0,   B
Heffalump, Harvey,     632-79-9439, 30.0,     1.0,    20.0,    30.0,    40.0,   C

This is the error message I get:

Traceback (most recent call last):
  File "C:\Python310\lib\site-packages\pandas\core\indexes\base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'First name'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\krbss\Documents\PlanIT Testing Automation Testing\Challenge 7.py", line 18, in <module>
    print(df.loc[df[col_heading] == value])
  File "C:\Python310\lib\site-packages\pandas\core\frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Python310\lib\site-packages\pandas\core\indexes\base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'First name'

Any guidance on this would be appreciated.

CodePudding user response：

The columns and values in your dataframe are polluted with leading spaces, which you can verify with these print statements (just after loading the dataframe):

print("Columns: ")
print(df.columns)

print("Example data: ")
print(df[" First name"].values)

Output:

Columns: 
Index(['Last name', ' First name', ' SSN', '        Test1', ' Test2', ' Test3',
       ' Test4', ' Final', ' Grade'],
      dtype='object')
Example data: 
['   Aloysius' '    University' '     Gramma' '   Electric' '   Fred'
 '    Betty' '    Cecil' '      Bif' '   Andrew' '    Jim' ' Art'
 '     Jim' '  Ima' '  Benny' '    Boy' ' Harvey']

Note that the first column 'Last name' has no leading spaces whereas the others do, which explains why you can could access the first column but not the others.

To fix this, you could tell pd.read_csv to skip initial spaces after the delimiter (see skipinitialspaces https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)

df = pd.read_csv(csv_data, skipinitialspace=True)

Now the columns and data do not have unnecessary spaces and your code should work as intended.

Columns: 
Index(['Last name', 'First name', 'SSN', 'Test1', 'Test2', 'Test3', 'Test4',
       'Final', 'Grade'],
      dtype='object')
Example data: 
['Aloysius' 'University' 'Gramma' 'Electric' 'Fred' 'Betty' 'Cecil' 'Bif'
 'Andrew' 'Jim' 'Art' 'Jim' 'Ima' 'Benny' 'Boy' 'Harvey']

You could also explicitly remove spaces using something like df.columns = [col.strip() for col in df.columns], which will fix your column names but you'd have to do the same for your data, so I think in this case setting skipinitialspaces=True is the cleanest solution.

CodePudding user response：

You need to remove the leading and trailing spaces. Assuming you dont want to remove the spaces in between (like 'Last name' changed to 'Lastname'). I would recommend removing both leading and trailing spaces by using strip. This should be done for column names and the data. You can do that like this

df.columns = df.columns.str.strip() #For column names
df.columns = [col.strip() for col in df.columns] #For data in each column

Also change your print statment in the end thus to handle both float and strings

if df[col_heading].dtype == 'float64':
    print("Float my boat")
    print(df.loc[df[col_heading] == float(value)])
else:
    print(df.loc[df[col_heading] == value])

This should enable you to print matching rows for both strings and float type