I have written a function which asks a user for column name (ex. 'Age') or column number (0, 1, ... or -1, -2, ...) and returns it if exists. I would like to know if my solution can be improved in terms of code design.
To clarify, I need this piece of code for another function which calculates Shannon Entropy on the dataframes for which label-column should be chosen manually.
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B':['a', 'b', 'c']})
def read(df):
while True:
column = input("Please, enter column name or number:")
if column.lstrip('-').isdecimal():
if (-df.shape[1] > int(column)) or (int(column) >= df.shape[1]):
print('Such column does not exist. Please, try again. \n')
continue
else:
return df.iloc[:, int(column)]
elif column not in df.columns:
print('Such column does not exist. Please, try again. \n')
continue
else:
return df.loc[:, column]
return data[column]
read(df)
CodePudding user response:
The columns are available in df.columns
which can be used to get the data you want. If the column isn't in df.columns
, try converting it to an int
to index df.columns
and use an exception handler to deal with the misses.
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B':['a', 'b', 'c']})
def read(df):
while True:
column = input("Please, enter column name or number:")
if column not in df.columns:
try:
column = df.columns[int(column)]
except (IndexError, ValueError):
print(f"Column {column!r} does not exist, Please try again.")
continue
break
return df.loc[:, column]
print(read(df))
CodePudding user response:
The EAFP approach would say that we should try to select from the DataFrame and handle the errors that arise, since pandas
has already done a lot of work to see whether or not an indexer is valid:
If we go fully in this direction, we end up with something like:
def read(df_: pd.DataFrame) -> pd.Series:
while True:
column = input("Please, enter column name or number:")
try:
# Attempt to return the Column
return df_[column]
except KeyError:
try:
# Attempt to convert the column to int and return the column
return df_.iloc[:, int(column)]
except (ValueError, IndexError):
# Print Message if both attempts fail
print('Such column does not exist. Please, try again. \n')
I've changed the function parameter from df
to df_
to avoid shadowing a variable from an external scope.
We first read in the column, then attempt to return the subset DataFrame. This raises a KeyError
if it does not exist in the DataFrame
. In this case, we attempt to access the values positionally. int(column)
will raise a ValueError
if it cannot be converted to an int
, and iloc
will produce an IndexError
if the indexer is out of bounds.
A slightly modified version of this is:
def read(df_: pd.DataFrame) -> pd.Series:
while True:
try:
column = input("Please, enter column name or number:")
try:
# Try to get int indexer from df_.columns
indexer = df_.columns.get_loc(column)
except KeyError:
# Use int version of Column
indexer = int(column)
return df_.iloc[:, indexer]
except (ValueError, IndexError):
# Catch Invalid int conversion, or out of bounds indexes
print('Such column does not exist. Please, try again. \n')
Here we use Index.get_loc
which "Get integer location, slice or boolean mask for requested label." This also raises a KeyError
if the label is not present in columns, however, in this case we attempt to convert column
to indexer
in the except
.
This means that indexer
is guaranteed to be integer location based and can be passed to iloc
. Then we can except the ValueError
caused if the int
conversion fails, and the IndexError
which occurs if the indexer is out of bounds.