Python How to check presence of a column in dataframe by it's name or number-CodePudding

I have written a function which asks a user for column name (ex. 'Age') or column number (0, 1, ... or -1, -2, ...) and returns it if exists. I would like to know if my solution can be improved in terms of code design.

To clarify, I need this piece of code for another function which calculates Shannon Entropy on the dataframes for which label-column should be chosen manually.

import pandas as pd

df = pd.DataFrame({'A': [1,2,3], 'B':['a', 'b', 'c']})

def read(df):
    while True:
        column = input("Please, enter column name or number:") 
        if column.lstrip('-').isdecimal():
            if (-df.shape[1] > int(column)) or (int(column) >= df.shape[1]):
                print('Such column does not exist. Please, try again. \n')
                continue
            else:
                return df.iloc[:, int(column)]
        elif column not in df.columns:
            print('Such column does not exist. Please, try again. \n')
            continue
        else:
            return df.loc[:, column]
    return data[column]

read(df)

CodePudding user response：

The columns are available in df.columns which can be used to get the data you want. If the column isn't in df.columns, try converting it to an int to index df.columns and use an exception handler to deal with the misses.

import pandas as pd

df = pd.DataFrame({'A': [1,2,3], 'B':['a', 'b', 'c']})

def read(df):
    while True:
        column = input("Please, enter column name or number:")
        if column not in df.columns:
            try:
                column = df.columns[int(column)]
            except (IndexError, ValueError):
                print(f"Column {column!r} does not exist, Please try again.")
                continue
        break
    return df.loc[:, column]

print(read(df))

CodePudding user response：

The EAFP approach would say that we should try to select from the DataFrame and handle the errors that arise, since pandas has already done a lot of work to see whether or not an indexer is valid:

If we go fully in this direction, we end up with something like:

def read(df_: pd.DataFrame) -> pd.Series:
    while True:
        column = input("Please, enter column name or number:")
        try:
            # Attempt to return the Column
            return df_[column]
        except KeyError:
            try:
                # Attempt to convert the column to int and return the column
                return df_.iloc[:, int(column)]
            except (ValueError, IndexError):
                # Print Message if both attempts fail
                print('Such column does not exist. Please, try again. \n')

I've changed the function parameter from df to df_ to avoid shadowing a variable from an external scope.

We first read in the column, then attempt to return the subset DataFrame. This raises a KeyError if it does not exist in the DataFrame. In this case, we attempt to access the values positionally. int(column) will raise a ValueError if it cannot be converted to an int, and iloc will produce an IndexError if the indexer is out of bounds.

A slightly modified version of this is:

def read(df_: pd.DataFrame) -> pd.Series:
    while True:
        try:
            column = input("Please, enter column name or number:")
            try:
                # Try to get int indexer from df_.columns
                indexer = df_.columns.get_loc(column)
            except KeyError:
                # Use int version of Column
                indexer = int(column)
            return df_.iloc[:, indexer]
        except (ValueError, IndexError):
            # Catch Invalid int conversion, or out of bounds indexes
            print('Such column does not exist. Please, try again. \n')

Here we use Index.get_loc which "Get integer location, slice or boolean mask for requested label." This also raises a KeyError if the label is not present in columns, however, in this case we attempt to convert column to indexer in the except.

This means that indexer is guaranteed to be integer location based and can be passed to iloc. Then we can except the ValueError caused if the int conversion fails, and the IndexError which occurs if the indexer is out of bounds.