Generic way to drop columns that are not needed for learning (in python using pandas df)-CodePudding

By generic; I mean to say that I do not know the name of a column that needs to be dropped ahead of pulling in the file. Examples I have found; assume that you know the name of a column you wish to drop. Those familiar with the PlayTennis data set are probably used to seeing:

my_df = pd.DataFrame({"Outlook": [Sunny,Cloudy,Rainy], "Temp":[Hot,Cold],
"Humidity":[high,low]...})

However in my class we get a first column 'Days' so something like:

my_df = pd.DataFrame({"Days":[D1,D2,...,D14],"Outlook": [Sunny,Cloudy,Rainy], "Temp":[Hot,Cold],"Humidity":[high,low]...})

Obviously, looking at this I would want to drop the 'Days' column:

df.drop(columns=['Days'], inplace=True)

The problem is that playtennis is just a sample dataset and in the actual dataset the column I may need to drop for the same reason as 'Days' will not be called Days. I need a way to drop the useless column by some method that can see that the number of unique values in a column and understands its too many to be useful; Before I read it into my machine learning algorithm.

import pandas as pd
import numpy as np

df_train = pd.read_csv("assets\playtennis.csv") # read in data
df_train.head() # see first 5

# get a list of attribute excluding the class label (e.g.,PlayTennis)
def attributes (df,label):
    return df.columns.drop(label).values.tolist()
    
    
def trash(df,attr,label):
    # Do something to trash useless columns
    df.drop(columns=[x],inplace=True)
    
class_label = df_train.columns[-1] # class label in the last column
attr = attributes(df_train,class_label)
trash(df_train,attr,class_label)

I only have about 6 weeks working with python so please forgive(and point out) syntax errors.

CodePudding user response：

You have to further define what understands its too many to be useful means.

As a starting point you can calculate the number of unique values per column with nunique.

You can use that value to drop columns. For example, this drops all columns with more than three unique values.

df.drop(columns=df.columns[df.nunique() > 3])

Full example:

import pandas as pd

df = pd.DataFrame({
    'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'col2': ['a', 'b', 'c', 'c', 'd', 'd', 'e', 'f', 'f', 'g'],
    'col3': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],
    'col4': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],
})


df.nunique()
col1    10
col2     7
col3     2
col4     3


df.drop(columns=df.columns[df.nunique() > 3], inplace=True)
  col3  col4
0    a     1
1    a     1
2    a     1
3    a     1
4    a     2
5    b     2
6    b     2
7    b     3
8    b     3
9    b     3

CodePudding user response：

When you load the data, e.g. with pd.read_csv, you can only load the columns you want with argument usecols=[list-of-columns-i-care-about]. That way you don't need to drop them.

CodePudding user response：

First thing, it was not quite obvious why you want to drop Days column in your dataset. I assume that you want to drop a feature with distinct values on each row or too many unique entries such that the feature has no predictability to your testing label. You can get the unique values of a column (eg. 'name') by calling df['name'].unique(), and call len() on top of that to get the number of unique values.

I would suggest you have a threshold for highest the proportion of unique values before you drop that column.

def trash(df, attr, label, threshold=0.8):
    for col in att:
        proportion = len(df.col.unique())/len(df)
        if proportion >= threshold:
            df.drop([col], inplace=True)