By generic; I mean to say that I do not know the name of a column that needs to be dropped ahead of pulling in the file. Examples I have found; assume that you know the name of a column you wish to drop. Those familiar with the PlayTennis data set are probably used to seeing:
my_df = pd.DataFrame({"Outlook": [Sunny,Cloudy,Rainy], "Temp":[Hot,Cold],
"Humidity":[high,low]...})
However in my class we get a first column 'Days' so something like:
my_df = pd.DataFrame({"Days":[D1,D2,...,D14],"Outlook": [Sunny,Cloudy,Rainy], "Temp":[Hot,Cold],"Humidity":[high,low]...})
Obviously, looking at this I would want to drop the 'Days' column:
df.drop(columns=['Days'], inplace=True)
The problem is that playtennis is just a sample dataset and in the actual dataset the column I may need to drop for the same reason as 'Days' will not be called Days. I need a way to drop the useless column by some method that can see that the number of unique values in a column and understands its too many to be useful; Before I read it into my machine learning algorithm.
import pandas as pd
import numpy as np
df_train = pd.read_csv("assets\playtennis.csv") # read in data
df_train.head() # see first 5
# get a list of attribute excluding the class label (e.g.,PlayTennis)
def attributes (df,label):
return df.columns.drop(label).values.tolist()
def trash(df,attr,label):
# Do something to trash useless columns
df.drop(columns=[x],inplace=True)
class_label = df_train.columns[-1] # class label in the last column
attr = attributes(df_train,class_label)
trash(df_train,attr,class_label)
I only have about 6 weeks working with python so please forgive(and point out) syntax errors.
CodePudding user response:
You have to further define what understands its too many to be useful
means.
As a starting point you can calculate the number of unique values per column with nunique
.
You can use that value to drop columns. For example, this drops all columns with more than three unique values.
df.drop(columns=df.columns[df.nunique() > 3])
Full example:
import pandas as pd
df = pd.DataFrame({
'col1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'col2': ['a', 'b', 'c', 'c', 'd', 'd', 'e', 'f', 'f', 'g'],
'col3': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],
'col4': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],
})
df.nunique()
col1 10
col2 7
col3 2
col4 3
df.drop(columns=df.columns[df.nunique() > 3], inplace=True)
col3 col4
0 a 1
1 a 1
2 a 1
3 a 1
4 a 2
5 b 2
6 b 2
7 b 3
8 b 3
9 b 3
CodePudding user response:
When you load the data, e.g. with pd.read_csv
, you can only load the columns you want with argument usecols=[list-of-columns-i-care-about]
. That way you don't need to drop them.
CodePudding user response:
First thing, it was not quite obvious why you want to drop Days column in your dataset.
I assume that you want to drop a feature with distinct values on each row or too many unique entries such that the feature has no predictability to your testing label.
You can get the unique values of a column (eg. 'name') by calling df['name'].unique()
, and call len()
on top of that to get the number of unique values.
I would suggest you have a threshold for highest the proportion of unique values before you drop that column.
def trash(df, attr, label, threshold=0.8):
for col in att:
proportion = len(df.col.unique())/len(df)
if proportion >= threshold:
df.drop([col], inplace=True)