def remove_low_data_states(column_name):
items = df[column_name].value_counts().reset_index()
items.columns = ['place', 'value']
print(f'Items in column: [{column_name}] with low data')
return list(items[items['value'].apply(lambda val: val < items.value.median())].place)
remove_low_data_states('col1') -- > returns ['hello', 'bye']
Orignal table
col1 | col2 | col3 |
---|---|---|
hello | 2 | 4 |
world | 2 | 4 |
bye | 2 | 4 |
Updated table
col1 | col2 | col3 |
---|---|---|
world | 2 | 4 |
The above method gives me a list of names within a column that do not pass the median criteria. How can I then use the list of names to go and remove the rows that are associated with the row value ??
I have tried using pd.drop
but that is not to helpful, or I am making some sort of mistake.
CodePudding user response:
We can use .isin()
def remove_low_data_states(column_name):
items = df[column_name].value_counts().reset_index()
items.columns = ['place', 'value']
print(f'Items in column: [{column_name}] with low data')
return list(items[items['value'].apply(lambda val: val < items.value.median())].place)
df = df[~df['col1'].isin(remove_low_data_states('col1'))]
df.head()