I have the following dataset:
my_df = pd.DataFrame({'id':[1,2,3,4,5],
'type':['corp','smb','smb','corp','mid'],
'sales':[34567,2190,1870,22000,10000],
'sales_roi':[.10,.21,.22,.15,.16],
'sales_pct':[.38,.05,.08,.30,.20],
'sales_ln':[4.2,2.1,2.0,4.1,4],
'cost_pct':[22000,1000,900,14000,5000],
'flag':[0,1,0,1,1],
'gibberish':['bla','ble','bla','ble','bla'],
'tech':['lnx','mst','mst','lnx','mc']})
my_df['type'] = pd.Categorical(my_df.type)
my_df
id type sales sales_roi sales_pct sales_ln cost_pct flag gibberish tech
0 1 corp 34567 0.10 0.38 4.2 22000 0 bla lnx
1 2 smb 2190 0.21 0.05 2.1 1000 1 ble mst
2 3 smb 1870 0.22 0.08 2.0 900 0 bla mst
3 4 corp 22000 0.15 0.30 4.1 14000 1 ble lnx
4 5 mid 10000 0.16 0.20 4.0 5000 1 bla mc
And I want to filter out all variables who end in "_pct" or "_ln" or are equal to "gibberish" or "tech". This is what I have tried:
df_selected = df.loc[:, ~my_df.columns.str.endswith('_pct') &
~my_df.columns.str.endswith('_ln') &
~my_df.columns.str.contains('gibberish','tech')]
But it returns me an unwanted column ("tech"):
id type sales sales_roi flag tech
0 1 corp 34567 0.10 0 lnx
1 2 smb 2190 0.21 1 mst
2 3 smb 1870 0.22 0 mst
3 4 corp 22000 0.15 1 lnx
4 5 mid 10000 0.16 1 mc
This is the expected result:
id type sales sales_roi flag
0 1 corp 34567 0.10 0
1 2 smb 2190 0.21 1
2 3 smb 1870 0.22 0
3 4 corp 22000 0.15 1
4 5 mid 10000 0.16 1
Please consider that I have to deal with hundreds of variables and this is just an example of what I need. Any help will be greatly appreciated.
CodePudding user response:
Currently, what you are doing will return every column because of how the conditions are written. endswith
will accept tuples so just put all the columns you are looking for in a single tuple and then filter
my_df[my_df.columns[~my_df.columns.str.endswith(('_pct','_ln','gibberish','tech'))]]
id type sales sales_roi flag
0 1 corp 34567 0.10 0
1 2 smb 2190 0.21 1
2 3 smb 1870 0.22 0
3 4 corp 22000 0.15 1
4 5 mid 10000 0.16 1
CodePudding user response:
I would do it like this:
criterion = ["_pct", "_ln", "gibberish", "tech"]
for column in my_df:
for criteria in criterion:
if criteria in column:
my_df = my_df.drop(column, axis=1)
Ofcourse you can change the if statement in line 3 to endswith or something of your choice. Hope this helped :)