I have a large (45K rows) dataset and I need to remove specific values from specific columns in a handful of cases. The dataset is large enough I'd like to avoid using apply
if at all possible.
Here's a sample dataset:
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})
My current solution is to use:
(
df.merge(
drops.pivot(index="ID", columns="Column", values="Override").reset_index()[["ID", "S"]],
how="left",
on=["ID", "S"],
indicator="_dropS",
).assign(
S=lambda d_: d_.S.mask(d_._dropS == "both", np.nan)))
But this only successfully removes one of the entries. My general Python knowledge is telling me to split the column S
by the delimiter "/", remove the matching entry, and join the list back together again (there may be more than two entries in the S
column), but I can't seem to make that work within the DataFrame without using apply
.
Edited to add goal state: Column S
should have the entries: 'n', 'o', ''. The final could be NaN as well.
Is there a reasonable way to do this without a separate function call?
CodePudding user response:
IIUC here is one solution that gives the expected output, no idea about the perfomance. Would be interested in your feedback on that.
#from your sample data
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})
pivoted_rules = drops.pivot(index="ID", columns="Column", values="Override").rename(columns={'S': 'compare_S'})
res = pd.concat([df.set_index('ID'),pivoted_rules],axis=1).fillna('fill_value')
res['S'] = [''.join([x for x in a if x!=b]) for a, b in zip(res['S'].str.split('/'), res['compare_S'])]
res = res.drop('compare_S', axis=1).reset_index()
print(res)
ID T S
0 30 C n
1 40 D o
2 50 E
Didn't use apply
:)