Home > database >  Remove a substring from a pandas dataframe column
Remove a substring from a pandas dataframe column

Time:04-30

I have a large (45K rows) dataset and I need to remove specific values from specific columns in a handful of cases. The dataset is large enough I'd like to avoid using apply if at all possible.

Here's a sample dataset:

df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})

drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})

My current solution is to use:

(
df.merge(
    drops.pivot(index="ID", columns="Column", values="Override").reset_index()[["ID", "S"]],
    how="left",
    on=["ID", "S"],
    indicator="_dropS",
).assign(
    S=lambda d_: d_.S.mask(d_._dropS == "both", np.nan)))

But this only successfully removes one of the entries. My general Python knowledge is telling me to split the column S by the delimiter "/", remove the matching entry, and join the list back together again (there may be more than two entries in the S column), but I can't seem to make that work within the DataFrame without using apply.

Edited to add goal state: Column S should have the entries: 'n', 'o', ''. The final could be NaN as well.

Is there a reasonable way to do this without a separate function call?

CodePudding user response:

IIUC here is one solution that gives the expected output, no idea about the perfomance. Would be interested in your feedback on that.

#from your sample data
df = pd.DataFrame({"ID": [30, 40, 50], "T": ["C", "D", "E"], "S": ["n", "o/p", "p"]})
drops = pd.DataFrame({"ID": [40, 50],"Column": ["S", "S"],"Rule": ["Remove", "Remove"],"Override": ["p", "p"]})

pivoted_rules = drops.pivot(index="ID", columns="Column", values="Override").rename(columns={'S': 'compare_S'})

res = pd.concat([df.set_index('ID'),pivoted_rules],axis=1).fillna('fill_value')
res['S'] = [''.join([x for x in a if x!=b]) for a, b in zip(res['S'].str.split('/'), res['compare_S'])]
res = res.drop('compare_S', axis=1).reset_index()

print(res)

    ID  T   S
0   30  C   n
1   40  D   o
2   50  E   

Didn't use apply :)

  • Related