I got a Dataframe with three columns: "Company", "Cost" and "Refund".
raw_df = pd.DataFrame({
'Company': ['A', 'A', 'B', 'A', 'A'],
'Cost': [10, 10, 10, 0, 158],
'Refund': [0, 0, 0, 10, 0]})
If there has been a refund, then my goal is to remove the corresponding cost. For the above example, the solution would be:
filtered_df = pd.DataFrame({
'Company': ['A', 'B', 'A'],
'Cost': [10, 10, 158],
'Refund': [0, 0, 0]})
How can you code it?
Clarification notes:
- A row can never have a cost and refund. It is always a cost OR a refund
- If there has been a refund, there will ALWAYS be a matching cost with a matching company.
- A single refund cancels only a single cost. Like in the example above.
CodePudding user response:
You can try this:
self join the dataframes renaming refund to cost. then filter each company to keep the last index and drop them.
tmp = raw_df.drop("Cost",1).rename(columns={"Refund":"Cost"}).query("Cost > 0")
tmp1 = raw_df.merge(tmp.reset_index(),on=['Company','Cost'],how='left')\
.dropna(subset=['index']).drop_duplicates(["Company","index"],keep='last')
filtered_df = raw_df.drop(sum(map(list,tmp1['index'].items()),[]))
print(filtered_df)
Company Cost Refund
0 A 10 0
2 B 10 0
4 A 158 0