Home > Software engineering >  Filter Pandas DF
Filter Pandas DF

Time:04-02

I got a Dataframe with three columns: "Company", "Cost" and "Refund".

raw_df = pd.DataFrame({
'Company': ['A', 'A', 'B', 'A', 'A'],
'Cost': [10, 10, 10, 0, 158],
'Refund': [0, 0, 0, 10, 0]})

If there has been a refund, then my goal is to remove the corresponding cost. For the above example, the solution would be:

filtered_df = pd.DataFrame({
'Company': ['A', 'B', 'A'],
'Cost': [10, 10, 158],
'Refund': [0, 0, 0]})

How can you code it?

Clarification notes:

  • A row can never have a cost and refund. It is always a cost OR a refund
  • If there has been a refund, there will ALWAYS be a matching cost with a matching company.
  • A single refund cancels only a single cost. Like in the example above.

CodePudding user response:

You can try this:

self join the dataframes renaming refund to cost. then filter each company to keep the last index and drop them.

tmp = raw_df.drop("Cost",1).rename(columns={"Refund":"Cost"}).query("Cost > 0")
tmp1 = raw_df.merge(tmp.reset_index(),on=['Company','Cost'],how='left')\
            .dropna(subset=['index']).drop_duplicates(["Company","index"],keep='last')
filtered_df = raw_df.drop(sum(map(list,tmp1['index'].items()),[]))

print(filtered_df)

  Company  Cost  Refund
0       A    10       0
2       B    10       0
4       A   158       0
  • Related