Home > other >  pandas determine missing value(s) in a row by computing across columns
pandas determine missing value(s) in a row by computing across columns

Time:12-18

I'm working on processing sales data. Below is a truncated version of the dataframe. Goal is to determine the actual value for "9999" and replace the cell.

    CUSTOMER_ID A ALL_Sales_2017  Toyota_sales_2017  Honda_sales_2017  Ford_sales_2017  **9999_count**
     3000522            93              9999                70                  20          1
     3000530            60              31                  9999                27          1
     3002817            231             9999                43                  170         1
     3004201            18              6                   9999                9999        2
     3004573            36              9999                18                  17          1
     3004888            9               9999                9999                9999        3

In the above dataset, the "ALL_Sales_2017" column represents total sales, which can be assumed to always have true value (non-9999). whereas the individual columns ("Toyota_sales_2017", "Honda_sales_2017", "Ford_sales_2017") can potentially have either one, two or three 9999 values amongst them.

The logic to determine a "9999" in a given row is

  1. row containing one 9999, for e.g. customer_id=3000522 is 93-(70 20)=3
  2. row containing two 9999, for e.g. customer_id=3004201 is (18-6)/2=6
  3. row containing three 9999, ignore the row

So, the dataset post processing would look like this

    CUSTOMER_ID A ALL_Sales_2017  Toyota_sales_2017  Honda_sales_2017  Ford_sales_2017  **9999_count**
    3000522            93              3                   70                  20          1
    3000530            60              31                  2                   27          1
    3002817            231             18                  43                  170         1
    3004201            18              6                   6                   6           2
    3004573            36              1                   18                  17          1
    3004888            9               9999                9999                9999        3

I came up with this implementation.

  1. Create a new column (9999_count) to track the number of 9999s in each row.
  2. compute and assign the value in appropriate column
df.loc[
    df["Toyota_sales_2017"].eq(9999) & (df["9999_count"] == 1), "Toyota_sales_2017"
] = (df["ALL_Sales_2017"] - df["Honda_sales_2017"] - df["Ford_sales_2017"])
df.loc[
    df["Honda_sales_2017"].eq(9999) & (df["9999_count"] == 1), "Honda_sales_2017"
] = (df["ALL_Sales_2017"] - df["Toyota_sales_2017"] - df["Ford_sales_2017"])
df.loc[df["Ford_sales_2017"].eq(9999) & (df["9999_count"] == 1), "Ford_sales_2017"] = (
    df["ALL_Sales_2017"] - df["Honda_sales_2017"] - df["Toyota_sales_2017"]
)

How can I extend this logic to more columns, for e.g. years 2018, 2019, 2020, etc. Can we rewrite the logic in a generic way? Is there another way, possibly simpler, to solve this problem?

CodePudding user response:

You did not provide a reproducible example, so I can't check for myself, but here is a way to generalize that should work using Python f-strings and Pandas concat:

dfs = []
for year in range(2017, 2022):  # years 2017 to 2021

    # Get the subset of df for the given year
    tmp = df.loc[
        :, ["CUSTOMER_ID"]   [col for col in df.columns if str(year) in col]
    ]

    # Create a temporary column (9999_count) to track the number of 9999s in each row
    tmp["9999_count"] = tmp.apply(
        lambda x: sum([x[col] == 9999 for col in tmp.columns]), axis=1
    )

    # Change values
    tmp.loc[
        tmp[f"Toyota_sales_{year}"].eq(9999) & (tmp["9999_count"] == 1),
        f"Toyota_sales_{year}",
    ] = (
        tmp[f"ALL_Sales_{year}"]
        - tmp[f"Honda_sales_{year}"]
        - tmp[f"Ford_sales_{year}"]
    )
    tmp.loc[
        tmp[f"Honda_sales_{year}"].eq(9999) & (tmp["9999_count"] == 1),
        f"Honda_sales_{year}",
    ] = (
        tmp[f"ALL_Sales_{year}"]
        - tmp[f"Toyota_sales_{year}"]
        - tmp[f"Ford_sales_{year}"]
    )
    tmp.loc[
        tmp[f"Ford_sales_{year}"].eq(9999) & (tmp["9999_count"] == 1),
        f"Ford_sales_{year}",
    ] = (
        tmp[f"ALL_Sales_{year}"]
        - tmp[f"Honda_sales_{year}"]
        - tmp[f"Toyota_sales_{year}"]
    )

    dfs.append(tmp.drop(columns="9999_count"))

# Get back full dataframe
df = pd.concat(dfs, axis=1)
  • Related