I'm working on processing sales data. Below is a truncated version of the dataframe. Goal is to determine the actual value for "9999" and replace the cell.
CUSTOMER_ID A ALL_Sales_2017 Toyota_sales_2017 Honda_sales_2017 Ford_sales_2017 **9999_count**
3000522 93 9999 70 20 1
3000530 60 31 9999 27 1
3002817 231 9999 43 170 1
3004201 18 6 9999 9999 2
3004573 36 9999 18 17 1
3004888 9 9999 9999 9999 3
In the above dataset, the "ALL_Sales_2017" column represents total sales, which can be assumed to always have true value (non-9999). whereas the individual columns ("Toyota_sales_2017", "Honda_sales_2017", "Ford_sales_2017") can potentially have either one, two or three 9999 values amongst them.
The logic to determine a "9999" in a given row is
- row containing one 9999, for e.g. customer_id=3000522 is 93-(70 20)=3
- row containing two 9999, for e.g. customer_id=3004201 is (18-6)/2=6
- row containing three 9999, ignore the row
So, the dataset post processing would look like this
CUSTOMER_ID A ALL_Sales_2017 Toyota_sales_2017 Honda_sales_2017 Ford_sales_2017 **9999_count**
3000522 93 3 70 20 1
3000530 60 31 2 27 1
3002817 231 18 43 170 1
3004201 18 6 6 6 2
3004573 36 1 18 17 1
3004888 9 9999 9999 9999 3
I came up with this implementation.
- Create a new column (
9999_count
) to track the number of 9999s in each row. - compute and assign the value in appropriate column
df.loc[
df["Toyota_sales_2017"].eq(9999) & (df["9999_count"] == 1), "Toyota_sales_2017"
] = (df["ALL_Sales_2017"] - df["Honda_sales_2017"] - df["Ford_sales_2017"])
df.loc[
df["Honda_sales_2017"].eq(9999) & (df["9999_count"] == 1), "Honda_sales_2017"
] = (df["ALL_Sales_2017"] - df["Toyota_sales_2017"] - df["Ford_sales_2017"])
df.loc[df["Ford_sales_2017"].eq(9999) & (df["9999_count"] == 1), "Ford_sales_2017"] = (
df["ALL_Sales_2017"] - df["Honda_sales_2017"] - df["Toyota_sales_2017"]
)
How can I extend this logic to more columns, for e.g. years 2018, 2019, 2020, etc. Can we rewrite the logic in a generic way? Is there another way, possibly simpler, to solve this problem?
CodePudding user response:
You did not provide a reproducible example, so I can't check for myself, but here is a way to generalize that should work using Python f-strings and Pandas concat:
dfs = []
for year in range(2017, 2022): # years 2017 to 2021
# Get the subset of df for the given year
tmp = df.loc[
:, ["CUSTOMER_ID"] [col for col in df.columns if str(year) in col]
]
# Create a temporary column (9999_count) to track the number of 9999s in each row
tmp["9999_count"] = tmp.apply(
lambda x: sum([x[col] == 9999 for col in tmp.columns]), axis=1
)
# Change values
tmp.loc[
tmp[f"Toyota_sales_{year}"].eq(9999) & (tmp["9999_count"] == 1),
f"Toyota_sales_{year}",
] = (
tmp[f"ALL_Sales_{year}"]
- tmp[f"Honda_sales_{year}"]
- tmp[f"Ford_sales_{year}"]
)
tmp.loc[
tmp[f"Honda_sales_{year}"].eq(9999) & (tmp["9999_count"] == 1),
f"Honda_sales_{year}",
] = (
tmp[f"ALL_Sales_{year}"]
- tmp[f"Toyota_sales_{year}"]
- tmp[f"Ford_sales_{year}"]
)
tmp.loc[
tmp[f"Ford_sales_{year}"].eq(9999) & (tmp["9999_count"] == 1),
f"Ford_sales_{year}",
] = (
tmp[f"ALL_Sales_{year}"]
- tmp[f"Honda_sales_{year}"]
- tmp[f"Toyota_sales_{year}"]
)
dfs.append(tmp.drop(columns="9999_count"))
# Get back full dataframe
df = pd.concat(dfs, axis=1)