I have following dataset
───────────────────────────────── ───────────────── ───────────────── ───────────────── ─────────────────
| product_name | column_value_1 | column_value_2 | column_value_3 | column_value_4 |
───────────────────────────────── ───────────────── ───────────────── ───────────────── ─────────────────
| Coca Cola 1L | 1 | NaN | 2 | NaN |
| Carbonated drink Coca Cola 1L | NaN | 1.8 | NaN | 1 |
| Yellow Cheese | NaN | 5 | NaN | NaN |
| Packed Yellow Cheese | NaN | NaN | NaN | 4.5 |
| Packed Yellow Cheese 1KG | 3 | NaN | NaN | NaN |
───────────────────────────────── ───────────────── ───────────────── ───────────────── ─────────────────
I know that 'Coca Cola 1L'
, 'Carbonated drink Coca Cola 1L'
are the same products such as 'Yellow Cheese'
, 'Packed Yellow Cheese'
and 'Packed Yellow Cheese 1KG'
and I want to combine their column values in a dataset which holds the info per product in only 1 row. I'm trying to get following dataset.
───────────────────────────────── ───────────────── ───────────────── ───────────────── ─────────────────
| product_name | column_value_1 | column_value_2 | column_value_3 | column_value_4 |
───────────────────────────────── ───────────────── ───────────────── ───────────────── ─────────────────
| Coca Cola 1L | 1 | 1.8 | 2 | 1 |
| Packed Yellow Cheese 1KG | 3 | 5 | NaN | 4.5 |
───────────────────────────────── ───────────────── ───────────────── ───────────────── ─────────────────
How this is achievable? I'm also looking for a solution where I can use a dictionary of similar products.
CodePudding user response:
Given that you somehow know which products are the same and you have this relationship in a map such as:
products_map = {
"Carbonated drink Coca Cola 1L": "Coca Cola 1L",
"Coca Cola 1L": "Coca Cola 1L",
"Yellow Cheese": "Packed Yellow Cheese 1 KG",
"Packed Yellow Cheese": "Packed Yellow Cheese 1 KG",
"Packed Yellow Cheese 1KG": "Packed Yellow Cheese 1 KG"
}
You can apply this map to the pandas series with a df.product_name.map(products_map)
. After that you just need to do a groupby by this column and aggregate the values as you wish.
CodePudding user response:
Thank you for your question, I am here to try to help you, so I will give you this solution and maybe it will work with you, but I don't know if this will help you if you have a lot of data, but this will give you an idea.
Solution:
import numpy as np
import pandas as pd
data = {"product_name":['one', 'oneee', 'two', 'twwo', 'twooo'],
"column_value_1":[1, np.nan, np.nan, np.nan, 3],
"column_value_2":[np.nan, 1.8, 5, np.nan, np.nan],
"column_value_3":[2, np.nan, np.nan, np.nan, np.nan],
"column_value_4":[np.nan, 1, np.nan, 4.5, np.nan]}
df = pd.DataFrame(data)
df.loc[df['product_name'].str.contains("one"), "product_name"] = "one"
df.loc[df["product_name"] != "one", "product_name"] = "two" # Or ['two','two','two']
print(df.groupby("product_name").sum())
# You can make a new variable and store the last result in it
new_df = df.groupby("product_name").sum()
# if you want to nan values instead of 0.0
# new_df.replace(0.0, np.nan, inplace=True)
# print(new_df)
Output:
column_value_1 column_value_2 column_value_3 column_value_4
product_name
one 1.0 1.8 2.0 1.0
two 3.0 5.0 0.0 4.5
Hope to help you,