Pandas combine rows based on condition or dictionary-CodePudding

I have following dataset

 ───────────────────────────────── ───────────────── ───────────────── ───────────────── ───────────────── 
| product_name                    | column_value_1  | column_value_2  | column_value_3  | column_value_4  |
 ───────────────────────────────── ───────────────── ───────────────── ───────────────── ───────────────── 
| Coca Cola 1L                    | 1               | NaN             | 2               | NaN             |
| Carbonated drink Coca Cola 1L   | NaN             | 1.8             | NaN             | 1               |
| Yellow Cheese                   | NaN             | 5               | NaN             | NaN             |
| Packed Yellow Cheese            | NaN             | NaN             | NaN             | 4.5             |
| Packed Yellow Cheese 1KG        | 3               | NaN             | NaN             | NaN             |
 ───────────────────────────────── ───────────────── ───────────────── ───────────────── ─────────────────

I know that 'Coca Cola 1L', 'Carbonated drink Coca Cola 1L' are the same products such as 'Yellow Cheese', 'Packed Yellow Cheese' and 'Packed Yellow Cheese 1KG' and I want to combine their column values in a dataset which holds the info per product in only 1 row. I'm trying to get following dataset.

 ───────────────────────────────── ───────────────── ───────────────── ───────────────── ───────────────── 
| product_name                    | column_value_1  | column_value_2  | column_value_3  | column_value_4  |
 ───────────────────────────────── ───────────────── ───────────────── ───────────────── ───────────────── 
| Coca Cola 1L                    | 1               | 1.8             | 2               | 1               |
| Packed Yellow Cheese 1KG        | 3               | 5               | NaN             | 4.5             |
 ───────────────────────────────── ───────────────── ───────────────── ───────────────── ─────────────────

How this is achievable? I'm also looking for a solution where I can use a dictionary of similar products.

CodePudding user response：

Given that you somehow know which products are the same and you have this relationship in a map such as:

products_map = {
    "Carbonated drink Coca Cola 1L": "Coca Cola 1L",
    "Coca Cola 1L": "Coca Cola 1L",
    "Yellow Cheese": "Packed Yellow Cheese 1 KG",
    "Packed Yellow Cheese": "Packed Yellow Cheese 1 KG",
    "Packed Yellow Cheese 1KG": "Packed Yellow Cheese 1 KG"
}

You can apply this map to the pandas series with a df.product_name.map(products_map). After that you just need to do a groupby by this column and aggregate the values as you wish.

CodePudding user response：

Thank you for your question, I am here to try to help you, so I will give you this solution and maybe it will work with you, but I don't know if this will help you if you have a lot of data, but this will give you an idea.

Solution:

import numpy as np
import pandas as pd

data = {"product_name":['one', 'oneee', 'two', 'twwo', 'twooo'],
        "column_value_1":[1, np.nan, np.nan, np.nan, 3],
        "column_value_2":[np.nan, 1.8, 5, np.nan, np.nan],
        "column_value_3":[2, np.nan, np.nan, np.nan, np.nan],
        "column_value_4":[np.nan, 1, np.nan, 4.5, np.nan]}

df = pd.DataFrame(data)

df.loc[df['product_name'].str.contains("one"), "product_name"] = "one"
df.loc[df["product_name"] != "one", "product_name"] = "two"  # Or ['two','two','two']

print(df.groupby("product_name").sum())

# You can make a new variable and store the last result in it

new_df = df.groupby("product_name").sum()

# if you want to nan values instead of 0.0

# new_df.replace(0.0, np.nan, inplace=True)

# print(new_df)

Output:

                  column_value_1  column_value_2  column_value_3  column_value_4
product_name                                                                
one                      1.0             1.8             2.0             1.0
two                      3.0             5.0             0.0             4.5

Hope to help you,