How much time certain string mentioned in dataframe column-CodePudding

I have dictionary where key is color_name and value is list of color_name similar to mentioned as key color

all_colors = {
    'red': ['coral','burgundy'],
    'yellow':['mustard','lemon']}

I have pandas dataframe

import pandas as pd

df = pd.DataFrame(
    {'market_color': ['red',
                      'coral',
                      'burgundy',
                      'light red',
                      'mustard',
                      'lemon',
                      'red'],
     'color_id': [1, 2, 3, 4, 5, 6, 7]})

I want to count how much time color_name from all_colors and it's similarities mentioned in dataframe market_color column.

Expecting final dictionary like this all_colors_frequencies={'red':5,'yellow':2}

How i can achive it

CodePudding user response：

You can define a function that iterates through the map and tries to match the value to one of the keys

def categorize(col, map):
    result = "Unknown"
    for key, color_list in map.items():
        if col == key or col in color_list:
            return key
    return result

Then you apply that function to the col market_color and use value_counts to get the final count for each key

df.market_color.apply(lambda col: categorize(col, all_colors)).value_counts()

The following snippet:

all_colors={'red':['coral','burgundy','light red'], 'yellow':['mustard','lemon']}
df={
    'market_color':['red','coral','burgundy','light red','mustard','lemon','red'],
    'color_id':[1,2,3,4,5,6,7]
}
df = pd.DataFrame(df)

def categorize(col, map):
    result = "Unknown"
    for key, color_list in map.items():
        if col == key or col in color_list:
            return key
    return result
    
print(df.market_color.apply(lambda col: categorize(col, all_colors)).value_counts())

Would give the following output:

red       5
yellow    2
Name: market_color, dtype: int64

CodePudding user response：

One approach using str.replace and str.extract:

reverse_lookup = {v: k for k, vs in all_colors.items() for v in vs}


def repl(m):
    return reverse_lookup[m.group()]


# map similar colors to key colors
normal = df["market_color"].str.replace("|".join(reverse_lookup), repl=repl, regex=True)

# extract only colors, i.e. light red -> red
colors_only = normal.str.extract(f'({"|".join(all_colors)})', expand=False)

# count and transform to dict
res = colors_only.value_counts().to_dict()
print(res)

Output

{'red': 5, 'yellow': 2}