Home > other >  Categorize column of strings by category name in new column
Categorize column of strings by category name in new column

Time:10-30

I am trying to carry out what should be a pretty simple procedure in Python, but I am having trouble searching for help on this, because I don't know how to best put what I am trying to do into searchable words. I am not sure if what I am trying to do is called reclassifying or using a conditional statement or what really. I will show an example of what I am trying to do, which is pretty simple I think. I have the following DataFrame:

Color     Value
----------------
blue         43
blue         53
blue         25 
orange       44 
orange       33 
orange       35
red          66
red          43
red          65
green        44  
green        35
green        24
green        34 

Now, what I want to do is categorize these colors based on whether they are primary colors or secondary colors, where of course, blue, and red are primary colors, and orange, and green are secondary colors. And so I want to create the following DataFrame:

Color     Value      Category
------------------------------
blue         43       Primary
blue         53       Primary
blue         25       Primary
orange       44     Secondary
orange       33     Secondary 
orange       35     Secondary
red          66       Primary
red          43       Primary
red          65       Primary
green        44     Secondary  
green        35     Secondary
green        24     Secondary
green        34     Secondary 

I am not sure if this involve needing to create a dictionary or if I just use a simple conditional statement to apply to my DataFrame. How can this be done in Python?

CodePudding user response:

You can use simple np.where:

df['Category'] = np.where(df['Color'].str.contains('blue|red'), 'Primary', 'Seconday')

or

df['Color'].str.contains('blue|red').map({True:'Primary',False:'Secondary'})

CodePudding user response:

Assuming we're looking to categorize all colours which fall into these categories the easiest way is to establish a mapping:

colors = {
    'Primary': ['red', 'blue', 'yellow'],
    'Secondary': ['orange', 'purple', 'green']
}

*Note the dictionary is built this way for convince as it assumes there are more colours then Categories.


We can then reformat it into a valid mapper for Series.map with a dictionary comprehension:

color_map = {k: v for v, lst in colors.items() for k in lst}
df['Category'] = df['Color'].map(color_map)

df:

     Color  Value   Category
0     blue     43    Primary
1     blue     53    Primary
2     blue     25    Primary
3   orange     44  Secondary
4   orange     33  Secondary
5   orange     35  Secondary
6      red     66    Primary
7      red     43    Primary
8      red     65    Primary
9    green     44  Secondary
10   green     35  Secondary
11   green     24  Secondary
12   green     34  Secondary

color_map for reference (this is the way the dictionary needs to be formatted to work with Series.map however it is less human readable then the colors dictionary's format):

{'red': 'Primary', 'blue': 'Primary', 'yellow': 'Primary', 
 'orange': 'Secondary', 'purple': 'Secondary', 'green': 'Secondary'}

We can also chain a str.lower if we expect mixed casing in the Color column:

df['Category'] = df['Color'].str.lower().map(color_map)

Setup and imports:

import pandas as pd

df = pd.DataFrame({
    'Color': ['blue', 'blue', 'blue', 'orange', 'orange', 'orange', 'red',
              'red', 'red', 'green', 'green', 'green', 'green'],
    'Value': [43, 53, 25, 44, 33, 35, 66, 43, 65, 44, 35, 24, 34]
})
  • Related