Creating a function which creates a new column based on the values of other columns in a dataframe a-CodePudding

Here is a simplified version of the DF in question:

df = pd.DataFrame({'type': ['terrier', 'toy','toy','toy', 'hound' , 'terrier', 
                            'terrier', 'terrier','terrier', 'hound'],
                            'breed' : ['yorkshire_terrier', 'king_charles_spaniel', 'poodle', 'shih_tzu',
                            'greyhound', 'west_highland', 'bull_terrier' , 'fox_terrier', 
                            'west_highland', 'afghan'],
                   'colour' : ['pink', 'orange','brown','purple', 'grey' , 'white', 
                               'black', 'cream','brown', 'brown']})
    
df


    type         breed                  colour
0   terrier     yorkshire_terrier       pink
1   toy         king_charles_spaniel    orange
2   toy         poodle                  brown
3   toy         shih_tzu                purple
4   hound       greyhound               grey
5   terrier     west_highland           white
6   terrier     bull_terrier            black
7   terrier     fox_terrier             cream
8   terrier     west_highland           brown
9   hound       afghan                  brown

Using the function below, I am able to create a new new_colours column with the rules presented in these dictionaries

Dictionaries:

toy = {'black' : ['poodle', 'shih_tzu'], 
       'mixed' : 'king_charles_spaniel',
       'white' : ['poodle', 'shih_tzu']}

terrier = {'black_brown' : ['yorkshire_terrier','bull_terrier'],
           'white' : 'west_highland',
           'white_orange' : 'fox_terrier'}

hound = {'brindle' : 'greyhound',
           'brown' : 'afghan'}

Function:

def colours(x):
    for dog in [hound,toy,terrier]:
        for colour in dog:
            if x in dog[colour]:
                return colour

df['new_colour']=df['breed'].map(colours)

Output:

    type    breed                 colour    new_colour
0   terrier yorkshire_terrier     pink      black_brown
1   toy     king_charles_spaniel  orange    mixed
2   toy     poodle                black     white
3   toy     shih_tzu              purple    black
4   hound   greyhound             grey      brindle
5   terrier west_highland         white     white
6   terrier bull_terrier          black     black_brown
7   terrier fox_terrier           cream     white_orange
8   terrier west_highland         brown     white
9   hound   afghan                brown     brown

The problem here, however, is with poodle (much more cases in real DF in question). According to the rules presented in the dictionaries, a poodle can be white or black. It was originally labeled, in the colour col as being black - but the new_coloursays whitewhich is possible but I would like to have the original colourcolumn as the correct colour.

CodePudding user response：

Let’s first generate all the tuples of allowed configurations:

>>> ok_tuples = [(breed, color) for color, breeds in {**toy, **terrier, **hound}.items()
...                             for breed in (breeds if type(breeds) is list else [breeds])]
...
>>> df_colors = pd.DataFrame(ok_tuples, columns=['breed', 'colour'])
>>> df_colors
                  breed        colour
0                poodle         black
1              shih_tzu         black
2  king_charles_spaniel         mixed
3         west_highland         white
4     yorkshire_terrier   black_brown
5          bull_terrier   black_brown
6           fox_terrier  white_orange
7             greyhound       brindle
8                afghan         brown

That way we can with a simple .merge find which colours are allowed:

>>> df.merge(df_colors, how='left', on=['breed', 'colour'], indicator=True)
      type                 breed  colour     _merge
0  terrier     yorkshire_terrier    pink  left_only
1      toy  king_charles_spaniel  orange  left_only
2      toy                poodle   brown  left_only
3      toy              shih_tzu  purple  left_only
4    hound             greyhound    grey  left_only
5  terrier         west_highland   white       both
6  terrier          bull_terrier   black  left_only
7  terrier           fox_terrier   cream  left_only
8  terrier         west_highland   brown  left_only
9    hound                afghan   brown       both
>>> colour_ok = df.merge(df_colors, how='left', indicator=True)['_merge'].eq('both')

Now we can keep the allowed colours with .where or remove them with .mask:

>>> df['colour'].where(colour_ok)
0      NaN
1      NaN
2      NaN
3      NaN
4      NaN
5    white
6      NaN
7      NaN
8      NaN
9    brown
Name: colour, dtype: object
>>> df['colour'].mask(colour_ok)
0      pink
1    orange
2     brown
3    purple
4      grey
5       NaN
6     brown
7     black
8     cream
9       NaN
Name: colour, dtype: object

So we can finally merge to get the new_colour and then use .mask to restore the values that were already correct:

>>> assign_new_colour = df_colors.drop_duplicates(subset=['breed']).rename(columns={'colour': 'new_colour'})
>>> df = df.merge(assign_new_colour, on='breed')
>>> df['new_colour'] = df['new_colour'].mask(colour_ok, df['colour'])
>>> df
      type                 breed  colour    new_colour
0  terrier     yorkshire_terrier    pink   black_brown
1      toy  king_charles_spaniel  orange         mixed
2      toy                poodle   brown         black
3      toy              shih_tzu  purple         black
4    hound             greyhound    grey       brindle
5  terrier         west_highland   white         white
6  terrier         west_highland   brown         white
7  terrier          bull_terrier   black   black_brown
8  terrier           fox_terrier   cream  white_orange
9    hound                afghan   brown         brown

CodePudding user response：

You can modify your colours function:

def colours(x):
    possibilities=[]
    for dog in [hound,toy,terrier]:
        for colour in dog:
            
            if x in dog[colour]:
                possibilities.append(colour)
            
    if df[df.breed==x].colour.values[0] in possibilities:
        return df[df.breed==x].colour.values[0]
    else:
        return possibilities[0]

This assumes that the dataset on which you are working is named df, othewise you can pass it as an argument to colours:

def colours(x,df):
    possibilities=[]
    for dog in [hound,toy,terrier]:
        for colour in dog:
            
            if x in dog[colour]:
                possibilities.append(colour)
            
    if df[df.breed==x].colour.values[0] in possibilities:
        return df[df.breed==x].colour.values[0]
    else:
        return possibilities[0]

df['new_colour']=df['breed'].map(lambda x: colours(x,df))