Here is a simplified version of the DF in question:
df = pd.DataFrame({'type': ['terrier', 'toy','toy','toy', 'hound' , 'terrier',
'terrier', 'terrier','terrier', 'hound'],
'breed' : ['yorkshire_terrier', 'king_charles_spaniel', 'poodle', 'shih_tzu',
'greyhound', 'west_highland', 'bull_terrier' , 'fox_terrier',
'west_highland', 'afghan'],
'colour' : ['pink', 'orange','brown','purple', 'grey' , 'white',
'black', 'cream','brown', 'brown']})
df
type breed colour
0 terrier yorkshire_terrier pink
1 toy king_charles_spaniel orange
2 toy poodle brown
3 toy shih_tzu purple
4 hound greyhound grey
5 terrier west_highland white
6 terrier bull_terrier black
7 terrier fox_terrier cream
8 terrier west_highland brown
9 hound afghan brown
Using the function below, I am able to create a new new_colours
column with the rules presented in these dictionaries
Dictionaries:
toy = {'black' : ['poodle', 'shih_tzu'],
'mixed' : 'king_charles_spaniel',
'white' : ['poodle', 'shih_tzu']}
terrier = {'black_brown' : ['yorkshire_terrier','bull_terrier'],
'white' : 'west_highland',
'white_orange' : 'fox_terrier'}
hound = {'brindle' : 'greyhound',
'brown' : 'afghan'}
Function:
def colours(x):
for dog in [hound,toy,terrier]:
for colour in dog:
if x in dog[colour]:
return colour
df['new_colour']=df['breed'].map(colours)
Output:
type breed colour new_colour
0 terrier yorkshire_terrier pink black_brown
1 toy king_charles_spaniel orange mixed
2 toy poodle black white
3 toy shih_tzu purple black
4 hound greyhound grey brindle
5 terrier west_highland white white
6 terrier bull_terrier black black_brown
7 terrier fox_terrier cream white_orange
8 terrier west_highland brown white
9 hound afghan brown brown
The problem here, however, is with poodle (much more cases in real DF in question). According to the rules presented in the dictionaries, a poodle can be white
or black
. It was originally labeled, in the colour
col as being black
- but the new_colour
says white
which is possible but I would like to have the original colour
column as the correct colour.
CodePudding user response:
Let’s first generate all the tuples of allowed configurations:
>>> ok_tuples = [(breed, color) for color, breeds in {**toy, **terrier, **hound}.items()
... for breed in (breeds if type(breeds) is list else [breeds])]
...
>>> df_colors = pd.DataFrame(ok_tuples, columns=['breed', 'colour'])
>>> df_colors
breed colour
0 poodle black
1 shih_tzu black
2 king_charles_spaniel mixed
3 west_highland white
4 yorkshire_terrier black_brown
5 bull_terrier black_brown
6 fox_terrier white_orange
7 greyhound brindle
8 afghan brown
That way we can with a simple .merge
find which colours are allowed:
>>> df.merge(df_colors, how='left', on=['breed', 'colour'], indicator=True)
type breed colour _merge
0 terrier yorkshire_terrier pink left_only
1 toy king_charles_spaniel orange left_only
2 toy poodle brown left_only
3 toy shih_tzu purple left_only
4 hound greyhound grey left_only
5 terrier west_highland white both
6 terrier bull_terrier black left_only
7 terrier fox_terrier cream left_only
8 terrier west_highland brown left_only
9 hound afghan brown both
>>> colour_ok = df.merge(df_colors, how='left', indicator=True)['_merge'].eq('both')
Now we can keep the allowed colours with .where
or remove them with .mask
:
>>> df['colour'].where(colour_ok)
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 white
6 NaN
7 NaN
8 NaN
9 brown
Name: colour, dtype: object
>>> df['colour'].mask(colour_ok)
0 pink
1 orange
2 brown
3 purple
4 grey
5 NaN
6 brown
7 black
8 cream
9 NaN
Name: colour, dtype: object
So we can finally merge to get the new_colour
and then use .mask
to restore the values that were already correct:
>>> assign_new_colour = df_colors.drop_duplicates(subset=['breed']).rename(columns={'colour': 'new_colour'})
>>> df = df.merge(assign_new_colour, on='breed')
>>> df['new_colour'] = df['new_colour'].mask(colour_ok, df['colour'])
>>> df
type breed colour new_colour
0 terrier yorkshire_terrier pink black_brown
1 toy king_charles_spaniel orange mixed
2 toy poodle brown black
3 toy shih_tzu purple black
4 hound greyhound grey brindle
5 terrier west_highland white white
6 terrier west_highland brown white
7 terrier bull_terrier black black_brown
8 terrier fox_terrier cream white_orange
9 hound afghan brown brown
CodePudding user response:
You can modify your colours
function:
def colours(x):
possibilities=[]
for dog in [hound,toy,terrier]:
for colour in dog:
if x in dog[colour]:
possibilities.append(colour)
if df[df.breed==x].colour.values[0] in possibilities:
return df[df.breed==x].colour.values[0]
else:
return possibilities[0]
This assumes that the dataset on which you are working is named df
, othewise you can pass it as an argument to colours
:
def colours(x,df):
possibilities=[]
for dog in [hound,toy,terrier]:
for colour in dog:
if x in dog[colour]:
possibilities.append(colour)
if df[df.breed==x].colour.values[0] in possibilities:
return df[df.breed==x].colour.values[0]
else:
return possibilities[0]
df['new_colour']=df['breed'].map(lambda x: colours(x,df))