I have a dataset that has 'unknown' in each group. I am thinking to put 'unknown' at the end of each group. For example: I have a code like this:
import pandas as pd
clients1 = {'Name': ['A','B','Unknown','Y','Z','A','B','Unknown','Y','Z'],
'group': ['1','1','1','1','1','2','2','2','2','2']
}
df = pd.DataFrame(clients1, columns= ['Name', 'group'])
df
the output:
Name group
A 1
B 1
Unknown 1
Y 1
Z 1
A 2
B 2
Unknown 2
Y 2
Z 2
the desired output should be:
Name group
A 1
B 1
Y 1
Z 1
Unknown 1
A 2
B 2
Y 2
Z 2
Unknown 2
Does anyone know how to do that?
CodePudding user response:
You can sort the data by group
and name==Unknown
, so the Unknown
's are marked with True
and put at the bottom:
df = (df.assign(unknown_name=df['Name'].eq('Unknown'))
.sort_values(['group','unknown_name'])
)
Output:
Name group unknown_name
0 A 1 False
1 B 1 False
3 Y 1 False
4 Z 1 False
2 Unknown 1 True
5 A 2 False
6 B 2 False
8 Y 2 False
9 Z 2 False
7 Unknown 2 True
CodePudding user response:
one way you might want to consider is to replace 'unknown' with 'NaN':
df["Name"].replace('Unknown', np.NaN,inplace=True)
df.sort_values(by=['group','Name'], inplace=True)
>>> output
Name group
0 A 1
1 B 1
3 Y 1
4 Z 1
2 NaN 1
5 A 2
6 B 2
8 Y 2
9 Z 2
7 NaN 2
and ofcource you can replace back the NaNs with unknown again
CodePudding user response:
Mind the right order of your items:
import pandas as pd
clients1 = {'Name': ['A','B','Y','Z', 'Unknown','A','B','Y','Z', 'Unknown'],
'group': ['1','1','1','1','1','2','2','2','2','2']
}
df = pd.DataFrame(clients1, columns= ['Name', 'group'])
df
CodePudding user response:
You can use category dtype like this:
l = df['Name'].unique().tolist()
u = l.pop(l.index('Unknown'))
l.append(u)
n_dtype = pd.CategoricalDtype(l, ordered=True)
df['Name'] = df['Name'].astype(n_dtype)
df.sort_values(['group', 'Name'])
Output:
Name group
0 A 1
1 B 1
3 Y 1
4 Z 1
2 Unknown 1
5 A 2
6 B 2
8 Y 2
9 Z 2
7 Unknown 2
CodePudding user response:
Without doing any sorting and extra column. It will preserve the original order of the group
field and so should be more generic. The logic is similar to manipulating the string. For each group, something like group[:Unknown] group[Unknown 1:] Unknown
def func(x):
result=pd.DataFrame()
idx = np.flatnonzero(x['Name']=='Unknown') # get "boolean indices" for using iloc
prev = None
for i in idx:
result = result.append(x.iloc[prev:i,:])
prev = i 1
result = result.append(x.iloc[prev:,:]) # last block after Unknown
for i in idx:
result = result.append(x.iloc[i, :]) # for appending Unknown records
return result
df.groupby('group').apply(func).droplevel(0)
Name group
0 A 1
1 B 1
3 Y 1
4 Z 1
2 Unknown 1
5 A 2
6 B 2
8 Y 2
9 Z 2
7 Unknown 2