I have a dataframe that looks like below. This dataframe contains name and address information.
l_name f_name m_name city_state zip
smith sam auburn,wa 98910
smith sam c auburn,wa 98910
smith sam durham,nc 27659
smith sam eddie auburn,wa 98910
smith sam auburn,wa 98910
:
:
aaron joe chris ocala,fl 58910
aaron joe ocala,fl 58910
aaron joe durham,nc 27659
aaron joe ocala,fl 58910
aaron joe a ocala,fl 58910
I am trying to create a code that can group/tag them into the same group if their l_name, f_name, city_state, and zip columns all match.
Here is what the output dataframe I am hoping to look like.
l_name f_name m_name city_state zip tag
smith sam auburn,wa 98910 1
smith sam c auburn,wa 98910 1
smith sam durham,nc 27659 0
smith sam eddie auburn,wa 98910 1
smith sam auburn,wa 98910 1
clair may seattle,wa 98092 2
clair may a seattle,wa 98092 2
clair may seattle,wa 98092 2
:
:
aaron joe chris ocala,fl 58910 n
aaron joe ocala,fl 58910 n
aaron joe durham,nc 27659 0
aaron joe ocala,fl 58910 n
aaron joe a ocala,fl 58910 n
If the person doesn't belong to any group, it will assign a 0 to it. 4 sam smith meets the conditions so they are all in the same group, which is 1. So tag/group 2 will be may clair, and joe aaron will be the n group, depending on how many names were stored in the dataframe.
I am wondering are there any good ways, methods, or suggestions for me to create a code that can output the results that I want above?
Thanks!
CodePudding user response:
You can take those groups with > 1 count and get the group number (and 1 ), then for groups of count 1, set tag = 0. Then concatenate.
df1 = df[df.groupby(['l_name', 'f_name', 'city_state', 'zip'])['m_name'].transform('count')>1].copy()
df1['tag'] = df1.groupby(['l_name', 'f_name', 'city_state', 'zip']).ngroup() 1
df2 = df[df.groupby(['l_name', 'f_name', 'city_state', 'zip'])['m_name'].transform('count')==1].copy()
df2['tag'] = 0
df_final = pd.concat([df1,df2])
df_final
l_name f_name m_name city_state zip tag
0 smith sam x auburn,wa 98910 2
1 smith sam c auburn,wa 98910 2
3 smith sam eddie auburn,wa 98910 2
4 smith sam x auburn,wa 98910 2
5 aaron joe chris ocala,fl 58910 1
6 aaron joe x ocala,fl 58910 1
8 aaron joe x ocala,fl 58910 1
9 aaron joe a ocala,fl 58910 1
2 smith sam x durham,nc 27659 0
7 aaron joe x durham,nc 27659 0
CodePudding user response:
You can use a combination of ngroup
and count its values afterwards:
# create a ID variable with a unique identifier for each group
groupvar = df.groupby(['l_name', 'f_name', 'city_state', 'zip']).ngroup(ascending=False) 1
# assign it to the data
df = df.assign(tag=groupvar)
# determine which IDs occur only once and assign 0 to them
df.loc[df['tag'].value_counts()[df['tag']].values == 1, 'tag'] = 0
df
l_name f_name m_name city_state zip tag
0 smith sam '' auburn,wa 98910 2 2
1 smith sam c auburn,wa 98910 2 2
2 smith sam '' durham,nc 27659 1 1
3 smith sam ed auburn,wa 98910 2 2
4 smith sam '' auburn,wa 98910 2 2
5 clair may '' seattle,wa 98092 3 3
6 clair may a seattle,wa 98092 3 3
7 clair may '' seattle,wa 98092 3 3
8 aaron joe ch ocala,fl 58910 4 4
9 aaron joe '' ocala,fl 58910 4 4
10 aaron joe '' durham,nc 27659 5 5
11 aaron joe '' ocala,fl 58910 4 4
12 aaron joe a ocala,fl 58910 4 4