How to groupBy/Tag into same group under conditions-CodePudding

I have a dataframe that looks like below. This dataframe contains name and address information.

l_name   f_name   m_name    city_state     zip 
smith     sam                auburn,wa    98910
smith     sam       c        auburn,wa    98910
smith     sam                durham,nc    27659
smith     sam     eddie      auburn,wa    98910
smith     sam                auburn,wa    98910
  :
  :
aaron     joe     chris      ocala,fl     58910
aaron     joe                ocala,fl     58910
aaron     joe                durham,nc    27659
aaron     joe                ocala,fl     58910
aaron     joe       a        ocala,fl     58910

I am trying to create a code that can group/tag them into the same group if their l_name, f_name, city_state, and zip columns all match.

Here is what the output dataframe I am hoping to look like.

l_name   f_name   m_name    city_state     zip       tag
smith     sam                auburn,wa    98910       1
smith     sam       c        auburn,wa    98910       1
smith     sam                durham,nc    27659       0
smith     sam     eddie      auburn,wa    98910       1
smith     sam                auburn,wa    98910       1
clair     may                seattle,wa   98092       2
clair     may       a        seattle,wa   98092       2
clair     may                seattle,wa   98092       2
  :
  :
aaron     joe     chris      ocala,fl     58910       n
aaron     joe                ocala,fl     58910       n
aaron     joe                durham,nc    27659       0
aaron     joe                ocala,fl     58910       n
aaron     joe       a        ocala,fl     58910       n

If the person doesn't belong to any group, it will assign a 0 to it. 4 sam smith meets the conditions so they are all in the same group, which is 1. So tag/group 2 will be may clair, and joe aaron will be the n group, depending on how many names were stored in the dataframe.

I am wondering are there any good ways, methods, or suggestions for me to create a code that can output the results that I want above?

Thanks!

CodePudding user response：

You can take those groups with > 1 count and get the group number (and 1 ), then for groups of count 1, set tag = 0. Then concatenate.

df1 = df[df.groupby(['l_name', 'f_name', 'city_state', 'zip'])['m_name'].transform('count')>1].copy()
df1['tag'] = df1.groupby(['l_name', 'f_name', 'city_state', 'zip']).ngroup()   1

df2 = df[df.groupby(['l_name', 'f_name', 'city_state', 'zip'])['m_name'].transform('count')==1].copy()
df2['tag'] = 0

df_final = pd.concat([df1,df2])
df_final

  l_name f_name m_name city_state    zip  tag
0  smith    sam      x  auburn,wa  98910    2
1  smith    sam      c  auburn,wa  98910    2
3  smith    sam  eddie  auburn,wa  98910    2
4  smith    sam      x  auburn,wa  98910    2
5  aaron    joe  chris   ocala,fl  58910    1
6  aaron    joe      x   ocala,fl  58910    1
8  aaron    joe      x   ocala,fl  58910    1
9  aaron    joe      a   ocala,fl  58910    1
2  smith    sam      x  durham,nc  27659    0
7  aaron    joe      x  durham,nc  27659    0

CodePudding user response：

You can use a combination of ngroup and count its values afterwards:

# create a ID variable with a unique identifier for each group
groupvar = df.groupby(['l_name', 'f_name', 'city_state', 'zip']).ngroup(ascending=False)   1
# assign it to the data 
df = df.assign(tag=groupvar)

# determine which IDs occur only once and assign 0 to them
df.loc[df['tag'].value_counts()[df['tag']].values == 1, 'tag'] = 0
df
    l_name  f_name  m_name  city_state  zip     tag
0   smith   sam     ''  auburn,wa   98910   2   2
1   smith   sam     c   auburn,wa   98910   2   2
2   smith   sam     ''  durham,nc   27659   1   1
3   smith   sam     ed  auburn,wa   98910   2   2
4   smith   sam     ''  auburn,wa   98910   2   2
5   clair   may     ''  seattle,wa  98092   3   3
6   clair   may     a   seattle,wa  98092   3   3
7   clair   may     ''  seattle,wa  98092   3   3
8   aaron   joe     ch  ocala,fl    58910   4   4
9   aaron   joe     ''  ocala,fl    58910   4   4
10  aaron   joe     ''  durham,nc   27659   5   5
11  aaron   joe     ''  ocala,fl    58910   4   4
12  aaron   joe     a   ocala,fl    58910   4   4