How to add tricky 1-to-1 : M-to-M relations for a dataframe in a new column while grouping data-CodePudding

I have the below sample data frame

In [1]: df = pd.DataFrame({'Name': {0: 'ahmed',
  1: 'ahmed',
  2: 'mohamed',
  3: 'mohamed',
  4: 'hassan',
  5: 'ali',
  6: 'nice',
  7: 'bad'},
 'Location': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'C', 5: 'C', 6: 'D', 7: 'D'},
 'Prof': {0: 'doc',
  1: 'doc',
  2: 'eng',
  3: 'doc',
  4: 'eng',
  5: 'eng',
  6: 'doc',
  7: 'eng'}})

In [2]: df
Out[2]:
      Name Location Prof
0    ahmed        A  doc
1    ahmed        A  doc
2  mohamed        B  eng
3  mohamed        B  doc
4   hassan        C  eng
5      ali        C  eng
6     nice        D  doc
7      bad        D  eng

I would like to add a new column Relations which will contain 1-to-1, 1-to-M, M-to-1 or M-to-M relation between profession and both Name and Location combination

I want this

In [3]: df
Out[3]:
      Name Location      Prof relation
0    ahmed        A       doc   1_to_1
1  mohamed        B  eng, doc   1_to_M
2   hassan        C       eng   M_to_1
3      ali        C       eng   M_to_1
4     nice        D       doc   M_to_M
5      bad        D       eng   M_to_M

beside the Relations column, I only want unique values in ['Prof'];

df.groupby(['Name', 'Location'])['Prof'].agg(','.join)

Using the above line of code resulted in doc,doc for the first record, which is something I don't want and I can't remove duplicates before as I have many other columns in the original data that makes both 1st rows not duplicate.

CodePudding user response：

You can use set in lambda after groupby like below then you can use merge two DataFrame and get what you want like below:

>>> relation_list = []
>>> for col in ['Name', 'Prof']:
...    relation_list.append(dict(df.groupby('Location')[col].apply(lambda x : '1' if len(set(x))==1 else 'M')))

>>> df_rel = pd.DataFrame(relation_list).apply(lambda x : '_to_'.join(x)).to_frame().reset_index().rename({'index': 'Location'}, axis=1)
>>> df_rel
    Location    0
0   A           1_to_1
1   B           1_to_M
2   C           M_to_1
3   D           M_to_M

>>> df_prof = df.groupby(['Name', 'Location'])['Prof'].apply(lambda x : ','.join(set(x))).to_frame().reset_index()
>>> df_prof

     Name    Location   Prof
0   ahmed    A          doc
1   ali      C          eng
2   bad      D          eng
3   hassan   C          eng
4   mohamed  B          eng,doc
5   nice     D          doc


>>> df_final = df_prof.merge(df_rel, on='Location').rename({0: 'relation'}, axis=1)
>>> df_final.sort_values('Location')
    Name        Location    Prof      relation
0   ahmed       A           doc       1_to_1
5   mohamed     B           eng,doc   1_to_M
1   ali         C           eng       M_to_1
2   hassan      C           eng       M_to_1
3   bad         D           eng       M_to_M
4   nice        D           doc       M_to_M

CodePudding user response：

Step 1: You can set the relation for Location as follows:

Get the unique counts of Name and Prof for each Location by .groupby() and nunique().

Set up relation using np.select()

# Get the unique counts of `Name` and `Prof` for each `Location`
df_count = df.groupby('Location', as_index=False)[['Name', 'Prof']].nunique()

# Set up `relation` using `np.select()`
cond_list = [(df_count['Name'] == 1) & (df_count['Prof'] == 1),
             (df_count['Name'] == 1) & (df_count['Prof'] >= 2),
             (df_count['Name'] >= 2) & (df_count['Prof'] == 1),
             (df_count['Name'] >= 2) & (df_count['Prof'] >= 2)
            ]

choice_list = ['1_to_1',
               '1_to_M',
               'M_to_1',
               'M_to_M'
              ]

df_count['relation'] = np.select(cond_list, choice_list)

Step 2: Set the Prof as follows:

We use .unique() to get the unique values of Prof for join

sort=False in .groupby() to keep original sequence.

df_prof = (df.groupby(['Name', 'Location'], as_index=False, sort=False)['Prof']
             .agg(lambda x: ', '.join(x.unique())))

Step 3: Merge them together

df_final = df_prof.merge(df_count[['Location', 'relation']], on='Location')

Results:

df_count

print(df_count)

  Location  Name  Prof relation
0        A     1     1   1_to_1
1        B     1     2   1_to_M
2        C     2     1   M_to_1
3        D     2     2   M_to_M

df_prof

print(df_prof)

      Name Location      Prof
0    ahmed        A       doc
1  mohamed        B  eng, doc
2   hassan        C       eng
3      ali        C       eng
4     nice        D       doc
5      bad        D       eng

df_final

print(df_final)

      Name Location      Prof relation
0    ahmed        A       doc   1_to_1
1  mohamed        B  eng, doc   1_to_M
2   hassan        C       eng   M_to_1
3      ali        C       eng   M_to_1
4     nice        D       doc   M_to_M
5      bad        D       eng   M_to_M