Home > Enterprise >  How to add tricky 1-to-1 : M-to-M relations for a dataframe in a new column while grouping data
How to add tricky 1-to-1 : M-to-M relations for a dataframe in a new column while grouping data

Time:10-11

I have the below sample data frame

In [1]: df = pd.DataFrame({'Name': {0: 'ahmed',
  1: 'ahmed',
  2: 'mohamed',
  3: 'mohamed',
  4: 'hassan',
  5: 'ali',
  6: 'nice',
  7: 'bad'},
 'Location': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'C', 5: 'C', 6: 'D', 7: 'D'},
 'Prof': {0: 'doc',
  1: 'doc',
  2: 'eng',
  3: 'doc',
  4: 'eng',
  5: 'eng',
  6: 'doc',
  7: 'eng'}})

In [2]: df
Out[2]:
      Name Location Prof
0    ahmed        A  doc
1    ahmed        A  doc
2  mohamed        B  eng
3  mohamed        B  doc
4   hassan        C  eng
5      ali        C  eng
6     nice        D  doc
7      bad        D  eng

I would like to add a new column Relations which will contain 1-to-1, 1-to-M, M-to-1 or M-to-M relation between profession and both Name and Location combination

I want this

In [3]: df
Out[3]:
      Name Location      Prof relation
0    ahmed        A       doc   1_to_1
1  mohamed        B  eng, doc   1_to_M
2   hassan        C       eng   M_to_1
3      ali        C       eng   M_to_1
4     nice        D       doc   M_to_M
5      bad        D       eng   M_to_M

beside the Relations column, I only want unique values in ['Prof'];

df.groupby(['Name', 'Location'])['Prof'].agg(','.join)

Using the above line of code resulted in doc,doc for the first record, which is something I don't want and I can't remove duplicates before as I have many other columns in the original data that makes both 1st rows not duplicate.

CodePudding user response:

You can use set in lambda after groupby like below then you can use merge two DataFrame and get what you want like below:

>>> relation_list = []
>>> for col in ['Name', 'Prof']:
...    relation_list.append(dict(df.groupby('Location')[col].apply(lambda x : '1' if len(set(x))==1 else 'M')))

>>> df_rel = pd.DataFrame(relation_list).apply(lambda x : '_to_'.join(x)).to_frame().reset_index().rename({'index': 'Location'}, axis=1)
>>> df_rel
    Location    0
0   A           1_to_1
1   B           1_to_M
2   C           M_to_1
3   D           M_to_M

>>> df_prof = df.groupby(['Name', 'Location'])['Prof'].apply(lambda x : ','.join(set(x))).to_frame().reset_index()
>>> df_prof

     Name    Location   Prof
0   ahmed    A          doc
1   ali      C          eng
2   bad      D          eng
3   hassan   C          eng
4   mohamed  B          eng,doc
5   nice     D          doc


>>> df_final = df_prof.merge(df_rel, on='Location').rename({0: 'relation'}, axis=1)
>>> df_final.sort_values('Location')
    Name        Location    Prof      relation
0   ahmed       A           doc       1_to_1
5   mohamed     B           eng,doc   1_to_M
1   ali         C           eng       M_to_1
2   hassan      C           eng       M_to_1
3   bad         D           eng       M_to_M
4   nice        D           doc       M_to_M

CodePudding user response:

Step 1: You can set the relation for Location as follows:

Get the unique counts of Name and Prof for each Location by .groupby() and nunique().

Set up relation using np.select()

# Get the unique counts of `Name` and `Prof` for each `Location`
df_count = df.groupby('Location', as_index=False)[['Name', 'Prof']].nunique()

# Set up `relation` using `np.select()`
cond_list = [(df_count['Name'] == 1) & (df_count['Prof'] == 1),
             (df_count['Name'] == 1) & (df_count['Prof'] >= 2),
             (df_count['Name'] >= 2) & (df_count['Prof'] == 1),
             (df_count['Name'] >= 2) & (df_count['Prof'] >= 2)
            ]

choice_list = ['1_to_1',
               '1_to_M',
               'M_to_1',
               'M_to_M'
              ]

df_count['relation'] = np.select(cond_list, choice_list)              

Step 2: Set the Prof as follows:

We use .unique() to get the unique values of Prof for join

sort=False in .groupby() to keep original sequence.

df_prof = (df.groupby(['Name', 'Location'], as_index=False, sort=False)['Prof']
             .agg(lambda x: ', '.join(x.unique())))

Step 3: Merge them together

df_final = df_prof.merge(df_count[['Location', 'relation']], on='Location')

Results:

df_count

print(df_count)

  Location  Name  Prof relation
0        A     1     1   1_to_1
1        B     1     2   1_to_M
2        C     2     1   M_to_1
3        D     2     2   M_to_M

df_prof

print(df_prof)

      Name Location      Prof
0    ahmed        A       doc
1  mohamed        B  eng, doc
2   hassan        C       eng
3      ali        C       eng
4     nice        D       doc
5      bad        D       eng

df_final

print(df_final)

      Name Location      Prof relation
0    ahmed        A       doc   1_to_1
1  mohamed        B  eng, doc   1_to_M
2   hassan        C       eng   M_to_1
3      ali        C       eng   M_to_1
4     nice        D       doc   M_to_M
5      bad        D       eng   M_to_M
  • Related