Home > Back-end >  Pivot/crosstab without agg function duplicates issue?
Pivot/crosstab without agg function duplicates issue?

Time:10-07

I have the following dataframe:

import pandas as pd
df = pd.DataFrame.from_dict({'data_point_key': {28: 'a',
      30: 'b',
      31: 'c',
      32: 'd',
      33: 'e',
      34: 'f',
      55: 'a',
      56: 'b',
      57: 'c',
      58: 'd',
      59: 'e',
      60: 'f',
      61: 'a',
      63: 'b',
      64: 'c',
      65: 'd',
      66: 'e',
      67: 'f',
      88: 'a',
      89: 'b',
      90: 'c',
      91: 'd',
      92: 'e',
      93: 'f'},
     'val': {28: '8',
      30: 'A',
      31: '8',
      32: 'kuku',
      33: 'lulu',
      34: 'DE',
      55: '8',
      56: 'B',
      57: '8',
      58: 'kuku',
      59: 'lulu',
      60: 'DE',
      61: '8',
      63: 'A',
      64: '8',
      65: 'kuku',
      66: 'lulu',
      67: 'DE',
      88: '8',
      89: 'B',
      90: '8',
      91: 'kuku',
      92: 'lulu',
      93: 'DE'},
     'request_id': {28: 4,
      30: 4,
      31: 4,
      32: 4,
      33: 4,
      34: 4,
      55: 4,
      56: 4,
      57: 4,
      58: 4,
      59: 4,
      60: 4,
      61: 7,
      63: 7,
      64: 7,
      65: 7,
      66: 7,
      67: 7,
      88: 7,
      89: 7,
      90: 7,
      91: 7,
      92: 7,
      93: 7}})

I am trying to do the following:

pd.crosstab(index=df['request_id'],
            columns=df['data_point_key'],
            values=df['val'],
            aggfunc=','.join)

And getting the following duplicates due to the aggfunc:

enter image description here

How can I avoid these duplicates with commas?

CodePudding user response:

You can remove duplicates before pivoting seems cleanest way in my opinion:

df = df.drop_duplicates(['request_id','data_point_key','val'])
df1 = pd.crosstab(index=df['request_id'],
            columns=df['data_point_key'],
            values=df['val'],
            aggfunc=','.join)
print (df1)
data_point_key  a    b  c     d     e   f
request_id                               
4               8  A,B  8  kuku  lulu  DE
7               8  A,B  8  kuku  lulu  DE

Alternative is change aggfunc - if order is important:

df1 = (df.pivot_table(index='request_id',
                   columns='data_point_key',
                   values='val',
                   aggfunc=lambda x: ','.join(dict.fromkeys(x).keys())))
print (df1)
data_point_key  a    b  c     d     e   f
request_id                               
4               8  A,B  8  kuku  lulu  DE
7               8  A,B  8  kuku  lulu  DE

Or if order isnot important:

df1 = pd.crosstab(index=df['request_id'],
            columns=df['data_point_key'],
            values=df['val'],
            aggfunc=lambda x: ','.join(set(x)))
print (df1)
data_point_key  a    b  c     d     e   f
request_id                               
4               8  B,A  8  kuku  lulu  DE
7               8  B,A  8  kuku  lulu  DE
  • Related