Home > Net >  Grouping pandas dataframe by column specificity to row values - python
Grouping pandas dataframe by column specificity to row values - python

Time:11-17

I have a dataset of this type:

                 id   1   2   3   4   5  
                  A  10  40  80  12  50
                  B  20  60  70  77  60
                  C  30  15  50  20  60
                  C  30  15  20  45  43 
                  B  50 100  70  77  32
                  C  30  15  20  80  21
                  A  50 100  10  12  50

Is there a way to group it somehow to show which columns are specific for which id? For example, we can see that all of the values corresponding to the id 'C' in the first column equal to 30; similarly, column 3 is pretty 'B' id specific - all of the values are the same for 'B' in column 3 etc. Same for column 5 and id 'A'.

So, there are columns specific for each id; is there a way to group them somehow and for each id visualise/list columns specific to each of them?

CodePudding user response:

(df.melt('id')
 .groupby(['id', 'variable'])
 .agg(lambda x: x.max() if x.max() == x.min() else None)
 .unstack())

result

    value
variable    1    2    3     4    5
id                  
A           NaN  NaN  NaN   12.0 50.0
B           NaN  NaN  70.0  77.0 NaN
C           30.0 15.0 NaN   NaN  NaN

or

(df.melt('id')
 .groupby(['id', 'variable'])
 .agg(lambda x: x.max() if x.max() == x.min() else None)
 .dropna().astype('int'))

result:

             value
id  variable    
A   4        12
    5        50
B   3        70
    4        77
C   1        30
    2        15

CodePudding user response:

You can use a double groupby to aggregate the data as sets and define a threshold of a number of values to keep to define uniqueness:

thresh = 1

(df.melt('id', var_name='col')
   .groupby(['col', 'id'], as_index=False)['value']
   .agg(frozenset)
   .loc[lambda d: d['value'].str.len().le(thresh)]
   .groupby(['value', 'col'])['id']
   .agg(set)
   .loc[lambda s: s.str.len().eq(1)]
)

Output:

value  col
(30)   1      {C}
(15)   2      {C}
(70)   3      {B}
(12)   4      {A}
(77)   4      {B}
(50)   5      {A}
Name: id, dtype: object

Example with a threshold of 2 values:

value      col
(10, 50)   1      {A}
(50, 20)   1      {B}
           3      {C}
(30)       1      {C}
(40, 100)  2      {A}
(100, 60)  2      {B}
(15)       2      {C}
(80, 10)   3      {A}
(70)       3      {B}
(12)       4      {A}
(77)       4      {B}
(50)       5      {A}
(32, 60)   5      {B}
Name: id, dtype: object
  • Related