I have a dataset of this type:
id 1 2 3 4 5
A 10 40 80 12 50
B 20 60 70 77 60
C 30 15 50 20 60
C 30 15 20 45 43
B 50 100 70 77 32
C 30 15 20 80 21
A 50 100 10 12 50
Is there a way to group it somehow to show which columns are specific for which id? For example, we can see that all of the values corresponding to the id 'C' in the first column equal to 30; similarly, column 3 is pretty 'B' id specific - all of the values are the same for 'B' in column 3 etc. Same for column 5 and id 'A'.
So, there are columns specific for each id; is there a way to group them somehow and for each id visualise/list columns specific to each of them?
CodePudding user response:
(df.melt('id')
.groupby(['id', 'variable'])
.agg(lambda x: x.max() if x.max() == x.min() else None)
.unstack())
result
value
variable 1 2 3 4 5
id
A NaN NaN NaN 12.0 50.0
B NaN NaN 70.0 77.0 NaN
C 30.0 15.0 NaN NaN NaN
or
(df.melt('id')
.groupby(['id', 'variable'])
.agg(lambda x: x.max() if x.max() == x.min() else None)
.dropna().astype('int'))
result:
value
id variable
A 4 12
5 50
B 3 70
4 77
C 1 30
2 15
CodePudding user response:
You can use a double groupby to aggregate the data as sets and define a threshold of a number of values to keep to define uniqueness:
thresh = 1
(df.melt('id', var_name='col')
.groupby(['col', 'id'], as_index=False)['value']
.agg(frozenset)
.loc[lambda d: d['value'].str.len().le(thresh)]
.groupby(['value', 'col'])['id']
.agg(set)
.loc[lambda s: s.str.len().eq(1)]
)
Output:
value col
(30) 1 {C}
(15) 2 {C}
(70) 3 {B}
(12) 4 {A}
(77) 4 {B}
(50) 5 {A}
Name: id, dtype: object
Example with a threshold of 2 values:
value col
(10, 50) 1 {A}
(50, 20) 1 {B}
3 {C}
(30) 1 {C}
(40, 100) 2 {A}
(100, 60) 2 {B}
(15) 2 {C}
(80, 10) 3 {A}
(70) 3 {B}
(12) 4 {A}
(77) 4 {B}
(50) 5 {A}
(32, 60) 5 {B}
Name: id, dtype: object