Home > Net >  count frequency group by pandas
count frequency group by pandas

Time:09-13

I am trying to add a new column to a data below, in which columns Algorithm1,2 & 3 are derived from Text column. I want to count each element in the 3 columns and create new column to only have elements which have count > 1

df = pd.DataFrame({'T':['AAABBX','AAABBX','AAABBX'],
                     'Algorithm1': ['AX','AB','AAB'],
                     'Algorithm2' : ['BX','AAX','AB'],
                     'Algorithm3' : ['AX','AB','AAX']})

This is how the final output should look like with columns values AB, AX & AAX as they have frequency > 1

df2 = pd.DataFrame({'T':['AAABBX','AAABBX','AAABBX'],
                     'Algorithm1': ['AX','AB','AAB'],
                     'Algorithm2' : ['BX','AAX','AB'],
                     'Algorithm3' : ['AX','AB','AAX'],
                   'Majority': ['AB','AX','AAX']})

CodePudding user response:

IIUC count values in Algorithm columns and filter if greater like 1:

df = pd.DataFrame({'T':['AAABBX','AAABBX','AAABBX'],
                     'Algorithm1': ['AX','AB','AAB'],
                     'Algorithm2' : ['BX','AAX','AB'],
                     'Algorithm3' : ['AX','AB','AAX']})
s = df.filter(like='Algorithm').stack().value_counts()
m = s.gt(1)
print (s)
AB     3
AX     2
AAX    2
AAB    1
BX     1
dtype: int64

df['new'] = pd.Series(s.index[m])[:m.sum()]
print (df)
        T Algorithm1 Algorithm2 Algorithm3  new
0  AAABBX         AX         BX         AX   AB
1  AAABBX         AB        AAX         AB   AX
2  AAABBX        AAB         AB        AAX  AAX

If number of values is greater like length of DataFrame:

df = pd.DataFrame({'T':['AAABBX','AAABBX','AAABBX'],
                     'Algorithm1': ['AX','AB','AAB'],
                     'Algorithm2' : ['AAX','AAX','AB'],
                     'Algorithm3' : ['AX','AB','AAB']})

s = df.filter(like='Algorithm').stack().value_counts()
m = s.gt(1)
print (s)
AB     3
AAB    2
AX     2
AAX    2
dtype: int64

df['new'] = pd.Series(s.index[m])[:m.sum()]
print (df)
        T Algorithm1 Algorithm2 Algorithm3  new
0  AAABBX         AX        AAX         AX   AB
1  AAABBX         AB        AAX         AB  AAB
2  AAABBX        AAB         AB        AAB   AX

If number of values is less like length of DataFrame:

df = pd.DataFrame({'T':['AAABBX','AAABBX','AAABBX'],
                     'Algorithm1': ['AX','AB','AX'],
                     'Algorithm2' : ['AA','AAX','AB'],
                     'Algorithm3' : ['AX','AB','AX']})

s = df.filter(like='Algorithm').stack().value_counts()
m = s.gt(1)
print (s)
AX     4
AB     3
AAX    1
AA     1
dtype: int64

df['new'] = pd.Series(s.index[m])[:m.sum()]
print (df)
        T Algorithm1 Algorithm2 Algorithm3  new
0  AAABBX         AX         AA         AX   AX
1  AAABBX         AB        AAX         AB   AB
2  AAABBX         AX         AB         AX  NaN
  • Related