I am trying to add a new column to a data below, in which columns Algorithm1,2 & 3 are derived from Text column. I want to count each element in the 3 columns and create new column to only have elements which have count > 1
df = pd.DataFrame({'T':['AAABBX','AAABBX','AAABBX'],
'Algorithm1': ['AX','AB','AAB'],
'Algorithm2' : ['BX','AAX','AB'],
'Algorithm3' : ['AX','AB','AAX']})
This is how the final output should look like with columns values AB, AX & AAX as they have frequency > 1
df2 = pd.DataFrame({'T':['AAABBX','AAABBX','AAABBX'],
'Algorithm1': ['AX','AB','AAB'],
'Algorithm2' : ['BX','AAX','AB'],
'Algorithm3' : ['AX','AB','AAX'],
'Majority': ['AB','AX','AAX']})
CodePudding user response:
IIUC count values in Algorithm
columns and filter if greater like 1
:
df = pd.DataFrame({'T':['AAABBX','AAABBX','AAABBX'],
'Algorithm1': ['AX','AB','AAB'],
'Algorithm2' : ['BX','AAX','AB'],
'Algorithm3' : ['AX','AB','AAX']})
s = df.filter(like='Algorithm').stack().value_counts()
m = s.gt(1)
print (s)
AB 3
AX 2
AAX 2
AAB 1
BX 1
dtype: int64
df['new'] = pd.Series(s.index[m])[:m.sum()]
print (df)
T Algorithm1 Algorithm2 Algorithm3 new
0 AAABBX AX BX AX AB
1 AAABBX AB AAX AB AX
2 AAABBX AAB AB AAX AAX
If number of values is greater like length of DataFrame:
df = pd.DataFrame({'T':['AAABBX','AAABBX','AAABBX'],
'Algorithm1': ['AX','AB','AAB'],
'Algorithm2' : ['AAX','AAX','AB'],
'Algorithm3' : ['AX','AB','AAB']})
s = df.filter(like='Algorithm').stack().value_counts()
m = s.gt(1)
print (s)
AB 3
AAB 2
AX 2
AAX 2
dtype: int64
df['new'] = pd.Series(s.index[m])[:m.sum()]
print (df)
T Algorithm1 Algorithm2 Algorithm3 new
0 AAABBX AX AAX AX AB
1 AAABBX AB AAX AB AAB
2 AAABBX AAB AB AAB AX
If number of values is less like length of DataFrame:
df = pd.DataFrame({'T':['AAABBX','AAABBX','AAABBX'],
'Algorithm1': ['AX','AB','AX'],
'Algorithm2' : ['AA','AAX','AB'],
'Algorithm3' : ['AX','AB','AX']})
s = df.filter(like='Algorithm').stack().value_counts()
m = s.gt(1)
print (s)
AX 4
AB 3
AAX 1
AA 1
dtype: int64
df['new'] = pd.Series(s.index[m])[:m.sum()]
print (df)
T Algorithm1 Algorithm2 Algorithm3 new
0 AAABBX AX AA AX AX
1 AAABBX AB AAX AB AB
2 AAABBX AX AB AX NaN