I have a dataframe like this:

import numpy as np
import pandas as pd
from collections import Counter

df = pd.DataFrame({'c0': ['app','e','i','owl','u'],'c1': ['p','app','i','g',''],'c2': ['g','p','app','owl','']})
df


    c0  c1  c2
0   app p   g
1   e   app p
2   i   i   app
3   owl g   owl
4   u

I would like to align the rows based on frequency of items.

Required dataframe with quantities:


   c0   c1  c2
0   app app app
1   i   i   
2   owl     owl
3   e   p   p
4   u   g   g

My attempt

all_cols = df.values.flatten()
all_cols = [i for i in all_cols if i]

freq = Counter(all_cols)
freq

CodePudding user response：

I can get you this far:

import pandas as pd
df = pd.DataFrame({'c0': list('aeiou'),'c1': ['p','a','i','g',''],'c2': ['g','p','a','o','']})
allLetters = set(x for x in df.to_numpy().flatten() if x)
binaryIncidence = []
for letter in allLetters:
    binaryIncidence.append(tuple(int(letter in df[col].tolist()) for col in df.columns))
x = list(zip(allLetters, binaryIncidence))
x.sort(key=lambda y:(y[1], -ord(y[0])), reverse=True)
x = [[y[0] if b else '' for b in y[1]] for y in x]
df_results = pd.DataFrame(x, columns=df.columns)
print(df_results)

... with this output:

  c0 c1 c2
0  a  a  a
1  i  i
2  o     o
3  e
4  u
5     g  g
6     p  p

However, in the sample output from your question, you show 'e' getting paired up with 'p', 'p', and also 'u' getting paired up with 'g', 'g'. It's not clear to me how this selection would be made.

UPDATE: generalize to strings of arbitrary length

This will work not just with strings of length <=1 but of arbitrary length:

import pandas as pd
df = pd.DataFrame({'c0': ['app','e','i','owl','u'],'c1': ['p','app','i','g',''],'c2': ['g','p','app','owl','']})
allStrings = set(x for x in df.to_numpy().flatten() if x)
binaryIncidence = []
for s in allStrings:
    binaryIncidence.append(tuple(int(s in df[col].tolist()) for col in df.columns))
x = list(zip(allStrings, binaryIncidence))
x.sort(key=lambda y:(tuple(-b for b in y[1]), y[0]))
x = [[y[0] if b else '' for b in y[1]] for y in x]
df_results = pd.DataFrame(x, columns=df.columns)
print(df_results)

Output:

    c0   c1   c2
0  app  app  app
1    i    i
2  owl       owl
3    e
4    u
5         g    g
6         p    p