Home > Software design >  Select the most frequent terms from a list of strings else return none?
Select the most frequent terms from a list of strings else return none?

Time:10-12

I need to group by the column (in this case the column text) and create a list of all the possible strings in the column tag. Then, I need to find the most frequent term from the list of strings and if there is Not a frequent term or common term, the function must return "none".

I have a dataset that looks like this:

 Text             tag
 drink coke       yes
 eat pizza        mic
 eat fruits       yes
 eat banana       yes
 eat banana       mic
 eat fruits       mic
 eat pizza        no
 eat pizza        mic
 eat pizza        yes
 drink coke       yes
 drink coke       no
 drink coke       no
 drink coke       yes

I used the function below to create a list of all the tags and appended to a new column called labels, but I'm missing the last step. Select the most frequent term and if there is not a frequent term, return none.

 df = pd.DataFrame(df.groupby(['text'])['tag'].apply(lambda x: 
 list(x.values)))

I need to return this:

  Text           labels               final
  eat pizza      [mic,no,mic,yes]    mic
  eat fruits     [yes,mic]           none
  eat banana     [yes,mic]           none
  drink coke     [yes,yes,no,no,yes] yes
  

My output should be like the one in the column "final".

CodePudding user response:

You can use groupby.agg with a custom function for the most frequent item:

def unique_mode(s):
    m = s.mode()
    if len(m) == 1:
        return m.iloc[0]
    return None
    
out = (df
   .groupby('Text', as_index=False)
   .agg(**{'labels': ('tag', list),
           'final': ('tag', unique_mode),
          })
)

output:

         Text                   labels final
0  drink coke  [yes, yes, no, no, yes]   yes
1  eat banana               [yes, mic]  None
2  eat fruits               [yes, mic]  None
3   eat pizza      [mic, no, mic, yes]   mic

CodePudding user response:

Use statistics.multimode and test if length is 1 else return None if performance is important:

from statistics import multimode

def f_unique(x):
    a = multimode(x)
    return a[0] if len(a) == 1 else None

df1 = (df.groupby('Text', as_index=False, sort=False)
         .agg(labels = ('tag', list), final = ('tag', f_unique)))
print (df1)
         Text                   labels final
0   eat pizza      [mic, no, mic, yes]   mic
1  eat fruits               [yes, mic]  None
2  eat banana               [yes, mic]  None
3  drink coke  [yes, no, no, yes, yes]   yes

CodePudding user response:

here is pythonic way:

df.groupby(['Text'])['tag']\
.agg(lambda ser: ser.mode() if len(ser.mode()) == 1 else None)\
.reset_index()
Text tag
0 drink coke yes
1 eat banana None
2 eat fruits None
3 eat pizza mic
  • Related