Home > Blockchain >  Number of character occurences in a list of strings
Number of character occurences in a list of strings

Time:01-16

I have a dataframe with a list of strings and I would like to add columns with the number of occurrences of a character, sorted with the maximum to minimum occurrences The datafrae is very big so I need an efficient way to calculate it

Originale df:

    Item
0   ABABCBF
1   ABABCGH
2   ABABEFR
3   ABABFBF
4   ABACTC3

Wanted df:

    Item    o1  o2  o3  o4  o5
0   ABABCBF 3   2   1   1   null
1   ABABCGH 2   2   1   1   1
2   ABABEFR 2   2   1   1   1
3   ABABFBF 3   2   2   null    null
4   ABACTC3 2   2   1   1   1

I have tried using collection counter but I am not able to convert the result in the column of the dataframe collections.Counter(df['item'])

Thanks

CodePudding user response:

You can use collections.Counter and the DataFrame constructor:

from collections import Counter

out = df.join(pd.DataFrame(
  sorted(Counter(x).values(), reverse=True)
         for x in df['Item'])
  .rename(columns=lambda x: f'o{x 1}')
)

print(out)

Output:

      Item  o1  o2  o3   o4   o5
0  ABABCBF   3   2   1  1.0  NaN
1  ABABCGH   2   2   1  1.0  1.0
2  ABABEFR   2   2   1  1.0  1.0
3  ABABFBF   3   2   2  NaN  NaN
4  ABACTC3   2   2   1  1.0  1.0

CodePudding user response:

Try:

import json
import pandas as pd
from collections import Counter
df = pd.DataFrame({'Item': ['ABACABDF', 'BACBDFHGAAAA']})
result = df.join(
    pd.DataFrame(
        json.loads(
            df['Item']
            .transform(lambda x: sorted(list(Counter(x).values()), reverse=True))
            .to_json(orient='records')
        )
    )
    .rename(columns=(lambda x: f'o{x 1}'))
)

result

           Item  o1  o2  o3  o4  o5   o6   o7
0      ABACABDF   3   2   1   1   1  NaN  NaN
1  BACBDFHGAAAA   5   2   1   1   1  1.0  1.0

CodePudding user response:

try this:

def count_chars(txt: str):
    ser = pd.Series([*txt])
    result = ser.value_counts().tolist()
    return result

result = df.join(
    pd.DataFrame([*df['Item'].apply(count_chars)]).rename(columns=lambda x: f'o{x 1}'))
print(result)
>>>
    Item    o1  o2  o3  o4  o5
0   ABABCBF 3   2   1   1.0 NaN
1   ABABCGH 2   2   1   1.0 1.0
2   ABABEFR 2   2   1   1.0 1.0
3   ABABFBF 3   2   2   NaN NaN
4   ABACTC3 2   2   1   1.0 1.0
  • Related