Home > Back-end >  Counting repeating words with numpy and pandas Python
Counting repeating words with numpy and pandas Python

Time:09-28

I want to write a code where it outputs the number of repeated values in a for each different value. Then I want to make a pandas data sheet to print it. The sums code down below does not work how would I be able to make it work and get the Expected Output?

import numpy as np
import pandas as pd 

a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
uniques = np.unique(a)
sums = np.sum(uniques[:-1]==a[:-1])

Expected Output:

Value    Repetition Count
1        1
3        3
12       3
22       1
43       4

CodePudding user response:

You can use groupby:

>>> pd.Series(a).groupby(a).count()
1     1
3     3
12    3
22    1
43    4
dtype: int64

Or value_counts():

>>> pd.Series(a).value_counts().sort_index()
1     1
3     3
12    3
22    1
43    4
dtype: int64

CodePudding user response:

Easiest if you make a pandas dataframe from np.array and then use value_counts().

df = pd.DataFrame(data=a, columns=['col1'])

print(df.col1.value_counts())
43    4
12    3
3     3
22    1
1     1

CodePudding user response:

Define a dataframe df based on the array a. Then, use .groupby() .size() to get the size/count of unique values, as follows:

a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
df = pd.DataFrame({'Value': a})

df.groupby('Value').size().reset_index(name='Repetition Count')

Result:

   Value  Repetition Count
0      1                 1
1      3                 3
2     12                 3
3     22                 1
4     43                 4

Edit

If you want also the percentages of counts, you can use:

(df.groupby('Value', as_index=False)
   .agg(**{'Repetition Count': ('Value', 'size'), 
           'Percent': ('Value', lambda x: round(x.size/len(a) *100, 2))})
)

Result:

   Value  Repetition Count  Percent
0      1                 1     8.33
1      3                 3    25.00
2     12                 3    25.00
3     22                 1     8.33
4     43                 4    33.33

or use .value_counts with normalize=True

pd.Series(a).value_counts(normalize=True).mul(100)

Result:

43    33.333333
12    25.000000
3     25.000000
22     8.333333
1      8.333333
dtype: float64
  • Related