Home > Software engineering >  Grouping and taking average length of characters
Grouping and taking average length of characters

Time:06-02

I have a dataset as under

Cty  A1  
AA  A123
AA  1111 
AA  99999
BB  a9999
BB  345689
BB  345699 
CC  1233 

I need to group the data by country, and identify mean length of characters of column A1 and A2 against each country.

So for Country AA, we see that there are 13 character for the 3 rows in A1 column, and while calculating average we would get 4.333 (i.e 13 / 3), same way for B we would get an average of 5.66

I used the below code to achieve the same

df = (df.groupby('Cty')['A1'].apply(lambda x: np.mean(x.str.len())).reset_index(name='mean_len_text'))

But got the below error

~\anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
    152         from pandas.core.arrays.string_ import StringDtype
    153 
--> 154         self._inferred_dtype = self._validate(data)
    155         self._is_categorical = is_categorical_dtype(data.dtype)
    156         self._is_string = isinstance(data.dtype, StringDtype)

~\anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
    215 
    216         if inferred_dtype not in allowed_types:
--> 217             raise AttributeError("Can only use .str accessor with string values!")
    218         return inferred_dtype
    219 

AttributeError: Can only use .str accessor with string values!

I tried to convert the column to string and run as below, but got an attribute error

df = (df.groupby('Cty')['A1'].astype(str).str.apply(lambda x: np.mean(x.str.len())).reset_index(name='mean_len_text'))

Error msg
AttributeError: 'SeriesGroupBy' object has no attribute 'astype'

Idea is to have the average length of characters for each country in a separate column.

Not sure how to go about the same

Help would be much appretiated.

CodePudding user response:

You can use GroupBy.mean passing df['Cty'] as grouper:

df['A1'].astype(str).str.len().groupby(df['Cty']).mean()

NB. conversion to string can be skipped if you already have strings.

Output:

Cty
AA    4.333333
BB    5.666667
CC    4.000000
Name: A1, dtype: float64

NB. if you want a DataFrame, add your .reset_index(name='mean_len_text')

A fix of your approach could be:

df.groupby('Cty')['A1'].apply(lambda x: x.str.len().mean())

But this should be significantly less efficient.

CodePudding user response:

You can pass Series to groupby, so solution should be simplify by Series.astype with Series.str.len and then per df['Cty'] aggregate mean:

df1 = (df['A1'].astype(str)
               .str.len()
               .groupby(df['Cty'])
               .mean()
               .reset_index(name='mean_len_text'))

Alternative solution is first create column mean_len_text in DataFrame.assign and then aggregate mean:

df1 = (df.assign(mean_len_text = df['A1'].astype(str)
                                         .str.len())
         .groupby('Cty', as_index=False)['mean_len_text']
         .mean())

Your solution with lambda function and astype:

df1 = (df.groupby('Cty')['A1']
         .apply(lambda x: x.astype(str).str.len().mean())
         .reset_index(name='mean_len_text'))
  • Related