I have a dataset as under
Cty A1
AA A123
AA 1111
AA 99999
BB a9999
BB 345689
BB 345699
CC 1233
I need to group the data by country, and identify mean length of characters of column A1 and A2 against each country.
So for Country AA, we see that there are 13 character for the 3 rows in A1 column, and while calculating average we would get 4.333 (i.e 13 / 3), same way for B we would get an average of 5.66
I used the below code to achieve the same
df = (df.groupby('Cty')['A1'].apply(lambda x: np.mean(x.str.len())).reset_index(name='mean_len_text'))
But got the below error
~\anaconda3\lib\site-packages\pandas\core\strings\accessor.py in __init__(self, data)
152 from pandas.core.arrays.string_ import StringDtype
153
--> 154 self._inferred_dtype = self._validate(data)
155 self._is_categorical = is_categorical_dtype(data.dtype)
156 self._is_string = isinstance(data.dtype, StringDtype)
~\anaconda3\lib\site-packages\pandas\core\strings\accessor.py in _validate(data)
215
216 if inferred_dtype not in allowed_types:
--> 217 raise AttributeError("Can only use .str accessor with string values!")
218 return inferred_dtype
219
AttributeError: Can only use .str accessor with string values!
I tried to convert the column to string and run as below, but got an attribute error
df = (df.groupby('Cty')['A1'].astype(str).str.apply(lambda x: np.mean(x.str.len())).reset_index(name='mean_len_text'))
Error msg
AttributeError: 'SeriesGroupBy' object has no attribute 'astype'
Idea is to have the average length of characters for each country in a separate column.
Not sure how to go about the same
Help would be much appretiated.
CodePudding user response:
You can use GroupBy.mean
passing df['Cty']
as grouper:
df['A1'].astype(str).str.len().groupby(df['Cty']).mean()
NB. conversion to string can be skipped if you already have strings.
Output:
Cty
AA 4.333333
BB 5.666667
CC 4.000000
Name: A1, dtype: float64
NB. if you want a DataFrame, add your .reset_index(name='mean_len_text')
A fix of your approach could be:
df.groupby('Cty')['A1'].apply(lambda x: x.str.len().mean())
But this should be significantly less efficient.
CodePudding user response:
You can pass Series
to groupby
, so solution should be simplify by Series.astype
with Series.str.len
and then per df['Cty']
aggregate mean
:
df1 = (df['A1'].astype(str)
.str.len()
.groupby(df['Cty'])
.mean()
.reset_index(name='mean_len_text'))
Alternative solution is first create column mean_len_text
in DataFrame.assign
and then aggregate mean
:
df1 = (df.assign(mean_len_text = df['A1'].astype(str)
.str.len())
.groupby('Cty', as_index=False)['mean_len_text']
.mean())
Your solution with lambda function and astype
:
df1 = (df.groupby('Cty')['A1']
.apply(lambda x: x.astype(str).str.len().mean())
.reset_index(name='mean_len_text'))