I have a dataframe and a dictionary. One of the columns of the dataframe contains sentences. I want to take every word from the sentence, check if it's in the dictionary and assign its value to it. The trimmed/truncated mean (where I trim 10% at both ends) of the values (per sentence, or row) is calculated and saved in an extra column.
import pandas as pd
test_df = pd.DataFrame({
'_id': ['1a','2b','3c','4d'],
'column': ['und der in zu',
'Kompliziertereswort something',
'Lehrerin in zu [Buch]',
'Buch (Lehrerin) kompliziertereswort']})
test_dict=
{'und': 20,
'der': 10,
'in': 40,
'zu': 10,
'Kompliziertereswort': 2,
'Buch': 5,
'Lehrerin': 5}
To calculate the arithmetic mean it is very simple:
test_df['extra_col'] = (test_df['column'].str.split(expand=True)
.stack().astype(str)
.str.strip(string.punctuation)
.map(test_dict)
.astype(float)
.groupby(level=0)
.mean())
But for the truncated mean, I need something like:
from scipy import stats
m = stats.trim_mean(X, 0.1)
where X is an array. Is it possible to do that using (part of) my current code and scipy
or should I just use .mean()
and "trim" it manually?
CodePudding user response:
Sure, you can use GroupBy.agg
:
test_df['extra_col'] = (test_df['column'].str.split(expand=True)
.stack().astype(str)
.str.strip(string.punctuation)
.map(test_dict)
.astype(float)
.groupby(level=0)
.agg(stats.trim_mean, 0.1))
Working same like passing lambda function:
test_df['extra_col'] = (test_df['column'].str.split(expand=True)
.stack().astype(str)
.str.strip(string.punctuation)
.map(test_dict)
.astype(float)
.groupby(level=0)
.agg(lambda x: stats.trim_mean(x, 0.1)))
print (test_df)
_id column extra_col
0 1a und der in zu 20.0
1 2b Kompliziertereswort something NaN
2 3c Lehrerin in zu [Buch] 15.0
3 4d Buch (Lehrerin) kompliziertereswort NaN