Home > Software engineering >  Trimmed/truncated mean for contents in a dataframe row
Trimmed/truncated mean for contents in a dataframe row

Time:09-23

I have a dataframe and a dictionary. One of the columns of the dataframe contains sentences. I want to take every word from the sentence, check if it's in the dictionary and assign its value to it. The trimmed/truncated mean (where I trim 10% at both ends) of the values (per sentence, or row) is calculated and saved in an extra column.

import pandas as pd
    test_df = pd.DataFrame({
    '_id': ['1a','2b','3c','4d'],
    'column': ['und der in zu',
                'Kompliziertereswort something',
                'Lehrerin in zu [Buch]',
                'Buch (Lehrerin) kompliziertereswort']})
test_dict=
{'und': 20,
     'der': 10,
     'in':  40,
     'zu':  10,
     'Kompliziertereswort': 2,
     'Buch': 5,
     'Lehrerin': 5}

To calculate the arithmetic mean it is very simple:

test_df['extra_col'] = (test_df['column'].str.split(expand=True)
                                     .stack().astype(str)
                                     .str.strip(string.punctuation)
                                     .map(test_dict)
                                     .astype(float)
                                     .groupby(level=0)
                                     .mean())

But for the truncated mean, I need something like:

from scipy import stats
m = stats.trim_mean(X, 0.1) 

where X is an array. Is it possible to do that using (part of) my current code and scipy or should I just use .mean() and "trim" it manually?

CodePudding user response:

Sure, you can use GroupBy.agg:

test_df['extra_col'] = (test_df['column'].str.split(expand=True)
                                         .stack().astype(str)
                                         .str.strip(string.punctuation)
                                         .map(test_dict)
                                         .astype(float)
                                         .groupby(level=0)
                                         .agg(stats.trim_mean, 0.1))

Working same like passing lambda function:

test_df['extra_col'] = (test_df['column'].str.split(expand=True)
                                         .stack().astype(str)
                                         .str.strip(string.punctuation)
                                         .map(test_dict)
                                         .astype(float)
                                         .groupby(level=0)
                                         .agg(lambda x: stats.trim_mean(x, 0.1)))
print (test_df)
  _id                               column  extra_col
0  1a                        und der in zu       20.0
1  2b        Kompliziertereswort something        NaN
2  3c                Lehrerin in zu [Buch]       15.0
3  4d  Buch (Lehrerin) kompliziertereswort        NaN
  • Related