Obtain the average lenght of words of sentences in a dataframe column-CodePudding

Context: I'm trying to obtain the average length of words for a column in a dataframe.

Basically if we have these 3 sentences in a dataframe:

Sentence1 = "This is a sentence"
Sentence2 = "This is a larger sentence"
Sentence3 = "This is an even larger sentence"

The output should be the average lenght of them, split by word. So for sentence1 `len(x.split(" "))" would be 4, sentence2 would be 5 and sentence3 would be 6 and their average would be 5.

How could I do this in a dataframe?

I was trying this

avg = df['strings'].apply(lambda x: np.mean([len(words.split(" ")) for words in x if isnstance(x,str)]))

This doesn't really make much sense since "x" would already be the string so "words" would actually be looping through the characters and that's not what I want (plus a single character doesn't have attr split)

Also, would be nice to filter out "strings" that only contain floats/NaN/only numbers (hence the isinstance(x,str).

How could I get the length of x.split(" ") only and only if x is a string? And then do the average of the sum of words for all the sentences?

Thank you in advance

CodePudding user response：

import pandas as pd

df = pd.DataFrame({'sentence':
                   ["This is a sentence",
                    "This is a larger sentence",
                    "This is an even larger sentence",
                    "",
                    1,
                    None]})

df = 
                          sentence
0               This is a sentence
1        This is a larger sentence
2  This is an even larger sentence
3                                 
4                                1
5                             None

df['length'] = df['sentence'].apply(
    lambda row: min(len(row.split(" ")), len(row)) if isinstance(row, str) else None
)

df['length'] = 
0    4.0
1    5.0
2    6.0
3    0.0
4    NaN
5    NaN

df['length'].mean() = 3.75

If you want to assign the length 1 for "", use len(row.split(" ")) instead of min(len(row.split(" ")), len(row)).

CodePudding user response：

The code fails you are not splitting the sentence into words based on " ". Try this way

CODE

import pandas as pd
import numpy as np

df = pd.DataFrame({"strings": ["This is a sentence",
                               "This is a larger sentence",
                               "This is an even larger sentence"]})

avg = np.mean(df['strings'].apply(lambda x: len([words for words in x.split(" ") if isinstance(x, str)])))
print(avg)

OUTPUT

5.0