Context: I'm trying to obtain the average length of words for a column in a dataframe.
Basically if we have these 3 sentences in a dataframe:
Sentence1 = "This is a sentence"
Sentence2 = "This is a larger sentence"
Sentence3 = "This is an even larger sentence"
The output should be the average lenght of them, split by word. So for sentence1 `len(x.split(" "))" would be 4, sentence2 would be 5 and sentence3 would be 6 and their average would be 5.
How could I do this in a dataframe?
I was trying this
avg = df['strings'].apply(lambda x: np.mean([len(words.split(" ")) for words in x if isnstance(x,str)]))
This doesn't really make much sense since "x" would already be the string so "words" would actually be looping through the characters and that's not what I want (plus a single character doesn't have attr split)
Also, would be nice to filter out "strings" that only contain floats/NaN/only numbers (hence the isinstance(x,str).
How could I get the length of x.split(" ") only and only if x is a string? And then do the average of the sum of words for all the sentences?
Thank you in advance
CodePudding user response:
import pandas as pd
df = pd.DataFrame({'sentence':
["This is a sentence",
"This is a larger sentence",
"This is an even larger sentence",
"",
1,
None]})
df =
sentence
0 This is a sentence
1 This is a larger sentence
2 This is an even larger sentence
3
4 1
5 None
df['length'] = df['sentence'].apply(
lambda row: min(len(row.split(" ")), len(row)) if isinstance(row, str) else None
)
df['length'] =
0 4.0
1 5.0
2 6.0
3 0.0
4 NaN
5 NaN
df['length'].mean() = 3.75
If you want to assign the length 1 for "", use len(row.split(" "))
instead of min(len(row.split(" ")), len(row))
.
CodePudding user response:
The code fails you are not splitting the sentence into words based on " "
. Try this way
CODE
import pandas as pd
import numpy as np
df = pd.DataFrame({"strings": ["This is a sentence",
"This is a larger sentence",
"This is an even larger sentence"]})
avg = np.mean(df['strings'].apply(lambda x: len([words for words in x.split(" ") if isinstance(x, str)])))
print(avg)
OUTPUT
5.0