Extracting mean number of words in verb phrases-CodePudding

So I have a bit of a silly question, but being fairly new to Python, I can't seem to find the answer to it myself. I extracted verb phrases using spaCy's matcher. Now, I'm hoping to get the mean number of words in the extracted verb phrases for each person's text and stored them in a new dataframe column. In order to do this, I'm trying to create a function that I'll then apply to said dataframe column.

I created this function:

def get_length_phrases(column):
    for phrase in column:
        phrase_length = len(phrase)
        mean_length = np.mean(phrase_length)
    return mean_length

The problem is, when I apply it to the column in which the verb phrases are stored, I get an output that looks like this :

0      1.0
1      1.0
2      1.0
3      1.0
4      1.0
      ... 
235    1.0
236    1.0
237    1.0
238    1.0
239    1.0
Name: verb_phrases_length, Length: 240, dtype: float64

The problem is, there is way more than one word per phrase, so clearly, I'm doing something wrong, but can't seem to figure out what... statistics.mean doesn't work either...

CodePudding user response：

np.mean() takes an array (or similar) as an argument. As far as I can tell (correct me if i'm wrong) you are getting the mean of the length of each phase, which is just one number, and the mean of one number will be that number.

From numpy docs:

Parameters: a:array_like - Array containing numbers whose mean is desired. If a is not an array, a conversion is attempted.

You are going to want to save each length to a list, and then give that to np.mean()

def get_length_phrases(column):
    phrase_lengths = []
    for phrase in column:
        phrase_lengths.append(len(phrase))
    mean_length = np.mean(phrase_lengths)
    return mean_length

If, at this point, you are still getting 1.0, it is likely an issue with getting the phrases, not this function.