Home > Enterprise >  How to get the average of average of a column of list of lists as string data type?
How to get the average of average of a column of list of lists as string data type?

Time:02-25

I have a dataframe with a column like this:

data = [
    '[[0.1, 0.2, 0.3], [0, 0.5]]',
    '[[0.1, 0.2], [0.3, 0.4, 0.5], [0, 0.4]]'
]
df = pd.DataFrame(data, columns=['word_probs'])

It shows the probability of one word in one sentence in one paragraph, the number of words and sentences is random. I would like to get another column average_prob that is the average of the average of each row. so basically 0.225 and 0.25 here.

The data type of column word_probs is string.

How can I achieve this? Thanks a lot in advance!

CodePudding user response:

We need first convert the string to list with ast , then we do explode

import ast 
df.word_probs.map(ast.literal_eval).explode().map(np.mean).groupby(level=0).mean()
Out[408]: 
0    0.225
1    0.250

CodePudding user response:

There's a much more compact answer already, but I'm including a few lines on storing the computed averages in the dataframe inside this jumbled code

short way, using BENY's answer

data = [
        '[[0.1, 0.2, 0.3], [0, 0.5]]',
        '[[0.1, 0.2], [0.3, 0.4, 0.5], [0, 0.4]]'
]
df = pd.DataFrame(data, columns=['word_probs'])
df['average_prob'] = df.word_probs.map(ast.literal_eval).explode().map(np.mean).groupby(level=0).mean()
print(df)

longer way, without ast import

(this can also be abrieviated, I just figured I'd include every conceivable step. For example, the array append pattern I reuse can be replaced with generators)

def row_averages(df: pd.DataFrame) -> list[float]:
    row_average_list: list[float] = []
    df['average_prob'] = [0]*len(df['word_probs'])  # create the new col with the length of the old col
    for i, row in enumerate(df['word_probs']):
        temp: str = row
        segments_no_brackets = temp.strip('[[').strip(']]').split('], [')
        average_list: list[float] = []
        for seg in segments_no_brackets:
            list_of_str_float: list[str] = seg.split(', ')
            # any tallying datastructure will do
            internal_list: list[float] = []
            for char in list_of_str_float:
                number = float(char)
                internal_list.append(number)
            inner_avg = np.mean(internal_list)
            average_list.append(inner_avg)
        row_average = np.mean(average_list)
        row_average_list.append(row_average)
        # enter into new col
        df.at[i, 'average_prob'] = row_average  # overwrite the zeroes set above with the average
    print(df)
    # do a rounding here if you want to control sig figs
    return [float(f'{num:.3}') for num in row_average_list]

output of print(df):

                                word_probs  average_prob
0              [[0.1, 0.2, 0.3], [0, 0.5]]         0.225
1  [[0.1, 0.2], [0.3, 0.4, 0.5], [0, 0.4]]         0.250
  • Related