I have a dataframe with a column like this:
data = [
'[[0.1, 0.2, 0.3], [0, 0.5]]',
'[[0.1, 0.2], [0.3, 0.4, 0.5], [0, 0.4]]'
]
df = pd.DataFrame(data, columns=['word_probs'])
It shows the probability of one word in one sentence in one paragraph, the number of words and sentences is random. I would like to get another column average_prob
that is the average of the average of each row. so basically 0.225 and 0.25 here.
The data type of column word_probs
is string.
How can I achieve this? Thanks a lot in advance!
CodePudding user response:
We need first convert the string to list with ast
, then we do explode
import ast
df.word_probs.map(ast.literal_eval).explode().map(np.mean).groupby(level=0).mean()
Out[408]:
0 0.225
1 0.250
CodePudding user response:
There's a much more compact answer already, but I'm including a few lines on storing the computed averages in the dataframe inside this jumbled code
short way, using BENY's answer
data = [
'[[0.1, 0.2, 0.3], [0, 0.5]]',
'[[0.1, 0.2], [0.3, 0.4, 0.5], [0, 0.4]]'
]
df = pd.DataFrame(data, columns=['word_probs'])
df['average_prob'] = df.word_probs.map(ast.literal_eval).explode().map(np.mean).groupby(level=0).mean()
print(df)
longer way, without ast import
(this can also be abrieviated, I just figured I'd include every conceivable step. For example, the array append pattern I reuse can be replaced with generators)
def row_averages(df: pd.DataFrame) -> list[float]:
row_average_list: list[float] = []
df['average_prob'] = [0]*len(df['word_probs']) # create the new col with the length of the old col
for i, row in enumerate(df['word_probs']):
temp: str = row
segments_no_brackets = temp.strip('[[').strip(']]').split('], [')
average_list: list[float] = []
for seg in segments_no_brackets:
list_of_str_float: list[str] = seg.split(', ')
# any tallying datastructure will do
internal_list: list[float] = []
for char in list_of_str_float:
number = float(char)
internal_list.append(number)
inner_avg = np.mean(internal_list)
average_list.append(inner_avg)
row_average = np.mean(average_list)
row_average_list.append(row_average)
# enter into new col
df.at[i, 'average_prob'] = row_average # overwrite the zeroes set above with the average
print(df)
# do a rounding here if you want to control sig figs
return [float(f'{num:.3}') for num in row_average_list]
output of print(df):
word_probs average_prob
0 [[0.1, 0.2, 0.3], [0, 0.5]] 0.225
1 [[0.1, 0.2], [0.3, 0.4, 0.5], [0, 0.4]] 0.250