Home > database >  How to get the minimum value from a nested-list-column on Pandas? Why numpy.min() doesn't work
How to get the minimum value from a nested-list-column on Pandas? Why numpy.min() doesn't work

Time:11-07

I have a little snippet of code that I need to modify and I'm not finding exactly why np.mean() works where np.min() doesn't in the specific situation when a pandas column is composed of nested lists. Maybe someone here could clarify?

This snippet here works perfectly:

import pandas as pd
import numpy as np


def transformation(custom_df):
    dic = dict(zip(custom_df['customers'], custom_df['values']))
    custom_df['values'] = np.where(custom_df['values'].isna() & (custom_df['valid_neighbors'] >= 1),
                                   custom_df['neighbors'].apply(
                                       lambda row: np.mean([dic[v] for v in row if dic.get(v)])),
                                   custom_df['values'])
    return custom_df


customers = [1, 2, 3, 4, 5, 6]
values = [np.nan, np.nan, 10, np.nan, 11, 12]
neighbors = [[6], [3], [], [3, 5], [6], [5]]
vn = [1, 1, 0, 2, 1, 1]
df2 = pd.DataFrame({'customers': customers, 'values': values, 'neighbors': neighbors, 'valid_neighbors': vn})


   customers  values neighbors  valid_neighbors
0          1     NaN       [6]                1
1          2     NaN       [3]                1
2          3    10.0        []                0
3          4     NaN    [3, 5]                2
4          5    11.0       [6]                1
5          6    12.0       [5]                1

df2 = transformation(df2)

The result:

   customers  values neighbors  valid_neighbors
0          1    12.0       [6]                1
1          2    10.0       [3]                1
2          3    10.0        []                0
3          4    10.5    [3, 5]                2
4          5    11.0       [6]                1
5          6    12.0       [5]                1

However, if I were to change, on the "transformation()" function, np.mean() to np.min(), it would return a ValueError, making me wonder why it doesn't happen when I call the np.mean() function:

ValueError: zero-size array to reduction operation minimum which has no identity

I would like to know which conditions I'm not fulfilling, and what can I do to get the expected result, which would be:

   customers  values neighbors  valid_neighbors
0          1    12.0       [6]                1
1          2    10.0       [3]                1
2          3    10.0        []                0
3          4    10.0    [3, 5]                2
4          5    11.0       [6]                1
5          6    12.0       [5]                1

CodePudding user response:

There is an empty list in your neighbors column which would throw error for np.min but where as np.mean works even for empty list.

import numpy as np

print(np.mean([])) 
# Output
# nan

print(np.min([])) 
# Throws error
# ValueError: zero-size array to reduction operation minimum which has no identity

CodePudding user response:

use following code and get result:

df3 = df2.set_index('customers')
df2['values'].fillna(df2['neighbors'].apply(lambda x: df3.loc[x, 'values'].mean()))

output(mean):

0   12.00
1   10.00
2   10.00
3   10.50
4   11.00
5   12.00
Name: values, dtype: float64



you can change mean to min:

df2['values'].fillna(df2['neighbors'].apply(lambda x: df3.loc[x, 'values'].min()))

output(min):

0   12.00
1   10.00
2   10.00
3   10.00
4   11.00
5   12.00
Name: values, dtype: float64

make desired result to value column

CodePudding user response:

It's better you update your transformation function with adjustment for empty array in neighbors column. Here's a workaround that may work.

def transformation(custom_df):
    dic = dict(zip(custom_df['customers'], custom_df['values']))
    custom_df['values'] = np.where(custom_df['values'].isna() & (custom_df['valid_neighbors'] >= 1),
                                   custom_df['neighbors'].apply(
                                       lambda row: np.min([dic[v] for v in row if dic.get(v)]) if len(row) else 0),
                                   custom_df['values'])
    return custom_df
  • Related