Home > OS >  Find the max of a list in a Pandas dataframe column
Find the max of a list in a Pandas dataframe column

Time:10-21

I am reading data from an RSS feed into a dataframe, and am trying to convert words (toddlers,kids,adults) to integers that represent min/max ages. I have gotten as far as replacing the words with numerical strings:

df['audience_max'].head(10)

0          10
1       2,4,3
2       2,4,3
3        10,3
4         2,3
5           4
6         4,3
7         4,3
8     10,4,12
9    2,10,4,3
Name: audience_max, dtype: string

Next, I am running a for loop on this column to find the maximum value in each list and create a new column in the data frame. I used str.split to change the string into a list:

maxlist=[]
for x in df['audience_max'].str.split(','):
    maxlist.append(max(x))
    
df['max_age']=maxlist

With the following result. You can see that line 3,8 and 9 are not showing the maximum value:

0    10
1     4
2     4
3     3
4     3
5     4
6     4
7     4
8     4
9     4
Name: max_age, dtype: object

When I checked the line of code in the for loop that creates the list, it appears to create the list correctly:

(df['audience_max'].str.split(',')).head(10)
0             [10]
1        [2, 4, 3]
2        [2, 4, 3]
3          [10, 3]
4           [2, 3]
5              [4]
6           [4, 3]
7           [4, 3]
8      [10, 4, 12]
9    [2, 10, 4, 3]
Name: audience_max, dtype: object

But the max(x) for some reason does not seem to be returning the max consistently. Is there a different function that I should be using?

CodePudding user response:

You need also convert values to integers from strings after split:

df['max_age'] = df['audience_max'].apply(lambda x: max(map(int, x.split(','))))
#alternative
#df['max_age'] = df['audience_max'].apply(lambda x: max(int(y) for y in x.split(',')))
print (df)
  audience_max  max_age
0           10       10
1        2,4,3        4
2        2,4,3        4
3         10,3       10
4          2,3        3
5            4        4
6          4,3        4
7          4,3        4
8      10,4,12       12
9     2,10,4,3       10

Or convert splitted values to DataFrame by expand=True, convert to floats, because also NaNs, get max values and last convert to integers:

df['max_age'] = (df['audience_max'].str.split(',', expand=True)
                                  .astype(float)
                                  .max(axis=1)
                                  .astype(int))
print (df)
  audience_max  max_age
0           10       10
1        2,4,3        4
2        2,4,3        4
3         10,3       10
4          2,3        3
5            4        4
6          4,3        4
7          4,3        4
8      10,4,12       12
9     2,10,4,3       10

CodePudding user response:

Based on this answer, I was able to use Series.str.extractall to get the max without having to use the for loop:

df['max_age_test'] = df.audience_max.astype(str).str.extractall('(\d )').astype(int).max(level=0)

Which returned the correct max ages:

0     10
1      4
2      4
3     10
4      3
5      4
6      4
7      4
8     12
9     10
  • Related