I am reading data from an RSS feed into a dataframe, and am trying to convert words (toddlers,kids,adults) to integers that represent min/max ages. I have gotten as far as replacing the words with numerical strings:
df['audience_max'].head(10)
0 10
1 2,4,3
2 2,4,3
3 10,3
4 2,3
5 4
6 4,3
7 4,3
8 10,4,12
9 2,10,4,3
Name: audience_max, dtype: string
Next, I am running a for loop on this column to find the maximum value in each list and create a new column in the data frame. I used str.split to change the string into a list:
maxlist=[]
for x in df['audience_max'].str.split(','):
maxlist.append(max(x))
df['max_age']=maxlist
With the following result. You can see that line 3,8 and 9 are not showing the maximum value:
0 10
1 4
2 4
3 3
4 3
5 4
6 4
7 4
8 4
9 4
Name: max_age, dtype: object
When I checked the line of code in the for loop that creates the list, it appears to create the list correctly:
(df['audience_max'].str.split(',')).head(10)
0 [10]
1 [2, 4, 3]
2 [2, 4, 3]
3 [10, 3]
4 [2, 3]
5 [4]
6 [4, 3]
7 [4, 3]
8 [10, 4, 12]
9 [2, 10, 4, 3]
Name: audience_max, dtype: object
But the max(x) for some reason does not seem to be returning the max consistently. Is there a different function that I should be using?
CodePudding user response:
You need also convert values to integers from strings after split
:
df['max_age'] = df['audience_max'].apply(lambda x: max(map(int, x.split(','))))
#alternative
#df['max_age'] = df['audience_max'].apply(lambda x: max(int(y) for y in x.split(',')))
print (df)
audience_max max_age
0 10 10
1 2,4,3 4
2 2,4,3 4
3 10,3 10
4 2,3 3
5 4 4
6 4,3 4
7 4,3 4
8 10,4,12 12
9 2,10,4,3 10
Or convert splitted values to DataFrame
by expand=True
, convert to floats
, because also NaN
s, get max values and last convert to integers:
df['max_age'] = (df['audience_max'].str.split(',', expand=True)
.astype(float)
.max(axis=1)
.astype(int))
print (df)
audience_max max_age
0 10 10
1 2,4,3 4
2 2,4,3 4
3 10,3 10
4 2,3 3
5 4 4
6 4,3 4
7 4,3 4
8 10,4,12 12
9 2,10,4,3 10
CodePudding user response:
Based on this answer, I was able to use Series.str.extractall to get the max without having to use the for loop:
df['max_age_test'] = df.audience_max.astype(str).str.extractall('(\d )').astype(int).max(level=0)
Which returned the correct max ages:
0 10
1 4
2 4
3 10
4 3
5 4
6 4
7 4
8 12
9 10