getting the index of the max value in list and returning the values in subsequent lists-CodePudding

I have a df

review                          review_id   word_count
['abc def', 'ghi']              [1, 2]      [2, 1]
['jkl mno pqr', 'stu', 'vwx']   [3, 4, 5]   [3, 1, 1]

For each row in the df, I would like to create new columns for review and review_id whereby there will only consist of the review/review_id with the maximum word count. Thus, the output should look something like this.

review         review_id
abc def        1      
jkl mno pqr    3

I can only think of getting the index of the max value in word_count. Any sugeestions will be welcomed. Thanks!

CodePudding user response：

Use DataFrame.explode with compare maximal values per groups (from rows) and select rows by boolean indexing if need all maximal values:

df = df.explode(['review','review_id','word_count'])
#for oldier pandas versions
#df = df.apply(lambda x: x.explode())

df = df[df['word_count'].eq(df.groupby(level=0)['word_count'].transform('max'))]
print (df)
        review review_id word_count
0      abc def         1          2
1  jkl mno pqr         3          3

CodePudding user response：

For efficiency, you can use a list comprehension and the DataFrame constructor. This will be at least one order of magnitude faster than using explode:

import numpy as np
df2 = pd.DataFrame([(a[(m:=np.argmax(x))], b[m]) for a,b,x in
                    zip(df['review'], df['review_id'], df['word_count'])],
                   index=df.index, columns=['review', 'review_id']
                   )
print(df2)

output:

        review  review_id
0      abc def          1
1  jkl mno pqr          3

If you want all columns:

df2 = pd.DataFrame([(a[(m:=np.argmax(x))], b[m], x[m]) for a,b,x in
                    zip(df['review'], df['review_id'], df['word_count'])],
                   index=df.index, columns=df.columns
                   )

output:

        review  review_id  word_count
0      abc def          1           2
1  jkl mno pqr          3           3

timing

on 20k rows

# list comprehension
50.2 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# explode   groupby
1.22 s ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)