Home > Blockchain >  getting the index of the max value in list and returning the values in subsequent lists
getting the index of the max value in list and returning the values in subsequent lists

Time:06-20

I have a df

review                          review_id   word_count
['abc def', 'ghi']              [1, 2]      [2, 1]
['jkl mno pqr', 'stu', 'vwx']   [3, 4, 5]   [3, 1, 1]

For each row in the df, I would like to create new columns for review and review_id whereby there will only consist of the review/review_id with the maximum word count. Thus, the output should look something like this.

review         review_id
abc def        1      
jkl mno pqr    3

I can only think of getting the index of the max value in word_count. Any sugeestions will be welcomed. Thanks!

CodePudding user response:

Use DataFrame.explode with compare maximal values per groups (from rows) and select rows by boolean indexing if need all maximal values:

df = df.explode(['review','review_id','word_count'])
#for oldier pandas versions
#df = df.apply(lambda x: x.explode())

df = df[df['word_count'].eq(df.groupby(level=0)['word_count'].transform('max'))]
print (df)
        review review_id word_count
0      abc def         1          2
1  jkl mno pqr         3          3

CodePudding user response:

For efficiency, you can use a list comprehension and the DataFrame constructor. This will be at least one order of magnitude faster than using explode:

import numpy as np
df2 = pd.DataFrame([(a[(m:=np.argmax(x))], b[m]) for a,b,x in
                    zip(df['review'], df['review_id'], df['word_count'])],
                   index=df.index, columns=['review', 'review_id']
                   )
print(df2)

output:

        review  review_id
0      abc def          1
1  jkl mno pqr          3

If you want all columns:

df2 = pd.DataFrame([(a[(m:=np.argmax(x))], b[m], x[m]) for a,b,x in
                    zip(df['review'], df['review_id'], df['word_count'])],
                   index=df.index, columns=df.columns
                   )

output:

        review  review_id  word_count
0      abc def          1           2
1  jkl mno pqr          3           3

timing

on 20k rows

# list comprehension
50.2 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# explode   groupby
1.22 s ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  • Related