I have a df
review review_id word_count
['abc def', 'ghi'] [1, 2] [2, 1]
['jkl mno pqr', 'stu', 'vwx'] [3, 4, 5] [3, 1, 1]
For each row in the df, I would like to create new columns for review and review_id whereby there will only consist of the review/review_id with the maximum word count. Thus, the output should look something like this.
review review_id
abc def 1
jkl mno pqr 3
I can only think of getting the index of the max value in word_count. Any sugeestions will be welcomed. Thanks!
CodePudding user response:
Use DataFrame.explode
with compare maximal values per groups (from rows) and select rows by boolean indexing
if need all maximal values:
df = df.explode(['review','review_id','word_count'])
#for oldier pandas versions
#df = df.apply(lambda x: x.explode())
df = df[df['word_count'].eq(df.groupby(level=0)['word_count'].transform('max'))]
print (df)
review review_id word_count
0 abc def 1 2
1 jkl mno pqr 3 3
CodePudding user response:
For efficiency, you can use a list comprehension and the DataFrame constructor. This will be at least one order of magnitude faster than using explode
:
import numpy as np
df2 = pd.DataFrame([(a[(m:=np.argmax(x))], b[m]) for a,b,x in
zip(df['review'], df['review_id'], df['word_count'])],
index=df.index, columns=['review', 'review_id']
)
print(df2)
output:
review review_id
0 abc def 1
1 jkl mno pqr 3
If you want all columns:
df2 = pd.DataFrame([(a[(m:=np.argmax(x))], b[m], x[m]) for a,b,x in
zip(df['review'], df['review_id'], df['word_count'])],
index=df.index, columns=df.columns
)
output:
review review_id word_count
0 abc def 1 2
1 jkl mno pqr 3 3
timing
on 20k rows
# list comprehension
50.2 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# explode groupby
1.22 s ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)