I have a dataframe looks like:
page reference ids - subject word
1 apple ['aaaa', 'bbbbb', 'cccc'] name app
1 apple ['bndv', 'asasa', 'swdsd'] fruit is
1 apple ['bsnm', 'dfsd', 'dgdf'] fruit text
1 bat ['asas', 'ddfgd', 'ff'] thing sport
1 cat ['sds', 'dffd', 'gdg'] fruit color
1 bat ['sds', 'fsss', 'ssfd'] thing was
1 bat ['fsf', 'sff', 'fss'] place that
2 dog ['fffds', 'gd', 'sdg'] name mud
2 egg ['dfff', 'sdf', 'vcv'] place gun
2 dog ['dsfd', 'fds', 'gfdg'] thing kit
2 egg ['ddd', 'fg', 'dfg'] place hut
I want to groupby reference column and subject column. The output should look like this:
output:
page reference ids subject word
1 apple [['bndv', 'asasa', 'swdsd'],['bsnm', 'dfsd', 'dgdf']] fruit [[is], [text]]
1 apple ['aaaa', 'bbbbb', 'cccc'] name [app]
1 bat [['asas', 'ddfgd', 'ff'], [['sds', 'fsss', 'ssfd']] thing [[sport], [was]]
1 bat ['fsf', 'sff', 'fss'] place [that]
1 cat ['sds', 'dffd', 'gdg'] fruit [color]
2 dog ['fffds', 'gd', 'sdg'] name [mud]
2 dog ['dsfd', 'fds', 'gfdg'] thing [kit]
2 egg [['dfff', 'sdf', 'vcv'], ['ddd', 'fg', 'dfg']] place [[gun], [hut]]
CodePudding user response:
First to group and aggregate necessary fields:
res = df.groupby(["reference", "subject"]).agg({"page": min, "ids": list, "word": lambda l: [[ll] for ll in l]}).reset_index
reference subject page ids word
0 apple fruit 1 [[bndv, asasa, swdsd], [bsnm, dfsd, dgdf]] [[is], [text]]
1 apple name 1 [[aaaa, bbbbb, cccc]] [[app]]
2 bat place 1 [[fsf, sff, fss]] [[that]]
3 bat thing 1 [[asas, ddfgd, ff], [sds, fsss, ssfd]] [[sport], [was]]
4 cat fruit 1 [[sds, dffd, gdg]] [[color]]
5 dog name 2 [[fffds, gd, sdg]] [[mud]]
6 dog thing 2 [[dsfd, fds, gfdg]] [[kit]]
7 egg place 2 [[dfff, sdf, vcv], [ddd, fg, dfg]] [[gun], [hut]]
Note that this also wraps each word
value in a list, just like what you want in your desired output. I'm also just assuming to take the minimum page
value in each group since you didn't mention the rule for this variable. You can update the min
value in agg
function to whatever you think is appropriate.
Then you can cleanup the lists if length is 1:
res["word"] = res["word"].apply(lambda l: l[0] if len(l) == 1 else l)
res["ids"] = res["ids"].apply(lambda l: l[0] if len(l) == 1 else l)
reference subject page ids word
0 apple fruit 1 [[bndv, asasa, swdsd], [bsnm, dfsd, dgdf]] [[is], [text]]
1 apple name 1 [aaaa, bbbbb, cccc] [app]
2 bat place 1 [fsf, sff, fss] [that]
3 bat thing 1 [[asas, ddfgd, ff], [sds, fsss, ssfd]] [[sport], [was]]
4 cat fruit 1 [sds, dffd, gdg] [color]
5 dog name 2 [fffds, gd, sdg] [mud]
6 dog thing 2 [dsfd, fds, gfdg] [kit]
7 egg place 2 [[dfff, sdf, vcv], [ddd, fg, dfg]] [[gun], [hut]]