Grouping together lists in pandas-CodePudding

I have a database of patents citing other patents looking like this:

{'index': {0: 0, 1: 1, 2: 2, 12: 12, 21: 21},
 'docdb_family_id': {0: 57904406,
  1: 57904406,
  2: 57906556,
  12: 57909419,
  21: 57942222},
 'cited_docdbs': {0: [15057621,
   16359315,
   18731820,
   19198211,
   19198218,
   19198340,
   19550248,
   19700609,
   20418230,
   22144166,
   22513333,
   22800966,
   22925564,
   23335606,
   23891186,
   25344297,
   25345599,
   25414615,
   25495423,
   25588955,
   26530649,
   27563473,
   34277948,
   36626718,
   38801947,
   40454852,
   40885675,
   40957530,
   41249600,
   41377563,
   41378429,
   41444278,
   41797413,
   42153280,
   42340085,
   42340086,
   42678557,
   42709962,
   42709963,
   42737942,
   43648036,
   44691991,
   44947081,
   45352855,
   45815534,
   46254922,
   46382961,
   47830116,
   49676686,
   49912209,
   54191614],
  1: [15057621,
   16359315,
   18731820,
   19198211,
   19198218,
   19198340,
   19550248,
   19700609,
   20418230,
   22144166,
   22513333,
   22800966,
   22925564,
   23335606,
   23891186,
   25344297,
   25345599,
   25414615,
   25495423,
   25588955,
   26530649,
   27563473,
   34277948,
   36626718,
   38801947,
   40454852,
   40885675,
   40957530,
   41249600,
   41377563,
   41378429,
   41444278,
   41797413,
   42153280,
   42340085,
   42340086,
   42678557,
   42709962,
   42709963,
   42737942,
   43648036,
   44691991,
   44947081,
   45352855,
   45815534,
   46254922,
   46382961,
   47830116,
   49676686,
   49912209,
   54191614],
  2: [6078355,
   8173164,
   14235835,
   16940834,
   18152411,
   18704525,
   27343995,
   45467248,
   46172598,
   49878759,
   50995553,
   52668238],
  12: [6293366,
   7856452,
   16980051,
   23177359,
   26477802,
   27453602,
   41135094,
   53004244,
   54332594,
   55018863],
  21: [7913900,
   13287798,
   18834564,
   23971781,
   26904791,
   27304292,
   29720924,
   34622252,
   35197847,
   37766575,
   39873073,
   42075013,
   44508652,
   44530218,
   45571357,
   48222848,
   48747089,
   49111776,
   49754218,
   50024241,
   50474222,
   50545849,
   52580625,
   58800268]},
 'doc_std_name': {0: 'SEEO INC',
  1: 'BOSCH GMBH ROBERT',
  2: 'SAMSUNG SDI CO LTD',
  12: 'NAGAI TAKAYUKI',
  21: 'SAMSUNG SDI CO LTD'}}

Now, what I would like to do is performing a groupby firm as follows:

df_grouped_byfirm=data_min.groupby("doc_std_name").agg(publn_nrs=('docdb_family_id',"unique")).reset_index()

but merging together the lists of cited_docdbs. So, for instance in the example above, for SAMSUNG SDI CO LTD the final list of cited_docdbs should become a mega list where all the cited docdbs of both ids of SAMSUNG SDI CO LTD are merged together:

Thank you

CodePudding user response：

You can just use sum in agg to concatenate the lists within each group.

df.groupby("doc_std_name").agg({"cited_docdbs": sum}).reset_index()

This will give the follow:

         doc_std_name                                       cited_docdbs
0   BOSCH GMBH ROBERT  [15057621, 16359315, 18731820, 19198211, 19198...
1      NAGAI TAKAYUKI  [6293366, 7856452, 16980051, 23177359, 2647780...
2  SAMSUNG SDI CO LTD  [6078355, 8173164, 14235835, 16940834, 1815241...
3            SEEO INC  [15057621, 16359315, 18731820, 19198211, 19198...

CodePudding user response：

You can flatten nested lists with dict.fromkeys for remove duplicates in original order:

f = lambda x: list(dict.fromkeys(z for y in x for z in y))
df=df.groupby("doc_std_name").agg(publn_nrs=('cited_docdbs',f))

print (df)
                                                            publn_nrs
doc_std_name                                                         
BOSCH GMBH ROBERT   [15057621, 16359315, 18731820, 19198211, 19198...
NAGAI TAKAYUKI      [6293366, 7856452, 16980051, 23177359, 2647780...
SAMSUNG SDI CO LTD  [6078355, 8173164, 14235835, 16940834, 1815241...
SEEO INC            [15057621, 16359315, 18731820, 19198211, 19198...

If order is not important use sets for remove duplicates:

f = lambda x: list(set(z for y in x for z in y))
df=df.groupby("doc_std_name").agg(publn_nrs=('cited_docdbs',f))

print (df)
                                                            publn_nrs
doc_std_name                                                         
BOSCH GMBH ROBERT   [19700609, 19198211, 19198340, 44947081, 19198...
NAGAI TAKAYUKI      [27453602, 7856452, 26477802, 23177359, 550188...
SAMSUNG SDI CO LTD  [48222848, 18834564, 42075013, 58800268, 18704...
SEEO INC            [19700609, 19198211, 19198340, 44947081, 19198...