I have a database of patents citing other patents looking like this:
{'index': {0: 0, 1: 1, 2: 2, 12: 12, 21: 21},
'docdb_family_id': {0: 57904406,
1: 57904406,
2: 57906556,
12: 57909419,
21: 57942222},
'cited_docdbs': {0: [15057621,
16359315,
18731820,
19198211,
19198218,
19198340,
19550248,
19700609,
20418230,
22144166,
22513333,
22800966,
22925564,
23335606,
23891186,
25344297,
25345599,
25414615,
25495423,
25588955,
26530649,
27563473,
34277948,
36626718,
38801947,
40454852,
40885675,
40957530,
41249600,
41377563,
41378429,
41444278,
41797413,
42153280,
42340085,
42340086,
42678557,
42709962,
42709963,
42737942,
43648036,
44691991,
44947081,
45352855,
45815534,
46254922,
46382961,
47830116,
49676686,
49912209,
54191614],
1: [15057621,
16359315,
18731820,
19198211,
19198218,
19198340,
19550248,
19700609,
20418230,
22144166,
22513333,
22800966,
22925564,
23335606,
23891186,
25344297,
25345599,
25414615,
25495423,
25588955,
26530649,
27563473,
34277948,
36626718,
38801947,
40454852,
40885675,
40957530,
41249600,
41377563,
41378429,
41444278,
41797413,
42153280,
42340085,
42340086,
42678557,
42709962,
42709963,
42737942,
43648036,
44691991,
44947081,
45352855,
45815534,
46254922,
46382961,
47830116,
49676686,
49912209,
54191614],
2: [6078355,
8173164,
14235835,
16940834,
18152411,
18704525,
27343995,
45467248,
46172598,
49878759,
50995553,
52668238],
12: [6293366,
7856452,
16980051,
23177359,
26477802,
27453602,
41135094,
53004244,
54332594,
55018863],
21: [7913900,
13287798,
18834564,
23971781,
26904791,
27304292,
29720924,
34622252,
35197847,
37766575,
39873073,
42075013,
44508652,
44530218,
45571357,
48222848,
48747089,
49111776,
49754218,
50024241,
50474222,
50545849,
52580625,
58800268]},
'doc_std_name': {0: 'SEEO INC',
1: 'BOSCH GMBH ROBERT',
2: 'SAMSUNG SDI CO LTD',
12: 'NAGAI TAKAYUKI',
21: 'SAMSUNG SDI CO LTD'}}
Now, what I would like to do is performing a groupby firm as follows:
df_grouped_byfirm=data_min.groupby("doc_std_name").agg(publn_nrs=('docdb_family_id',"unique")).reset_index()
but merging together the lists of cited_docdbs. So, for instance in the example above, for SAMSUNG SDI CO LTD the final list of cited_docdbs should become a mega list where all the cited docdbs of both ids of SAMSUNG SDI CO LTD are merged together:
[6078355,
8173164,
14235835,
16940834,
18152411,
18704525,
27343995,
45467248,
46172598,
49878759,
50995553,
52668238,
7913900,
13287798,
18834564,
23971781,
26904791,
27304292,
29720924,
34622252,
35197847,
37766575,
39873073,
42075013,
44508652,
44530218,
45571357,
48222848,
48747089,
49111776,
49754218,
50024241,
50474222,
50545849,
52580625,
58800268]
Thank you
CodePudding user response:
You can just use sum
in agg
to concatenate the lists within each group.
df.groupby("doc_std_name").agg({"cited_docdbs": sum}).reset_index()
This will give the follow:
doc_std_name cited_docdbs
0 BOSCH GMBH ROBERT [15057621, 16359315, 18731820, 19198211, 19198...
1 NAGAI TAKAYUKI [6293366, 7856452, 16980051, 23177359, 2647780...
2 SAMSUNG SDI CO LTD [6078355, 8173164, 14235835, 16940834, 1815241...
3 SEEO INC [15057621, 16359315, 18731820, 19198211, 19198...
CodePudding user response:
You can flatten nested lists with dict.fromkeys
for remove duplicates in original order:
f = lambda x: list(dict.fromkeys(z for y in x for z in y))
df=df.groupby("doc_std_name").agg(publn_nrs=('cited_docdbs',f))
print (df)
publn_nrs
doc_std_name
BOSCH GMBH ROBERT [15057621, 16359315, 18731820, 19198211, 19198...
NAGAI TAKAYUKI [6293366, 7856452, 16980051, 23177359, 2647780...
SAMSUNG SDI CO LTD [6078355, 8173164, 14235835, 16940834, 1815241...
SEEO INC [15057621, 16359315, 18731820, 19198211, 19198...
If order is not important use set
s for remove duplicates:
f = lambda x: list(set(z for y in x for z in y))
df=df.groupby("doc_std_name").agg(publn_nrs=('cited_docdbs',f))
print (df)
publn_nrs
doc_std_name
BOSCH GMBH ROBERT [19700609, 19198211, 19198340, 44947081, 19198...
NAGAI TAKAYUKI [27453602, 7856452, 26477802, 23177359, 550188...
SAMSUNG SDI CO LTD [48222848, 18834564, 42075013, 58800268, 18704...
SEEO INC [19700609, 19198211, 19198340, 44947081, 19198...