I have a pandas dataframe looking like this:
docdb tech_classes
1187498 ['Y02P 20/10']
1236571 ['Y02B 30/13' 'Y02B 30/12' 'Y02P 20/10']
1239098 ['Y10S 426/805' 'Y02A 40/81']
...
What I would like to do is to create N dummy variables where N is the total number of names appearing in the variable tech_classes (please notice that Y02P 20/10 is a unique name as if it was: Y02P_20/10 and so Y02B 30/13 and the others). The variables should be dummies having value 1 whenever a docdb has that class inside tech_classes.
In other words the result of the above example should look like this:
docdb Y02P_20/10 Y02B_30/13 Y02B_30/12 Y02A_40/81 Y10S_426/805 ...
1187498 1 0 0 0 0
1236571 1 1 1 0 0
1239098 0 0 0 1 1
...
Thanks a lot!
P.s. I know that there is a get_dummies in pandas but it does not quite work as tech_classes is not in list form from... Secifically:
df_patents.head().to_dict('list')
gives:
{'docdb_family_id': [1187498, 1226468, 1236571, 1239098, 1239277],
'tech_fields_cited': ["['Y02P_20_10']",
"['Y10T_156_1023']",
"['Y02B_30_13','Y02B_30_12','Y02E_60_14','Y02B_10_70']",
"['Y10S_426_805','Y02A_40_81']",
"['Y02E_60_10','Y02T_90_12','Y02T_10_7072','Y02T_90_14','Y02T_10_70']"],
'patindocdb_years': ['[1998 1999 1996]',
'[1996 1992 1994 1993 1997]',
'[1991 1993 1990 1996]',
'[1995 1992 1993]',
'[1996 1993 1992]'],
'appln_auth': ['DE', 'DE', 'WO', 'WO', 'WO'],
'appln_nr': ['19581932', '4042441', '9002512', '9103158', '9105114'],
'earliest_publn_year': [1998, 1992, 1991, 1992, 1993],
'nb_citing_docdb_fam_y': [5, 17, 35, 32, 35],
'person_ctrycode': ["['RU']", "['DE']", "['US']", "['US']", "['IL']"],
'fronteer': [0, 0, 0, 0, 0],
'distance': [9999, 2, 9999, 9999, 9999],
'oecd_fields': ['[nan]', '[nan]', '[nan]', '[nan]', '[nan]'],
'nr_green': [1, 3, 5, 4, 10],
'pctage_green': [0.2, 0.17647059, 0.14285715, 0.125, 0.2857143],
'id_mas': [1, 2, 3, 4, 5],
'avg_dist_citing': ['[0.6666666666666666]',
'[2.5]',
'[inf]',
'[inf]',
'[inf]'],
'dist_citing_patents2': ['[1, 1, 0]',
'[3, 3, 1, 3, 2, 3]',
'[5, 99999, 5, 2, 5, 99999, 4, 6, 99999, 6, 7, 7, 2, 0, 1, 0, 0, 0, 1, 0, 3, 1, 1]',
'[99999, 99999, 99999, 99999, 99999, 99999, 2, 2, 2, 99999, 99999, 2, 2, 99999, 4, 99999, 3, 2, 0, 1, 1, 1, 3, 99999, 99999]',
'[99999, 1, 1, 1, 1, 3, 1, 1, 1, 99999, 6, 1, 2, 99999, 5, 4, 3, 0, 2, 1, 1, 1, 1, 2, 1, 1, 0, 0, 2, 0, 3, 2]'],
'id_us': [3, 4, 5, 6, 7],
'y_tr1': [0.60000002, 0.05882353, 0.25714287, 0.125, 0.51428574],
'y_tr2': [0.60000002, 0.11764706, 0.31428573, 0.3125, 0.65714288],
'y_tr3': [0.60000002, 0.35294119, 0.34285715, 0.375, 0.74285716],
'y_tr4': [0.60000002, 0.35294119, 0.37142858, 0.40625, 0.77142859],
'y_tr5': [0.60000002, 0.35294119, 0.45714286, 0.40625, 0.80000001]}
CodePudding user response:
It seems you are looking for explode
and get_dummies
pd.get_dummies(df.explode('tech_classes')).groupby('docdb').sum()
CodePudding user response:
Assuming you have lists in tech_classes
, you can join the strings and use str.get_dummies
:
df = df.join(df.pop('tech_classes').agg('|'.join).str.get_dummies())
Output:
docdb Y02A 40/81 Y02B 30/12 Y02B 30/13 Y02P 20/10 Y10S 426/805
0 1187498 0 0 0 1 0
1 1236571 0 1 1 1 0
2 1239098 1 0 0 0 1
update
You actually have string representations of lists. While first converting to lists with ast.literal_eval
would allow to use the above method, a more efficient approach would be:
df = df.join(df.pop('tech_classes').str[2:-1].str.get_dummies("','"))
If you want a quick test:
df['tech_fields_cited'].head().str[2:-1].str.get_dummies("','")