Creating a dummy out of a list variable python-CodePudding

I have a pandas dataframe looking like this:

docdb    tech_classes
1187498     ['Y02P 20/10']
1236571     ['Y02B 30/13' 'Y02B 30/12' 'Y02P 20/10']
1239098     ['Y10S 426/805' 'Y02A 40/81']
...

What I would like to do is to create N dummy variables where N is the total number of names appearing in the variable tech_classes (please notice that Y02P 20/10 is a unique name as if it was: Y02P_20/10 and so Y02B 30/13 and the others). The variables should be dummies having value 1 whenever a docdb has that class inside tech_classes.

In other words the result of the above example should look like this:

docdb Y02P_20/10 Y02B_30/13 Y02B_30/12 Y02A_40/81 Y10S_426/805 ...
1187498  1             0          0          0          0
1236571  1             1          1          0          0
1239098  0             0          0          1          1
...

Thanks a lot!

P.s. I know that there is a get_dummies in pandas but it does not quite work as tech_classes is not in list form from... Secifically:

df_patents.head().to_dict('list')

gives:

{'docdb_family_id': [1187498, 1226468, 1236571, 1239098, 1239277],
 'tech_fields_cited': ["['Y02P_20_10']",
  "['Y10T_156_1023']",
  "['Y02B_30_13','Y02B_30_12','Y02E_60_14','Y02B_10_70']",
  "['Y10S_426_805','Y02A_40_81']",
  "['Y02E_60_10','Y02T_90_12','Y02T_10_7072','Y02T_90_14','Y02T_10_70']"],
 'patindocdb_years': ['[1998 1999 1996]',
  '[1996 1992 1994 1993 1997]',
  '[1991 1993 1990 1996]',
  '[1995 1992 1993]',
  '[1996 1993 1992]'],
 'appln_auth': ['DE', 'DE', 'WO', 'WO', 'WO'],
 'appln_nr': ['19581932', '4042441', '9002512', '9103158', '9105114'],
 'earliest_publn_year': [1998, 1992, 1991, 1992, 1993],
 'nb_citing_docdb_fam_y': [5, 17, 35, 32, 35],
 'person_ctrycode': ["['RU']", "['DE']", "['US']", "['US']", "['IL']"],
 'fronteer': [0, 0, 0, 0, 0],
 'distance': [9999, 2, 9999, 9999, 9999],
 'oecd_fields': ['[nan]', '[nan]', '[nan]', '[nan]', '[nan]'],
 'nr_green': [1, 3, 5, 4, 10],
 'pctage_green': [0.2, 0.17647059, 0.14285715, 0.125, 0.2857143],
 'id_mas': [1, 2, 3, 4, 5],
 'avg_dist_citing': ['[0.6666666666666666]',
  '[2.5]',
  '[inf]',
  '[inf]',
  '[inf]'],
 'dist_citing_patents2': ['[1, 1, 0]',
  '[3, 3, 1, 3, 2, 3]',
  '[5, 99999, 5, 2, 5, 99999, 4, 6, 99999, 6, 7, 7, 2, 0, 1, 0, 0, 0, 1, 0, 3, 1, 1]',
  '[99999, 99999, 99999, 99999, 99999, 99999, 2, 2, 2, 99999, 99999, 2, 2, 99999, 4, 99999, 3, 2, 0, 1, 1, 1, 3, 99999, 99999]',
  '[99999, 1, 1, 1, 1, 3, 1, 1, 1, 99999, 6, 1, 2, 99999, 5, 4, 3, 0, 2, 1, 1, 1, 1, 2, 1, 1, 0, 0, 2, 0, 3, 2]'],
 'id_us': [3, 4, 5, 6, 7],
 'y_tr1': [0.60000002, 0.05882353, 0.25714287, 0.125, 0.51428574],
 'y_tr2': [0.60000002, 0.11764706, 0.31428573, 0.3125, 0.65714288],
 'y_tr3': [0.60000002, 0.35294119, 0.34285715, 0.375, 0.74285716],
 'y_tr4': [0.60000002, 0.35294119, 0.37142858, 0.40625, 0.77142859],
 'y_tr5': [0.60000002, 0.35294119, 0.45714286, 0.40625, 0.80000001]}

CodePudding user response：

It seems you are looking for explode and get_dummies

pd.get_dummies(df.explode('tech_classes')).groupby('docdb').sum()

CodePudding user response：

Assuming you have lists in tech_classes, you can join the strings and use str.get_dummies:

df = df.join(df.pop('tech_classes').agg('|'.join).str.get_dummies())

Output:

     docdb  Y02A 40/81  Y02B 30/12  Y02B 30/13  Y02P 20/10  Y10S 426/805
0  1187498           0           0           0           1             0
1  1236571           0           1           1           1             0
2  1239098           1           0           0           0             1

update

You actually have string representations of lists. While first converting to lists with ast.literal_eval would allow to use the above method, a more efficient approach would be:

df = df.join(df.pop('tech_classes').str[2:-1].str.get_dummies("','"))

If you want a quick test:

df['tech_fields_cited'].head().str[2:-1].str.get_dummies("','")