Home > OS >  Get dummies for column of literals, with some values without quotation marks
Get dummies for column of literals, with some values without quotation marks

Time:09-09

I have a .csv file downloaded from an SQL database, where column value is a list of strings.
However some values come without quotation marks around them:

  user_id  B                                    value
0      a1  3                          {no_quotations} #no quotation mark ardoun this word
1      a2  4       {"with quotations", no_quotations} #one with quotations, one without
2      a3  1  {"with quotations", "with quotations 2"} #UPDATE quotations have spaces in them.
3      a4  1          {no_quotations, no_quotations2} #2 without quotations marks
4      a6  1                      {"with quotations"} #UPDATE quotations have spaces in them 

The words are:

  • single alphanumeric for no_quotations
  • string alphanumeric spaces between words for "with quotations"

I want convert the column value into a list of dummies. Expected output: enter image description here

The values in column value might change in a future iteration, so I don't want the code to depend on the exact string of the value.
I am able to do this for values with quotations, but not for all

import pandas as pd 
import ast 

df = pd.read_csv("/content/SQLDB_quotations_in_literal_example.csv")
df_ = df.iloc[[2,4]]['value'].apply(lambda x: list(ast.literal_eval(x))) 
pd.merge(df, pd.get_dummies(df_.apply(pd.Series).stack()).groupby(level=0).sum(), left_index=True, right_index=True, how='outer' )

OUT: 

  user_id  B                                     value  with quotations  \
0      a1  3                           {no_quotations}              NaN   
1      a2  4        {"with quotations", no_quotations}              NaN   
2      a3  1  {"with quotations", "with quotations 2"}              1.0   
3      a4  1          {no_quotations, no_quotations_2}              NaN   
4      a6  1                       {"with quotations"}              1.0   

   with quotations 2  
0                NaN  
1                NaN  
2                1.0  
3                NaN  
4                0.0  
  • How do I handle the missing quotation marks around certain strings in col value?
  • Since the table was taken from SQL, is this an error in how the SQL table data was stored? or is there a way of downloading the data without this issue?

EDIT Other important info - string "with quotations" have a spaces in them, the no_quotations have none

CodePudding user response:

One approach with str.extractall and pivot_table:

out = df.join(df['value']
 .str.extractall('(\w )')[0]
 .droplevel(1).reset_index(name='col').assign(value=1)
 .pivot_table(index='index', columns='col', values='value', fill_value=0)
)

Alternative with str.get_dummies:

out = df.join(df['value']
 .str.extractall('(\w )')[0]
 .groupby(level=0).agg('|'.join).str.get_dummies()
)

output:

  user_id  B                                    value  no_quotations  no_quotations2  with_quotations  with_quotations2
0      a1  3                          {no_quotations}              1               0                0                 0
1      a2  4       {"with_quotations", no_quotations}              1               0                1                 0
2      a3  1  {"with_quotations", "with_quotations2"}              0               0                1                 1
3      a4  1          {no_quotations, no_quotations2}              1               1                0                 0
4      a6  1                      {"with_quotations"}              0               0                1                 0
variant: allow spaces inside a word:

Use the following regex: r'([^",{}]*[^",{}\s])' in extractall.

output:

  user_id  B                                     value   no_quotations   no_quotations2  no_quotations  with quotations  with quotations 2
0      a1  3                           {no_quotations}               0                0              1                0                  0
1      a2  4        {"with quotations", no_quotations}               1                0              0                1                  0
2      a3  1  {"with quotations", "with quotations 2"}               0                0              0                1                  1
3      a4  1           {no_quotations, no_quotations2}               0                1              1                0                  0
4      a6  1                       {"with quotations"}               0                0              0                1                  0
  • Related