I have a .csv file downloaded from an SQL database, where column value is a list of strings.
However some values come without quotation marks around them:
user_id B value
0 a1 3 {no_quotations} #no quotation mark ardoun this word
1 a2 4 {"with quotations", no_quotations} #one with quotations, one without
2 a3 1 {"with quotations", "with quotations 2"} #UPDATE quotations have spaces in them.
3 a4 1 {no_quotations, no_quotations2} #2 without quotations marks
4 a6 1 {"with quotations"} #UPDATE quotations have spaces in them
The words are:
- single alphanumeric for no_quotations
- string alphanumeric spaces between words for "with quotations"
I want convert the column value into a list of dummies. Expected output:
The values in column value might change in a future iteration, so I don't want the code to depend on the exact string of the value.
I am able to do this for values with quotations, but not for all
import pandas as pd
import ast
df = pd.read_csv("/content/SQLDB_quotations_in_literal_example.csv")
df_ = df.iloc[[2,4]]['value'].apply(lambda x: list(ast.literal_eval(x)))
pd.merge(df, pd.get_dummies(df_.apply(pd.Series).stack()).groupby(level=0).sum(), left_index=True, right_index=True, how='outer' )
OUT:
user_id B value with quotations \
0 a1 3 {no_quotations} NaN
1 a2 4 {"with quotations", no_quotations} NaN
2 a3 1 {"with quotations", "with quotations 2"} 1.0
3 a4 1 {no_quotations, no_quotations_2} NaN
4 a6 1 {"with quotations"} 1.0
with quotations 2
0 NaN
1 NaN
2 1.0
3 NaN
4 0.0
- How do I handle the missing quotation marks around certain strings in col value?
- Since the table was taken from SQL, is this an error in how the SQL table data was stored? or is there a way of downloading the data without this issue?
EDIT Other important info - string "with quotations" have a spaces in them, the no_quotations have none
CodePudding user response:
One approach with str.extractall
and pivot_table
:
out = df.join(df['value']
.str.extractall('(\w )')[0]
.droplevel(1).reset_index(name='col').assign(value=1)
.pivot_table(index='index', columns='col', values='value', fill_value=0)
)
Alternative with str.get_dummies
:
out = df.join(df['value']
.str.extractall('(\w )')[0]
.groupby(level=0).agg('|'.join).str.get_dummies()
)
output:
user_id B value no_quotations no_quotations2 with_quotations with_quotations2
0 a1 3 {no_quotations} 1 0 0 0
1 a2 4 {"with_quotations", no_quotations} 1 0 1 0
2 a3 1 {"with_quotations", "with_quotations2"} 0 0 1 1
3 a4 1 {no_quotations, no_quotations2} 1 1 0 0
4 a6 1 {"with_quotations"} 0 0 1 0
variant: allow spaces inside a word:
Use the following regex: r'([^",{}]*[^",{}\s])'
in extractall
.
output:
user_id B value no_quotations no_quotations2 no_quotations with quotations with quotations 2
0 a1 3 {no_quotations} 0 0 1 0 0
1 a2 4 {"with quotations", no_quotations} 1 0 0 1 0
2 a3 1 {"with quotations", "with quotations 2"} 0 0 0 1 1
3 a4 1 {no_quotations, no_quotations2} 0 1 1 0 0
4 a6 1 {"with quotations"} 0 0 0 1 0