Im fairly new dealing with .txt files that has a dictionary within it. Im trying to pd.read_csv
and create a dataframe in pandas.I get thrown an error of Error tokenizing data. C error: Expected 4 fields in line 2, saw 11
. I belive I found the root problem which is the file is difficult to read because each row contains a dict, whose key-value pairs are separated by commas in this case is the delimiter.
Data (store.txt)
id,name,storeid,report
11,JohnSmith,3221-123-555,{"Source":"online","FileFormat":0,"Isonline":true,"comment":"NAN","itemtrack":"110", "info": {"haircolor":"black", "age":53}, "itemsboughtid":[],"stolenitem":[{"item":"candy","code":1},{"item":"candy","code":1}]}
35,BillyDan,3221-123-555,{"Source":"letter","FileFormat":0,"Isonline":false,"comment":"this is the best store, hands down and i will surely be back...","itemtrack":"110", "info": {"haircolor":"black", "age":21},"itemsboughtid":[1,42,465,5],"stolenitem":[{"item":"shoe","code":2}]}
64,NickWalker,3221-123-555, {"Source":"letter","FileFormat":0,"Isonline":false, "comment":"we need this area to be fixed, so much stuff is everywhere and i do not like this one bit at all, never again...","itemtrack":"110", "info": {"haircolor":"red", "age":22},"itemsboughtid":[1,2],"stolenitem":[{"item":"sweater","code":11},{"item":"mask","code":221},{"item":"jack,jill","code":001}]}
How would I read this csv file and create new columns based on the key-values. In addition, what if there are more key-value in other data... for example > 11 keys within the dictionary.
Is there a an efficient way of create a df from the example above?
My code when trying to read as csv##
df = pd.read_csv('store.txt', header=None)
I tried to import json and user a converter but it do not work and converted all the commas to a | `
import json
df = pd.read_csv('store.txt', converters={'report': json.loads}, header=0, sep="|")
In addition I also tried to use: `
import pandas as pd
import json
df=pd.read_csv('store.txt', converters={'report':json.loads}, header=0, quotechar="'")
I also was thinking to add a quote at the begining of the dictionary and at the end to make it a string but thought that was too tedious to find the closing brackets.
CodePudding user response:
I think adding quotes around the dictionaries is the right approach. You can use regex to do so and use a different quote character than "
(I used §
in my example):
from io import StringIO
import re
import json
with open("store.txt", "r") as f:
csv_content = re.sub(r"(\{.*})", r"§\1§", f.read())
df = pd.read_csv(StringIO(csv_content), skipinitialspace=True, quotechar="§", engine="python")
df_out = pd.concat([
df[["id", "name", "storeid"]],
pd.DataFrame(df["report"].apply(lambda x: json.loads(x)).values.tolist())
], axis=1)
print(df_out)
Note: the very last value in your csv isn't valid json: "code":001
. It should either be "code":"001"
or "code":1
Output:
id name storeid Source ... itemtrack info itemsboughtid stolenitem
0 11 JohnSmith 3221-123-555 online ... 110 {'haircolor': 'black', 'age': 53} [] [{'item': 'candy', 'code': 1}, {'item': 'candy...
1 35 BillyDan 3221-123-555 letter ... 110 {'haircolor': 'black', 'age': 21} [1, 42, 465, 5] [{'item': 'shoe', 'code': 2}]
2 64 NickWalker 3221-123-555 letter ... 110 {'haircolor': 'red', 'age': 22} [1, 2] [{'item': 'sweater', 'code': 11}, {'item': 'ma...