Im fairly new dealing with .txt files that has a dictionary within it. Im trying to pd.read_csv and create a dataframe in pandas.I get thrown an error of Error tokenizing data. C error: Expected 4 fields in line 2, saw 11. I belive I found the root problem which is the file is difficult to read because each row contains a dict, whose key-value pairs are separated by commas in this case is the delimiter.

Data (store.txt)

id,name,storeid,report
11,JohnSmith,3221-123-555,{"Source":"online","FileFormat":0,"Isonline":true,"comment":"NAN","itemtrack":"110", "info": {"haircolor":"black", "age":53}, "itemsboughtid":[],"stolenitem":[{"item":"candy","code":1},{"item":"candy","code":1}]}
35,BillyDan,3221-123-555,{"Source":"letter","FileFormat":0,"Isonline":false,"comment":"this is the best store, hands down and i will surely be back...","itemtrack":"110", "info": {"haircolor":"black", "age":21},"itemsboughtid":[1,42,465,5],"stolenitem":[{"item":"shoe","code":2}]}
64,NickWalker,3221-123-555, {"Source":"letter","FileFormat":0,"Isonline":false, "comment":"we need this area to be fixed, so much stuff is everywhere and i     do not like this one bit at all, never again...","itemtrack":"110", "info": {"haircolor":"red", "age":22},"itemsboughtid":[1,2],"stolenitem":[{"item":"sweater","code":11},{"item":"mask","code":221},{"item":"jack,jill","code":001}]}

How would I read this csv file and create new columns based on the key-values. In addition, what if there are more key-value in other data... for example > 11 keys within the dictionary.

Is there a an efficient way of create a df from the example above?

My code when trying to read as csv##

df = pd.read_csv('store.txt', header=None)

I tried to import json and user a converter but it do not work and converted all the commas to a | `

import json
df = pd.read_csv('store.txt', converters={'report': json.loads}, header=0, sep="|")

In addition I also tried to use: `

import pandas as pd
import json
df=pd.read_csv('store.txt', converters={'report':json.loads}, header=0, quotechar="'")

I also was thinking to add a quote at the begining of the dictionary and at the end to make it a string but thought that was too tedious to find the closing brackets.

CodePudding user response：

I think adding quotes around the dictionaries is the right approach. You can use regex to do so and use a different quote character than " (I used § in my example):

from io import StringIO
import re
import json

with open("store.txt", "r") as f:
    csv_content = re.sub(r"(\{.*})", r"§\1§", f.read())

df = pd.read_csv(StringIO(csv_content), skipinitialspace=True, quotechar="§", engine="python")

df_out = pd.concat([
    df[["id", "name", "storeid"]],
    pd.DataFrame(df["report"].apply(lambda x: json.loads(x)).values.tolist())
], axis=1)

print(df_out)

Note: the very last value in your csv isn't valid json: "code":001. It should either be "code":"001" or "code":1

Output:

   id        name       storeid  Source  ...  itemtrack                               info    itemsboughtid                                         stolenitem
0  11   JohnSmith  3221-123-555  online  ...        110  {'haircolor': 'black', 'age': 53}               []  [{'item': 'candy', 'code': 1}, {'item': 'candy...
1  35    BillyDan  3221-123-555  letter  ...        110  {'haircolor': 'black', 'age': 21}  [1, 42, 465, 5]                      [{'item': 'shoe', 'code': 2}]
2  64  NickWalker  3221-123-555  letter  ...        110    {'haircolor': 'red', 'age': 22}           [1, 2]  [{'item': 'sweater', 'code': 11}, {'item': 'ma...