Convert a list of strings containing a list into a dataframe-CodePudding

I have a list of strings that I would like to convert to a dataframe but the ',' inside a list object is messing up every approach I take, any ideas.

For example: y = [ "a,['b'],0.3,1" ,"a,['b,c'],0.4,2" ]

should become: df with 4 columns consisting of 'a',['b'],0.3,1 'a',['b,c'],0.4,2

this is the closest I've found so far but it crashes as it see the second row as having 5 elements instead of 4. df = pd.read_csv(io.StringIO("\n".join(y)), delimiter=",", header=None)) Thanks.

CodePudding user response：

First, I'd search and fix the root problem (if it is possible - how is the data generated)?

If the fix there is not possible, you can try:

import csv
import pandas as pd
from io import StringIO


y = ["a,['b'],0.3,1", "a,['b,c'],0.4,2"]

out = []
for s in y:
    s = StringIO(s.replace("[", "").replace("]", ""))
    out.append(next(csv.reader(s, quotechar="'")))

df = pd.DataFrame(out, columns=["Col1", "Col2", "Col3", "Col4"])
df["Col2"] = df["Col2"].str.split(",")

print(df)

Prints:

  Col1    Col2 Col3 Col4
0    a     [b]  0.3    1
1    a  [b, c]  0.4    2

CodePudding user response：

You can surround the list parts in the strings by " using re.sub and then join the list using \n and then use quotechar parameter of read_csv to ignore ".

import re
s = '\n'.join([re.sub(pattern=r'(\[.*\])', repl='"'   '\\1'   '"',string=s) for s in y])
df = pd.read_csv(StringIO(s), quotechar='"', header=None)

print(df):

   0        1    2  3
0  a    ['b']  0.3  1
1  a  ['b,c']  0.4  2

CodePudding user response：

The list objects in your data are strange. If you really want them as lists, the other answer gets it right. But if you want them as strings e.g., "['b,c']", this is faster:

p = re.compile(",(?![^\[]*\])")
for i in range(len(y)):
  y[i] = re.sub(p, "\t", y[i])
df = pd.read_csv(io.StringIO("\n".join(y)),delimiter='\t',header=None)

CodePudding user response：

Another possible solution, which uses regex split with lookahead to prevent the split by comma in ['b,c']:

import re

df = pd.DataFrame([re.split(r',(?!\w \'\])', s) for s in y])
df.columns = [f'col{i 1}' for i in range(df.shape[1])]

Output:

  col1     col2 col3 col4
0    a    ['b']  0.3    1
1    a  ['b,c']  0.4    2