Home > Software engineering >  Convert a list of strings containing a list into a dataframe
Convert a list of strings containing a list into a dataframe

Time:12-09

I have a list of strings that I would like to convert to a dataframe but the ',' inside a list object is messing up every approach I take, any ideas.

For example: y = [ "a,['b'],0.3,1" ,"a,['b,c'],0.4,2" ]

should become: df with 4 columns consisting of 'a',['b'],0.3,1 'a',['b,c'],0.4,2

this is the closest I've found so far but it crashes as it see the second row as having 5 elements instead of 4. df = pd.read_csv(io.StringIO("\n".join(y)), delimiter=",", header=None)) Thanks.

CodePudding user response:

First, I'd search and fix the root problem (if it is possible - how is the data generated)?

If the fix there is not possible, you can try:

import csv
import pandas as pd
from io import StringIO


y = ["a,['b'],0.3,1", "a,['b,c'],0.4,2"]

out = []
for s in y:
    s = StringIO(s.replace("[", "").replace("]", ""))
    out.append(next(csv.reader(s, quotechar="'")))

df = pd.DataFrame(out, columns=["Col1", "Col2", "Col3", "Col4"])
df["Col2"] = df["Col2"].str.split(",")

print(df)

Prints:

  Col1    Col2 Col3 Col4
0    a     [b]  0.3    1
1    a  [b, c]  0.4    2

CodePudding user response:

You can surround the list parts in the strings by " using re.sub and then join the list using \n and then use quotechar parameter of read_csv to ignore ".

import re
s = '\n'.join([re.sub(pattern=r'(\[.*\])', repl='"'   '\\1'   '"',string=s) for s in y])
df = pd.read_csv(StringIO(s), quotechar='"', header=None)

print(df):

   0        1    2  3
0  a    ['b']  0.3  1
1  a  ['b,c']  0.4  2

CodePudding user response:

The list objects in your data are strange. If you really want them as lists, the other answer gets it right. But if you want them as strings e.g., "['b,c']", this is faster:

p = re.compile(",(?![^\[]*\])")
for i in range(len(y)):
  y[i] = re.sub(p, "\t", y[i])
df = pd.read_csv(io.StringIO("\n".join(y)),delimiter='\t',header=None)

CodePudding user response:

Another possible solution, which uses regex split with lookahead to prevent the split by comma in ['b,c']:

import re

df = pd.DataFrame([re.split(r',(?!\w \'\])', s) for s in y])
df.columns = [f'col{i 1}' for i in range(df.shape[1])]

Output:

  col1     col2 col3 col4
0    a    ['b']  0.3    1
1    a  ['b,c']  0.4    2
  • Related