I have a list of strings that I would like to convert to a dataframe but the ',' inside a list object is messing up every approach I take, any ideas.
For example: y = [ "a,['b'],0.3,1" ,"a,['b,c'],0.4,2" ]
should become: df with 4 columns consisting of 'a',['b'],0.3,1 'a',['b,c'],0.4,2
this is the closest I've found so far but it crashes as it see the second row as having 5 elements instead of 4. df = pd.read_csv(io.StringIO("\n".join(y)), delimiter=",", header=None)) Thanks.
CodePudding user response:
First, I'd search and fix the root problem (if it is possible - how is the data generated)?
If the fix there is not possible, you can try:
import csv
import pandas as pd
from io import StringIO
y = ["a,['b'],0.3,1", "a,['b,c'],0.4,2"]
out = []
for s in y:
s = StringIO(s.replace("[", "").replace("]", ""))
out.append(next(csv.reader(s, quotechar="'")))
df = pd.DataFrame(out, columns=["Col1", "Col2", "Col3", "Col4"])
df["Col2"] = df["Col2"].str.split(",")
print(df)
Prints:
Col1 Col2 Col3 Col4
0 a [b] 0.3 1
1 a [b, c] 0.4 2
CodePudding user response:
You can surround the list parts in the strings by "
using re.sub
and then join the list using \n
and then use quotechar
parameter of read_csv
to ignore "
.
import re
s = '\n'.join([re.sub(pattern=r'(\[.*\])', repl='"' '\\1' '"',string=s) for s in y])
df = pd.read_csv(StringIO(s), quotechar='"', header=None)
print(df):
0 1 2 3
0 a ['b'] 0.3 1
1 a ['b,c'] 0.4 2
CodePudding user response:
The list objects in your data are strange. If you really want them as lists, the other answer gets it right. But if you want them as strings e.g., "['b,c']", this is faster:
p = re.compile(",(?![^\[]*\])")
for i in range(len(y)):
y[i] = re.sub(p, "\t", y[i])
df = pd.read_csv(io.StringIO("\n".join(y)),delimiter='\t',header=None)
CodePudding user response:
Another possible solution, which uses regex split with lookahead to prevent the split by comma in ['b,c']
:
import re
df = pd.DataFrame([re.split(r',(?!\w \'\])', s) for s in y])
df.columns = [f'col{i 1}' for i in range(df.shape[1])]
Output:
col1 col2 col3 col4
0 a ['b'] 0.3 1
1 a ['b,c'] 0.4 2