I have a text data that look like this:
3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"
I want to transfor it to be table like this:
a b e r
1 2 5 7
23 45 76 76
I've tried to use a pandas data frame for that, but the data size is quite big, like 40 Mb. So what should I do to solve it? Sorry for my bad explanation. I hope you can understand what I mean. Thanks!
import os
import pandas as pd
from io import StringIO
a = pd.read_csv(StringIO("12test.txt"), sep=",", header=None, error_bad_lines=False)
df = pd.DataFrame([row.split('.') for row in a.split('\n')])
print(df)
I've tried this but it doesn't work. Some errors occurred like "'DataFrame' object has no attribute 'split' ", the data frame containing a string "12test.txt" not the data inside the file, memory problem, etc.
CodePudding user response:
Try:
>>> s = '3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"'
>>> pd.DataFrame([[x.strip('"') for x in i.split(',')[1:]] for i in s.splitlines()[1:]], columns=[x.strip('"') for x in s.splitlines()[0].split(',')[1:]])
a b e r
0 1 2 5 7
1 23 45 76 76
>>>
Use a list comprehension then convert it to a pandas.DataFrame
.
CodePudding user response:
To read files or binary text data you can use StringIO
, removing first digit of string and digits alongside \n
make a readable input string when pass to read_csv
.
import io
import re
import pandas as pd
s = '3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"'
s = re.sub(r'[\n][0-9]', "\n", s)
df = pd.read_csv(io.StringIO(s))
# remove column generated by first character that contains NAN values
df.drop(df.columns[0], axis=1)