Home > other >  Create DataFrame from String "like" data
Create DataFrame from String "like" data

Time:12-17

I have a text data that look like this:

3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"

I want to transfor it to be table like this:

a  b  e  r
1  2  5  7
23 45 76 76

I've tried to use a pandas data frame for that, but the data size is quite big, like 40 Mb. So what should I do to solve it? Sorry for my bad explanation. I hope you can understand what I mean. Thanks!

import os
import pandas as pd
from io import StringIO
a = pd.read_csv(StringIO("12test.txt"), sep=",", header=None, error_bad_lines=False)
df = pd.DataFrame([row.split('.') for row in a.split('\n')])


print(df)

I've tried this but it doesn't work. Some errors occurred like "'DataFrame' object has no attribute 'split' ", the data frame containing a string "12test.txt" not the data inside the file, memory problem, etc.

CodePudding user response:

Try:

>>> s = '3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"'
>>> pd.DataFrame([[x.strip('"') for x in i.split(',')[1:]] for i in s.splitlines()[1:]], columns=[x.strip('"') for x in s.splitlines()[0].split(',')[1:]])
    a   b   e   r
0   1   2   5   7
1  23  45  76  76
>>> 

Use a list comprehension then convert it to a pandas.DataFrame.

CodePudding user response:

To read files or binary text data you can use StringIO, removing first digit of string and digits alongside \n make a readable input string when pass to read_csv.

import io
import re

import pandas as pd

s = '3,"a","b","e","r"\n4,"1","2","5","7"\n4,"23","45","76","76"'
s = re.sub(r'[\n][0-9]', "\n", s)

df = pd.read_csv(io.StringIO(s))

# remove column generated by first character that contains NAN values
df.drop(df.columns[0], axis=1)
  • Related