I've got a bunch of 20MB text files with such a structure:
{'col-1': ['data-11', 'data-12'], 'col-2': [1, 2], 'col-n': [1, 2]}
{'col-1': ['data-21', 'data-22'], 'col-2': [1, 2], 'col-n': [1, 2]}
{'col-1': ['data-31', 'data-32', 'data-33'], 'col-2': [1, 2, 3], 'col-n': [1, 2, 3]}
...
A number of records rows in each line varies.
Which I want to read as:
col-1 col-2 col-n
0 data-11 1 1
1 data-12 2 2
2 data-21 1 1
3 data-22 2 2
4 data-31 1 1
5 data-32 2 2
6 data-33 3 3
I've already prepared such code which works just fine:
import ast
import pandas as pd
import itertools
def _eval(line: str):
''' Turns: {'col-1': ['data-11', 'data-12'], 'col-2': [1, 2], 'col-n': [1, 2]}
Into: [{'col-1': 'data-11', 'col-2': 1, 'col-n': 1}, {'col-1': 'data-12', 'col-2': 2, 'col-n': 2}]
'''
try:
v = ast.literal_eval(line.rstrip()) # Reads line to dict
v = [{k: v[k][i] for k in v} for i in range(len(v['col-n']))] # Revers dict, by splitting them and rejoin
return v
# In case of file structure error
except Exception as e:
print('err', e)
return []
def read_structure(fp):
with open(fp) as fh:
return pd.DataFrame(itertools.chain(*list(map(_eval, fh.readlines()))))
But there must be a better way. With itertools and map I've already get under 5 seconds.
I want to:
A) optimize loop moment v = [{k: v[k][i] for k in v} for i in range(len(v['col-n']))]
B) know if there's already pandas method that can read such structure (I've searched and tried all read_dict and records methods with different parameters)
C) Optimize it at all.
The function is already running in threads so multithreading/processing isn't a best idea.
CodePudding user response:
Can you try the code below:
v = {k: v for k, v in zip(d.keys(), t)} for t in zip(*d.values())]
return v
CodePudding user response:
First, I would try to avoid readlines
which loads the whole file in memory as a list of lines. Then I would try to provide a dict of lists to the Dataframe constructor:
def read_structure(fp):
cols = 'col-1', 'col-2', 'col-n' # declare the column names
data = {col: [] for col in cols} # initialize the data structure
with open(fp) as fh:
for i, line in enumerate(io.StringIO(t), 1):
try:
row = ast.literal_eval(line)
for k, v in row.items():
data[k].extend(v)
except Exception as e:
print('Error line', i, line, e)
raise # unsure that we can continue after an error
return pd.DataFrame(data)
Unsure whether it is fast, but at least it is simple