Home > database >  Reading file with reverse dict to DataFrame efficiently
Reading file with reverse dict to DataFrame efficiently

Time:09-24

I've got a bunch of 20MB text files with such a structure:

{'col-1': ['data-11', 'data-12'], 'col-2': [1, 2], 'col-n': [1, 2]}
{'col-1': ['data-21', 'data-22'], 'col-2': [1, 2], 'col-n': [1, 2]}
{'col-1': ['data-31', 'data-32', 'data-33'], 'col-2': [1, 2, 3], 'col-n': [1, 2, 3]}
...

A number of records rows in each line varies.

Which I want to read as:

        col-1       col-2   col-n
0       data-11     1       1
1       data-12     2       2
2       data-21     1       1
3       data-22     2       2
4       data-31     1       1
5       data-32     2       2
6       data-33     3       3

I've already prepared such code which works just fine:

import ast
import pandas as pd
import itertools

def _eval(line: str):
    ''' Turns: {'col-1': ['data-11', 'data-12'], 'col-2': [1, 2], 'col-n': [1, 2]}
        Into: [{'col-1': 'data-11', 'col-2': 1, 'col-n': 1}, {'col-1': 'data-12', 'col-2': 2, 'col-n': 2}]
    '''
    try:
        v = ast.literal_eval(line.rstrip()) # Reads line to dict
        v = [{k: v[k][i] for k in v} for i in range(len(v['col-n']))] # Revers dict, by splitting them and rejoin
        return v
    
    # In case of file structure error
    except Exception as e:
        print('err', e)
        return []

def read_structure(fp):
    with open(fp) as fh:
        return pd.DataFrame(itertools.chain(*list(map(_eval, fh.readlines()))))

But there must be a better way. With itertools and map I've already get under 5 seconds.

I want to:

A) optimize loop moment v = [{k: v[k][i] for k in v} for i in range(len(v['col-n']))]

B) know if there's already pandas method that can read such structure (I've searched and tried all read_dict and records methods with different parameters)

C) Optimize it at all.

The function is already running in threads so multithreading/processing isn't a best idea.

CodePudding user response:

Can you try the code below:

v = {k: v for k, v in zip(d.keys(), t)} for t in zip(*d.values())]
return v

CodePudding user response:

First, I would try to avoid readlines which loads the whole file in memory as a list of lines. Then I would try to provide a dict of lists to the Dataframe constructor:

def read_structure(fp):
    cols = 'col-1', 'col-2', 'col-n'  # declare the column names
    data = {col: [] for col in cols}  # initialize the data structure

    with open(fp) as fh:
        for i, line in enumerate(io.StringIO(t), 1):
            try:
                row = ast.literal_eval(line)
                for k, v in row.items():
                    data[k].extend(v)
            except Exception as e:
                print('Error line', i, line, e)
                raise           # unsure that we can continue after an error
    return pd.DataFrame(data)

Unsure whether it is fast, but at least it is simple

  • Related