Home > Enterprise >  Read multiple dataframes (chunks) from a file in pandas
Read multiple dataframes (chunks) from a file in pandas

Time:12-22

I want to read a file in pandas with read_fwf but the file has multiple chunks. I am aware of chunksize option while reading but that splits the file into multiple chunks. However, I want to read from a file that already consists multiple chunks. I want the chunks to be saved in different variables like df1, df2, etc.

For each chunks, the file also has a header to determine the beginning of the next chunk. The file has the following format.

a1   b   c1    d            // chunk 1
1    2    3    4
2    3    4    5
a2   b    c    d2            // chunk 2
3    9    1    4

The headers are not always identical but they always start with the same string and no other rows start with that same string. As in in this example, the header (and the start of next chunk) always start with a and no other rows start with a.

One maybe not so efficient way would be to split the file into multiple files and then read them separately as a pandas Dataframe but there must an efficient way to do this?

The file itself is rather small so reading the whole file is not an issue.

CodePudding user response:

You could first read the initial file (at text level) and split it in memory chunks then, as soon as one chunk is ready, pass it to pandas through a io.StringIO object. That way you only read the file once but allow pandas to guess the columns datatypes. Something like

chunk = []
with open(file) as fd:
    for line in fd:
        if line.startswith('a'):
            # found start of a new chunk: process previous one if any
            if len(chunk) != 0:
                df = pd.read_fwf(io.StringIO(''.join(chunk)), ...)
                # process df
                ...
                chunk = []    # DO NOT FORGET TO RESET chunk HERE...
        chunk.append(line)
# process last chunk
if len(chunk) != 0:
    df = pd.read_fwf(io.StringIO(''.join(chunk)), ...)
    # process df
    ...

But beware: this will only makes sense if the chunks are small enough because this only changes memory for disk accesses...

CodePudding user response:

You can create DataFrame from all values and then split by starting a in first column:

print (df)
   a1  b  c   d
0   1  2  3   4
1   2  3  4   5
2  a2  b  c  d5
3   3  9  1   4

m = df.iloc[:, 0].str.startswith('a')
dfs = {f'df{i}': v for i, (k, v) in enumerate(df[~m].groupby(m.cumsum()), 1)}
print (dfs['df1'])
  a1  b  c  d
0  1  2  3  4
1  2  3  4  5

print (dfs['df2'])
  a1  b  c  d
3  3  9  1  4

If need set each columns names by first row:

print (df)
    0  1  2   3
0  a1  b  c   d
1   1  2  3   4
2   2  3  4   5
3  a2  b  c  d5
4   3  9  1   4


m = df.iloc[:, 0].str.startswith('a')
dfs = {f'df{i}': v.iloc[1:].set_axis(v.iloc[0], axis=1).rename_axis(None, axis=1) 
       for i, (k, v) in enumerate(df.groupby(m.cumsum()), 1)}
print (dfs['df1'])
  a1  b  c  d
1  1  2  3  4
2  2  3  4  5

print (dfs['df2'])
  a2  b  c d5
4  3  9  1  4
  • Related