Home > Back-end >  read file into data frame with grouping variable
read file into data frame with grouping variable

Time:10-14

file.txt

///// A
13 32 12 13
4 22 34 42    
///// B
3 1 34 11
0 NaN 21 1
44 32 33 32  
///// C
5 32 11 21
43 23 NaN 3

I'm trying to read this file such that values are grouped according to the letter after /////. Desired output:

0   1   2   3   Group
13  32  12  13  A
4   22  34  42  A
3   1   34  11  B
0   NaN 21  1   B
44  32  33  32  B
5   32  11  21  C
43  23  NaN 3   C

I tried most options in pd.read_table but I don't know how to deal with the grouping, as I can only manage to read the file if I ignore the ///// lines df = pd.read_table('file.txt', sep=' ', header=None, comment='/')

CodePudding user response:

Try this:

import numpy as np
import pandas as pd

df_list = []


def converter(x):
    try:
        return int(x)
    except:
        return np.nan


with open('file.txt', 'r') as f:
    for line in f:
        line = line.strip()
        if line.startswith('/////'):
            group = line[-1]
        else:
            values = map(converter, line.split())
            df_list.append([*values, group])

df = pd.DataFrame(df_list, columns=[*[i for i in range(4)], 'Group'])

CodePudding user response:

You can use a regex to read the group header and count the rows, then read your file treating the intermediate headers as comments and add the groups:

# data is the file content here
with open('file.txt') as f:
    groups = re.findall(r'/////\s*(\w )|^', f.read(), flags=re.M)
s = pd.Series(groups)
m = s.eq('')

df = pd.read_table('file.txt', sep='\s ', header=None, comment='/')
df['group'] = s.mask(m).ffill()[m].values

output:

    0     1     2   3 group
0  13  32.0  12.0  13     A
1   4  22.0  34.0  42     A
2   3   1.0  34.0  11     B
3   0   NaN  21.0   1     B
4  44  32.0  33.0  32     B
5   5  32.0  11.0  21     C
6  43  23.0   NaN   3     C
  • Related