Go through dataframe line by line-CodePudding

I have a text file which has multiple tables inside, and I was wondering what a good approach would be for this? Below is an example of what the text file might look like.

Employee Table:
Name Description Type
Bob  Employee    Standard
Jim  Employee    Standard
james Employee   Standard


Tools:
Item    Serial  Tag
Battery  0101    B.
Drill    9292    D.
Phone    8464    P.

Locations:
Station code len
West     12   9
North     1   9
East     21   9

I originally tried splicing by index:

instance_of_employee=df.loc[df['x'].str.contains("Employee table", case=True, na=False)]
employees=df.loc[instance_of_employee.index[0]:Instance_of_tools.index[0]-1 ]

But I found out these files can show up in random orders. The name however is always the same... 'Employee table', 'Tools', and 'Locations'

Is it possible to read the dataframe line by line and then where these titles exist, make them new dataframes?

CodePudding user response：

This comes pretty close to your idea of "reading the file line-by-line and make new dataframe with each section":

def parse_file(path):
    from collections import defaultdict
    from io import StringIO
    
    data = {}
    with open(path) as fp:
        section, content = None, ""
        for line in fp:
            if line.endswith(":\n"):
                section = line[:-2]
                content = ""
            elif line == "\n" and section:
                data[section] = pd.read_csv(StringIO(content), sep="\s ")
                section, content = None, ""
            else:
                content  = line
    
    return data

The function returns a dictionary whose keys are the section names and values are a dataframe representing the section:

data = parse_file("data.txt")

data["Employee Table"] # returns the Employee Table section
data["Tools"]          # returns the Tools section

CodePudding user response：

https://regex101.com/r/WELf2x/1

import re
import pandas as pd
from io import StringIO
data = []
with open(path) as fp:
    s = fp.read()
    regex = r"(?P<name>\w[^:] ):(?P<df>[^:] )$"
    matches = re.finditer(regex, s, re.MULTILINE)
    for matchNum, match in enumerate(matches, start=1):
        name = match['name']
        df = pd.read_csv(StringIO(match['df']), sep="\s ")
        data  = [{name:df}]
print(data)

output

[{'Employee Table':     
Name Description      Type
0    Bob    Employee  Standard
1    Jim    Employee  Standard
2  james    Employee  Standard}, 
{'Tools':       
Item  Serial Tag
0  Battery     101  B.
1    Drill    9292  D.
2    Phone    8464  P.}, 
{'Locations':   
Station  code  len
0    West    12    9
1   North     1    9
2    East    21    9}]