I have a text file which has multiple tables inside, and I was wondering what a good approach would be for this? Below is an example of what the text file might look like.
Employee Table:
Name Description Type
Bob Employee Standard
Jim Employee Standard
james Employee Standard
Tools:
Item Serial Tag
Battery 0101 B.
Drill 9292 D.
Phone 8464 P.
Locations:
Station code len
West 12 9
North 1 9
East 21 9
I originally tried splicing by index:
instance_of_employee=df.loc[df['x'].str.contains("Employee table", case=True, na=False)]
employees=df.loc[instance_of_employee.index[0]:Instance_of_tools.index[0]-1 ]
But I found out these files can show up in random orders. The name however is always the same... 'Employee table', 'Tools', and 'Locations'
Is it possible to read the dataframe line by line and then where these titles exist, make them new dataframes?
CodePudding user response:
This comes pretty close to your idea of "reading the file line-by-line and make new dataframe with each section":
def parse_file(path):
from collections import defaultdict
from io import StringIO
data = {}
with open(path) as fp:
section, content = None, ""
for line in fp:
if line.endswith(":\n"):
section = line[:-2]
content = ""
elif line == "\n" and section:
data[section] = pd.read_csv(StringIO(content), sep="\s ")
section, content = None, ""
else:
content = line
return data
The function returns a dictionary whose keys are the section names and values are a dataframe representing the section:
data = parse_file("data.txt")
data["Employee Table"] # returns the Employee Table section
data["Tools"] # returns the Tools section
CodePudding user response:
https://regex101.com/r/WELf2x/1
import re
import pandas as pd
from io import StringIO
data = []
with open(path) as fp:
s = fp.read()
regex = r"(?P<name>\w[^:] ):(?P<df>[^:] )$"
matches = re.finditer(regex, s, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
name = match['name']
df = pd.read_csv(StringIO(match['df']), sep="\s ")
data = [{name:df}]
print(data)
output
[{'Employee Table':
Name Description Type
0 Bob Employee Standard
1 Jim Employee Standard
2 james Employee Standard},
{'Tools':
Item Serial Tag
0 Battery 101 B.
1 Drill 9292 D.
2 Phone 8464 P.},
{'Locations':
Station code len
0 West 12 9
1 North 1 9
2 East 21 9}]