Home > database >  Reading and parsing a txt file with nested headers
Reading and parsing a txt file with nested headers

Time:12-26

I have a .txt file having multiple nested headers (textual version down below)

sample input

  • The Sr.No. column is repeated in the outer as well as inner table
  • The Inner table headers are repeated every time the outer Sr.No. Changes (For ex. here from Sr.N. from 1 to 2)
  • There could be n number of rows for each inner table. I want to parse this and flatten this table such that the actual columns are (Name, Account Number,Sr. No., Section,Transaction Date,Status of Booking,Date of Booking,Remarks,Amount Paid,Tax Deducted,Amount Deposited).

I have tried using pandas read_csv function with multi header option

>>> pd.read_csv('sample.txt.csv', sep="^", header=[0],index_col=[0,1])

but that does not seems to help

pandas read csv

Pardon any mistakes, this is my first question.Can provide more info

Sample Data (^ column divider):

Sr. No.^Name ^Account Number^^^^^Total Amount Paid^Total Tax Deducted^Total TDS Deposited
1^ABC^ABC354^^^^^100^100^100
^Sr. No.^Section^Transaction Date^Status of Booking^Date of Booking^Remarks^Amount Paid^Tax Deducted^TDS Deposited
^1^278^2020-08-08^ABC^2020-10-20^^150^1000^100
^2^78^2020-08-08^ABC^2020-10-20^^150^1000^100
^3^278^2020-08-08^ABC^2020-10-20^^150^1000^100
^^^^^^^^^
2^XYZ^ABC123^^^^^900^900^900
^Sr. No.^Section^Transaction Date^Status of Booking^Date of Booking^Remarks^Amount Paid^Tax Deducted^TDS Deposited
^1^178^2020-08-08^ABCD^2020-10-20^^150^1000^100
^2^978^2020-08-08^ABCE^2020-10-20^^150^1000^100
^3^778^2020-08-08^ABCF^2020-10-20^^150^1000^100

CodePudding user response:

Here is a more concise approach, which does the following:

  1. It splits the text file to extract the top header
  2. It creates a list of items by splitting the data on ^^^^^^^^^
  3. It then iterates all items and creates separate dataframes for the top and bottom parts
  4. It joins top and bottom dfs
  5. It concatenates all dfs to a single file

Example:

import pandas as pd
import io

# read file here, I'll proceed without it for the reproducibility of the answer
#with open('filename.csv') as f:
#    data = f.read()

data = '''Sr. No.^Name ^Account Number^^^^^Total Amount Paid^Total Tax Deducted^Total TDS Deposited
1^ABC^ABC354^^^^^100^100^100
^Sr. No.^Section^Transaction Date^Status of Booking^Date of Booking^Remarks^Amount Paid^Tax Deducted^TDS Deposited
^1^278^2020-08-08^ABC^2020-10-20^^150^1000^100
^2^78^2020-08-08^ABC^2020-10-20^^150^1000^100
^3^278^2020-08-08^ABC^2020-10-20^^150^1000^100
^^^^^^^^^
2^XYZ^ABC123^^^^^900^900^900
^Sr. No.^Section^Transaction Date^Status of Booking^Date of Booking^Remarks^Amount Paid^Tax Deducted^TDS Deposited
^1^178^2020-08-08^ABCD^2020-10-20^^150^1000^100
^2^978^2020-08-08^ABCE^2020-10-20^^150^1000^100
^3^778^2020-08-08^ABCF^2020-10-20^^150^1000^100'''

headers, content = data.split('\n', 1)
content = [i.strip() for i in content.split('^^^^^^^^^')]

dfs = []
for i in content:
    content_top, content_bottom = i.split('\n', 1)
    df_top = pd.read_csv(io.StringIO(headers   '\n'   content_top), sep='^').dropna(axis=1, how='all')
    top_data = headers   '\n'   content_top
    df_bottom = pd.read_csv(io.StringIO(content_bottom), sep='^').dropna(axis=1, how='all')
    df = df_bottom.join(df_top, lsuffix='_').fillna(method='ffill')
    dfs.append(df)
    
final_df = pd.concat(dfs)

Output:

Sr. No._ Section Transaction Date Status of Booking Date of Booking Amount Paid Tax Deducted TDS Deposited Sr. No. Name Account Number Total Amount Paid Total Tax Deducted Total TDS Deposited
0 1 278 2020-08-08 ABC 2020-10-20 150 1000 100 1 ABC ABC354 100 100 100
1 2 78 2020-08-08 ABC 2020-10-20 150 1000 100 1 ABC ABC354 100 100 100
2 3 278 2020-08-08 ABC 2020-10-20 150 1000 100 1 ABC ABC354 100 100 100
0 1 178 2020-08-08 ABCD 2020-10-20 150 1000 100 2 XYZ ABC123 900 900 900
1 2 978 2020-08-08 ABCE 2020-10-20 150 1000 100 2 XYZ ABC123 900 900 900
2 3 778 2020-08-08 ABCF 2020-10-20 150 1000 100 2 XYZ ABC123 900 900 900

CodePudding user response:

If your data is well formed, you could transform it into something more readable by python. with 500k to 700k lines this should be done line wise - on block may be a bit memory hungry.

This code may work or this may also not be feasable if your file does not adhere to your shown structure:

Structure:
- outer header
- outer data (starts with number in col 0)
-   inner header
-   [inner data]
- empty line
- [outer header
- outer data (starts with number in col 0)
-   inner header
-   inner data 
- empty line]

Demo file:

infile = "t.txt"

with open("t.txt","w") as f:
    f.write("""Sr. No.^Name ^Account Number^^^^^Total Amount Paid^Total Tax Deducted^Total TDS Deposited
1^ABC^ABC354^^^^^100^100^100
^Sr. No.^Section^Transaction Date^Status of Booking^Date of Booking^Remarks^Amount Paid^Tax Deducted^TDS Deposited
^1^278^2020-08-08^ABC^2020-10-20^^150^1000^100
^2^78^2020-08-08^ABC^2020-10-20^^150^1000^100
^3^278^2020-08-08^ABC^2020-10-20^^150^1000^100
^^^^^^^^^
2^XYZ^ABC123^^^^^900^900^900
^Sr. No.^Section^Transaction Date^Status of Booking^Date of Booking^Remarks^Amount Paid^Tax Deducted^TDS Deposited
^1^178^2020-08-08^ABCD^2020-10-20^^150^1000^100
^2^978^2020-08-08^ABCE^2020-10-20^^150^1000^100
^3^778^2020-08-08^ABCF^2020-10-20^^150^1000^100""")

Code:

import csv

infile = "t.txt"
outfile = "processed.txt"

with open(infile) as f, open(outfile,"w", newline='') as out:
    writer = csv.writer(out)
    headers = ["outer "   k if k.strip() else '' for k in next(f).strip().split("^")]
    inner_header = []
    outer_data = None
    curr_data = []
    for line in f:
        line = line.strip().split("^")
        if all(not l for l in line): 
            line = None       # line is empty == end of an inner data block
        if line and line[0]:  # start of outer data block
            outer_data = line[:]
            curr_data = []
        elif line and not inner_header: # 1st inner header
            headers.extend(line)
            writer.writerow(headers)
            inner_header = line[:]
        elif line and inner_header and all(
             inner_header[i] == line[i] for i in range(len(line))):
            # skip all other inner headers
            continue
        elif line:            # inner data
            curr_data.append(line[:])
        else:                 # line == None => inner data done, write it
            for cur in curr_data:
                writer.writerow(outer_data   cur)
            curr_data = []

    # last part of inner data, not yet written, no last empty line, write it
    if curr_data:
        # write it
        for cur in curr_data:
            writer.writerow(outer_data   cur)
        curr_data = []

Result (i did not see any point in using ^ instead of , - change it if you want):

outfile = "processed.txt"

with open(outfile) as f:
    for l in f:
        print(l, end = "")

outer Sr. No.,outer Name ,outer Account Number,,,,,outer Total Amount Paid,outer Total Tax Deducted,outer Total TDS Deposited,,Sr. No.,Section,Transaction Date,Status of Booking,Date of Booking,Remarks,Amount Paid,Tax Deducted,TDS Deposited
1,ABC,ABC354,,,,,100,100,100,,1,278,2020-08-08,ABC,2020-10-20,,150,1000,100
1,ABC,ABC354,,,,,100,100,100,,2,78,2020-08-08,ABC,2020-10-20,,150,1000,100
1,ABC,ABC354,,,,,100,100,100,,3,278,2020-08-08,ABC,2020-10-20,,150,1000,100
2,XYZ,ABC123,,,,,900,900,900,,1,178,2020-08-08,ABCD,2020-10-20,,150,1000,100
2,XYZ,ABC123,,,,,900,900,900,,2,978,2020-08-08,ABCE,2020-10-20,,150,1000,100
2,XYZ,ABC123,,,,,900,900,900,,3,778,2020-08-08,ABCF,2020-10-20,,150,1000,100

This cleaned file should be easily understood by pandas - you can drop any columns you do not want.

  • Related