Home > Software engineering >  How to read fixed-width data?
How to read fixed-width data?

Time:04-20

data looks like

212253820000025000.00000002500.00000000375.00111120211105202117
212456960000000750.00000000075.00000000011.25111120211102202117
212387470000010000.00000001000.00000000150.00111120211105202117

need to add separator like

21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17

The CSV file length is high nearly 20000 rows are there is there any possibility to do

CodePudding user response:

This question is generally about reading "fixed width data".

If you're stuck with this data, you'll need to parse it line by line then column by column. I'll show you how to do this with Python.

First off, the columns you counted off in the comment do not match your sample output. You seemed to have omitted the last column with a count of 2 characters.

You'll need accurate column widths to perform the task. I took your sample data and counted the columns for you and got these numbers:

8, 13, 12, 12, 8, 8, 2

So, we'll read the input data line by line, and for every line we'll:

  • Read 8 chars and save it as a column, then 13 chars and save it as a column, then 12 chars, etc... till we've read all the specified column widths
  • As we move through the line we'll keep track of our position with the variables beg and end to denote where a column begins (inclusive) and where it ends (exclusive)
  • The end of the first column becomes the beginning of the next, and so on down the line
  • We'll store those columns in a list (array) that is the new row
  • At the end of the line we'll save the new row to a list of all the rows
  • Then, we'll repeat the process for the next line

Here's how this looks in Python:

import pprint

Col_widths = [8, 13, 12, 12, 8, 8, 2]

all_rows = []
with open("data.txt") as in_file:
    for line in in_file:
        row = []
        beg = 0
        for width in Col_widths:
            end = beg   width
            col = line[beg:end]
            row.append(col)
            beg = end
        all_rows.append(row)

pprint.pprint(all_rows, width=100)

all_rows is just a list of lists of text:

[['21225382', '0000025000.00', '000002500.00', '000000375.00', '11112021', '11052021', '17'],
 ['21245696', '0000000750.00', '000000075.00', '000000011.25', '11112021', '11022021', '17'],
 ['21238747', '0000010000.00', '000001000.00', '000000150.00', '11112021', '11052021', '17']]

With this approach, if you miscounted the column width or the number of columns you can easily modify the Column_widths to match your data.

From here we'll use Python's CSV module to make sure the CSV file is written correctly:

import csv

with open("data.csv", "w", newline="") as out_file:
    writer = csv.writer(out_file)
    writer.writerows(all_rows)

and my data.csv file looks like:

21225382,0000025000.00,000002500.00,000000375.00,11112021,11052021,17
21245696,0000000750.00,000000075.00,000000011.25,11112021,11022021,17
21238747,0000010000.00,000001000.00,000000150.00,11112021,11052021,17

CodePudding user response:

If you have access to the command-line tool awk, you can fix your data like the following:

  • substr() gives a portion of the string $0, which is the entire line
  • you start at char 1 then specify the width of your first column, 8
  • for the next substr(), you again use $0, you start at 9 (1 8 from the last substr), and give it the second column's width, 13
  • and repeat for each column, starting at "the start of the last column plus the last column's width"
#!/bin/sh

# Col_widths = [8, 13, 12, 12, 8, 8, 2]

awk '{print substr($0,1,8) "," substr($0,9,13) "," substr($0,22,12) "," substr($0,34,12) "," substr($0,46,8) "," substr($0,54,8) "," substr($0,62,2)}' data.txt > data.csv
  • Related