Home > OS >  Reading txt files with no delimiter and irregular width
Reading txt files with no delimiter and irregular width

Time:11-18

I'm trying to read in a large number of text files and clean up into a dataframe. The files do not include delimiters, have irregular widths due to the inclusion of some rows with data that corresponds to the subsequent set of rows.

Here are 2 samples:

    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0326806  WASHOUT BASIN                                1.000    L SUM
 1216      Copenhaver Construction, Inc.                                1,000.0000        1,000.00        1,000.00
 1320      D. Construction, Inc.                                        1,500.0000        1,500.00        1,500.00
 3069      K-Five Construction Corporation                              1,000.0000        1,000.00        1,000.00
 3702      Martam Construction Incorporated                             1,500.0000        1,500.00        1,500.00
 4741      Phoenix Corporation of the Quad Cities                       5,000.0000        5,000.00        5,000.00
 4786      Pir Tano Construction Company, Inc.                          1,200.0000        1,200.00        1,200.00
 1560      R. W. Dunteman Company                                         450.0000          450.00          450.00
 5378      Schroeder Asphalt Services, Inc.                             5,100.0000        5,100.00        5,100.00

    X0327036  BIKE PATH REM                              120.000    SQ YD
 1216      Copenhaver Construction, Inc.                                   16.0000        1,920.00        1,920.00
 1320      D. Construction, Inc.                                           20.0000        2,400.00        2,400.00
 3069      K-Five Construction Corporation                                  5.0000          600.00          600.00
 3702      Martam Construction Incorporated                                10.0000        1,200.00        1,200.00
 4741      Phoenix Corporation of the Quad Cities                          14.0000        1,680.00        1,680.00
 4786      Pir Tano Construction Company, Inc.                             32.0000        3,840.00        3,840.00
 1560      R. W. Dunteman Company                                          12.8400        1,540.80        1,540.80
 5378      Schroeder Asphalt Services, Inc.                                18.0000        2,160.00        2,160.00


                           

and another file here:

    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00

I'm open to using R or Python and have tried a variety of approaches with base R, readr & data.table as well as pandas & looping over rows with open() with little success. My delimiter usage has been wrong as my results have either parsed every single space into a column or given me with a single column with all contents from each row.

Is there a clean way of accomplishing this? Thanks.

CodePudding user response:

Here is a working workflow:

  1. split the text by multiple newlines (and process all items in the list except the first one that only contains headers)
  2. Use pandas read_fwf to read the first line (item data) as a dataframe by identifying the fixed-width fields of the columns
  3. Do the same for the rest of the text (bidder data)
  4. Concatenate the two dataframes and append to a list
  5. Concatenate all dataframes in the list to one df

Code:

import re
import pandas as pd

data = '''
    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00'''

#with open('filename.txt') as f:
#    data = f.read()

tables = [i for i in re.split(r'\n\n ', data)[1:] if i]

dfs= []
for i in tables:
    item_df = pd.read_fwf(io.StringIO(i.splitlines()[0]), names=['ITEM NBR','ITEM DESCRIPTION','QUANTITY','UNIT OF MEASURE'], colspecs=[(0,12),(12,45),(45,64),(64,73)])

    headings = ['BIDR NBR','BIDDER NAME','UNIT PRICE','BIDDER EXTENSION','CALCULATED EXTENSION']
    colspecs = [(1, 11), (11, 64), (64, 82), (82,98), (98, 114)]
    buyers_df = pd.read_fwf(io.StringIO(i), names=headings, index=False, colspecs=colspecs, skiprows=1, thousands=',')
    dfs.append(pd.concat([item_df, buyers_df], axis=1).ffill())
    
df = pd.concat(dfs)

Output:

ITEM NBR ITEM DESCRIPTION QUANTITY UNIT OF MEASURE BIDR NBR BIDDER NAME UNIT PRICE BIDDER EXTENSION CALCULATED EXTENSION
0 X0320050 CONSTRUCTN LAYOUT SPL 1 L SUM 2341 Builders Paving, LLC 5000 5000 5000
1 X0320050 CONSTRUCTN LAYOUT SPL 1 L SUM 3020 J. A. Johnson Paving Company 5000 5000 5000
2 X0320050 CONSTRUCTN LAYOUT SPL 1 L SUM 280 Peter Baker & Son Co. 1500 1500 1500
0 X0327611 REM & REIN BRIC PAVER 55 SQ FT 2341 Builders Paving, LLC 20 1100 1100
1 X0327611 REM & REIN BRIC PAVER 55 SQ FT 3020 J. A. Johnson Paving Company 40 2200 2200
2 X0327611 REM & REIN BRIC PAVER 55 SQ FT 280 Peter Baker & Son Co. 20 1100 1100

CodePudding user response:

Try this

import pandas as pd

data=open("filename.txt","r").read()
df = pd.DataFrame({"data": data.split()})
print(df)

CodePudding user response:

Read in the file, remove empty lines, create groups and split by them, remove the junk at the beginning, read in each group and create columns, remove commas from all fields except 2nd, convert columns that should be numeric to numeric and set the column names. We used a, b, c, ..., i as the column names but you can replace that with whatever you want. Note that this dynamically locates the fields rather than hard coding their locations.

library(magrittr)
library(data.table)

DF <- "data.txt" %>%
  readLines %>%
  Filter(nzchar, .) %>%
  split(., cumsum(grepl("^    ", .))) %>%
  tail(-1) %>%
  lapply(function(x) {
    read.table(text = gsub("   ", ";", paste0(x[-1], x[1])), fill = NA, sep = ";")
  }) %>% 
  rbindlist(fill = TRUE) %>%
  as.data.frame %>%
  replace(., -2, lapply(.[-2], gsub, pattern = ",", replacement = "")) %>%
  type.convert(as.is = TRUE) %>%
  setNames(letters[1:9])

Note

Create input file for testing.

Lines <- "    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00"
writeLines(Lines, "data.txt")
  • Related