Reading txt files with no delimiter and irregular width-CodePudding

I'm trying to read in a large number of text files and clean up into a dataframe. The files do not include delimiters, have irregular widths due to the inclusion of some rows with data that corresponds to the subsequent set of rows.

Here are 2 samples:

    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0326806  WASHOUT BASIN                                1.000    L SUM
 1216      Copenhaver Construction, Inc.                                1,000.0000        1,000.00        1,000.00
 1320      D. Construction, Inc.                                        1,500.0000        1,500.00        1,500.00
 3069      K-Five Construction Corporation                              1,000.0000        1,000.00        1,000.00
 3702      Martam Construction Incorporated                             1,500.0000        1,500.00        1,500.00
 4741      Phoenix Corporation of the Quad Cities                       5,000.0000        5,000.00        5,000.00
 4786      Pir Tano Construction Company, Inc.                          1,200.0000        1,200.00        1,200.00
 1560      R. W. Dunteman Company                                         450.0000          450.00          450.00
 5378      Schroeder Asphalt Services, Inc.                             5,100.0000        5,100.00        5,100.00

    X0327036  BIKE PATH REM                              120.000    SQ YD
 1216      Copenhaver Construction, Inc.                                   16.0000        1,920.00        1,920.00
 1320      D. Construction, Inc.                                           20.0000        2,400.00        2,400.00
 3069      K-Five Construction Corporation                                  5.0000          600.00          600.00
 3702      Martam Construction Incorporated                                10.0000        1,200.00        1,200.00
 4741      Phoenix Corporation of the Quad Cities                          14.0000        1,680.00        1,680.00
 4786      Pir Tano Construction Company, Inc.                             32.0000        3,840.00        3,840.00
 1560      R. W. Dunteman Company                                          12.8400        1,540.80        1,540.80
 5378      Schroeder Asphalt Services, Inc.                                18.0000        2,160.00        2,160.00

and another file here:

    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00

I'm open to using R or Python and have tried a variety of approaches with base R, readr & data.table as well as pandas & looping over rows with open() with little success. My delimiter usage has been wrong as my results have either parsed every single space into a column or given me with a single column with all contents from each row.

Is there a clean way of accomplishing this? Thanks.

CodePudding user response：

Here is a working workflow:

split the text by multiple newlines (and process all items in the list except the first one that only contains headers)
Use pandas read_fwf to read the first line (item data) as a dataframe by identifying the fixed-width fields of the columns
Do the same for the rest of the text (bidder data)
Concatenate the two dataframes and append to a list
Concatenate all dataframes in the list to one df

Code:

import re
import pandas as pd

data = '''
    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00'''

#with open('filename.txt') as f:
#    data = f.read()

tables = [i for i in re.split(r'\n\n ', data)[1:] if i]

dfs= []
for i in tables:
    item_df = pd.read_fwf(io.StringIO(i.splitlines()[0]), names=['ITEM NBR','ITEM DESCRIPTION','QUANTITY','UNIT OF MEASURE'], colspecs=[(0,12),(12,45),(45,64),(64,73)])

    headings = ['BIDR NBR','BIDDER NAME','UNIT PRICE','BIDDER EXTENSION','CALCULATED EXTENSION']
    colspecs = [(1, 11), (11, 64), (64, 82), (82,98), (98, 114)]
    buyers_df = pd.read_fwf(io.StringIO(i), names=headings, index=False, colspecs=colspecs, skiprows=1, thousands=',')
    dfs.append(pd.concat([item_df, buyers_df], axis=1).ffill())
    
df = pd.concat(dfs)

Output:

	ITEM NBR	ITEM DESCRIPTION	QUANTITY	UNIT OF MEASURE	BIDR NBR	BIDDER NAME	UNIT PRICE	BIDDER EXTENSION	CALCULATED EXTENSION
0	X0320050	CONSTRUCTN LAYOUT SPL	1	L SUM	2341	Builders Paving, LLC	5000	5000	5000
1	X0320050	CONSTRUCTN LAYOUT SPL	1	L SUM	3020	J. A. Johnson Paving Company	5000	5000	5000
2	X0320050	CONSTRUCTN LAYOUT SPL	1	L SUM	280	Peter Baker & Son Co.	1500	1500	1500
0	X0327611	REM & REIN BRIC PAVER	55	SQ FT	2341	Builders Paving, LLC	20	1100	1100
1	X0327611	REM & REIN BRIC PAVER	55	SQ FT	3020	J. A. Johnson Paving Company	40	2200	2200
2	X0327611	REM & REIN BRIC PAVER	55	SQ FT	280	Peter Baker & Son Co.	20	1100	1100

CodePudding user response：

Try this

import pandas as pd

data=open("filename.txt","r").read()
df = pd.DataFrame({"data": data.split()})
print(df)

CodePudding user response：

Read in the file, remove empty lines, create groups and split by them, remove the junk at the beginning, read in each group and create columns, remove commas from all fields except 2nd, convert columns that should be numeric to numeric and set the column names. We used a, b, c, ..., i as the column names but you can replace that with whatever you want. Note that this dynamically locates the fields rather than hard coding their locations.

library(magrittr)
library(data.table)

DF <- "data.txt" %>%
  readLines %>%
  Filter(nzchar, .) %>%
  split(., cumsum(grepl("^    ", .))) %>%
  tail(-1) %>%
  lapply(function(x) {
    read.table(text = gsub("   ", ";", paste0(x[-1], x[1])), fill = NA, sep = ";")
  }) %>% 
  rbindlist(fill = TRUE) %>%
  as.data.frame %>%
  replace(., -2, lapply(.[-2], gsub, pattern = ",", replacement = "")) %>%
  type.convert(as.is = TRUE) %>%
  setNames(letters[1:9])

Note

Create input file for testing.

Lines <- "    ITEM NBR  ITEM DESCRIPTION                                    UNIT OF     UNIT        BIDDER         CALCULATED  BIDR CALC
 BIDR NBR  BIDDER NAME                                  QUANTITY  MEASURE     PRICE       EXTENSION      EXTENSION   EXTENSION DIFF


    X0320050  CONSTRUCTN LAYOUT SPL                        1.000    L SUM
 2341      Builders Paving, LLC                                         5,000.0000        5,000.00        5,000.00
 3020      J. A. Johnson Paving Company                                 5,000.0000        5,000.00        5,000.00
 0280      Peter Baker & Son Co.                                        1,500.0000        1,500.00        1,500.00

    X0327611  REM & REIN BRIC PAVER                       55.000    SQ FT
 2341      Builders Paving, LLC                                            20.0000        1,100.00        1,100.00
 3020      J. A. Johnson Paving Company                                    40.0000        2,200.00        2,200.00
 0280      Peter Baker & Son Co.                                           20.0000        1,100.00        1,100.00"
writeLines(Lines, "data.txt")