I'm trying to read in a large number of text files and clean up into a dataframe. The files do not include delimiters, have irregular widths due to the inclusion of some rows with data that corresponds to the subsequent set of rows.
Here are 2 samples:
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0326806 WASHOUT BASIN 1.000 L SUM
1216 Copenhaver Construction, Inc. 1,000.0000 1,000.00 1,000.00
1320 D. Construction, Inc. 1,500.0000 1,500.00 1,500.00
3069 K-Five Construction Corporation 1,000.0000 1,000.00 1,000.00
3702 Martam Construction Incorporated 1,500.0000 1,500.00 1,500.00
4741 Phoenix Corporation of the Quad Cities 5,000.0000 5,000.00 5,000.00
4786 Pir Tano Construction Company, Inc. 1,200.0000 1,200.00 1,200.00
1560 R. W. Dunteman Company 450.0000 450.00 450.00
5378 Schroeder Asphalt Services, Inc. 5,100.0000 5,100.00 5,100.00
X0327036 BIKE PATH REM 120.000 SQ YD
1216 Copenhaver Construction, Inc. 16.0000 1,920.00 1,920.00
1320 D. Construction, Inc. 20.0000 2,400.00 2,400.00
3069 K-Five Construction Corporation 5.0000 600.00 600.00
3702 Martam Construction Incorporated 10.0000 1,200.00 1,200.00
4741 Phoenix Corporation of the Quad Cities 14.0000 1,680.00 1,680.00
4786 Pir Tano Construction Company, Inc. 32.0000 3,840.00 3,840.00
1560 R. W. Dunteman Company 12.8400 1,540.80 1,540.80
5378 Schroeder Asphalt Services, Inc. 18.0000 2,160.00 2,160.00
and another file here:
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00
I'm open to using R or Python and have tried a variety of approaches with base R, readr & data.table as well as pandas & looping over rows with open() with little success. My delimiter usage has been wrong as my results have either parsed every single space into a column or given me with a single column with all contents from each row.
Is there a clean way of accomplishing this? Thanks.
CodePudding user response:
Here is a working workflow:
- split the text by multiple newlines (and process all items in the list except the first one that only contains headers)
- Use pandas read_fwf to read the first line (item data) as a dataframe by identifying the fixed-width fields of the columns
- Do the same for the rest of the text (bidder data)
- Concatenate the two dataframes and append to a list
- Concatenate all dataframes in the list to one df
Code:
import re
import pandas as pd
data = '''
ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00'''
#with open('filename.txt') as f:
# data = f.read()
tables = [i for i in re.split(r'\n\n ', data)[1:] if i]
dfs= []
for i in tables:
item_df = pd.read_fwf(io.StringIO(i.splitlines()[0]), names=['ITEM NBR','ITEM DESCRIPTION','QUANTITY','UNIT OF MEASURE'], colspecs=[(0,12),(12,45),(45,64),(64,73)])
headings = ['BIDR NBR','BIDDER NAME','UNIT PRICE','BIDDER EXTENSION','CALCULATED EXTENSION']
colspecs = [(1, 11), (11, 64), (64, 82), (82,98), (98, 114)]
buyers_df = pd.read_fwf(io.StringIO(i), names=headings, index=False, colspecs=colspecs, skiprows=1, thousands=',')
dfs.append(pd.concat([item_df, buyers_df], axis=1).ffill())
df = pd.concat(dfs)
Output:
ITEM NBR | ITEM DESCRIPTION | QUANTITY | UNIT OF MEASURE | BIDR NBR | BIDDER NAME | UNIT PRICE | BIDDER EXTENSION | CALCULATED EXTENSION | |
---|---|---|---|---|---|---|---|---|---|
0 | X0320050 | CONSTRUCTN LAYOUT SPL | 1 | L SUM | 2341 | Builders Paving, LLC | 5000 | 5000 | 5000 |
1 | X0320050 | CONSTRUCTN LAYOUT SPL | 1 | L SUM | 3020 | J. A. Johnson Paving Company | 5000 | 5000 | 5000 |
2 | X0320050 | CONSTRUCTN LAYOUT SPL | 1 | L SUM | 280 | Peter Baker & Son Co. | 1500 | 1500 | 1500 |
0 | X0327611 | REM & REIN BRIC PAVER | 55 | SQ FT | 2341 | Builders Paving, LLC | 20 | 1100 | 1100 |
1 | X0327611 | REM & REIN BRIC PAVER | 55 | SQ FT | 3020 | J. A. Johnson Paving Company | 40 | 2200 | 2200 |
2 | X0327611 | REM & REIN BRIC PAVER | 55 | SQ FT | 280 | Peter Baker & Son Co. | 20 | 1100 | 1100 |
CodePudding user response:
Try this
import pandas as pd
data=open("filename.txt","r").read()
df = pd.DataFrame({"data": data.split()})
print(df)
CodePudding user response:
Read in the file, remove empty lines, create groups and split by them, remove the junk at the beginning, read in each group and create columns, remove commas from all fields except 2nd, convert columns that should be numeric to numeric and set the column names. We used a, b, c, ..., i as the column names but you can replace that with whatever you want. Note that this dynamically locates the fields rather than hard coding their locations.
library(magrittr)
library(data.table)
DF <- "data.txt" %>%
readLines %>%
Filter(nzchar, .) %>%
split(., cumsum(grepl("^ ", .))) %>%
tail(-1) %>%
lapply(function(x) {
read.table(text = gsub(" ", ";", paste0(x[-1], x[1])), fill = NA, sep = ";")
}) %>%
rbindlist(fill = TRUE) %>%
as.data.frame %>%
replace(., -2, lapply(.[-2], gsub, pattern = ",", replacement = "")) %>%
type.convert(as.is = TRUE) %>%
setNames(letters[1:9])
Note
Create input file for testing.
Lines <- " ITEM NBR ITEM DESCRIPTION UNIT OF UNIT BIDDER CALCULATED BIDR CALC
BIDR NBR BIDDER NAME QUANTITY MEASURE PRICE EXTENSION EXTENSION EXTENSION DIFF
X0320050 CONSTRUCTN LAYOUT SPL 1.000 L SUM
2341 Builders Paving, LLC 5,000.0000 5,000.00 5,000.00
3020 J. A. Johnson Paving Company 5,000.0000 5,000.00 5,000.00
0280 Peter Baker & Son Co. 1,500.0000 1,500.00 1,500.00
X0327611 REM & REIN BRIC PAVER 55.000 SQ FT
2341 Builders Paving, LLC 20.0000 1,100.00 1,100.00
3020 J. A. Johnson Paving Company 40.0000 2,200.00 2,200.00
0280 Peter Baker & Son Co. 20.0000 1,100.00 1,100.00"
writeLines(Lines, "data.txt")