I am parsing pdf files using tika and then converting the output to a txt file. This gives me a large txt file with many spaces in between the lines. After reading in the text file, I get a list like below:
['1 ARBOR HILLS WATER WORKS, CORP BEQ-5264-16-PW $1,700.00 LEW',
'2 COFFEE LAB ROASTERS, INC BEQ-5456-16-AP $900.00 TTN',
'3 HAZEN & SAWYER, P.C BEQ-5277-17-PW $1,500.00 HAR',
'4',
'JOHN PUFF',
'CONSTRUCTION',
'COMPANY, INC',
'OEHRC-611-16-PBS $2,500.00 ELM',
'5 HORIZON OWNERS, CORP OEHRC-601-17-PBS $2,450.00 YON',
'6',
'ONE FRANKLIN',
'OWNERS, CORP. c/o',
'BENCHMARK LM',
'MANAGEMENT, SVCS',
'OEHRC-1204-17-PBS $250.00 WHP',
'7 TACO PROJECTS, INC/ THE TACO PROJECT PHP-6717-16-FSE $3,000.00 TTN',
'8 NUTTIN TO IT CATERING, LLC PHP-6801-16-FSE $2,000.00 YTN',
'9',
'JET ENTERPRISES',
'M13, LLC/ MOE’S',
'SOUTHWEST GRILL',
'PHP-6811-16-FSE $2,250.00 YTN',
'10',
'TWO MEN & A LADY,',
'INC/ GORDON’S DELI',
'CAFÉ',
'PHP-6816-16-FSE $1,300.00 CRO',
'11 EJG LEGACY, CORP/ LICENSE 2 GRILL PHP-6827-17-FSE $1,450.00 MTP']
However, I want to get a list where each string in the list starts with a number and has the rest of the information like below but I can't figure out out to do this.
['1 ARBOR HILLS WATER WORKS, CORP BEQ-5264-16-PW $1,700.00 LEW',
'2 COFFEE LAB ROASTERS, INC BEQ-5456-16-AP $900.00 TTN',
'3 HAZEN & SAWYER, P.C BEQ-5277-17-PW $1,500.00 HAR',
'4 JOHN PUFF CONSTRUCTION COMPANY, INC OEHRC-611-16-PBS $2,500.00 ELM',
'5 HORIZON OWNERS, CORP OEHRC-601-17-PBS $2,450.00 YON',
'6 ONE FRANKLIN OWNERS, CORP. c/o BENCHMARK LM MANAGEMENT, SVCS OEHRC-1204-17-PBS $250.00 WHP',
'7 TACO PROJECTS, INC/ THE TACO PROJECT PHP-6717-16-FSE $3,000.00 TTN',
'8 NUTTIN TO IT CATERING, LLC PHP-6801-16-FSE $2,000.00 YTN',
'9 JET ENTERPRISES M13, LLC/ MOE’S SOUTHWEST GRILL PHP-6811-16-FSE $2,250.00 YTN',
'10 TWO MEN & A LADY INC/ GORDON’S DELI CAFÉ PHP-6816-16-FSE $1,300.00 CRO',
'11 EJG LEGACY, CORP/ LICENSE 2 GRILL PHP-6827-17-FSE $1,450.00 MTP']
The ultimate goal is to convert this to a dataframe.
id name code amount area
1 'ARBOR HILLS WATER WORKS, CORP' 'BEQ-5264-16-PW' $1,700.00 'LEW'
2 'COFFEE LAB ROASTERS, INC' 'BEQ-5456-16-AP' $900.00 'TTN'
3 'HAZEN & SAWYER, P.C' 'BEQ-5277-17-PW' $1,500.00 'HAR'
I don't think it'll be too hard to get into the dataframe format but I can't figure out how to get into the list format I need to get it there. THanks
CodePudding user response:
Well, the basic philosophy is to create a new list, copy the items, and if a new item doesn't start with a digit, append it to the previous item.
newlist = []
for row in oldlist:
if row[0].isdigit():
newlist.append( row )
else:
newlist[-1] = ' ' row