Combine strings in list that don't start with a number until there is a string with a number?-CodePudding

I am parsing pdf files using tika and then converting the output to a txt file. This gives me a large txt file with many spaces in between the lines. After reading in the text file, I get a list like below:

 ['1 ARBOR HILLS WATER WORKS, CORP BEQ-5264-16-PW $1,700.00 LEW',
 '2 COFFEE LAB ROASTERS, INC BEQ-5456-16-AP $900.00 TTN',
 '3 HAZEN & SAWYER, P.C BEQ-5277-17-PW $1,500.00 HAR',
 '4',
 'JOHN PUFF',
 'CONSTRUCTION',
 'COMPANY, INC',
 'OEHRC-611-16-PBS $2,500.00 ELM',
 '5 HORIZON OWNERS, CORP OEHRC-601-17-PBS $2,450.00 YON',
 '6',
 'ONE FRANKLIN',
 'OWNERS, CORP. c/o',
 'BENCHMARK LM',
 'MANAGEMENT, SVCS',
 'OEHRC-1204-17-PBS $250.00 WHP',
 '7 TACO PROJECTS, INC/ THE TACO PROJECT PHP-6717-16-FSE $3,000.00 TTN',
 '8 NUTTIN TO IT CATERING, LLC PHP-6801-16-FSE $2,000.00 YTN',
 '9',
 'JET ENTERPRISES',
 'M13, LLC/ MOE’S',
 'SOUTHWEST GRILL',
 'PHP-6811-16-FSE $2,250.00 YTN',
 '10',
 'TWO MEN & A LADY,',
 'INC/ GORDON’S DELI',
 'CAFÉ',
 'PHP-6816-16-FSE $1,300.00 CRO',
 '11 EJG LEGACY, CORP/ LICENSE 2 GRILL PHP-6827-17-FSE $1,450.00 MTP']

However, I want to get a list where each string in the list starts with a number and has the rest of the information like below but I can't figure out out to do this.

['1 ARBOR HILLS WATER WORKS, CORP BEQ-5264-16-PW $1,700.00 LEW',
 '2 COFFEE LAB ROASTERS, INC BEQ-5456-16-AP $900.00 TTN',
 '3 HAZEN & SAWYER, P.C BEQ-5277-17-PW $1,500.00 HAR',
 '4 JOHN PUFF CONSTRUCTION COMPANY, INC OEHRC-611-16-PBS $2,500.00 ELM',
 '5 HORIZON OWNERS, CORP OEHRC-601-17-PBS $2,450.00 YON',
 '6 ONE FRANKLIN OWNERS, CORP. c/o BENCHMARK LM MANAGEMENT, SVCS OEHRC-1204-17-PBS $250.00 WHP',
 '7 TACO PROJECTS, INC/ THE TACO PROJECT PHP-6717-16-FSE $3,000.00 TTN',
 '8 NUTTIN TO IT CATERING, LLC PHP-6801-16-FSE $2,000.00 YTN',
 '9 JET ENTERPRISES M13, LLC/ MOE’S SOUTHWEST GRILL PHP-6811-16-FSE $2,250.00 YTN',
 '10 TWO MEN & A LADY INC/ GORDON’S DELI CAFÉ PHP-6816-16-FSE $1,300.00 CRO',
 '11 EJG LEGACY, CORP/ LICENSE 2 GRILL PHP-6827-17-FSE $1,450.00 MTP']

The ultimate goal is to convert this to a dataframe.

id         name                         code             amount       area
1   'ARBOR HILLS WATER WORKS, CORP'  'BEQ-5264-16-PW'   $1,700.00     'LEW'
2   'COFFEE LAB ROASTERS, INC'       'BEQ-5456-16-AP'   $900.00       'TTN'
3   'HAZEN & SAWYER, P.C'            'BEQ-5277-17-PW'   $1,500.00     'HAR'

I don't think it'll be too hard to get into the dataframe format but I can't figure out how to get into the list format I need to get it there. THanks

CodePudding user response：

Well, the basic philosophy is to create a new list, copy the items, and if a new item doesn't start with a digit, append it to the previous item.

newlist = []
for row in oldlist:
    if row[0].isdigit():
        newlist.append( row )
    else:
        newlist[-1]  = ' '   row