I am trying to split a string into multiple strings (like observations).
For example, a sample text with 3 "bidder id" observations is:
BID RANK BID TOTAL BIDDER ID BIDDER INFORMATION (NAME/ADDRESS/LOCATION)
-------- ----------- --------- -------------------------------------------------
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489
The ultimate goal is to create a dataset that mimics this text document. The first step is to split this big string into multiple small strings. For example, the three small strings would look as follows:
Split string 1
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
Split string 2
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
Split String 3
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489
I started using the split pattern as [\r\n] \s
, but unfortunately, it splits by any new line and not just the new line with no other character/text in it.
Code:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
txt = " 1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489"
p = re.split("[\r\n] ",txt)
But it splits text by all the possible new lines. Is there a way to separate text by a newline with no other character in it? Thank you so much!!
P.S. if you think I'm doing something wildly wrong or if there's a much simpler way to create a dataset - please let me know. Any help is appreciated. Thanks!!
CodePudding user response:
You can try re.findall
with pattern (regex101):
(?ms)^\s{,20}\d.*?(?=^\s{,20}\d|\Z)
import re
text = """\
BID RANK BID TOTAL BIDDER ID BIDDER INFORMATION (NAME/ADDRESS/LOCATION)
-------- ----------- --------- -------------------------------------------------
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489"""
groups = re.findall(r"(?ms)^\s{,20}\d.*?(?=^\s{,20}\d|\Z)", text)
for group in groups:
print(group)
print('-' * 80)
Prints:
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
--------------------------------------------------------------------------------
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
--------------------------------------------------------------------------------
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489
--------------------------------------------------------------------------------
CodePudding user response:
You can capture those blocks with:
(?=^[ \t] (?:\d [ \t] [\d,.] [ \t] \d))([\s\S] ?)(?=(?:^[ \t] \d [ \t] [\d,.] [ \t] \d)|\Z)
Or split like this and deal with the header by poping 2 lines off before the split:
re.split(r'(?:\r?\n){2}, s)
Python demo:
s='''\
BID RANK BID TOTAL BIDDER ID BIDDER INFORMATION (NAME/ADDRESS/LOCATION)
-------- ----------- --------- -------------------------------------------------
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489'''
import re
# with re.split:
print(
"\n-------\n".join(re.split(r'(?:\r?\n){2,}', "\n".join(s.splitlines()[2:])))
)
# with re.findall:
print(
"\n-------\n".join(re.findall(r'(?=^[ \t] (?:\d [ \t] [\d,.] [ \t]))([\s\S] ?)(?=(?:^[ \t] \d [ \t] [\d,.] [ \t])|\Z)', s, flags=re.M))
)
Both methods prints:
1 1,486,399.87 5 ORTIZ ASPHALT PAVING INC 909 386-1200 SB PREF CLAIMED
00814766
P O BOX 883 FAX 909 386-1288
COLTON CA 92324
-------
2 1,534,243.00 3 EXCEL PAVING COMPANY 562 599-5841 SB PREF CLAIMED
00688659
2230 LEMON AVENUE FAX 562 591-7485
LONG BEACH CA 90806
-------
3 1,593,549.40 2 SECURITY PAVING COMPANY INC 818 767-8418 CC PREF CLAIMED
00116307
P O BOX 1489 FAX 818 767-3169
SUN VALLEY CA 91353-1489