Good evening,
I am converting PDF into CSV using python and is using RegEx to extract the information.
The raw text, after extracting text from PDF, could look like this:
Account Transaction Details
Twin Account 123-456-789-1
Date Description Withdrawals Deposits Balance
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78
mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78
WIRE OTHR
ANTON HARLEY
Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78
PIB8452145632845963
Abricot 480
OTHR Transfer
I used a RegEx [0-3]{1}[0-9]{1}\s[A-Z]{1}[a-z]{2}\s[?A-Za-z]{1,155}
and managed to get the needed transactions:
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78
03 Jan Inward Credit-QUICK 3,000.84 123,456.78
03 Jan Funds Trf - SPEED 3,500.00 123,345.78
However, the additional information between the matches had been dropped because I have split the text using \n
and then running the RegEx.
How do I code such that I get the additional information that is in-between the RegEx matches, and the additional info is tagged to the previous match? This is my envisaged output:
01 Jan BALANCE B/F 123,456.78
03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690
03 Jan Inward Credit-QUICK 3,000.84 123,456.78 OTHR ANTON HARLEY Other
03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer
Edit:
I have adapted @dcsuka solution and have gotten the following:
06 Jan Debit-Consumer 12.60 123,456.78 SNIP AVENU13568100 4265884035605848
06 Jan Inward DR - 828.24 123,456.78 SHIP G12345HUJ ITX
07 Jan Funds Transfer 50.00 123,456.78 Pleasenotethatyouareboundbyadutyundertherulesgoverningtheoperationofthisaccount,tochecktheentriesintheabovestatement. Ifyoudonotnotifyusinwritingofanyerrors, omissionsorunauthoriseddebitswithinfourteen(14)daysofthisstatement,theentriesaboveshallbedeemedvalid,correct,accurateandconclusivelybindinguponyou,andyoushallhaveno claim against the bank in relation thereto. XYZ Ltd • 80 QuincyPlace ABC Plaza XXX 12345 • Co. Reg. No. 1234567890Z • GST Reg. No. YY-8121234-2 • www.xyzabc.com
07 Jan Inward CR - SPEED 9,092.06 123,456.78 SALAD SALAS Payment CARL QWE 817264950
How do I remove the excess words "Pleasenotethatyouareboundbyadut...
" The only pattern I can see is that it would be a very long word (probably more than 20 characters). Is that the way to go?
CodePudding user response:
You can try using a positive lookahead for a number after newline when you split the string, to get bigger chunks more reflective of your expected output:
import re
split_text = re.split("\n(?=\d{1,3}\s)", text1)
[" ".join(i.split()) for i in split_text if re.search("^\d\d\s", i)]
# ['01 Jan BALANCE B/F 123,456.78',
# '03 Jan Funds Transfer 195.04 123,456.78 mBK-4653112690',
# '03 Jan Inward Credit-QUICK 3,000.84 123,456.78 WIRE OTHR ANTON HARLEY Other',
# '03 Jan Funds Trf - SPEED 3,500.00 123,345.78 PIB8452145632845963 Abricot 480 OTHR Transfer']