I need to extract all the characters behind a certain date using regex.
I tried something list before since I knew the pattern,
(\d{7,8})|([A-Za-z0-9\/]{12})|([0-9\/-]{8,9})
as and when I receive new Invoices there are different Invoice numbers in the PDF. One thing that is certain that after the Invoice number there is an Invoice date which is in the format DD/MM/YYYY
So I need all the data before this date
Sample Data
91504458 26/04/2022
TYRES/REEXPORT 04/07/2022
TYRES/RE-EXPORT 23/09/2022
SAM0112/2022 23/05/2021
020/22-23 17/02/2022
SAM0141/2022 19/03/1975
91/22-23 01/01/2022
SAM0159/2022 15/08/2021
111/22-23 09/09/2021
SAM0106/2022 09/09/2022
017/2022 08/08/2022
91/22-23 07/07/2022
Expected Output Data
91504458
TYRES/REEXPORT
TYRES/RE-EXPORT
SAM0112/2022
020/22-23
SAM0141/2022
91/22-23
SAM0159/2022
111/22-23
SAM0106/2022
017/2022
91/22-23
Appreciate your feedback on the same.
Regards, Manjesh
CodePudding user response:
You could word boundaries and use an alternation to list and capture the allowed formats in group 1 before matching the date format at the end of the string.
\b([a-zA-Z] (?:[/-][A-Za-z] ) |\d{7,8}|(?:[a-zA-Z] \d |\d \/\d\d-?\d\d)(?:/\d{4})?)\s \d\d/\d\d/\d{4}\b
The pattern matches:
\b
A word boundary to prevent a partial word match(
Capture group 1[a-zA-Z]
Match 1 ASCII letters(?:[/-][A-Za-z] )
Repeat 1 times-
or/
and again 1 letters|
Or\d{7,8}
Match 7-8 digits|
Or(?:
Non capture group[a-zA-Z] \d
Match 1 letters and 1 digits|
Or\d \/\d\d-?\d\d
Match digits/
and then 2 digits, optional-
and 2 digits
)
Close non capture group(?:/\d{4})?
Optionally match/
and 4 digits
)
Close group 1\s \d\d/\d\d/\d{4}
Match 1 whitespace chars and a date like format\b
A word boundary
See a regex demo.
CodePudding user response:
Assuming that the dates would always end each row, you could try doing a regex replacement:
Find: \s*\b\d{2}/\d{2}/\d{4}$
Replace: (empty)