Can anyone help in fixing the issue here. I am trying extract GSTIN/UIN from texts.
#None of these works
#GSTIN_REG = re.compile(r'^\d{2}([a-z?A-Z?0-9]){5}([a-z?A-Z?0-9]){4}([a-z?A-Z?0-9]){1}?[Z]{1}[A-Z\d]{1}$')
#GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}Z{1}[A-Z0-9]{1}')
#GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}[Z]{1}[A-Z0-9]{1}$')
GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[1-9A-Z]{1}Z[0-9A-Z]{1}$')
#GSTIN_REG = re.compile(r'19AISPJ4698P1ZX') #This works
#GSTIN_REG = re.compile(r'06AACCE2308Q1ZK') #This works
def extract_gstin(text):
return re.findall(GSTIN_REG, text)
text = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(text))
CodePudding user response:
You might be able to simplify this and instead use re.findall
to search for the key GSTIN
followed by colon and then the value:
text = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
gstin = re.findall(r'GSTIN\s*:\s*([A-Z0-9] )', text)
print(gstin) # ['06AACCE2308Q1ZK']
CodePudding user response:
Your second pattern in the commented out part works, and you can omit {1}
as it is the default.
What you might do to make it a bit more specific is add word boundaries \b
to the left and right to prevent a partial word match.
If it should be after GSTIN :
you can use a capture group as well.
Example with the commented pattern:
import re
GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9]')
def extract_gstin(s):
return re.findall(GSTIN_REG, s)
s = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(s))
Output
['06AACCE2308Q1ZK']
A bit more specific pattern (which has the same output as re.findall returns the value of the capture group)
\bGSTIN : ([0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9])\b