Home > database >  Unable to extract GSTIN using python regex
Unable to extract GSTIN using python regex

Time:02-18

Can anyone help in fixing the issue here. I am trying extract GSTIN/UIN from texts.

#None of these works
#GSTIN_REG = re.compile(r'^\d{2}([a-z?A-Z?0-9]){5}([a-z?A-Z?0-9]){4}([a-z?A-Z?0-9]){1}?[Z]{1}[A-Z\d]{1}$')
#GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}Z{1}[A-Z0-9]{1}')
#GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[A-Z0-9]{1}[Z]{1}[A-Z0-9]{1}$')
GSTIN_REG = re.compile(r'^[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z]{1}[1-9A-Z]{1}Z[0-9A-Z]{1}$')
    
    
#GSTIN_REG = re.compile(r'19AISPJ4698P1ZX') #This works
#GSTIN_REG = re.compile(r'06AACCE2308Q1ZK') #This works


def extract_gstin(text):
    return re.findall(GSTIN_REG, text)
text = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(text))

CodePudding user response:

You might be able to simplify this and instead use re.findall to search for the key GSTIN followed by colon and then the value:

text = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
gstin = re.findall(r'GSTIN\s*:\s*([A-Z0-9] )', text)
print(gstin)  # ['06AACCE2308Q1ZK']

CodePudding user response:

Your second pattern in the commented out part works, and you can omit {1} as it is the default.

What you might do to make it a bit more specific is add word boundaries \b to the left and right to prevent a partial word match.

If it should be after GSTIN : you can use a capture group as well.

Example with the commented pattern:

import re

GSTIN_REG = re.compile(r'[0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9]')


def extract_gstin(s):
    return re.findall(GSTIN_REG, s)


s = 'Haryana, India, GSTIN : 06AACCE2308Q1ZK'
print(extract_gstin(s))

Output

['06AACCE2308Q1ZK']

A bit more specific pattern (which has the same output as re.findall returns the value of the capture group)

\bGSTIN : ([0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][A-Z0-9]Z[A-Z0-9])\b

Regex demo

  • Related