Home > Blockchain >  Question about matching RE in a complicated form
Question about matching RE in a complicated form

Time:10-05

How can I match a word using RE in the following format: Letter number Alphanumeric dot(.) Alphanumeric{0-4}

Examples:

A24.L
A2F.L9
A2F.LG4

This is what I've come up with so far:

answer=re.findall(r'[A-Za-z]\d\w\.\w{0-4})

CodePudding user response:

As you are using re.findall, I assume you are looking for partial matches inside longer text. Bearing that in mind, you need to fix the following:

  • \w matches not only alphanumeric, but also a _ char
  • {0-4} is not a valid limiting ("range", or "interval") quantifier, it has a {min,max} syntax (note that the min value should not be omitted, although some regex engines allow that with 0 value used as default, but there are regex engines that either do not support or that do not work correctly with this omitting)
  • In Python 3, \d matches any Unicode digit (like ٠١٢٣٤٥٦٧٨٩۰۱۲۳۴۵۶۷۸۹߀߁߂߃߄߅߆߇߈߉०१२३४५६७८९০১২৩৪৫৬৭৮৯੦੧੨੩੪੫੬੭੮੯૦૧૨૩૪૫૬૭૮૯୦୧୨୩୪୫୬୭୮୯௦௧௨௩௪௫௬௭௮௯౦౧౨౩౪౫౬౭౮౯೦೧೨೩೪೫೬೭೮೯൦൧൨൩൪൫൬൭൮൯๐๑๒๓๔๕๖๗๘๙໐໑໒໓໔໕໖໗໘໙༠༡༢༣༤༥༦༧༨༩၀၁၂၃၄၅၆၇၈၉႐႑႒႓႔႕႖႗႘႙០១២៣៤៥៦៧៨៩᠐᠑᠒᠓᠔᠕᠖᠗᠘᠙᥆᥇᥈᥉᥊᥋᥌᥍᥎᥏᧐᧑᧒᧓᧔᧕᧖᧗᧘᧙᭐᭑᭒᭓᭔᭕᭖᭗᭘᭙᮰᮱᮲᮳᮴᮵᮶᮷᮸᮹᱀᱁᱂᱃᱄᱅᱆᱇᱈᱉᱐᱑᱒᱓᱔᱕᱖᱗᱘᱙꘠꘡꘢꘣꘤꘥꘦꘧꘨꘩꣐꣑꣒꣓꣔꣕꣖꣗꣘꣙꤀꤁꤂꤃꤄꤅꤆꤇꤈꤉꩐꩑꩒꩓꩔꩕꩖꩗꩘꩙0123456789), so you probably want to use (?a) inline modifier (to only match ASCII digits) or an explicit [0-9].

So, you can use

answer=re.findall(r'\b[A-Za-z][0-9][A-Za-z0-9]\.[A-Za-z0-9]{1,4}\b', text)

if the alphanumeric after . is obligatory, and the following if the match can end in a dot:

answer=re.findall(r'\b[A-Za-z][0-9][A-Za-z0-9]\.[A-Za-z0-9]{0,4}(?<!\w\B)', text)

Details:

  • \b - word boundary
  • [A-Za-z] - a letter
  • [0-9] - an ASCII digit
  • [A-Za-z0-9] - an ASCII alphanumeric
  • \. - a . char
  • [A-Za-z0-9]{1,4}\b - one to four alphanumeric chars at the word boundary.

The second regex does not contain a word boundary at the end since the match is supposed to be able to end in a . (that is not a word char). The (?<!\w\B) is a right-hand dynamic word boundary that only requires a non-word char or end position if the preceding char is a word char.

See the regex demo.

CodePudding user response:

The best way to solve these types of problems is via an online regex checker. You were very close. Only a slight modification is required.

Try: [a-zA-Z][0-9]\w\.\w{0,4}

  • Related