Home > Blockchain >  Extract words from a string with specific conditions
Extract words from a string with specific conditions

Time:02-16

I'm extracting users getting tagged in messages, where the username contains digits so I need to extract words from a long string with the following conditions:

  • it has to be 10 to 14 characters long
  • it can start with a @ but it is not necessary (this is the only special character allowed and if the word contains it, it has to be the first character)
  • it can contain numbers and characters
  • it can be only numbers, but it can't be only characters

Example:

str = "I have a pretty nice gaming experience with the user: @THYSSEN1145 and his brother THYSSEN1146. 
His username was first THY@SSEN1145, his brother's was 1234567891011. I played with them 123456789 times up until this point. "

Words that the regular expression should extract:

@THYSSEN1145
THYSSEN1146
1234567891011

CodePudding user response:

You might use

(?<!\S)@?(?=[A-Za-z\d]{10,14}\b)[A-Za-z]*\d[A-Za-z\d]*
  • (?<!\S) Assert a whitespace boundary to the left
  • @? Match an optional @
  • (?=[A-Za-z\d]{10,14}\b) Assert 10 - 14 word characters followed by a word boundary
  • [A-Za-z]*\d[A-Za-z\d]* Match at least a digit in the ranges A-Za-z\d

Regex demo

import re

pattern = r"(?<!\S)@?(?=[A-Za-z\d]{10,14}\b)[A-Za-z]*\d[A-Za-z\d]*"

s = ("I have a pretty nice gaming experience with the user: @THYSSEN1145 and his brother THYSSEN1146. \n"
            "His username was first THY@SSEN1145, his brother's was 1234567891011. I played with them 123456789 times up until this point.")

print(re.findall(pattern, s))

Output

['@THYSSEN1145', 'THYSSEN1146', '1234567891011']
  • Related