Home > Software design >  extract names from string
extract names from string

Time:11-08

I have a string as:

s="(2021-07-29 01:00:00 AM BST)  
---  
peter.j.matthew has joined the conversation  
  
  

(2021-07-29 01:00:00 AM BST)  
---  
john cheung has joined the conversation  
  
  


(2021-07-29 01:11:19 AM BST)  
---  
allen.p.jonas  
Hi, james  
  
  
(2021-07-30 12:51:16 AM BST)  
---  
karren wenda  
how're you ? 
  
  
  
---  
  
* * *"

I want to extract the names as:

names_list= ['allen.p.jonas','karren wenda']

what I have tried:

names_list=re.findall(r'--- [\S\n](\D [\S\n])',s)

CodePudding user response:

It is somewhat difficult or not clear what the rules are for knowing where a name actually ends. This answer assumes that a name ends on the first line after ---, which in turns appears after the timestamp. If that line also contains has joined the conversation, then we only capture the text leading up to that phrase.

names = re.findall(r'\(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} [AP]M [A-Z]{3}\)\s ---\s (.*?)(?:\s has joined the conversation)?[ ]*\r?\n', s)
print(names)  # ['peter.j.matthew', 'john cheung', 'allen.p.jonas', 'karren wenda']

CodePudding user response:

Supposing you want to match names that are not followed by "has joined the conversation":

name_pattern = re.compile(r'---\s*\n(\w(?:[\w\. ](?!has joined the conversation))*?)\s*\n', re.MULTILINE)
print(re.findall(name_pattern, s))

Explanation:

  • ---\s*\n matches the dashes possibly followed by whitespaces and a required new line

  • Then comes our matching group composed of:

    • \w starts with a 'word' character (a-Z, 0-9 or _)
    • (?:[\w\. ](?!has joined the conversation))*? a non capturing group of repeating \w, . or whitespace not followed by "has joined the conversation". The capturing goes on until the next whitespace or new line. (*? makes the expression lazy instead of greedy)

Output:

['allen.p.jonas', 'karren wenda']

CodePudding user response:

If you only want to match ['allen.p.jonas','karren wenda'], you can use match a non whitespace char after it on the next line:

^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S

The pattern matches:

  • ^ Start of string
  • --- Match ---
  • [^\S\n]*\n Match optional spaces and a newline
  • (\S.*?) Capture group 1 (returned by re.findall) match a non whitespace char followed by as least as possible chars
  • [^\S\r\n]* Match optional whitespace chars without a newline
  • \n\S Match a newline and a non whitespace char

Regex demo | Python demo

For example

print(re.findall(r"^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S", s, re.M))

Output

['allen.p.jonas', 'karren wenda']

If you want to match all names:

^---[^\S\n]*\n(\S.*?)(?= has joined the conversation\b|[^\S\n]*$)

Regex demo

For example

print(re.findall(r"^---[^\S\n]*\n(\S.*?)(?= has joined the conversation\b|[^\S\n]*$)", s, re.M))

Output

['peter.j.matthew', 'john cheung', 'allen.p.jonas', 'karren wenda']
  • Related