I have a string as:
s="(2021-07-29 01:00:00 AM BST)
---
peter.j.matthew has joined the conversation
(2021-07-29 01:00:00 AM BST)
---
john cheung has joined the conversation
(2021-07-29 01:11:19 AM BST)
---
allen.p.jonas
Hi, james
(2021-07-30 12:51:16 AM BST)
---
karren wenda
how're you ?
---
* * *"
I want to extract the names as:
names_list= ['allen.p.jonas','karren wenda']
what I have tried:
names_list=re.findall(r'--- [\S\n](\D [\S\n])',s)
CodePudding user response:
It is somewhat difficult or not clear what the rules are for knowing where a name actually ends. This answer assumes that a name ends on the first line after ---
, which in turns appears after the timestamp. If that line also contains has joined the conversation
, then we only capture the text leading up to that phrase.
names = re.findall(r'\(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} [AP]M [A-Z]{3}\)\s ---\s (.*?)(?:\s has joined the conversation)?[ ]*\r?\n', s)
print(names) # ['peter.j.matthew', 'john cheung', 'allen.p.jonas', 'karren wenda']
CodePudding user response:
Supposing you want to match names that are not followed by "has joined the conversation":
name_pattern = re.compile(r'---\s*\n(\w(?:[\w\. ](?!has joined the conversation))*?)\s*\n', re.MULTILINE)
print(re.findall(name_pattern, s))
Explanation:
---\s*\n
matches the dashes possibly followed by whitespaces and a required new lineThen comes our matching group composed of:
\w
starts with a 'word' character (a-Z, 0-9 or _)(?:[\w\. ](?!has joined the conversation))*?
a non capturing group of repeating\w
,.
or whitespace not followed by "has joined the conversation". The capturing goes on until the next whitespace or new line. (*?
makes the expression lazy instead of greedy)
Output:
['allen.p.jonas', 'karren wenda']
CodePudding user response:
If you only want to match ['allen.p.jonas','karren wenda']
, you can use match a non whitespace char after it on the next line:
^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S
The pattern matches:
^
Start of string---
Match---
[^\S\n]*\n
Match optional spaces and a newline(\S.*?)
Capture group 1 (returned by re.findall) match a non whitespace char followed by as least as possible chars[^\S\r\n]*
Match optional whitespace chars without a newline\n\S
Match a newline and a non whitespace char
For example
print(re.findall(r"^---[^\S\n]*\n(\S.*?)[^\S\r\n]*\n\S", s, re.M))
Output
['allen.p.jonas', 'karren wenda']
If you want to match all names:
^---[^\S\n]*\n(\S.*?)(?= has joined the conversation\b|[^\S\n]*$)
For example
print(re.findall(r"^---[^\S\n]*\n(\S.*?)(?= has joined the conversation\b|[^\S\n]*$)", s, re.M))
Output
['peter.j.matthew', 'john cheung', 'allen.p.jonas', 'karren wenda']