Home > Software design >  Remove digits outside alphanumeric characters using regex
Remove digits outside alphanumeric characters using regex

Time:07-29

I have a string that looks like this:

details = "| 4655748765321 | _jeffybion5                    | John Dutch                                                    |"

The end product I want is this:

>>> details
>>> _jeffybion5 John Dutch

My current code removes all digits including those attached to strings, also ignores the whitespace between two or more strings.

>>> import re
>>>
>>> details = "| 47574802757 | _jeffybion5                    | John Dutch                                                    |"
>>> details = re.sub("[0-9]", "", details)
>>> details = re.sub("  ", "", details)
>>> details = details.replace("|", " ") 
>>> details
>>> _jeffybion JohnDutch

Any help to achieving the desired result would be really appreciated.

CodePudding user response:

Non-Regex Solution

One approach:

chunks =  details.split()
res = " ".join(chunk for chunk in chunks if not chunk.isnumeric() and (chunk != "|"))
print(res)

Output

_jeffybion5 John Dutch

Regex Solution

An alternative using re.findall:

res = " ".join(re.findall(r"\w*[a-zA-z]\w*", details))
print(res)

Output

_jeffybion5 John Dutch

A third alternative using also re.findall:

res = " ".join(re.findall(r"\w*[^\W\d]\w*", details))

The pattern:

[^\W\d]  

matches any word character that is not a digit.

The regex solutions are based on the idea that you want strings composed of letters and numbers (also underscore) with at least one letter (or underscore).

CodePudding user response:

With your shown exact samples please try following regex.

^[^|]*\|[^|]*\|\s (\S )\s \|\s ([^|]*)

Here is the Online demo for above regex.

Python3 code: Using Python3x's re module's split function to get required output.

import re
##Creating x variable here...
x="""
| 4655748765321 | _jeffybion5                    | John Dutch                                                    |
"""
##Getting required output from split function and data manipulation here. 
[x.strip(' |\||\n') for x in re.split(r'^[^|]*\|[^|]*\|\s (\S )\s \|\s ([^|]*)',var) if x ][0:-1]

##Output:
['_jeffybion5', 'John Dutch']

Explanation: Using regex ^[^|]*\|[^|]*\|\s (\S )\s \|\s ([^|]*) to get required output, this is creating 2 capturing groups which will help us to fetch values later. Then removing new lines or pipes from strip command further. Then removing last item of list, which is empty one created by split function.

CodePudding user response:

For the example data, you might remove a pipe surrounded with optional whitespace chars, and optionally remove digits followed by whitespace chars till the next pipe.

Then strip the surrounding spaces.

\s*\|\s*(?:\d \s*\|)?

Regex demo

details = "| 4655748765321 | _jeffybion5                    | John Dutch                                                    |"
res = re.sub(r"\s*\|\s*(?:\d \s*\|)?", " ", details).strip()
print(res)

Output

_jeffybion5 John Dutch

If there should be a char A-Za-z in the string, you could split in | between whitespace chars and check for it:

details = "| 4655748765321 | _jeffybion5                    | John Dutch                                                    |  | "
res = " ".join([m for m in re.split(r"\s*\|\s*", details) if re.search(r"[A-Za-z]", m)])
print(res)

Output

_jeffybion5 John Dutch
  • Related