Below is an example of a test case:
inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi" # WE HAVE
outpoot = "A.p.p.l.e () Orange () Kiwi" # WE WANT
The only reason I spelled inpoot
incorrectly is because input
is a reserved language keyword.
One might think that the following would work:
import string
def kill_numbers(text: str) -> str:
text = str(text)
return "".join(filter(lambda ch: ch not in string.digits, text))
However, the decimal point (.
) in a decimal numbers will be preserved.
inpoot = "A.p.p.l.e (45) Orange T5.11T Kiwi 99 Apricot"
outpoot = kill_numbers(inpoot)
print(repr(outpoot))
# prints 'A.p.p.l.e () Orange T.T Kiwi'
# We want `TT` not `T.T`
# the output contains a stray decimal point.
outpoot = kill_numbers("Strawberry 3.145 Plum")
print(repr(outpoot))
# fails to delete the `.` in `3.145`
INPUT | BAD OUTPUT | DESIRED OUTPUT |
---|---|---|
"3.14" |
"." |
"" (empty string) |
So, how can we delete all numbers, including decimal numbers?
A substitution using regular expressions is theoretically possible.
import re
test_case = "(.4) A.p.p.l.e (44) Orange .... (4.44) Kiwi . . . . ."
result = re.sub("[0-9] \.?[0-9]*|\.[0-9] ", "", test_case)
print(result) # () A.p.p.l.e () Orange .... () Kiwi . . . . .
The regular expression shown above works for that one test case, but not all test cases.
The table below shows how various regular expressions perform on various test inputs.
KEY FOR TABLE
-
means that the regex does NOT match the stringmeh
means that the regex matches a small part of string, but not the whole thing.
REGEX | ' 1 ' |
'2' |
'3' |
'365' |
'9.43' |
'-5000' |
' 10' |
'3.10.4' |
'0001' |
'.5' |
'.' |
'591.' |
'' |
'0x77F' |
'3.456e11' |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[0-9] \\.?[0-9]*|\\.[0-9] |
- | - | - | - | - | meh | meh | meh | - | - | - | meh | meh | ||
[ -]?[0-9] \\.?[0-9]*|\\.[0-9] |
- | - | - | - | - | - | - | meh | - | - | - | meh | meh | ||
[ -]?([0-9] \\.?[0-9]*|\\.[0-9] ) |
- | - | - | - | - | - | - | meh | - | - | - | meh | meh | ||
[0-9]*\\.?[0-9]* |
meh | - | - | - | - | meh | meh | meh | - | - | - | - | - | meh | meh |
[0-9] \\.?[0-9] |
- | - | meh | meh | meh | - | meh | meh | meh | ||||||
[0-9] \\.?[0-9]* |
- | - | - | - | - | meh | meh | meh | - | meh | - | meh | meh | ||
[0-9]*\\.?[0-9] |
- | - | - | - | - | meh | meh | meh | - | - | meh | meh | meh | ||
\\d |
- | - | - | - | meh | meh | meh | meh | - | meh | meh | meh | meh | ||
[0-9] |
- | - | - | meh | meh | meh | meh | meh | meh | meh | meh | meh | meh | ||
\\d |
- | - | - | meh | meh | meh | meh | meh | meh | meh | meh | meh | meh | ||
\\d* |
meh | - | - | - | meh | meh | meh | meh | - | meh | meh | meh | - | meh | meh |
The same table in ASCII form might be easier to read and understand:
' 1 ' '2' '3' '365' '9.43' '-5000' ' 10' '3.10.4' '0001' '.5' '.' '591.' '' '0x77F' '3.456e11'
[0-9] \.?[0-9]*|\.[0-9] - - - - - meh meh meh - - - meh meh
[ -]?[0-9] \.?[0-9]*|\.[0-9] - - - - - - - meh - - - meh meh
[ -]?([0-9] \.?[0-9]*|\.[0-9] ) - - - - - - - meh - - - meh meh
[0-9]*\.?[0-9]* meh - - - - meh meh meh - - - - - meh meh
[0-9] \.?[0-9] - - meh meh meh - meh meh meh
[0-9] \.?[0-9]* - - - - - meh meh meh - meh - meh meh
[0-9]*\.?[0-9] - - - - - meh meh meh - - meh meh meh
\d - - - - meh meh meh meh - meh meh meh meh
[0-9] - - - meh meh meh meh meh meh meh meh meh meh
\d - - - meh meh meh meh meh meh meh meh meh meh
\d* meh - - - meh meh meh meh - meh meh meh - meh meh
In my humble opinion, regular expressions are a nightmare.
To digress, it took me a long time to realize that:
IMHO = In my humble opinion`. I don't speak acronym very well.
Back to business...
I cannot find a regex which satisfies the following requirements:
- the regex must not match the empty string (
""
) - the regex must not match any sub-string of a version number, such as
"3.10.4"
At most one decimal point is allowed to appear in what we call a "number" - the regex must not match free-floating decimal points (
"."
).
Desired behavior is as follows:
PSEUDO-NUMBER | IS_A_NUMBER() |
NOTES |
---|---|---|
"1" |
Yes | int |
"2" |
Yes | int |
"365" |
Yes | int |
"365." |
No | 365. is a float equivalent to 365.0 However, I do not want to delete the (. ) at the end of the string "The number of houses was 44." |
"9.43" |
Yes | one decimal points |
"-5000" |
Yes | |
" 10" |
Yes | |
"0001" |
Yes | |
".5" |
Yes | .5 is equivalent to 0.5 |
"1" |
Yes | |
"0x77F" |
Yes | |
"3.456e11" |
Yes | pseudo-scientific-notation |
"3.10.4" |
Not a number | two decimals points |
"." |
Not a number | |
"" |
Not a number | do not match the empty string |
EDIT:
The following are defined to be seed numbers ...
(1
, 365
, 9.43
, -5000
, 10
, 0001
, .5
, .5
, 0x77F
, 3.456e11
)
A valid number is defined to be any seed number or a string formed by a seed number by doing one of the following:
- Iteratively replacing any digit in a seed number with
99
- Replacing any digit in a valid number with a different digit.
- Replacing
F
in0xF
with2F
orF2
orA
,B
,C
,D
, orE
.
For example, you could replace the 5
in -5000
with 9
to get -9000
Also, you could replace the 5
in .5
with 99
to get .99
The above defines language L.
My question could be re-worded as follows:
What algorithm A will return s′ from input string s such that:
- s is any finite-length string of ASCII characters.
- string s′ is like string s except that all maximal substrings of s which are in language L, have been replaced by empty strings.
A substring t of string s is maximal and t is in language L if it is not possible to tack on one more character to the left or to the right of t to form t′, such that t′ is a string in language L and t′ is a substring of s.
In layman's terms, if you see "apple 12.345" you should go after "12.345" not "2.34".
Indices matter. Sometimes, it makes no sense to say that the letter "a"
is a sub-string of "abracadabra"
. Which letter "a" is it? It it the letter "a" third-from-the-left, or second-from-the left?
We define a string to a mathematical mapping M from a finite subset of the natural numbers to the ASCii character set such that the absolute difference between the maximum of the domain of mapping M and the minimum of the domain of mapping M is the sum of one and the cardinality of the domain of mapping M.
For any string SML and any string LRG, we say that SML is a sub-string of LRG if and only if SML[k] = LRG[k] for all k in the domain of string SML
END OF EDIT
CodePudding user response:
>>> import re
>>> inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi" # WE HAVE
>>> pattern = re.compile(r"\d \.?\d*")
>>> re.sub(pattern, "", inpoot)
'A.p.p.l.e () Orange () Kiwi'
>>>
CodePudding user response:
Try this:
>>> inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi"
>>> re.sub('(\d .\d )|(\d )', '', inpoot)
'A.p.p.l.e () Orange () Kiwi'
The first part tries to find a decimal number with the pattern: digits decimalpoint digits
The second part looks for a just a number without a decimal point.
The first part goes first because alternation picks the first match and we want the longer of the two.
CodePudding user response:
Quite many requirements, so I could be missing something here, but still worth a try:
import re
import itertools
def filter_nums(text) -> str:
def is_a_number(x):
try:
if re.search('^0x', x):
return(x, 16)
return float(x)
except ValueError:
return False
tokens = text.split(' ')
suspect_tokens = [re.findall(r"[A-Fa-f0-9\-\.\ x] ", elem) for elem in tokens]
suspect_tokens = list(itertools.chain(*suspect_tokens))
num_tokens = [elem for elem in suspect_tokens if is_a_number(elem)]
# Reversed sort, so to avoid "45" fire a call to replace the 45 in 3.456e11
# i.e. the longer the sooner to be replaced:
for num_token in sorted(num_tokens, key=len, reverse=True):
text = text.replace(num_token, '')
return text
text = "A.p.p.l.e (45) Orange (5.11) Kiwi [0x77F] 10 .,.,-5000!343£ ///3.456e11sd 3.10.4"
print(filter_nums(text))
# "A.p.p.l.e () Orange () Kiwi [] .,.,!£ ///sd 3.10.4"