Home > Net >  How to delete all numbers from a string?
How to delete all numbers from a string?

Time:07-15

Below is an example of a test case:

inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi"  # WE HAVE
outpoot = "A.p.p.l.e () Orange () Kiwi" # WE WANT

The only reason I spelled inpoot incorrectly is because input is a reserved language keyword.

One might think that the following would work:

import string
def kill_numbers(text: str) -> str:
    text = str(text)
    return "".join(filter(lambda ch: ch not in string.digits, text))

However, the decimal point (.) in a decimal numbers will be preserved.

inpoot = "A.p.p.l.e (45) Orange T5.11T Kiwi 99 Apricot"

outpoot = kill_numbers(inpoot)
print(repr(outpoot))

# prints 'A.p.p.l.e  () Orange T.T Kiwi'
# We want `TT` not `T.T`
# the output contains a stray decimal point. 

outpoot = kill_numbers("Strawberry 3.145 Plum")
print(repr(outpoot))

# fails to delete the `.` in `3.145`
INPUT BAD OUTPUT DESIRED OUTPUT
"3.14" "." "" (empty string)

So, how can we delete all numbers, including decimal numbers?

A substitution using regular expressions is theoretically possible.

import re
test_case =  "(.4) A.p.p.l.e (44) Orange .... (4.44) Kiwi . . . . ."
result = re.sub("[0-9] \.?[0-9]*|\.[0-9] ", "", test_case)
print(result) # () A.p.p.l.e () Orange .... () Kiwi . . . . .

The regular expression shown above works for that one test case, but not all test cases.

The table below shows how various regular expressions perform on various test inputs.

KEY FOR TABLE

  • - means that the regex does NOT match the string
  • means that the regex matches the entire string
  • meh means that the regex matches a small part of string, but not the whole thing.
REGEX ' 1 ' '2' '3' '365' '9.43' '-5000' ' 10' '3.10.4' '0001' '.5' '.' '591.' '' '0x77F' '3.456e11'
[0-9] \\.?[0-9]*|\\.[0-9] - - - - - meh meh meh - - - meh meh
[ -]?[0-9] \\.?[0-9]*|\\.[0-9] - - - - - - - meh - - - meh meh
[ -]?([0-9] \\.?[0-9]*|\\.[0-9] ) - - - - - - - meh - - - meh meh
[0-9]*\\.?[0-9]* meh - - - - meh meh meh - - - - - meh meh
[0-9] \\.?[0-9] - - meh meh meh - meh meh meh
[0-9] \\.?[0-9]* - - - - - meh meh meh - meh - meh meh
[0-9]*\\.?[0-9] - - - - - meh meh meh - - meh meh meh
\\d - - - - meh meh meh meh - meh meh meh meh
[0-9] - - - meh meh meh meh meh meh meh meh meh meh
\\d - - - meh meh meh meh meh meh meh meh meh meh
\\d* meh - - - meh meh meh meh - meh meh meh - meh meh

The same table in ASCII form might be easier to read and understand:

                                ' 1  ' '2' '3' '365' '9.43' '-5000' ' 10' '3.10.4' '0001' '.5'  '.' '591.' '' '0x77F' '3.456e11'
[0-9] \.?[0-9]*|\.[0-9]              -   -   -     -      -     meh   meh      meh      -    -           -        meh        meh
[ -]?[0-9] \.?[0-9]*|\.[0-9]         -   -   -     -      -       -     -      meh      -    -           -        meh        meh
[ -]?([0-9] \.?[0-9]*|\.[0-9] )      -   -   -     -      -       -     -      meh      -    -           -        meh        meh
[0-9]*\.?[0-9]*                    meh   -   -     -      -     meh   meh      meh      -    -    -      -  -     meh        meh
[0-9] \.?[0-9]                                     -      -     meh   meh      meh      -              meh        meh        meh
[0-9] \.?[0-9]*                      -   -   -     -      -     meh   meh      meh      -  meh           -        meh        meh
[0-9]*\.?[0-9]                       -   -   -     -      -     meh   meh      meh      -    -         meh        meh        meh
\d                                   -   -   -     -    meh     meh   meh      meh      -  meh         meh        meh        meh
[0-9]                                -   -   -   meh    meh     meh   meh      meh    meh  meh         meh        meh        meh
\d                                   -   -   -   meh    meh     meh   meh      meh    meh  meh         meh        meh        meh
\d*                                meh   -   -     -    meh     meh   meh      meh      -  meh  meh    meh  -     meh        meh

In my humble opinion, regular expressions are a nightmare.

To digress, it took me a long time to realize that:

IMHO = In my humble opinion`. I don't speak acronym very well. 

Back to business...

I cannot find a regex which satisfies the following requirements:

  • the regex must not match the empty string ("")
  • the regex must not match any sub-string of a version number, such as "3.10.4" At most one decimal point is allowed to appear in what we call a "number"
  • the regex must not match free-floating decimal points (".").

Desired behavior is as follows:

PSEUDO-NUMBER IS_A_NUMBER() NOTES
"1" Yes int
"2" Yes int
"365" Yes int
"365." No 365. is a float equivalent to 365.0 However, I do not want to delete the (.) at the end of the string "The number of houses was 44."
"9.43" Yes one decimal points
"-5000" Yes
" 10" Yes
"0001" Yes
".5" Yes .5 is equivalent to 0.5
"1" Yes
"0x77F" Yes
"3.456e11" Yes pseudo-scientific-notation
"3.10.4" Not a number two decimals points
"." Not a number
"" Not a number do not match the empty string

EDIT:

The following are defined to be seed numbers ...

(1, 365, 9.43, -5000, 10, 0001, .5, .5, 0x77F, 3.456e11)

A valid number is defined to be any seed number or a string formed by a seed number by doing one of the following:

  1. Iteratively replacing any digit in a seed number with 99
  2. Replacing any digit in a valid number with a different digit.
  3. Replacing F in 0xF with 2F or F2 or A,B,C ,D, or E.

For example, you could replace the 5 in -5000 with 9 to get -9000

Also, you could replace the 5 in .5 with 99 to get .99

The above defines language L.

My question could be re-worded as follows:

What algorithm A will return s′ from input string s such that:

  • s is any finite-length string of ASCII characters.
  • string s′ is like string s except that all maximal substrings of s which are in language L, have been replaced by empty strings.

A substring t of string s is maximal and t is in language L if it is not possible to tack on one more character to the left or to the right of t to form t′, such that t′ is a string in language L and t′ is a substring of s.

In layman's terms, if you see "apple 12.345" you should go after "12.345" not "2.34".

Indices matter. Sometimes, it makes no sense to say that the letter "a" is a sub-string of "abracadabra". Which letter "a" is it? It it the letter "a" third-from-the-left, or second-from-the left?

We define a string to a mathematical mapping M from a finite subset of the natural numbers to the ASCii character set such that the absolute difference between the maximum of the domain of mapping M and the minimum of the domain of mapping M is the sum of one and the cardinality of the domain of mapping M.

For any string SML and any string LRG, we say that SML is a sub-string of LRG if and only if SML[k] = LRG[k] for all k in the domain of string SML

END OF EDIT

CodePudding user response:

>>> import re
>>> inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi"  # WE HAVE
>>> pattern = re.compile(r"\d \.?\d*")
>>> re.sub(pattern, "", inpoot)
'A.p.p.l.e () Orange () Kiwi'
>>>

CodePudding user response:

Try this:

>>> inpoot = "A.p.p.l.e (45) Orange (5.11) Kiwi"
>>> re.sub('(\d .\d )|(\d )', '', inpoot)
'A.p.p.l.e () Orange () Kiwi'
  • The first part tries to find a decimal number with the pattern: digits decimalpoint digits

  • The second part looks for a just a number without a decimal point.

The first part goes first because alternation picks the first match and we want the longer of the two.

CodePudding user response:

Quite many requirements, so I could be missing something here, but still worth a try:

import re
import itertools
    

def filter_nums(text) -> str:

        def is_a_number(x):
                try:
                        if re.search('^0x', x):
                                return(x, 16)
                        return float(x)
                except ValueError:
                        return False

        tokens = text.split(' ')
        suspect_tokens = [re.findall(r"[A-Fa-f0-9\-\.\ x] ", elem) for elem in tokens]
        suspect_tokens = list(itertools.chain(*suspect_tokens))
        num_tokens = [elem for elem in suspect_tokens if is_a_number(elem)]

        # Reversed sort, so to avoid "45" fire a call to replace the 45 in 3.456e11 
        # i.e. the longer the sooner to be replaced:
        for num_token in sorted(num_tokens, key=len, reverse=True):
                text = text.replace(num_token, '')
        return text

text = "A.p.p.l.e (45) Orange (5.11) Kiwi [0x77F]  10 .,.,-5000!343£ ///3.456e11sd 3.10.4"
print(filter_nums(text))
# "A.p.p.l.e () Orange () Kiwi []  .,.,!£ ///sd  3.10.4"
  • Related