Python - remove punctuation marks at the end and at the beginning of one or more words-CodePudding

I wanted to know how to remove punctuation marks at the end and at the beginning of one or more words. If there are punctuation marks between the word, we don't remove.

for example

input:

word = "!.test-one,-"

output:

word = "test-one"

CodePudding user response：

use strip

>>> import string
>>> word = "!.test-one,-"
>>> word.strip(string.punctuation)
'test-one'

CodePudding user response：

The best solution is to use Python .strip(chars) method of the built-in class str.

Another approach will be to use a regular expression and the regular expressions module.

In order to understand what strip() and the regular expression does you can take a look at two functions which duplicate the behavior of strip(). The first one using recursion, the second one using while loops:


chars = '''!"#$%&'()* ,-./:;<=>?@[\]^_`{|}~'''

def cstm_strip_1(word, chars):
    # Approach using recursion: 
    w = word[1 if word[0] in chars else 0: -1 if word[-1] in chars else None]
    if w == word:
        return w
    else: 
        return cstm_strip_1(w, chars)

def cstm_strip_2(word, chars):
    # Approach using a while loop: 
    i , j = 0, -1
    while word[i] in chars:
        i  = 1
    while word[j] in chars:
        j -= 1
    return word[i:j 1]

import re, string

chars = string.punctuation
word = "~!.test-one^&test-one--two???"

wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
assert wsc == re.sub(r"(^[^\w] )|([^\w] $)", "", word)

word = "__~!.test-one^&test-one--two??__"

wsc = word.strip(chars)
assert wsc == cstm_strip_1(word, chars)
assert wsc == cstm_strip_2(word, chars)
# assert wsc == re.sub(r"(^[^\w] )|([^\w] $)", "", word)
assert re.sub(r"(^[^\w] )|([^\w] $)", "", word) == word

print(re.sub(r"(^[^\w] )|([^\w] $)", "", word), '!=', wsc )
print('"',re.sub(r"(^[^\w] )|([^\w] $)", "", "\tword\t"), '" != "', "\tword\t".strip(chars), '"', sep='' )

Notice that the result when using the given regular expression pattern can differ from the result when using .strip(string.punctuation) because the set of characters covered by regular expression [^\w] pattern differs from the set of characters in string.punctuation.

SUPPLEMENT

What does the regular expression pattern:

(^[^\w] )|([^\w] $)

mean?

Below a detailed explanation:

The '|' character means 'or' providing two alternatives for the 
   sub-string (called match) which is to find in the provided string. 

'(^[^\w] )' is the first of the two alternatives for a match

    '(' ')' enclose what is called a "capturing group" (^[^\w] )

    The first of the two '^' asserts position at start of a line

    '\w' : with \ escaped 'w' means: "word character" 
        (i.e. letters a-z, A-Z, digits 0-9 and the underscore '_').

    The second of the two '^' means: logical "not" 
      (here not a "word character")
      i.e. all characters except a-zA-z0-9 and '_'
        (for example '~' or 'ö')
      Notice that the meaning of '^' depends on context: 
        '^' outside of [ ] it means start of line/string
        '^' inside  of [ ] as first char means logical not 
            and not as first means itself 

    '[', ']' enclose specification of a set of characters 
      and mean the occurrence of exactly one of them

    ' ' means occurrence between one and unlimited times
        of what was defined in preceding token

    '([^\w] $)' is the second alternative for a match 
        differing from the first by stating that the match
        should be found at the end of the string
        '$' means: "end of the line" (or "end of string")

The regular expression pattern tells the regular expression engine to work as follows:

The engine looks at the start of the string for an occurrence of a non-word character. If one if found it will be remembered as a match and next character will be checked and added to the already found ones if it is also a non-word character. This way the start of the string is checked for occurrences of non-word characters which then will be removed from the string if the pattern is used in re.sub(r"(^[^\w] )|([^\w] $)", "", word) which replaces any found characters with an empty string (in other words it deletes found character from the string).

After the engine hits first word character in the string the search at the start of the string will the jump to the end of the string because of the second alternative given for the pattern to find as the first alternative is limited to the start of the line.

This way any non-word characters in the intermediate part of the string will be not searched for.

The engine looks then at the end of a string for a non-word character and proceeds like at the start but going backwards to assure that the found non-word characters are at the end of the string.

CodePudding user response：

Using re.sub

import re
word = "!.test-one,-"
out = re.sub(r"(^[^\w] )|([^\w] $)", "", word)
print(out)

Gives #

test-one

CodePudding user response：

Check this example using slice

import string
sentence = "_blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers."    
if sentence[0] in string.punctuation:
    sentence = sentence[1:]
if sentence[-1] in string.punctuation:
    sentence = sentence[:-1]
print(sentence)

Output:

blogs that are consistently updated by people that know about the trends, and market, and care about giving quality content to their readers