Remove nonalphabet letters using a function, returning incorrect-CodePudding

For an assignment I am creating function remove_extraneous, designed to take in any string and return the string with only letters in the alphabet. Here is my attempt so far:

alphabet = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']

def remove_extraneous(text):
    '''
    Description: 
        
    Examples:
    >>> remove_extraneous('test !')
    
    >>> remove_extraneous('code??')
    '''
    
    return ([text.replace(i, "") for i in text if i not in alphabet])

My examples return:

Examples:
    >>> remove_extraneous('test !')
    ['test!', 'test ']
    >>> remove_extraneous('code??')
    ['code', 'code']

This is good so far, since it kind of works, but not quite. It should return:

Examples:
    >>> remove_extraneous('test !')
    'test'
    >>> remove_extraneous('code??')
    'code'

Also, my teachers example says that this example should return this:

>>> remove_extraneous('boo!\n')
    'boo'

but when I try it, mine returns the following error:

raise ValueError('line %r of the docstring for %s has '

ValueError: line 10 of the docstring for __main__.remove_extraneous has inconsistent leading whitespace: "')"

The newline stuff really confuses me so bear with me on that... But overall what should I change in my code so that the correct string value returns?

CodePudding user response：

You could simplify this drastically. Make sure to return a str, not a list:

from string import ascii_lowercase

alphabet = set(ascii_lowercase)

def remove_extraneous(text):
    return "".join(c for c in text if c in alphabet)

>>> remove_extraneous('test !')
'test'
>>> remove_extraneous('code??')
'code'
>>> remove_extraneous('boo!\n')
'boo'

Some docs:

CodePudding user response：

Here is why your code does not work.

When you do:

[text.replace(i, "") for i in text if i not in alphabet]

you produce a list with one item per letter in text if the letter is not in alphabet.

Meaning for 'abc' you will have nothing, for 'abc!' you will have ['abc'] as you have one invalid character, for 'abc!!!!!!!!' you will get as many item as there are exclamation marks.

Second think. Using replace and looping over character is not efficient as you will parse the full string as many time as you have characters, so roughly you'll parse it the square of its length. This means that your code will become very slow quite fast.

The correct approach is to check the characters one by one and keep them if they are in the whitelist:

[char for char in text if char in alphabet]

Then you obtain a list, which you need to convert back to a string by joining the characters:

''.join(char for char in text if char in alphabet)

CodePudding user response：

I would suggest using the re regex module:

import re

non_letters = re.compile('[^A-Za-z]')

def remove_extraneous(text):
    return non_letters.sub('', text)