How can I simplify my function and make it more pythonic?-CodePudding

I have written a Python function for some post processing in my text recognition algorithm. It works fine, but it seems to be kind of lengthy and has many if-else-conditions. I am looking for a way to simplify my code:

def postProcessing(new_s): #new_s is a list
  import string
    
  new_s=removeFrontLetters(new_s) #This function has to be called first
 
  if all(ch in string.digits for ch in new_s[:4]): #checking if first 4 elements are digits.
    
    if all(ch in string.ascii_letters for ch in new_s[4:5]): #checking if the [4] and [5] are letters
      
      if len(new_s)>=7:         
        if new_s[6]=='C': 
          
          new_s=new_s[:7] #if length>=7 and the [6] =='C' the reversed of the first 7 elements has to be returned.
          new_s=list(reversed(new_s)) 
          return(new_s)

          
        else:
          new_s=new_s[:6] #if length>=7 but the [6] =!'C' the reversed of the first 6 elements has to be returned.
          new_s=list(reversed(new_s))

          return(new_s)
      

      else:
        new_s=list(reversed(new_s)) #if length<7, the reversed of the given list has to be returned.
        return(new_s)
    else:
      print('not valid') #if the [4] and [5] are not letters, it is not valid
  else:
    print('not valid') #if the [:4] are not digits, it is not valid

This seems very beginner-level and lengthy. I am a beginner, but I am trying to improve my function. Do you have suggestions?

CodePudding user response：

You can invert your if statements and use early returns to reduce the indentation of your code.

def postProcessing(new_s):  # new_s is a list
    import string

    new_s = removeFrontLetters(new_s)  # This function has to be called first

    if not all(ch in string.digits for ch in new_s[:4]):  # checking if first 4 elements are digits.
        raise ValueError("First four elements must be digits")
    
    if not all(ch in string.ascii_letters for ch in new_s[4:5]):  # checking if the [4] and [5] are letters
        raise ValueError("First elements 4 and 5 must be digits")
    
    if len(new_s) <= 7:
        new_s = list(reversed(new_s))  # if length<7, the reversed of the given list has to be returned.
        return (new_s)
    
    if new_s[6] == 'C':

        new_s = new_s[
                :7]  # if length>=7 and the [6] =='C' the reversed of the first 7 elements has to be returned.
        new_s = list(reversed(new_s))
        return (new_s)

    new_s = new_s[:6]  # if length>=7 but the [6] =!'C' the reversed of the first 6 elements has to be returned.
    new_s = list(reversed(new_s))
    return (new_s)

CodePudding user response：

It's quite neat you ask 'the world' for advise. Did you know there's a dedicated stackexchange site for this? https://codereview.stackexchange.com/.

Unless you insist writing Python code for this, it seems that you need a regular expression here.

So some tips:

use regex for pattern matching
use variables to express what an expression means
use exceptions instead of an 'invalid value' string
separate the 'parsing' from the processing to keep functions small and focused
use doctest to document and test small functions

def post_process(new_s):
    """
    reverse a string (with a twist)

    what follows is a doctest.  You can run it with
    $ python -m doctest my_python_file.py

    >>> post_process('1234abCdef')
    'Cba4321'
    >>> post_process('1234abdef')
    'ba4321'
    """

    cmds = {
      'C': cmd_c,
      '': cmd_none
    }

    command, content = command_and_content(new_s)
    process = cmds[command]
    return process(content)


def cmd_c(content):
  return 'C'   "".join(reversed(content))
def cmd_none(content):
  return "".join(reversed(content))

The command_and_content function replaces the parsing logic:

def command_and_content(new_s):
    # get help on https://regex101.com/ to find the
    # regular expression for four digits and two letters
    digits_then_ascii = re.compile(r"(\d{4}[a-z]{2})(C?)(.*)")
    if match := digits_then_ascii.match(new_s):
      content = match.group(1)
      command = match.group(2)
      return command, content

    # pylint will ask you to not use an else clause after a return
    # Also, Python advises exceptions for notifying erroneous input
    raise ValueError(new_s)

CodePudding user response：

From the context you provided, I assume that all this processing can happen in-place (i.e. without the need to allocate additional memory). The benefit of lists is that they are mutable, so you can actually do all your operations in-place.

This adheres to the style conventions (PEP 8) and uses correct type annotations (PEP 484):

from string import digits, ascii_letters


def remove_front_letters(new_s: list[str]) -> None:
    ...
    raise NotImplementedError


def post_processing(new_s: list[str]) -> None:
    remove_front_letters(new_s)
    if any(ch not in digits for ch in new_s[:4]):
        raise ValueError("The first 4 characters must be digits")
    if any(ch not in ascii_letters for ch in new_s[4:6]):
        raise ValueError("The 5th and 6th characters must be letters")
    if len(new_s) >= 7:
        if new_s[6] == 'C':
            del new_s[7:]
        else:
            del new_s[6:]
    new_s.reverse()

If you do want a new list, you can just call this function with a .copy() of your input list.

References: list methods; del statement

PS: If you use Python version 3.8 or lower, instead of list[str] you'll need to use typing.List[str].

Also someone mentioned the possibility of replacing the iteration via all() (or any()) with a "".join(...).isdigit() for example. While this is certainly also correct and technically less code, I am not sure it is necessarily more readable. More importantly it creates a new string in the process, which I don't think is necessary.

By the way, you could even reduce that conditional deletion of list elements to a one liner like this:

...
    if len(new_s) >= 7:
        del new_s[7 if new_s[6] == 'C' else 6:]
    new_s.reverse()

But I would argue that this is worse because it is less readable. Personal preference I guess.