Home > OS >  How can I parse only string without using regex in python?
How can I parse only string without using regex in python?

Time:07-06

I am learning Python and have a question about parsing strings without regex. We should use a while loop. Here is the question;

We will have a string from the user with the input function. And then we will export just alpha characters from this sentence to a list.

For example, sentence: "The weather is so lovely today. Jack (our Jack) – Jason - and Alex went to park..? "

Example output: ["The", "weather", "is", "so","lovely","today","Jack","our","Jack","and","Alex","went","to","park"]

I have to note that punctuation marks and special characters such as parentheses are not part of words.

Below you can find I tried my codes. I couldn't find where I had an error.


    s="    The weather is so lovely today. Jack (our Jack) – Jason - and Alex went to park..?"
    
    i = 0
    j = 0
    l=[]
    k=[]
    count = 0
    while s:
        while j<len(s) and not s[j].isalpha():
            j =1
            l = s[j:]
            s=s[j:]
            while j < len(s) and l[j].isalpha():
                j =1
                s=s[j:]
    k.append(l[0:i])
    print(k)
    print(l)

On the other hand, I did parse the first word with the code below.

s="    The weather is so lovely today. Jack (our Jack) – Jason - and Alex went to park..?"

i = 0
j = 0
l=[]
k=[]

while j<len(s) and not s[j].isalpha():
    j =1
    l = s[j:]
    while i < len(l) and l[i].isalpha():
        i =1
        s=s[i:]
k.append(l[0:i])
print(k)
print(l)

Thanks for your help.

CodePudding user response:

By and large, if your goal is to parse a string and you find yourself modifying the string, you're probably doing it wrong. That's particularly true of languages like Python where strings are immutable, and modifying a string really means creating a new one, which takes time proportional to the length of the string. Doing that in a loop effectively turns a linear scan into a quadratic-time algorithm; you might not notice the dramatic consequences with a few short test cases, but sooner or later you (or someone) will try your code out on a significantly longer string, and the quadratic time will come back to bite you.

Anyway, there's no need. All you need to do is to look at the characters, or more accurately, look at each position between two characters, in order to find the positions of the beginnings of the words (where an alphabetic character follows a non-alphabetic character) and the ends of the words (where a non-alphabetic character follows an alphabetic character). Once the beginning and end of each word is discovered, the complete word can be added to the word list.

Note that we don't actually care what each character is, only whether it is alphabetic. So in the following code, I don't save the previous character; rather I save the boolean value of whether the previous character was alphabetic. At the start of the scan, previous_was_alphabetic is set to False, so if the first character in the string is alphabetic, that counts as the start of a word.

There's one little Python trick here, to handle the end of the string. If the last character in the string is alphabetic, then it's the end of a word, so it would be convenient to ensure that the string ends with a non-alphabetic character. But I don't really want to create a modified string, and I'd prefer not to have to write special purpose code for the end of the string. What I do instead is to use a slice; instead of looking at s[i] (the ith character), I use s[i:i 1], the one-character slice starting at position i. Conveniently, if i happens to be the length of s, then s[i:i 1] is an empty string, '', and even more conveniently, ''.isalpha() is False. So that will act as though there were an invisible non-alphabetic character at the end of the string.

This is not really very Pythonic, but your assignment seems to be insisting that you use a while loop rather than the much more natural for loop (which would require a different way of dealing with the end of the string).

def words_from(s):
    """Returns a list of the "words" (contiguous sequences of alphabetic
       characters) from the string s
    """
    words = []
    previous_was_alphabetic = False
    i = 0
    while i <= len(s):
        next_is_alphabetic = s[i:i 1].isalpha()
        if not previous_was_alphabetic and next_is_alphabetic:
             # i is the start of a word
             start = i
        elif previous_was_alphabetic and not next_is_alphabetic:
             # i is the position after the end of a word
             words.append(s[start:i])
        # Move to the next position
        previous_was_alphabetic = next_is_alphabetic
        i  = 1
    return words

CodePudding user response:

I think you might want sth like this:

s = "The weather is so lovely today. Jack (our Jack) – Jason - and Alex went to park..? "

punc = '''!()-[]{};:'"\,–,<>./?@#$%^&*_~'''
 
# Removing punctuations in string
# Using loop   punctuation string
for i in s:
    if i in punc:
        s = s.replace(i, "")
print(s.split())

output:

['The', 'weather', 'is', 'so', 'lovely', 'today', 'Jack', 'our', 'Jack', 'Jason', 'and', 'Alex', 'went', 'to', 'park']

  • Related