Replacing String Text That Contains Double Quotes-CodePudding

I have a number series contained in a string, and I want to remove everything but the number series. But the double quotes are giving me errors. Here are examples of the strings and a sample command that I have used. All I want is 127.60-02-15, 127.60-02-16, etc.

<span id="lblTaxMapNum">127.60-02-15</span>
<span id="lblTaxMapNum">127.60-02-16</span>

I have tried all sorts of methods (e.g., triple double quotes, single quotes, quotes with backslashes, etc.). Here is one inelegant way that still isn't working because it's still leaving ">:

text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\"", "")
text = text.replace("</span>", "")

Here is what I am working with (more specific code). I'm retrieving the data from an CSV and just trying to clean it up.

text = open("outputA.csv", "r")
text = ''.join([i for i in text])
text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\"", "")
text = text.replace("</span>", "")
outputB = open("outputB.csv", "w")
outputB.writelines(text)
outputB.close()

CodePudding user response：

If you add a > in the second replace it is still not elegant but it works:

text = text.replace("<span id=", "")
text = text.replace("\"lblTaxMapNum\">", "")
text = text.replace("</span>", "")

Alternatively, you could use a regex:

import re

text = "<span id=\"lblTaxMapNum\">127.60-02-16</span>"

pattern = r".*>(\d*.\d*-\d*-\d*)\D*"  # the pattern in the brackets matches the number
match = re.search(pattern, text)  # this searches for the pattern in the text

print(match.group(1))  # this prints out only the number

CodePudding user response：

You can use beatifulsoup.

from bs4 import BeautifulSoup

strings = ['<span id="lblTaxMapNum">127.60-02-15</span>', '<span id="lblTaxMapNum">127.60-02-16</span>']

# Use BeautifulSoup to extract the text from the <span> tags
for string in strings:
    soup = BeautifulSoup(string, 'html.parser')
    number_series = soup.span.text
    print(number_series)

output:

127.60-02-15
127.60-02-16

CodePudding user response：

it's a little bit long , hope my documents are readable

with open(r'c:\users\GH\desktop\test.csv' , 'r') as f:
text = f.read().strip()
stRange = '<' # we will gonna remove the dump txt from our file by using (range 
index) method
endRange = '>' # which means removing all extra literals between <>
text = list(text)
# casting our data to a list to be able to modify our data by reffering to its 
components by index number
i = 0
length = len(text) 
# we're gonna manipulate our text while we are iterating upon it 
# so we have to declare a variable to be able to change it while iterating
while i < length:
    if text[i] == stRange:
        stRange = text.index(text[i])
    elif text[i] != endRange and text[i] != stRange:
        i  = 1
        continue
    elif text[i] == endRange:
        endRange = text.index(text[i]) # an integer to be used as rangeIndex
        i = 0
        del text[stRange : endRange   1] # deleting the extra unwanted 
characters
        length = len(text) # getting the new length of our data
        stRange = '<' # and again , assigning the specific characters to their 
 variables
        endRange = '>'
    i  = 1
else:
    result = str()
    for l in text:
        result  = l
    else:
        with open(path , 'w') as f:
            f.write(result)
        with open(path , 'r') as f:
            print('the result ==>')
            print(f.read())