Python Windows throws AttributeError: 'NoneType' object has no attribute 'group'-CodePudding

This is head sample of my file:

1.1.1.0 - 1.1.1.255
2.2.2.0 - 2.2.2.255
3.3.3.0 - 3.3.3.100

This is my Python regex code:

regex = r'(?P<start>\w \.\w \.\w \.\w ) \- (?P<end>\w \.\w \.\w \.\w )'
with open('sorted_french_ips.txt', 'r') as file:
    with open('sorted_french_edited_ips.txt', 'a') as sorted_french_edited_ips:
        for line in file:
            find = re.match(regex, line)
            start = find.group('start')
            end = find.group('end')
            cidr = netaddr.iprange_to_cidrs(start, end)
            cidr = str(cidr)
            cidr = cidr.replace(', IPNetwork', '\n')
            cidr = cidr.replace('[IPNetwork(\'', '')
            cidr = cidr.replace('\')]', '')
            cidr = cidr.replace('(\'', '')
            cidr = cidr.replace('\')', '')
            sorted_french_edited_ips.write(f'{cidr}\n')

When I run this code in Windows, I get this error:

Traceback (most recent call last):
  File "C:\Users\Saeed\Desktop\python-projects\win.py", line 131, in <module>
    write()
  File "C:\Users\Saeed\Desktop\python-projects\win.py", line 81, in write
    start = find.group('start')
AttributeError: 'NoneType' object has no attribute 'group'

But if I run the same code in Linux, it works fine.

Why I get error in Windows? Are Window's regex and Linux different? Should it be changed?

CodePudding user response：

Your input text file is Unicode encoded with a byte order mark, BOM. When you open the file in Python in Windows, by default, the system encoding is used, which is different from Unicode, and the BOM byte sequence is read in as part of the text.

The re.match function only finds a match if the match occurs at the start of the string.

The first line does not start with the pattern you defined, it starts with the BOM sequence.

To fix the problem, make sure you read the file with the right encoding.

If your file is UTF-8 BOM encoded, use

with open('sorted_french_ips.txt', 'r', encoding="utf-8-sig") as file:

In Linux, usually, the default encoding is UTF-8 without BOM, hence it works without setting the encoding argument explicitly.