Home > other >  how to edit the code to be able to read non-English characters from CSV file
how to edit the code to be able to read non-English characters from CSV file

Time:09-30

this code is built to download images from links in a column called "link" in CSV file and replace it with the name in another column called "name" but the code stopped working when he is facing a non-English character, I want the code to work also with non-english character

here is the code

import urllib.request
import csv
import os

with open('booklogo.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    
    for row in reader:
        print(row)
        if row["link"] != '' and row["title"] != '':
            name, ext = os.path.splitext(row['link'])
            if ext == '':
                ext = ".png"
            title_filename = f"{row['title']}{ext}".replace('/', '-')
            urllib.request.urlretrieve(row['link'], title_filename)

here is the error


OSError Input In [5], in <cell line: 5>() 13 ext = ".png" 14 title_filename = f"{row['title']}{ext}".replace('/', '-') ---> 15 urllib.request.urlretrieve(row['link'], title_filename) File ~\anaconda3\lib\urllib\request.py:249, in urlretrieve(url, filename, reporthook, data) 247 # Handle temporary file setup. 248 if filename: --> 249 tfp = open(filename, 'wb') 250 else: 251 tfp = tempfile.NamedTemporaryFile(delete=False) OSError: [Errno 22] Invalid argument: 'Albert ?eská republika.png 

CodePudding user response:

I think you're correct (in your comment below) that it's probably the question mark.

You need to sanitize your filename. This is not included in Python's standard lib, so we'll draw on the most popular answer to the same issue/question, from Turn a string into a valid filename?.

You'll need to add this function to your file:

import unicodedata
import re

def slugify(value, allow_unicode=False):
    """
    Taken from https://github.com/django/django/blob/master/django/utils/text.py
    Convert to ASCII if 'allow_unicode' is False. Convert spaces or repeated
    dashes to single dashes. Remove characters that aren't alphanumerics,
    underscores, or hyphens. Convert to lowercase. Also strip leading and
    trailing whitespace, dashes, and underscores.
    """
    value = str(value)
    if allow_unicode:
        value = unicodedata.normalize('NFKC', value)
    else:
        value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
    value = re.sub(r'[^\w\s-]', '', value.lower())
    return re.sub(r'[-\s] ', '-', value).strip('-_')

Then modify your existing code, like:

...
# Sanitize filename.  Will get rid of periods too, so add ext after
title_filename = slugify(row['title'])
title_filename  = ext
...
  • Related