Generator comprehension with open function-CodePudding

I'm trying to figure out what is the best of using generator when parsing a file line by line. Which use of the generator comprehension will be better.

First option.

with open('some_file') as file:
    lines = (line for line in file)

Second option.

lines = (line for line in open('some_file'))

I know it will produce same results, but which one will be faster/ more efficient?

CodePudding user response：

You can't combine generators and context managers (with statements).

Generators are lazy. They will not actually read their source data until something requests an item from them.

This appears to work:

with open('some_file') as file:
    lines = (line for line in file)

but when you actually try to read a line later in your program

for line in lines:
    print(line)

it will fail with ValueError: I/O operation on closed file.

This is because the context manager has already closed the file - that's it's sole purpose in life - and the generator has not started reading it until the for loop started to actually request data.

Your second suggestion

lines = (line for line in open('some_file'))

suffers from the opposite problem. You open() the file, but unless you manually close() it (and you can't because you don't know the file handle), it will stay open forever. That's the very situation that context managers fix.

Overall, if you want to read the file, you can either ... read the file:

with open('some_file') as file:
    lines = list(file)

or you can use a real generator:

def lazy_reader(*args, **kwargs):
    with open(*args, **kwargs) as file:
        yield from file

and then you can do

for line in lazy_reader('some_file', encoding="utf8"):
    print(line)

and lazy_reader() will close the file when the last line was read.

CodePudding user response：

If you want to test stuff like this, I recommend looking at the timeit module.

Let's setup a working version of your two tests and I will add some additional options that are all about the same performance.

Here are several options:

def test1(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return [line for line in file_in]

def test2(file_path):
    return [line for line in open(file_path, "r", encoding="utf-8")]

def test3(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return file_in.readlines()

def test4(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return list(file_in)

def test5(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        yield from file_in

lets test them with a text file that is the 10x the complete works of Shakespeare that I happen to have for doing tests like this.

If I do:

print(test1('shakespeare2.txt') == test2('shakespeare2.txt'))
print(test1('shakespeare2.txt') == test3('shakespeare2.txt'))
print(test1('shakespeare2.txt') == test4('shakespeare2.txt'))
print(test1('shakespeare2.txt') == list(test5('shakespeare2.txt')))

I see that all tests produce the same results.

Now let's time them:

import timeit

setup = '''
file_path = "shakespeare2.txt"

def test1(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return [line for line in file_in]

def test2(file_path):
    return [line for line in open(file_path, "r", encoding="utf-8")]

def test3(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return file_in.readlines()

def test4(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        return list(file_in)

def test5(file_path):
    with open(file_path, "r", encoding="utf-8") as file_in:
        yield from file_in
'''

print(timeit.timeit("test1(file_path)", setup=setup, number=100))
print(timeit.timeit("test2(file_path)", setup=setup, number=100))
print(timeit.timeit("test3(file_path)", setup=setup, number=100))
print(timeit.timeit("test4(file_path)", setup=setup, number=100))
print(timeit.timeit("list(test5(file_path))", setup=setup, number=100))

On my laptop this shows me:

9.65
9.79
9.29
9.08
9.85

Suggesting to me that it does not matter which one you pick from a performance perspective. So don't use your test2() strategy :-)

Note though that test5() (credit to @tomalak) might be important from a memory management perspective!.