Home > front end >  Read contents from zipfile, apply transformation and write to new zip file in Python
Read contents from zipfile, apply transformation and write to new zip file in Python

Time:08-20

I have a zip file which contains a text file(with millions of lines). I need to read line by line, apply some transformations to each line and write to a new file and zip it.

with zipfile.ZipFile("orginal.zip") as zf, zipfile.ZipFile("new.zip", "w") as new_zip:
    
    with io.TextIOWrapper(zf.open("orginal_file.txt"), encoding="UTF-8") as fp, open("new.txt", "w") as new_txt:
        
        for line in fp:
                       
            new_txt.write(f"{line} - NEW")  # Some transformation
        
        new_zip.writestr("new.txt", new_txt)

But I am getting following error in new_zip.writestr("new.txt", new_txt)

TypeError: object of type '_io.TextIOWrapper' has no len()
  1. If I do transformation using the above method, will there be any out of memory issue(since the file can have millions of lines)?
  2. How to identify the first line(since the first line is a header record)?
  3. When I write using new_txt.write(f"{line} - NEW"), - NEW comes first in the line(For ex. if line is 003000000011000000, the output will be - NEW003000000011000000).
  4. How can we ensure the file integrity(for ex. to ensure whether all lines are written in the new file.)
  5. What causes the TypeError: object of type '_io.TextIOWrapper' has no len() error?

Thank You.

CodePudding user response:

When you're doing:

new_zip.writestr("new.txt", new_txt)

you are trying to write the object new_txt as some data (text or equivalent) to the zip file as the file "new.txt". But the object new_txt is already a file. That's what gives you the error: TypeError: object of type '_io.TextIOWrapper' has no len() - it's expecting some content, but getting a file object. From the docs:

Write a file into the archive. The contents is data, which may be either a str or a bytes instance;

Instead, what you probably want to do is use write(file):

new_zip.write("new.txt")

which should write the file "new.txt" into the zip file.

Regarding your other questions:

If I do transformation using the above method, will there be any out of memory issue(since the file can have millions of lines)?

Everything is being done with files, so probably no.

How to identify the first line(since the first line is a header record)?

Use a flag that gets set in the first iteration of the line loop

When I write using new_txt.write(f"{line} - NEW"), - NEW comes first in the line(For ex. if line is 003000000011000000, the output will be - NEW003000000011000000).

You are probably missing a newline \n from you transformation logic. The NEW in the front is probably coming from the previous line you wrote. Try adding a \n to the transformation & make sure there is no existing newline at the end of the input string.

How can we ensure the file integrity(for ex. to ensure whether all lines are written in the new file.)

Count the lines? Ideally, unless some error occurs all lines should be read without you having to worry about it.

  • Related