Home > OS >  Extracting excel data after unzipping using rubyzip
Extracting excel data after unzipping using rubyzip

Time:12-24

I'm trying to get spreadsheet data from zipped .xlsx files. I'm using rubyzip to access the contents of the zipfile

Zip::File.open(file_path) do |zip_file|
   zip_file.each do |entry|
     *process entry*
   end
end

My problem is that rubyzip gives a Zip::Entry object, which, I cant get to work with gems like roo or creek.

I've done something similar, but with .csv file. This was as simple as CSV.parse(entry.get_input_stream.read). However, that just gives me a string of encoded gibberish when using it on an .xlsx file.

I've looked around and the closest answer I got was temporarily extracting the files, but I want to avoid doing this since the files can get pretty large.

Does anyone have any suggestions? Thanks in advance.

CodePudding user response:

So what you need to do is convert the stream into an IO object that Roo can understand.

To determine if the object passed to Roo::Spreadsheet.open is a "stream" Roo uses the following method:

 def is_stream?(filename_or_stream)
    filename_or_stream.respond_to?(:seek)
  end

Since a Zip::InputStream does not respond to seek you cannot use this object directly. To get around this we simply need an object that does respond to seek (like a StringIO)

We can just read the input stream into the StringIO directly:

stream = StringIO.new(entry.get_input_stream.read)

Or the Zip library also provides a method to copy a Zip::InputStream to another IO object through the IOExtras module, which I think reads fairly nicely as well.

Knowing all of the above we can implement as follows:

Zip::File.open(file_path) do |zip_file|
   zip_file.each do |entry|
     # make sure Roo can handle the file (at least based on the extension)
     ext = File.extname(entry.name)&.to_sym 
     next unless Roo::CLASS_FOR_EXTENSION[ext]
     # stream = StringIO.new(entry.get_input_stream.read)
     ::Zip::IOExtras.copy_stream(stream = StringIO.new, entry.get_input_stream)
     spreadsheet = Roo::Spreadsheet.open(stream, extension: ext)
     # process file
   end
end
  • Related