Home > Enterprise >  Finding output cells causing large file size in jupyter notebook
Finding output cells causing large file size in jupyter notebook

Time:08-10

I have a jupyter notebook which has ~400 cells. The total file size is 8MB so I'd like to suppress the output cells that have a large size so as to reduce the overall file size.

There are quite a few possible output cells that could be causing this (mainly matplotlib and seaborn plots) so to avoid spending time on trial and error, is there a way of finding the size of each output cell? I'd like to keep as many output plots as possible as I'll be pushing the work to github for others to see.

CodePudding user response:

My idea with nbformat spelled out for running in a cell in a Jupyter notebook cell to get the code cell numbers listed largest to smallest (it will fetch a notebook example first to have something to try it on):

############### Get test notebook ########################################
import os
notebook_example = "matplotlib3d-scatter-plots.ipynb"
if not os.path.isfile(notebook_example):
    !curl -OL https://raw.githubusercontent.com/fomightez/3Dscatter_plot-binder/master/matplotlib3d-scatter-plots.ipynb
### Use nbformat to get estimate of output size from code cells. #########
import nbformat as nbf
ntbk = nbf.read(notebook_example, nbf.NO_CONVERT)
size_estimate_dict = {}
for cell in ntbk.cells:
    if cell.cell_type == 'code':
        size_estimate_dict[cell.execution_count] = len(str(cell.outputs))
out_size_info = [k for k, v in sorted(size_estimate_dict.items(), key=lambda item: item[1],reverse=True)]
out_size_info

(To have a place to easily run that code go here and click on the launch binder button. When the session spins up, open a new notebook and paste in the code and run it. Static form of the notebook is here.)

Example I tried didn't include Plotly, but it seemed to do similar using a notebook with all Plotly plots. I don't know how it will handle a mix though. It may not sort perfectly if different kinds.
Hopefully, this gives you an idea though how to do what you wondered. The code example could be further expanded to use the retrieved size estimates to have nbformat make a copy of the input notebook without the output showing for, say, the top ten largest code cells.

  • Related