Home > Software design >  Writing large Excel in Java causing high CPU usage using apache-poi
Writing large Excel in Java causing high CPU usage using apache-poi

Time:01-29

Writing large data around 1/2 million records with 25 columns.

Using apache-poi streaming workbook to write data from list to excel file. when tested locally it is giving high CPU spikes in local machine too. appears to be causing when writing workbook data to file

workbook.write(fileOutputStream) // it is causing CPU spikes debugged and confirmed.

It is causing high CPU usage in cloud app (deployed in kubernetes) and restarting application as it is hitting resource limits. we have a simple app with 2042Mi memory and 1024m CPU config.

Is there any way to write a large excel file without impacting CPU and Memory and java heap efficiently.

(NOTE: can't use csv or other formats as business requirement is for excel files)

Code using:

import java.io.File;
import java.io.FileOutputStream;
import java.util.List;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.CellStyle;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.springframework.stereotype.Service;

import com.king.medicalcollege.model.Medico;

@Service
public class ExcelWriterService {

    // file is an empty file already created
    // Large List around 500K records of medico data [Medico is POJO]

    public File writeData(File file, List<Medico> medicos) {

        SXSSFWorkbook sxssfWorkbook = null;
        try (SXSSFWorkbook workbook = sxssfWorkbook = new SXSSFWorkbook(1);
                FileOutputStream fileOutputStream = new FileOutputStream(file)) {

            Sheet sheet = workbook.createSheet();
            CellStyle cellStyle = workbook.createCellStyle();
            int rowNum = 0;
            for (Medico medico : medicos) {
                Row row = sheet.createRow(rowNum);
                //just adding POJO values (25 fields)  into ROW 
                addDataInRow(medico, row, cellStyle);
                rowNum  ;
            }

            //workbook.write causing CPU spike
            workbook.write(fileOutputStream);

            workbook.dispose();

        } catch (Exception exception) {
            return null;
        } finally {
            if (sxssfWorkbook != null) {
                sxssfWorkbook.dispose();
            }
        }

        return file;
    }

    private void addDataInRow(Medico medico, Row row, CellStyle cellStyle) {
        Cell cell_0 = row.createCell(0);
        cell_0.setCellValue(medico.getFirstName());
        cell_0.setCellStyle(cellStyle);
        
        Cell cell_1 = row.createCell(1);
        cell_1.setCellValue(medico.getMiddleName());
        cell_1.setCellStyle(cellStyle);
        
        Cell cell_2 = row.createCell(2);
        cell_2.setCellValue(medico.getLastName());
        cell_2.setCellStyle(cellStyle);
        
        Cell cell_3 = row.createCell(2);
        cell_3.setCellValue(medico.getFirstName());
        cell_3.setCellStyle(cellStyle);
        
        //...... around 25 columns will be added like this
    }
}

CodePudding user response:

You seem to be doing the right thing by giving SXSSFWorkbook a window size (although 1 might be too small causing problems?). The workbook should be getting flushed to disk when the number of rows exceed the limit you set, reducing memory usage. I doubt there is a workaround for reducing cpu usage though.

You can try to limit your memory usage by adjusting JVM parameters so it doesn't trigger the K8s limit. Have a look at these: -Xmx -Xms -XX:MaxRAM -XX: UseSerialGC

Have you considered using an alternate library for writing Excel files? For example, have a look at this SO answer: Are there any alternatives to using Apache POI Java for Microsoft Office?

CodePudding user response:

The question "Is there any way to write a large excel file without impacting CPU and Memory?" is something like: Let me have my cake and eat it too. In German we say: Wash me but don't get me wet. In other words: All content which shall be in a file on a computer must be through the CPU and must be in memory while processing.

To get a clue about what amount of memory we are talking about, lets have a simple calculation of what it means to have 500,000 rows with 25 columns:

                     Cell value                                                                         Len     500,000 times  KiByte        MiByte
Single value         some cell value                                                                    15          
25 columns           some cell value, some cell value, some cell value, some cell value, some ...       425     212500000      207519.5313   202.6557922
XML of single value  <c r="C99" t="inlineStr" s="9"><is><t>some cell value</t></is></c>                 66       
XML of 25 columns    <c r="C99" t="inlineStr" s="9"><is><t>some cell value</t></is></c><c r="...        1650    825000000      805664.0625   786.781311

That shows, even having only plain text, 500,000 rows with 25 columns having cell value "some cell value" will take 202.6557922 MiByte of memory.

But a Excel file is not simply plain text. Current Open Office Excel format stores XML. And that needs much more memory because of the XML-overhead. The above shows that 500,000 rows with 25 columns having cell value "some cell value" will take 786.781311 MiByte of memory, when stored as XML.

That 786.781311 MiByte of memory is only to store the cells, there is more overhead to store the rows, the sheets, the workbook, the styles, the relations, ...

SXSSFWorkbook claims to be a streaming approach. But it only streams the cells into rows into sheets as temporary files. It additional needs memory to hold the workbook, the styles, the relations, ... After streaming, it needs memory to put that all together into the workbook. And at least while this process the whole workbook size must be processed through CPU and memory.

Conclusion: Excel is a spreadsheet application. It is not a good format for data exchange. Good formats for data exchange are: Plain text (CSV) or plain XML or JSON, as those really can contain streams of plain data rows without the overhead of sheets in a workbook.

  • Related