Add XLSX spreadsheets import optimisations for small/readonly devices
Needs ReviewPublic

Authored by davidllewellynjones on Oct 28 2019, 3:08 PM.

Details

Reviewers
pvuorela
dcaliste
Group Reviewers
Calligra: 3.0
Summary

Loading large XLSX files on resource-constrained devices can be very slow, and the problem is made worse by the fact that spreadsheet applications make it easy to add styles to large numbers of rows and columns, even though the number of cells with actual content may be quite small (e.g. select a single entire column/row and apply a style). In this case it's much faster to only convert the cells inside the boundary containing values, and ignore the cells beyond this which only have styles applied.

This patch adds a number of compile-time defines that can be used to optimise loading of XLSX spreadsheet documents on restricted devices.

For the desktop version I expect none of these flags will be particularly interesting, so if they're not provided to cmake, the current behaviour is left unchanged.

MSOOXML_MAX_SPREADSHEET_COLS=<integer>

This controls the maximum number of columns that the importer will load. Any columns outside the range will be ignored. The default value is 0x7FFF, the maximum number of columns supported by Calligra.

MSOOXML_MAX_SPREADSHEET_ROWS=<integer>

This controls the maximum number of rows that the importer will load. Any rows outside the range will be ignored. The default value is 0xFFFFF, the maximum number of rows supported by Calligra.

MSOOXML_SPREADSHEET_CONTENT_BORDER=<integer>

If this value is set to a positive value, the importer will calculate the smallest spreadsheet size that can accommodate all cells containing values or formulae. It adds a border of the given number of cells and uses that for the maximum bounds of the imported spreadsheet. This can be useful because it's easy to create spreadsheets where large numbers of cells have style attributes set, but which in practice only have content in a smaller area of the spreadsheet. Setting this value therefore provides an optimised way to import the spreadsheet, but with style data lost outside of the bounded region.

If this value is left unset, or is negative, the full spreadsheet will be imported.

MSOOXML_IMPORT_READ_ONLY=<true|false>

Formula cells in XLSX documents store both the formula and the calculated value at the time the document was saved. When importing a file for read-only viewing, the value is more useful than the cell. By setting this flag, the importer can optimise by using the values rather than the formulae.

Test Plan

The following example spreadsheet has data in columns IR (252) and IW (257):

http://www.flypig.co.uk/dnload/dnload/other/calligra-optimise01.zip

  1. Load the file and notice there's data in columns IR and IW.
  1. Following the standard instructions, rebuild Calligra using the additional flags:
cd ~/kde/src/calligra
cmake -DCMAKE_INSTALL_PREFIX=$HOME/kde/inst5 $HOME/kde/src/calligra -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPRODUCTSET=SHEETS -DMSOOXML_MAX_SPREADSHEET_COLS=0xFF -DMSOOXML_MAX_SPREADSHEET_ROWS=0x1400 -DMSOOXML_SPREADSHEET_CONTENT_BORDER=5 -DMSOOXML_IMPORT_READ_ONLY=true
make -j6
make install -j6
  1. Open the file with the newly build version of Calligra and notice that the less data has been loaded.

Playing around with the file and different values for the flags should give a mixture of results.

Diff Detail

Repository
R8 Calligra
Lint
Lint Skipped
Unit
Unit Tests Skipped
Restricted Application added a subscriber: Calligra-Devel-list. · View Herald TranscriptOct 28 2019, 3:08 PM
davidllewellynjones requested review of this revision.Oct 28 2019, 3:08 PM

@davidllewellynjones, I hadn't time yet to test the patch, but the idea and implementation is already looking great in my opinion. Maybe, just one remark after looking at the proposed additions: not saving the formale, but the values should not be named read-only. Indeed, maybe in the future, we would like to be able to "inspect" the content of a cell, even on portable devices. Being able to read the formulae should then be possible, even if not for modification. I may suggest the following naming changes (functionality to load by values to help loading time is great) :

  • MSOOXML_IMPORT_READ_ONLY -> MSOOXML_IMPORT_BY_VALUES or MSOOXML_IMPORT_VALUES_ONLY
  • MSOOXML::readonlySpreadsheet() -> MSOOXML::byValuesSpreadsheet() or MSOOXML::valuesOnlySpreadsheet()

Thanks @dcaliste, this is an excellent suggestion and much better naming. I've updated the diff to make the changes you suggested.

  • MSOOXML_IMPORT_READ_ONLY -> MSOOXML_IMPORT_BY_VALUES
  • MSOOXML::readonlySpreadsheet() -> MSOOXML::byValuesSpreadsheet()
anthonyfieroni added inline comments.
CMakeLists.txt
124–129 ↗(On Diff #68954)

Why 2 times?

CMakeLists.txt
124–129 ↗(On Diff #68954)

That's a very good question; nicely noticed. I'll update the patch to remove one.

Remove duplicated cmake definition.

davidllewellynjones marked 2 inline comments as done.Oct 30 2019, 1:47 PM
davidllewellynjones added inline comments.
CMakeLists.txt
124–129 ↗(On Diff #68954)

Thanks again, I've removed the duplication now.

A bit mixed feelings for this. In a way nice to avoid unnecessary work, but then again it seems effectively a compilation option if xlsx support is for read-write or view-only. On main level CMake it's also a bit random detail compared to others.

Could be nice to have this sort of things as some kind of hints to file opening api, but not spotting any existing mechanism.

filters/sheets/xlsx/XlsxXmlWorksheetReader.cpp
1603

Doesn't latter already imply non-empty?