Refactoring and improving {tidyods}

2023-05-01 | Matt | 5 min read

#r #package #tidyods

Animated GIF from Schitts Creek (CBC television) of David Rose asking if something is a spreadsheet

Last summer I started the development of {tidyods}, an R package for reading cells in ODS (OpenDocument Spreadsheet) files, and effectively a {tidyxl} equivalent for ODS files. After developing the first iteration of the package in June I left it alone, but I’ve recently revisited the package and refactored the code which has improved performance.

What’s new

In late June, after I posted about {tidyods}, I implemented a “smart” rectifier function that can handle more complex spreadsheets, especially those that follow UK government advice on releasing statistics in spreadsheets. While this is a working function, it is something I want to review and therefore should not be considered stable at this point.

In the last couple of weeks I’ve been working on some code where {tidyods} came in handy and was more useful than if I’d relied on {readODS}, which inspired me to go and spend some time revisiting the package.

Animated GIF from The Orville (Fox/Hulu/Disney+) of Isaac saying his method is more efficient

Refactoring the code has split the underlying code for the “quick” and “full” cell extraction into separate functions, at the cost of some duplication it makes it easier in the codebase to follow what happens when a user sets the quick argument in the reading functions. The quick will also now provide the underlying raw numeric value for cells with a float or percentage value-type.

The “full” cell extraction method has been improved and now handles the more complex ways that text can be represented in ODS files¹, it also now correctly processes multi-line text entries within cells. Information about merged cells has also been improved with cells being classified either as a merge-lead when they are the top-left cell of a merge group or merge-hidden when they are subordinate cells in a merge group. The lead cell in a merge group now also now has a merge-shape attribute of either vertical or horizontal when the merge group covers a set of cells in either a single column or row respectively, when a group covers multiple columns and rows its shape attribute is set to rectangle².

Finally, I’ve explicitly set a minimum R version of 4.1.0, given the complexity of the workflow I’ve made a fair use of pipes³. In the first version of the code I used the {magrittr} pipe (%>%), but I’ve switched to the native pipe which means needing to use R 4.1.0 or greater. I’ve also explicitly set the version of {dplyr} to 1.1.0 since the refactored code makes use of some of its newest features.

The main goal of the refactor was to optimise the code, to generally review and consider if there were better ways of achieving the desired outcome. While performance improvement was a desired goal but not an explicit aim⁴ the improvements are certainly notable⁵. For a simple test file the quick method has dropped from 27ms to 17ms whereas {readODS} takes 25ms, while the full process takes around 32ms (down from 39). Using a more complex file as input⁶ to test real-world performance perhaps better illustrates performance differences, here the quick method takes just 4 seconds where {readODS} takes 11 seconds, the full read process for {tidyods} also now takes 11 seconds but this is down from 36 seconds.

The package is fully functional and performs as well as the only other package for reading ODS files. Therefore I’ve decided to bump it from its initial development version number of 0.0.0.90000 to 0.0.1` to mark the occasion. I’d like to do a couple of more refinements before bumping it to 0.1.0 at which point I think I’d like to consider getting it on CRAN.

What next

So what needs to be done before I’ll release the package as 0.1.0 and consider submitting it to CRAN?

Although I’ve included rectifier functions to reshape the extracted cell data back into 2-dimensional datasets I’ve not yet focussed on those in my refactoring. The “smart” rectifier definitely needs to be reviewed to see if there are opportunities to optimise it and/or improving how it works for end-users.

The package needs tests if it’s going to be considered a serious package.

I also ought to write some vignettes to demonstrate the functionality of package, its similarities and differences with {tidyxl}, {unpivotr} and {readODS}.

Finally I need to rediscover the checklist of things to do before trying to submit to CRAN, to test the package on different systems, it’s probably also a good idea to get someone else to review the package/code.

Particularly by Microsoft Excel⁷ which makes use of a text repetition entity, via a tag of the format <text:s c="3"> where c governs the number of repetitions, but which xml2::xml_text() ignores when extracting the text content of a node. ↩︎
Yes, everything in a spreadsheet is some sort of rectangle, but in this case I’m saying that vertical/horizontal sets of cells are simply a “line” and you need lines in both directions, i.e. multiple columns and multiple rows, to make a rectangular merge group. ↩︎
I should review these further to try and reduce them, but the complexity of the workflows makes using a pipe (be it {magrittr} or native) very handy. And if ultimately the main user base of {tidyods} ends up being government folk it might not be wise to limit the package to a version of R that’s only 2 years old. ↩︎
Fundamental performance improvement can only really come from writing custom C++ to handle the parsing of the underlying XML. ↩︎
This is performance on my personal M1 MacBook Air with 8GB RAM. ↩︎
A rather hideous spreadsheet published by the Cabinet Office on the number of civil servants by postcode which has 5,554 rows and 11 columns. ↩︎
Speaking of text in Microsoft Excel…

↩︎