Last summer I started
the development of {tidyods}
, an
R package for reading cells in ODS (OpenDocument Spreadsheet) files, and
effectively a {tidyxl}
equivalent for
ODS files. After developing the first iteration of the package in June I left
it alone, but I’ve recently revisited the package and refactored the code which
has improved performance.
What’s new
In late June, after I posted about {tidyods}
, I implemented a “smart”
rectifier function that can handle more complex spreadsheets, especially those
that follow UK government advice on
releasing statistics in spreadsheets.
While this is a working function, it is something I want to review and
therefore should not be considered stable at this point.
In the last couple of weeks I’ve been working on some code where {tidyods}
came in handy and was more useful than if I’d relied on {readODS}
, which
inspired me to go and spend some time revisiting the package.
Refactoring the code has split the underlying code for the “quick” and “full”
cell extraction into separate functions, at the cost of some duplication it
makes it easier in the codebase to follow what happens when a user sets the
quick
argument in the reading functions. The quick
will also now provide
the underlying raw numeric value for cells with a float or percentage
value-type.
The “full” cell extraction method has been improved and now handles the more
complex ways that text can be represented in ODS files1, it also now
correctly processes multi-line text entries within cells. Information about
merged cells has also been improved with cells being classified either as a
merge-lead
when they are the top-left cell of a merge group or
merge-hidden
when they are subordinate cells in a merge group. The lead
cell in a merge group now also now has a merge-shape
attribute of either
vertical
or horizontal
when the merge group covers a set of cells in either
a single column or row respectively, when a group covers multiple columns and
rows its shape attribute is set to rectangle
2.
Finally, I’ve explicitly set a minimum R version of 4.1.0, given the complexity
of the workflow I’ve made a fair use of pipes3. In the first version of the
code I used the {magrittr}
pipe (%>%
), but I’ve switched to the native
pipe which means needing to use R 4.1.0 or greater. I’ve also explicitly set
the version of {dplyr}
to 1.1.0 since the refactored code makes use of some
of its newest features.
The main goal of the refactor was to optimise the code, to generally review and
consider if there were better ways of achieving the desired outcome. While
performance improvement was a desired goal but not an explicit aim4 the
improvements are certainly notable5. For a simple test file the quick
method has dropped from 27ms to 17ms whereas {readODS}
takes 25ms, while the
full process takes around 32ms (down from 39). Using a more complex file as
input6 to test real-world performance perhaps better illustrates performance
differences, here the quick
method takes just 4 seconds where {readODS}
takes 11 seconds, the full read process for {tidyods}
also now takes 11
seconds but this is down from 36 seconds.
The package is fully functional and performs as well as the only other package for reading ODS files. Therefore I’ve decided to bump it from its initial development version number of 0.0.0.90000 to 0.0.1` to mark the occasion. I’d like to do a couple of more refinements before bumping it to 0.1.0 at which point I think I’d like to consider getting it on CRAN.
What next
So what needs to be done before I’ll release the package as 0.1.0 and consider submitting it to CRAN?
Although I’ve included rectifier functions to reshape the extracted cell data back into 2-dimensional datasets I’ve not yet focussed on those in my refactoring. The “smart” rectifier definitely needs to be reviewed to see if there are opportunities to optimise it and/or improving how it works for end-users.
The package needs tests if it’s going to be considered a serious package.
I also ought to write some vignettes to demonstrate the functionality of
package, its similarities and differences with {tidyxl}
, {unpivotr}
and
{readODS}
.
Finally I need to rediscover the checklist of things to do before trying to submit to CRAN, to test the package on different systems, it’s probably also a good idea to get someone else to review the package/code.
-
Particularly by Microsoft Excel7 which makes use of a text repetition entity, via a tag of the format
<text:s c="3">
wherec
governs the number of repetitions, but whichxml2::xml_text()
ignores when extracting the text content of a node. ↩︎ -
Yes, everything in a spreadsheet is some sort of rectangle, but in this case I’m saying that vertical/horizontal sets of cells are simply a “line” and you need lines in both directions, i.e. multiple columns and multiple rows, to make a rectangular merge group. ↩︎
-
I should review these further to try and reduce them, but the complexity of the workflows makes using a pipe (be it
{magrittr}
or native) very handy. And if ultimately the main user base of{tidyods}
ends up being government folk it might not be wise to limit the package to a version of R that’s only 2 years old. ↩︎ -
Fundamental performance improvement can only really come from writing custom C++ to handle the parsing of the underlying XML. ↩︎
-
This is performance on my personal M1 MacBook Air with 8GB RAM. ↩︎
-
A rather hideous spreadsheet published by the Cabinet Office on the number of civil servants by postcode which has 5,554 rows and 11 columns. ↩︎
-
Speaking of text in Microsoft Excel… ↩︎