Introducing shrthnd

2024-04-11 | Matt | 13 min read

#r #package #tidyods #shrthnd

Animated GIF of Kelly Rowland trying to send a text message using the
Microsoft Excel app on a Nokia mobile phone

About a year ago I started development of a new R package to deal with dirty data, specifically to process the use of shorthand, symbols and other annotations in columns of spreadsheets and tables. I realised I’ve not blogged about it, and thus if one subscribes to “blog-driven-development”¹ then if I’ve not blogged about it the package doesn’t exist.

Background

In my slow going development of {tidyods} I implemented two “rectifier” functions, the first is based on the approach in {unpivotr} to convert a data.frame of tidy cells back into a 2-dimensional sheet. The simple_rectifier() uses the row and column positions to replicate the original sheet, albeit with options to skip rows and process column headers.

However, spreadsheets are often mystical beasts that don’t conform to laws of nature (tidy data), and so I’ve also been working on a experimental “rectifier” that tries to identify where the actual sheet of data starts.

As well as header/footer rows, we can also try and coerce the sheet data away from character vectors to possible data types. Within {tidyods} we get information on the data types of cells, so we can use that as a starter. But how to process columns with multiple data-types. One of the most common situations is a column that is in essence a numeric column but in some cells we might have only non-numeric values or non-numeric values in addition to the data.

Common practice is often to clean out these values, sometimes this might be appropriate if the symbol conveys that data is missing, but symbols are used for a wide variety of purposes. Publishers of statistical data have different standards when it comes to the use of symbols but a dataset may have markers to indicate the different reasons for missing-ness, it may use asterisks to denote levels of statistical significance, it may use a symbol or shorthand to indicate a value is estimated rather than actual, or markers might relate to explanatory notes associated with the data point or value.

In {tidyods} I included an experimental function for trying to do a “smart” rectifying process that attempts to analyse the structure of the cells, extract the table of data from within a sheet with header rows, and use data type information to coerce columns to their relevant type.

As my use case at the time was working with UK government publications I included in the tidyods::smart_rectify() function the option to strip out shorthand notation based on guidance published by the UK Government Analysis Function as well as similar guidance from the US Census Bureau (at least I certainly remember trying to find something from them but a in the time that’s passed I can no longer find the page/PDF I originally consulted).

This was a very basic approach that looked for common shorthand, symbols and markers and simply used gsub() to remove them from the vector. But that didn’t feel like a “safe” long-term solution to handling this sort of problem. Around the same time Matt posted a blog about data types in R, which was in response to a similar issue regarding vectors that contained numbers and shorthand notation. Which got me thinking, maybe there was a need for a more systematic approach to handling shorthand in tables.

The {shrthnd} package

Shorthand and other symbolic markers convey information just as much as the numbers they are associated with. While some datasets/sources will put these in a separate notes column many sources do not, and even if everything published from here on out did do this there’s all those legacy spreadsheets out there.

Let’s start with a vector (x) that has 8 values that we might see in a spreadsheet column and has a mix of numeric and non-numeric components. Two of these values are explicitly missing values ("NA" and "")², so we have six values that convey some quantum of information. If we coerce this vector using as.numeric() then half of these values are coerced to NA, and so out of our original 8 values we only have 3 remaining.

x <- c("12", "34.567", "[c]", "NA", "56.78[e]", "78.9", "90.123[e]", "")

as.numeric(x)
#> [1] 12.000 34.567     NA     NA     NA 78.900     NA     NA
#> Warning message:
#> NAs introduced by coercion

We could do some cleaning to ensure the numeric values are extracted, from looking at the vector we can see that some of the numeric values have the suffix [e] (meaning in this case the values were estimated), so we can use gsub() to remove this and then coerce the vector.

y <- gsub("\\[e\\]", "", x)

y
#> [1] "12"     "34.567" "[c]"    "NA"     "56.78"  "78.9"   "90.123" ""

as.numeric(y)
#> [1] 12.000 34.567     NA     NA 56.780 78.900 90.123     NA
#> Warning message:
#> NAs introduced by coercion

So if we know the structure of our data then we construct a way to strip out the non-numeric components and get a nice neat numeric vector. But what if later down the line it would be useful to know which of those values had been estimated, or the value that was suppressed for confidentially reasons, the [c] value. To get to this information we’d need to go back to the source data and re-process it.

But, what if we didn’t and we had a way to retain that information. The shrthnd_num() function allows us to separate but retain both the numeric and non-numeric components in the vector.

sh_x <- shrthnd_num(x)

sh_x
#> <shrthnd_num[8]>
#> [1] 12.00    34.57    [c]      <NA>     56.78[e] 78.90    90.12[e] <NA>

as.numeric(sh_x)
#> [1] 12.000 34.567     NA     NA 56.780 78.900 90.123     NA

As that preview my suggest can also still access the non-numeric components, which I’ll call tags from here on. shrthnd_tags() provides a vector of the tags that is the same length as the vector itself, while shrthnd_list() provides a summary of the tag information.

shrthnd_tags(sh_x)
#> [1] NA    NA    "[c]" NA    "[e]" NA    "[e]" NA

shrthnd_list(sh_x)
#> <shrthnd_list[2]>
#> [c] (1 location): 3 
#> [e] (2 locations): 5, 7

While these examples have included shorthand markers that are surrounded by square brackets this isn’t necessary. When converting a character vector to a shrthnd_num() vector, any non-numeric content following a number is extracted as tag component. By default values with a trailing percentage sign will be coerced to a numeric value (i.e. x/100). By default, shrthnd_num will also convert any numbers contained in parentheses into negative numbers as this is commonly used in accounting formats. Both these behaviours can be reversed by changing the relevant arguments to shrthnd_num(). There is also support for scientific notation (e.g. 1.234e5), note that when a number is followed only by an e and not an exponent (e.g. e-05) it will be treated as a tag.

y <- c("12", "0.23*", "78e", "19%", "56 x", "13(1)", "(12)")
sh_y <- shrthnd_num(y)

sh_y
#> <shrthnd_num[7]>
#> [1] 12.00    0.23*    78.00e   0.19     56.00x   13.00(1) -12.00

shrthnd_tags(sh_y)
#> [1] NA    "*"   "e"   NA    "x"   "(1)" NA 

z <- c("12e", "1.23e4", "1.23e-4e*")
sh_z <- shrthnd_num(z)

sh_z
#> <shrthnd_num[3]>
#> [1] 12.00e   12300.00 0.00* 

as.numeric(sh_z)
#> [1] 1.20e+01 1.23e+04 1.23e-04

shrthnd_tags(sh_z)
#> [1] "e" NA  "*"

Coercion

Mot of the time a shrthnd_num() will try to behave as if it is a numeric vector, for example is.na() will return where the numeric component is missing and arithmetic/logic will work on the numeric component.

is.na(sh_x)
#> [1] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE

sh_x + 1
#> [1] 13.000 35.567     NA     NA 57.780 79.900 91.123     NA

sh_x > 60
#> [1] FALSE FALSE    NA    NA FALSE  TRUE  TRUE    NA

One thing to remember when working with a shrthnd_num() is that unless {shrthnd} can determine that the underlying vector is an integer then it will assume it is a double (and thus uses R’s floating point arithmetic).

In addition to simple arithmetic, there is also support for statistical functions.

sum(sh_x, na.rm = TRUE)
#> [1] 272.37

range(sh_x, na.rm = TRUE)
#> [1] 12.000 90.123

mean(sh_x, na.rm = TRUE)
#> [1] 54.474

Using as.character() on a shrthnd_num() vector will return a character vector solely of the numeric component, use as_shrthnd() to generate a character vector that includes both the numeric and non-numeric components. While the intent of as_shrthnd() is to mimic the original input, due to whitespace processing the output vector might not be identical to the input.

as.character(sh_x)
#> [1] "12"     "34.567" NA       NA       "56.78"  "78.9"   "90.123" NA

as_shrthnd(sh_x)
#> [1] "12.00"    "34.57"    "[c]"      NA         "56.78[e]" "78.90" 
#> [7] "90.12[e]" NA

Working with tag information

While the default behaviour is to treat a shrthnd_num() as a numeric vector the package also provides a variety of functions to work with the tag information. For example is_na_tag() assesses whether tags are missing, while is_na_both() assesses whether both the numeric and tag components are missing (i.e. a completely missing value in the underlying data). There are also a number of locational functions to test for whether tags exist and in what locations within a vector.

is_na_tag(sh_x)
#> [1]  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE

is_na_both(sh_x)
#> [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

Working with table annotations

As well as processing numeric vectors that include annotations, {shrthnd} also includes a set of functions to process table annotations. Many spreadsheets have header and footer rows that include useful information about the table, such as a title, details of the data source and other explanatory notes.

The shrthnd_tbl() function takes a tibble::tibble() object (or something that can be coerced to a tibble), and you can set a title, notes and a source_note. The title and source_note must be singular character vectors, while notes can be a character vector of any length.

When a shrthnd_tbl() is printed it will prepend the title to the top of the console output for the tibble, if there is a source note this will be appended after the tibble output followed by a summary of any notes. The annotations() function provides a display of all the annotations associated with the shrthnd_tbl() including the notes in full.

x <- c("12", "34.567", "[c]", "NA", "56.78[e]", "78.9", "90.123[e]", "")
sh_x <- shrthnd_num(x)
tbl <- tibble::tibble(x = x, sh_x = sh_x)

shrthnd_tbl(
  tbl,
  title = "Example table",
  notes = c("Note 1", "Note 2"),
  source_note = "Shrthnd documentation, 2023"
)
#> # Title:    Example table
#> # A tibble: 8 × 2
#>   x             sh_x
#>   <chr>     <sh_dbl>
#> 1 12           12.00
#> 2 34.567       34.57
#> 3 [c]            [c]
#> 4 NA              NA
#> 5 56.78[e]  56.78[e]
#> 6 78.9         78.90
#> 7 90.123[e] 90.12[e]
#> 8 ""              NA
#> # ☰ Source: Shrthnd documentation, 2023
#> # ☰ There are 2 notes, use `annotations(x)` to view

annotations(sh_tbl)
#> ── Notes for `sh_tbl` ──────────────────────────────────────────────────────────
#> Title: My Example Table
#> Source: Shrthnd documentation (2023)
#> Notes:
#> • Note 1
#> • Note 2

You can also access the individual components using functions for the specific type of note and modify them. If you want to append to existing notes use the add_notes() function rather than the set_notes() otherwise you’ll overwrite the existing notes.

shrthnd_title(sh_tbl)
#> [1] "My Example Table"
shrthnd_source_note(sh_tbl)
#> [1] "Shrthnd documentation (2023)"
shrthnd_notes(sh_tbl)
#> [1] "Note 1" "Note 2"

add_notes(sh_tbl) <- "Note 3"
shrthnd_notes(sh_tbl)
#> [1] "Note 1" "Note 2" "Note 3"

While you can set these values manually, the intention is for these to integrate with my future plans for {tidyods} and so there is also a find_annotations() function (and its low-level helper counterpart guess_annotations()) which analyse an input table and try to identify annotations.

example_df <- tibble::tibble(
  col1 = c(
    "Table 1", "An example sheet", "species", "Adelie", "Gentoo", "Chinstrap",
    "This table is based on data in the palmerpenguins R package",
    "Source: {palmerpenguins} R package"
  ),
  col2 = c(NA_character_, NA_character_, "bill_length_mm", "38.791",
           "47.505", "48.834", NA_character_, NA_character_),
  col3 = c(NA_character_, NA_character_, "bill_depth_mm", "18.346",
           "14.982", "18.421", NA_character_, NA_character_)
)

example_df
#> # A tibble: 8 × 3
#>   col1                                                        col2         col3 
#>   <chr>                                                       <chr>        <chr>
#> 1 Table 1                                                     NA           NA   
#> 2 An example sheet                                            NA           NA   
#> 3 species                                                     bill_length… bill…
#> 4 Adelie                                                      38.791       18.3…
#> 5 Gentoo                                                      47.505       14.9…
#> 6 Chinstrap                                                   48.834       18.4…
#> 7 This table is based on data in the palmerpenguins R package NA           NA   
#> 8 Source: {palmerpenguins} R package                          NA           NA   

find_annotations(example_df)
#> ── Notes found in `example_df` ─────────────────────────────────────────────────
#> Title: Table 1
#> Source: Source: {palmerpenguins} R package
#> Notes:
#> • An example sheet
#> • This table is based on data in the palmerpenguins R package

guess_annotations(example_df)
#> # A tibble: 4 × 3
#>     row   col annotation                                                 
#>   <int> <int> <chr>                                                      
#> 1     1     1 Table 1                                                    
#> 2     2     1 An example sheet                                           
#> 3     7     1 This table is based on data in the palmerpenguins R package
#> 4     8     1 Source: {palmerpenguins} R package

Under the hood

The {shrthand} package relies heavily on the {vctrs} package, especially vectors::new_rcrd() which allows for the creation of a record data type where you can store multiple values in a linked way. A rcrd is in some respects a more formalised approach to a list, in which you have a declared structure for which its components must all be vectors of the same length (but not necessarily the same type). The easiest way I find of thinking it about it is a set of paired vectors, a buy-one-get-one-free sort of deal if you like. Though, you’re not limited on the number of vectors in a record.

In development of the package I thought about using attributes, for example creating a list of cell annotations and locations and then storing that information as an attribute of the vector/table. Attributes however didn’t feel like a “safe” option as they can easily be lost when mutating a dataset. The lookup table approach also felt unsafe if for example you went on to subset the vector³, either explicitly such as x[c(1,3)] or more likely an implicit subset created when filtering a tibble that contains the vector.

Instead, with rcrd type and some of the in-built features of the {vctrs} package it’s much harder to lose information. The component vectors inside a rcrd have to have the same length, hence why I described them as a paired vector, because in the case of shrthnd_num() thats what we have: a numeric vector (num) and a character vector (tag) of the same length and when you subset the record it will subset both of those vectors in the same way.

It’s through the {vctrs} package that all the things like coercion and most maths is handled. I’ve also added methods for median() and quantile() as those are not supported out of the box by {vectrs}.

There’s a great example from Jesse Sadler of using the rcrd format for storing non-decimal currency information (e.g. pounds, shillings and pence), and making calculations with it, which give you more of a flavour of how else you could use the rcrd format⁴.

I don’t, but I know someone that does. ↩︎
Assuming that this dataset doesn’t include 2-letter country code data and includes data on Namibia (who’s ISO 3166 2-character country code is NA). ↩︎
A lookup table approach is still used in the initial conversion from a character vector to a shrthnd_num() vector, but it is not used for ongoing storage/position information of tags. ↩︎
Jesse’s even published a CRAN package based on this, if you’re particularly in need of working with historical/non-decimal currency data. ↩︎