About a year ago I started development of a new R package to deal with dirty data, specifically to process the use of shorthand, symbols and other annotations in columns of spreadsheets and tables. I realised I’ve not blogged about it, and thus if one subscribes to “blog-driven-development”1 then if I’ve not blogged about it the package doesn’t exist.
Background
In my slow going development of
{tidyods}
I implemented two
“rectifier” functions, the first is based on the approach in
{unpivotr}
to
convert a data.frame of tidy cells back into a 2-dimensional sheet. The
simple_rectifier()
uses the row and column positions to replicate the
original sheet, albeit with options to skip rows and process column headers.
However, spreadsheets are often mystical beasts that don’t conform to laws of nature (tidy data), and so I’ve also been working on a experimental “rectifier” that tries to identify where the actual sheet of data starts.
As well as header/footer rows, we can also try and coerce the sheet data away
from character vectors to possible data types. Within {tidyods}
we get
information on the data types of cells, so we can use that as a starter. But
how to process columns with multiple data-types. One of the most common
situations is a column that is in essence a numeric column but in some cells
we might have only non-numeric values or non-numeric values in addition to the
data.
Common practice is often to clean out these values, sometimes this might be appropriate if the symbol conveys that data is missing, but symbols are used for a wide variety of purposes. Publishers of statistical data have different standards when it comes to the use of symbols but a dataset may have markers to indicate the different reasons for missing-ness, it may use asterisks to denote levels of statistical significance, it may use a symbol or shorthand to indicate a value is estimated rather than actual, or markers might relate to explanatory notes associated with the data point or value.
In {tidyods}
I included an experimental function for trying to do a “smart”
rectifying process that attempts to analyse the structure of the cells,
extract the table of data from within a sheet with header rows, and use data
type information to coerce columns to their relevant type.
As my use case at the time was working with UK government publications I
included in the tidyods::smart_rectify()
function the option to strip out
shorthand notation based on
guidance
published by the UK Government Analysis Function as well as similar guidance
from the US Census Bureau (at least I certainly remember trying to find
something from them but a in the time that’s passed I can no longer find the
page/PDF I originally consulted).
This was a very basic approach that looked for common shorthand, symbols
and markers and simply used gsub()
to remove them from the vector. But that
didn’t feel like a “safe” long-term solution to handling this sort of problem.
Around the same time Matt posted a blog about
data types in R,
which was in response to a similar issue regarding vectors that contained
numbers and shorthand notation. Which got me thinking, maybe there was a need
for a more systematic approach to handling shorthand in tables.
The {shrthnd} package
Shorthand and other symbolic markers convey information just as much as the numbers they are associated with. While some datasets/sources will put these in a separate notes column many sources do not, and even if everything published from here on out did do this there’s all those legacy spreadsheets out there.
Let’s start with a vector (x
) that has 8 values that we might see in a
spreadsheet column and has a mix of numeric and non-numeric components. Two
of these values are explicitly missing values ("NA"
and ""
)2, so we
have six values that convey some quantum of information. If we coerce this
vector using as.numeric()
then half of these values are coerced to NA
, and
so out of our original 8 values we only have 3 remaining.
x <- c("12", "34.567", "[c]", "NA", "56.78[e]", "78.9", "90.123[e]", "")
as.numeric(x)
#> [1] 12.000 34.567 NA NA NA 78.900 NA NA
#> Warning message:
#> NAs introduced by coercion
We could do some cleaning to ensure the numeric values are extracted, from
looking at the vector we can see that some of the numeric values have the
suffix [e]
(meaning in this case the values were estimated), so we can
use gsub()
to remove this and then coerce the vector.
y <- gsub("\\[e\\]", "", x)
y
#> [1] "12" "34.567" "[c]" "NA" "56.78" "78.9" "90.123" ""
as.numeric(y)
#> [1] 12.000 34.567 NA NA 56.780 78.900 90.123 NA
#> Warning message:
#> NAs introduced by coercion
So if we know the structure of our data then we construct a way to strip out
the non-numeric components and get a nice neat numeric vector. But what if
later down the line it would be useful to know which of those values had been
estimated, or the value that was suppressed for confidentially reasons, the
[c]
value. To get to this information we’d need to go back to the source data
and re-process it.
But, what if we didn’t and we had a way to retain that information. The
shrthnd_num()
function allows us to separate but retain both the numeric
and non-numeric components in the vector.
sh_x <- shrthnd_num(x)
sh_x
#> <shrthnd_num[8]>
#> [1] 12.00 34.57 [c] <NA> 56.78[e] 78.90 90.12[e] <NA>
as.numeric(sh_x)
#> [1] 12.000 34.567 NA NA 56.780 78.900 90.123 NA
As that preview my suggest can also still access the non-numeric components,
which I’ll call tags from here on.
shrthnd_tags()
provides a vector of the tags that is the same
length as the vector itself, while
shrthnd_list()
provides a summary of the tag information.
shrthnd_tags(sh_x)
#> [1] NA NA "[c]" NA "[e]" NA "[e]" NA
shrthnd_list(sh_x)
#> <shrthnd_list[2]>
#> [c] (1 location): 3
#> [e] (2 locations): 5, 7
While these examples have included shorthand markers that are surrounded by
square brackets this isn’t necessary. When converting a character vector to
a shrthnd_num()
vector, any non-numeric content following a number is
extracted as tag component. By default values with a trailing percentage sign
will be coerced to a numeric value (i.e. x/100
). By default, shrthnd_num
will also convert any numbers contained in parentheses into negative numbers
as this is commonly used in accounting formats. Both these behaviours can be
reversed by changing the relevant arguments to shrthnd_num()
. There is also
support for scientific notation (e.g. 1.234e5
), note that when a number is
followed only by an e
and not an exponent (e.g. e-05
) it will be treated
as a tag.
y <- c("12", "0.23*", "78e", "19%", "56 x", "13(1)", "(12)")
sh_y <- shrthnd_num(y)
sh_y
#> <shrthnd_num[7]>
#> [1] 12.00 0.23* 78.00e 0.19 56.00x 13.00(1) -12.00
shrthnd_tags(sh_y)
#> [1] NA "*" "e" NA "x" "(1)" NA
z <- c("12e", "1.23e4", "1.23e-4e*")
sh_z <- shrthnd_num(z)
sh_z
#> <shrthnd_num[3]>
#> [1] 12.00e 12300.00 0.00*
as.numeric(sh_z)
#> [1] 1.20e+01 1.23e+04 1.23e-04
shrthnd_tags(sh_z)
#> [1] "e" NA "*"
Coercion
Mot of the time a shrthnd_num()
will try to behave as if it is a numeric
vector, for example is.na()
will return where the numeric component is
missing and arithmetic/logic will work on the numeric component.
is.na(sh_x)
#> [1] FALSE FALSE TRUE TRUE FALSE FALSE FALSE TRUE
sh_x + 1
#> [1] 13.000 35.567 NA NA 57.780 79.900 91.123 NA
sh_x > 60
#> [1] FALSE FALSE NA NA FALSE TRUE TRUE NA
One thing to remember when working with a shrthnd_num()
is that unless
{shrthnd}
can determine that the underlying vector is an integer then it
will assume it is a double (and thus uses R’s floating point arithmetic).
In addition to simple arithmetic, there is also support for statistical functions.
sum(sh_x, na.rm = TRUE)
#> [1] 272.37
range(sh_x, na.rm = TRUE)
#> [1] 12.000 90.123
mean(sh_x, na.rm = TRUE)
#> [1] 54.474
Using as.character()
on a shrthnd_num()
vector will return a character
vector solely of the numeric component, use
as_shrthnd()
to generate a character vector that includes both the numeric and non-numeric
components. While the intent of as_shrthnd()
is to mimic the original input,
due to whitespace processing the output vector might not be identical to the
input.
as.character(sh_x)
#> [1] "12" "34.567" NA NA "56.78" "78.9" "90.123" NA
as_shrthnd(sh_x)
#> [1] "12.00" "34.57" "[c]" NA "56.78[e]" "78.90"
#> [7] "90.12[e]" NA
Working with tag information
While the default behaviour is to treat a shrthnd_num()
as a numeric vector
the package also provides a variety of functions to work with the tag
information. For example
is_na_tag()
assesses whether tags are missing, while
is_na_both()
assesses whether both the numeric and tag components are missing (i.e. a
completely missing value in the underlying data). There are also a number of
locational functions to test for whether tags exist and in what locations
within a vector.
is_na_tag(sh_x)
#> [1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
is_na_both(sh_x)
#> [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
Working with table annotations
As well as processing numeric vectors that include annotations, {shrthnd}
also includes a set of functions to process table annotations. Many
spreadsheets have header and footer rows that include useful information about
the table, such as a title, details of the data source and other explanatory
notes.
The
shrthnd_tbl()
function takes a
tibble::tibble()
object (or something that can be coerced to a tibble), and you can set a
title
, notes
and a source_note
. The title
and source_note
must be
singular character vectors, while notes
can be a character vector of any
length.
When a shrthnd_tbl()
is printed it will prepend the title to the top of the
console output for the tibble, if there is a source note this will be appended
after the tibble output followed by a summary of any notes. The
annotations()
function provides a display of all the annotations associated with the
shrthnd_tbl()
including the notes in full.
x <- c("12", "34.567", "[c]", "NA", "56.78[e]", "78.9", "90.123[e]", "")
sh_x <- shrthnd_num(x)
tbl <- tibble::tibble(x = x, sh_x = sh_x)
shrthnd_tbl(
tbl,
title = "Example table",
notes = c("Note 1", "Note 2"),
source_note = "Shrthnd documentation, 2023"
)
#> # Title: Example table
#> # A tibble: 8 × 2
#> x sh_x
#> <chr> <sh_dbl>
#> 1 12 12.00
#> 2 34.567 34.57
#> 3 [c] [c]
#> 4 NA NA
#> 5 56.78[e] 56.78[e]
#> 6 78.9 78.90
#> 7 90.123[e] 90.12[e]
#> 8 "" NA
#> # ☰ Source: Shrthnd documentation, 2023
#> # ☰ There are 2 notes, use `annotations(x)` to view
annotations(sh_tbl)
#> ── Notes for `sh_tbl` ──────────────────────────────────────────────────────────
#> Title: My Example Table
#> Source: Shrthnd documentation (2023)
#> Notes:
#> • Note 1
#> • Note 2
You can also access the individual components using functions for the specific
type of note and modify them. If you want to append to existing notes use the
add_notes()
function rather than the set_notes()
otherwise you’ll
overwrite the existing notes.
shrthnd_title(sh_tbl)
#> [1] "My Example Table"
shrthnd_source_note(sh_tbl)
#> [1] "Shrthnd documentation (2023)"
shrthnd_notes(sh_tbl)
#> [1] "Note 1" "Note 2"
add_notes(sh_tbl) <- "Note 3"
shrthnd_notes(sh_tbl)
#> [1] "Note 1" "Note 2" "Note 3"
While you can set these values manually, the intention is for these to
integrate with my future plans for {tidyods}
and so there is also a
find_annotations()
function (and its low-level helper counterpart guess_annotations()
) which
analyse an input table and try to identify annotations.
example_df <- tibble::tibble(
col1 = c(
"Table 1", "An example sheet", "species", "Adelie", "Gentoo", "Chinstrap",
"This table is based on data in the palmerpenguins R package",
"Source: {palmerpenguins} R package"
),
col2 = c(NA_character_, NA_character_, "bill_length_mm", "38.791",
"47.505", "48.834", NA_character_, NA_character_),
col3 = c(NA_character_, NA_character_, "bill_depth_mm", "18.346",
"14.982", "18.421", NA_character_, NA_character_)
)
example_df
#> # A tibble: 8 × 3
#> col1 col2 col3
#> <chr> <chr> <chr>
#> 1 Table 1 NA NA
#> 2 An example sheet NA NA
#> 3 species bill_length… bill…
#> 4 Adelie 38.791 18.3…
#> 5 Gentoo 47.505 14.9…
#> 6 Chinstrap 48.834 18.4…
#> 7 This table is based on data in the palmerpenguins R package NA NA
#> 8 Source: {palmerpenguins} R package NA NA
find_annotations(example_df)
#> ── Notes found in `example_df` ─────────────────────────────────────────────────
#> Title: Table 1
#> Source: Source: {palmerpenguins} R package
#> Notes:
#> • An example sheet
#> • This table is based on data in the palmerpenguins R package
guess_annotations(example_df)
#> # A tibble: 4 × 3
#> row col annotation
#> <int> <int> <chr>
#> 1 1 1 Table 1
#> 2 2 1 An example sheet
#> 3 7 1 This table is based on data in the palmerpenguins R package
#> 4 8 1 Source: {palmerpenguins} R package
Under the hood
The {shrthand}
package relies heavily on the
{vctrs}
package, especially
vectors::new_rcrd()
which
allows for the creation of a record data type where you can store multiple
values in a linked way. A rcrd
is in some respects a more formalised approach
to a list, in which you have a declared structure for which its components must
all be vectors of the same length (but not necessarily the same type). The
easiest way I find of thinking it about it is a set of paired vectors, a
buy-one-get-one-free sort of deal if you like. Though, you’re not limited on
the number of vectors in a record.
In development of the package I thought about using attributes, for example
creating a list of cell annotations and locations and then storing that
information as an attribute of the vector/table. Attributes however didn’t
feel like a “safe” option as they can easily be lost when mutating a dataset.
The lookup table approach also felt unsafe if for example you went on to
subset the vector3, either explicitly such as x[c(1,3)]
or more likely an
implicit subset created when filtering a tibble that contains the vector.
Instead, with rcrd
type and some of the in-built features of the {vctrs}
package it’s much harder to lose information. The component vectors inside a
rcrd
have to have the same length, hence why I described them as a
paired vector, because in the case of shrthnd_num()
thats what we have: a
numeric vector (num
) and a character vector (tag
) of the same length and
when you subset the record it will subset both of those vectors in the same way.
It’s through the {vctrs}
package that all the things like coercion and most
maths is handled. I’ve also added methods for median()
and quantile()
as those are not supported out of the box by {vectrs}
.
There’s a great example from
Jesse Sadler of using the rcrd
format for storing non-decimal currency
information (e.g. pounds, shillings and pence), and making calculations with
it, which give you more of a flavour of how else you could use the rcrd
format4.
-
Assuming that this dataset doesn’t include 2-letter country code data and includes data on Namibia (who’s ISO 3166 2-character country code is NA). ↩︎
-
A lookup table approach is still used in the initial conversion from a character vector to a
shrthnd_num()
vector, but it is not used for ongoing storage/position information of tags. ↩︎ -
Jesse’s even published a CRAN package based on this, if you’re particularly in need of working with historical/non-decimal currency data. ↩︎