Unpivot Complex and Irregular Data Layouts

Tools for converting data from complex or irregular layouts to a columnar structure. For example, tables with multilevel column or row headers, or spreadsheets. Header and data cells are selected by their contents and position, as well as formatting and comments where available, and are associated with one other by their proximity in given directions. Functions for data frames and HTML tables are provided.


Travis-CI BuildStatus AppVeyor BuildStatus CranStatus CranDownloads codecov

unpivotr deals with non-tabular data, especially from spreadsheets. Use unpivotr when your source data has any of these ‘features’:

  • Multi-headered hydra
  • Meaningful formatting
  • Headers anywhere but at the top of each column
  • Non-text headers e.g. dates
  • Other stuff around the table
  • Several similar tables in one sheet
  • Sentinel values
  • Superscript symbols
  • Meaningful comments
  • Nested HTML tables

If that list makes your blood boil, you’ll enjoy the function names.

  • behead() deals with multi-headered hydra tables one layer of headers at a time, working from the edge of the table inwards. It’s a bit like using header = TRUE in read.csv(), but because it’s a function, you can apply it to as many layers of headers as you need. You end up with all the headers in columns.
  • spatter() is like tidyr::spread() but preserves mixed data types. You get into a mixed-data-type situation by delaying type coercion until after the table is tidy (rather than before, like read.csv() et al). And yes, it usually follows behead().

More positive, corrective functions:

  • justify() aligns column headers before behead()ing, and has deliberate moral overtones.
  • enhead() attaches a header to the body of the data, a la Frankenstein. The effect is the same as behead(), but is more powerful because you can choose exactly which header cells you want, paying attention to formatting (which behead() doesn’t understand).
  • isolate_sentinels() separates meaningful symbols like "N/A" or "confidential" from the rest of the data, giving them some time alone think about what they’ve done.
  • partition() takes a sheet with several tables on it, and slashes into pieces that each contain one table. You can then unpivot each table in turn with purrr::map() or similar.

Make cells tidy

Unpivotr uses data where each cells is represented by one row in a dataframe. Like this.

Gif of tidyxl converting cells into a tidy representation of one rowper cell

What can you do with tidy cells? The best places to start are:

Otherwise the basic idea is:

  1. Read the data with a specialist tool.
    • For spreadsheets, use tidyxl.
    • For plain text files, you might soon be able to use readr, but for now you’ll have to install a pull-request on that package with devtools::install_github("tidyverse/readr#760").
    • For tables in html pages, use unpivotr::tidy_html()
    • For data frames, use unpivotr::as_cells() – this should be a last resort, because by the time the data is in a conventional data frame, it is often too late – formatting has been lost, and most data types have been coerced to strings.
  2. Either behead() straight away, else dplyr::filter() separately for the header cells and the data cells, and then recombine with enhead().
  3. spatter() so that each column has one data type.
library(unpivotr)
library(tidyverse)
#> ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.1.0           ✔ purrr   0.2.5.9000 
#> ✔ tibble  1.4.99.9006     ✔ dplyr   0.7.8      
#> ✔ tidyr   0.8.2           ✔ stringr 1.3.1      
#> ✔ readr   1.2.1.9000      ✔ forcats 0.3.0
#> ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
x <- purpose$`NNW WNW`
x # A pivot table in a conventional data frame.  Four levels of headers, in two
#>                            X2      X3     X4     X5    X6     X7
#> 1                        <NA>    <NA> Female   <NA>  Male   <NA>
#> 2                        <NA>    <NA>  0 - 6 7 - 10 0 - 6 7 - 10
#> 3           Bachelor's degree 15 - 24   7000  27000  <NA>  13000
#> 4                        <NA> 25 - 44  12000 137000  9000  81000
#> 5                        <NA> 45 - 64  10000  64000  7000  66000
#> 6                        <NA>     65+   <NA>  18000  7000  17000
#> 7                 Certificate 15 - 24  29000 161000 30000 190000
#> 8                        <NA> 25 - 44  34000 179000 31000 219000
#> 9                        <NA> 45 - 64  30000 210000 23000 199000
#> 10                       <NA>     65+  12000  77000  8000 107000
#> 11                    Diploma 15 - 24   <NA>  14000  9000  11000
#> 12                       <NA> 25 - 44  10000  66000  8000  47000
#> 13                       <NA> 45 - 64   6000  68000  5000  58000
#> 14                       <NA>     65+   5000  41000  1000  34000
#> 15           No Qualification 15 - 24  10000  43000 12000  37000
#> 16                       <NA> 25 - 44  11000  36000 21000  50000
#> 17                       <NA> 45 - 64  19000  91000 17000  75000
#> 18                       <NA>     65+  16000 118000  9000  66000
#> 19 Postgraduate qualification 15 - 24   <NA>   6000  <NA>   <NA>
#> 20                       <NA> 25 - 44   5000  86000  7000  60000
#> 21                       <NA> 45 - 64   6000  55000  6000  68000
#> 22                       <NA>     65+   <NA>  13000  <NA>  18000
  # rows and two columns.
 
y <- as_cells(x) # 'Tokenize' or 'melt' the data frame into one row per cell
y
#> # A tibble: 132 x 4
#>      row   col data_type chr              
#>    <int> <int> <chr>     <chr>            
#>  1     1     1 chr       <NA>             
#>  2     2     1 chr       <NA>             
#>  3     3     1 chr       Bachelor's degree
#>  4     4     1 chr       <NA>             
#>  5     5     1 chr       <NA>             
#>  6     6     1 chr       <NA>             
#>  7     7     1 chr       Certificate      
#>  8     8     1 chr       <NA>             
#>  9     9     1 chr       <NA>             
#> 10    10     1 chr       <NA>             
#> # … with 122 more rows
 
rectify(y) # useful for reviewing the melted form as though in a spreadsheet
#> # A tibble: 22 x 7
#>    `row/col` `1(A)`            `2(B)`  `3(C)` `4(D)` `5(E)` `6(F)`
#>        <int> <chr>             <chr>   <chr>  <chr>  <chr>  <chr> 
#>  1         1 <NA>              <NA>    Female <NA>   Male   <NA>  
#>  2         2 <NA>              <NA>    0 - 6  7 - 10 0 - 6  7 - 10
#>  3         3 Bachelor's degree 15 - 24 7000   27000  <NA>   13000 
#>  4         4 <NA>              25 - 44 12000  137000 9000   81000 
#>  5         5 <NA>              45 - 64 10000  64000  7000   66000 
#>  6         6 <NA>              65+     <NA>   18000  7000   17000 
#>  7         7 Certificate       15 - 24 29000  161000 30000  190000
#>  8         8 <NA>              25 - 44 34000  179000 31000  219000
#>  9         9 <NA>              45 - 64 30000  210000 23000  199000
#> 10        10 <NA>              65+     12000  77000  8000   107000
#> # … with 12 more rows
 
y %>%
  behead("NNW", "sex") %>%               # Strip headers
  behead("N", "life-satisfication") %>%  # one
  behead("WNW", "qualification") %>%     # by
  behead("W", "age-band") %>%            # one.
  select(-row, -col, -data_type, count = chr) %>% # cleanup
  mutate(count = as.integer(count))
#> # A tibble: 80 x 5
#>     count sex    `life-satisfication` qualification     `age-band`
#>     <int> <chr>  <chr>                <chr>             <chr>     
#>  1   7000 Female 0 - 6                Bachelor's degree 15 - 24   
#>  2  12000 Female 0 - 6                Bachelor's degree 25 - 44   
#>  3  10000 Female 0 - 6                Bachelor's degree 45 - 64   
#>  4     NA Female 0 - 6                Bachelor's degree 65+       
#>  5  27000 Female 7 - 10               Bachelor's degree 15 - 24   
#>  6 137000 Female 7 - 10               Bachelor's degree 25 - 44   
#>  7  64000 Female 7 - 10               Bachelor's degree 45 - 64   
#>  8  18000 Female 7 - 10               Bachelor's degree 65+       
#>  9     NA Male   0 - 6                Bachelor's degree 15 - 24   
#> 10   9000 Male   0 - 6                Bachelor's degree 25 - 44   
#> # … with 70 more rows

Note the compass directions in the code above, which hint to behead() where to find the header cell for each data cell.

  • "NNW" means the header (Female, Male) is positioned up and to the left of the columns of data cells it describes.
  • "N" means the header (0 - 6, 7 - 10) is positioned directly above the columns of data cells it describes.
  • "WNW" means the header (Bachelor's degree, Certificate, etc.) is positioned to the left and upwards of the rows of data cells it describes.
  • "W" means the header (15 - 24, 25 - 44, etc.) is positioned directly to the left of the rows of data cells it describes.

Installation

# install.packages("devtools") # If you don't already have devtools
devtools::install_github("nacnudus/unpivotr", build_vignettes = TRUE)

The version 0.4.0 release had somee breaking changes. See NEWS.md for details. The previous version can be installed as follow:

devtools::install_version("unpivotr", version = "0.3.1", repos = "http://cran.us.r-project.org")

Similar projects

unpivotr is inspired by Databaker, a collaboration between the United Kingdom Office of National Statistics and The Sensible Code Company. unpivotr.

jailbreaker attempts to extract non-tabular data from spreadsheets into tabular structures automatically via some clever algorithms. unpivotr differs by being less magic, and equipping you to express what you want to do.

News

unpivotr 0.5.1

  • Patch for tibble .name_repair compatibility (#2144 @krlmlr)
  • Patch to switch to tidyselect from dplyr functions.

unpivotr 0.5.0

New features

  • behead_if() is for tiered headers within the same row or column. It takes filter functions similarly to dplyr::filter() to decide which cells to treat as headers, and can be applied more than once to the same row or column of headers until every tier has been dealt with.
  • merge_rows() and merge_cols() combines header text when it is split over multiple cells.

Other changes

  • behead(), enhead(), partition() and rectify() give a more informative error message for non-distinct cells, for example when trying to pass cells from more than one sheet to these functions (@gregrs-uk, #15).

unpivotr 0.4.0

This version makes some big breaking changes for the sake of a more intuitive grammar. It comes with much more documentation in the online book Spreadsheet Munging Strategies.

The main new workhorses:

  • behead() takes one level of headers from a pivot table and make it part of the data. Chain this function to gradually strip every level of header away until you have tidy data.
  • spatter() is a data-type aware version of tidyr::spread() and is a common final step.
  • partition() breaks up small-multiples on a single sheet, so you can handle them individually.
  • rectify() visualises the cells in the console as they would look in a spreadsheet.

Breaking changes

The previous version can be installed as follows.

devtools::install_version("unpivotr", version = "0.3.1", repos = "http://cran.us.r-project.org")
  • The family of functions NNW() etc. has been removed in favour of the verbose join_header(), which has itself been renamed to enhead() to suggest its similarity to behead() (though they are not complements).
  • enhead() (formerly join_header() now follows the tidyverse convention of fct for 'factor' and ord for 'ordered factor'.
  • enhead() (formerly join_header()) now uses col_names and row_names as arguments instead of colnames and rownames, for consistency with tidyr.

New features

  • behead() is takes one level of headers from a pivot table and make it part of the data. Think of it like tidyr::gather(), except that it works when there is more than one row of headers (or more than one column of row-headers), and it only works on tables that have first come through enhead() (formerly join_header() or tidyxl::xlsx_cells().
  • rectify() displays cells as though in a spreadsheet, rather than in the 'melted' form of enhead() (formerly join_header()) and tidyxl::xlsx_cells(). This is useful for understanding the structure of a pivot table as a human, when planning how to unpivot it. A print method is available to render large datasets in the browser or the RStudio viewer pane.
  • partition() divides a grid of cells into partitions containing individual tables. Give it the corner cells of each table on a spreadsheet.
  • pack() packs cells values from separate columns per data type into one list-column. unpack() is the complement.
  • isolate_sentinels() move sentinel values into a separate column, leaving NA behind (or NULL for list-columns).
  • spatter() is like tidyr::spread(), but preserves mixed data types.
  • enhead() (formerly join_header()) now returns a data_type column that names the column that contains the value of a cell, similar to tidyxl::xlsx_cells().
  • enhead() (formerly join_header() now follows the tidyverse convention of fct for 'factor' and ord for 'ordered factor'.
  • enhead() (formerly join_header()) gains a drop = TRUE argument to control whether to discard cells that don't have a matching header (e.g. ones that are left of the leftmost header in enhead(x, y, "NNW")).
  • justify() moves one set of cells to the same positions as another set. This is useful when header cells aren't at the corner of the cells they describle. Put the header cells into justify(), along with cells that are at the corner.
  • New vignette 'worked-examples' of common tasks when munging spreadsheets.
  • The 'small-multiples' vignette has been refactored to use the new features.
  • purpose (built-in dataset) gains a new list-member small-multiples.

Under the hood

  • No longer depends on the data.table package.

Many other tweaks especially to documentation

unpivotr 0.3.1

  • Performance improvements to tidy_table().
  • Fixed a CRAN test on some platforms.

unpivotr 0.3.0

  • Made compatible with tidyxl version 1.0.0 (avoids dplyr::distinct(), which doesn't handle list columns).
  • Updated to use the new dplyr/rlang combination instead of the old dplyr/lazyeval one.

unpivotr 0.2.1

This release overhauls the tidy_table() function of unpivotr to preserve the original data types of table cells and to support HTML tables.

  • tidytable() has been renamed tidy_table(). tidytable() is an error, rather than a deprecation warning, because tidy_table() is so different from before.
  • There is a new tidy_table() method and vignette for HTML.
  • There is no tidy_table() method for matrices. Convert matrices to data.frames first, choosing what to do with row and column names.
  • tidy_table() returns only relevant columns, according to the data types of the columns in the given data frame. It uses tibble::type_sum() to determine the column type and to name the columns, so whereas characters used to be returned in a column called character, they are now returned in a column called chr. The full list of column names is in ?tidy_table and is chr, cplx, cplx, dbl, fctr, int, lgl, list. The columns fctr and list are list-columns, where each element is itself a list. This means that factors with different levels are kept separate. For HTML tables, an html column is returned containing the standalone HTML of each cell.
  • Both rowname and colname arguments to tidy_table() now default to FALSE.
  • All functions consistently return a tibble.
  • Some error messages are more helpful.

unpivotr 0.1.1

  • Fixed breakages introduced by dplyr 0.6.

unpivotr 0.1.0

  • Moved images from inst/extdata to vignettes.

unpivotr 0.1.0

  • Bumped version to match tidyxl, which is now more mature.
  • Updated README and vignettes to use the new tidyxl api.

unpivotr 0.0.0.9000

  • Added a NEWS.md file to track changes to the package.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("unpivotr")

0.5.1 by Duncan Garmonsway, 6 months ago


https://github.com/nacnudus/unpivotr


Report a bug at https://github.com/nacnudus/unpivotr/issues


Browse source code at https://github.com/cran/unpivotr


Authors: Duncan Garmonsway [aut, cre]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports methods, rlang, magrittr, dplyr, forcats, purrr, tidyr, pillar, tibble, cellranger, xml2, tidyselect

Suggests knitr, rmarkdown, readr, tidyxl, readxl, stringr, htmltools, rvest, selectr, DT, testthat, covr


Imported by tidycells.


See at CRAN