Unpivot Complex and Irregular Data Layouts

Tools for converting data from complex or irregular layouts to a columnar structure. For example, tables with multilevel column or row headers, or spreadsheets. Header and data cells are selected by their contents and position, as well as formatting and comments where available, and are associated with one other by their proximity in given directions. Functions for data frames and HTML tables are provided.


Travis-CI BuildStatus AppVeyor BuildStatus CranStatus CranDownloads codecov

unpivotr provides tools for converting data from complex or irregular layouts to a columnar structure. For example, tables with multi-level column or row headers, or spreadsheets, or nested HTML tables. Header and data cells can selected by their contents, position and formatting, and can be associated with one other by their relative positions.

Excel (.xlsx) files can be prepared for unpivotr via the tidyxl package.

Mailing list

For bugs and/or issues, create a new issue on GitHub For other questions or comments, please subscribe to the tidyxl-devel mailing list. You must be a member to post messages, but anyone can read the archived discussions.

Installation

devtools::install_github("nacnudus/unpivotr", build_vignettes = TRUE)

There are several dependencies:

  • data.table (>= 1.9.7)
  • dplyr
  • dtplyr
  • purrr
  • tidyr
  • xml2

Example

The package includes a dataset, purpose, which is a list of pivot tables, derived from a survey by Statistics New Zealand of people’s sense-of-purpose. A ‘tidy’ version of the data is also included.

library(unpivotr)
#> 
#> Attaching package: 'unpivotr'
#> The following object is masked from 'package:stats':
#> 
#>     offset
head(purpose$Tidy) # 'Tidy' version of the data
#>    Sex Age group (Life-stages) Highest qualification Sense of purpose
#> 1 Male                 15 - 24      No Qualification            0 - 6
#> 2 Male                 15 - 24      No Qualification           7 - 10
#> 3 Male                 15 - 24           Certificate            0 - 6
#> 4 Male                 15 - 24           Certificate           7 - 10
#> 5 Male                 15 - 24               Diploma            0 - 6
#> 6 Male                 15 - 24               Diploma           7 - 10
#>    Value Flags
#> 1  12000  <NA>
#> 2  37000  <NA>
#> 3  30000  <NA>
#> 4 190000  <NA>
#> 5   9000     *
#> 6  11000     *
(pivoted <- purpose$`NNW WNW`) # The same data, pivoted
#>                            X2      X3     X4     X5    X6     X7
#> 1                        <NA>    <NA> Female   <NA>  Male   <NA>
#> 2                        <NA>    <NA>  0 - 6 7 - 10 0 - 6 7 - 10
#> 3           Bachelor's degree 15 - 24   7000  27000  <NA>  13000
#> 4                        <NA> 25 - 44  12000 137000  9000  81000
#> 5                        <NA> 45 - 64  10000  64000  7000  66000
#> 6                        <NA>     65+   <NA>  18000  7000  17000
#> 7                 Certificate 15 - 24  29000 161000 30000 190000
#> 8                        <NA> 25 - 44  34000 179000 31000 219000
#> 9                        <NA> 45 - 64  30000 210000 23000 199000
#> 10                       <NA>     65+  12000  77000  8000 107000
#> 11                    Diploma 15 - 24   <NA>  14000  9000  11000
#> 12                       <NA> 25 - 44  10000  66000  8000  47000
#> 13                       <NA> 45 - 64   6000  68000  5000  58000
#> 14                       <NA>     65+   5000  41000  1000  34000
#> 15           No Qualification 15 - 24  10000  43000 12000  37000
#> 16                       <NA> 25 - 44  11000  36000 21000  50000
#> 17                       <NA> 45 - 64  19000  91000 17000  75000
#> 18                       <NA>     65+  16000 118000  9000  66000
#> 19 Postgraduate qualification 15 - 24   <NA>   6000  <NA>   <NA>
#> 20                       <NA> 25 - 44   5000  86000  7000  60000
#> 21                       <NA> 45 - 64   6000  55000  6000  68000
#> 22                       <NA>     65+   <NA>  13000  <NA>  18000

To unpivot this table, first get a tabular representation of each cell, its value and its position in the original pivot table.

cells <- tidy_table(pivoted, colnames = FALSE)
cells <- cells[!is.na(cells$chr), ]
head(cells)
#> # A tibble: 6 x 3
#>     row   col chr                       
#>   <int> <int> <chr>                     
#> 1     3     1 Bachelor's degree         
#> 2     7     1 Certificate               
#> 3    11     1 Diploma                   
#> 4    15     1 No Qualification          
#> 5    19     1 Postgraduate qualification
#> 6     3     2 15 - 24

This can easily be subset for header and data cells.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
# Select the cells containing the values
datacells <-
  cells %>%
  filter(row >= 3, col >= 3)
head(datacells)
#> # A tibble: 6 x 3
#>     row   col chr  
#>   <int> <int> <chr>
#> 1     3     3 7000 
#> 2     4     3 12000
#> 3     5     3 10000
#> 4     7     3 29000
#> 5     8     3 34000
#> 6     9     3 30000
 
# Select the row headers
row_headers <-
  cells %>%
  filter(col <= 2) %>%
  select(row, col, header = chr) %>%
  split(.$col) # Separate each column of headers
lapply(row_headers, head)
#> $`1`
#> # A tibble: 5 x 3
#>     row   col header                    
#>   <int> <int> <chr>                     
#> 1     3     1 Bachelor's degree         
#> 2     7     1 Certificate               
#> 3    11     1 Diploma                   
#> 4    15     1 No Qualification          
#> 5    19     1 Postgraduate qualification
#> 
#> $`2`
#> # A tibble: 6 x 3
#>     row   col header 
#>   <int> <int> <chr>  
#> 1     3     2 15 - 24
#> 2     4     2 25 - 44
#> 3     5     2 45 - 64
#> 4     6     2 65+    
#> 5     7     2 15 - 24
#> 6     8     2 25 - 44
 
# Select the column headers
col_headers <-
  cells %>%
  filter(row <= 2) %>%
  select(row, col, header = chr) %>%
  split(.$row) # Separate each row of headers
col_headers
#> $`1`
#> # A tibble: 2 x 3
#>     row   col header
#>   <int> <int> <chr> 
#> 1     1     3 Female
#> 2     1     5 Male  
#> 
#> $`2`
#> # A tibble: 4 x 3
#>     row   col header
#>   <int> <int> <chr> 
#> 1     2     3 0 - 6 
#> 2     2     4 7 - 10
#> 3     2     5 0 - 6 
#> 4     2     6 7 - 10

Using unpivotr functions, we associate the data cells with the headers, by proximity in given compass directions.

# From each data cell, search for the nearest one of each of the headers
unpivoted <-
  datacells %>%
  NNW(col_headers$`1`) %>% # Search north (up) and north-west (up-left)
  N(col_headers$`2`) %>% # Search directly north (up)
  WNW(row_headers$`1`) %>% # Search west (left) and north-west
  W(row_headers$`2`) # Search directly left (west)
unpivoted
#> # A tibble: 72 x 7
#>      row   col chr   header.data header.header header.data.data 
#>    <int> <int> <chr> <chr>       <chr>         <chr>            
#>  1     3     3 7000  Female      0 - 6         Bachelor's degree
#>  2     4     3 12000 Female      0 - 6         Bachelor's degree
#>  3     5     3 10000 Female      0 - 6         Bachelor's degree
#>  4     7     3 29000 Female      0 - 6         Certificate      
#>  5     8     3 34000 Female      0 - 6         Certificate      
#>  6     9     3 30000 Female      0 - 6         Certificate      
#>  7    10     3 12000 Female      0 - 6         Certificate      
#>  8    12     3 10000 Female      0 - 6         Diploma          
#>  9    13     3 6000  Female      0 - 6         Diploma          
#> 10    14     3 5000  Female      0 - 6         Diploma          
#> # ... with 62 more rows, and 1 more variable: header.header.header <chr>

We can re-pivot the final data in R using ftable() to check that it has been imported correctly.

ftable(unpivoted[3:7],
       row.vars = 4:5,
       col.vars = 2:3)
#>                                                 header.data   Female         Male       
#>                                                 header.header  0 - 6 7 - 10 0 - 6 7 - 10
#> header.data.data           header.header.header                                         
#> Bachelor's degree          15 - 24                                 1      1     0      1
#>                            25 - 44                                 1      1     1      1
#>                            45 - 64                                 1      1     1      1
#>                            65+                                     0      1     1      1
#> Certificate                15 - 24                                 1      1     1      1
#>                            25 - 44                                 1      1     1      1
#>                            45 - 64                                 1      1     1      1
#>                            65+                                     1      1     1      1
#> Diploma                    15 - 24                                 0      1     1      1
#>                            25 - 44                                 1      1     1      1
#>                            45 - 64                                 1      1     1      1
#>                            65+                                     1      1     1      1
#> No Qualification           15 - 24                                 1      1     1      1
#>                            25 - 44                                 1      1     1      1
#>                            45 - 64                                 1      1     1      1
#>                            65+                                     1      1     1      1
#> Postgraduate qualification 15 - 24                                 0      1     0      0
#>                            25 - 44                                 1      1     1      1
#>                            45 - 64                                 1      1     1      1
#>                            65+                                     0      1     0      1

HTML tables

HTML tables can be imported in a similar way.

urls <- system.file("extdata", "url.html", package = "unpivotr")
htmltools::includeHTML(urls)

<!DOCTYPE html>

HTML table with rowspan


Scraping HTML.

Sweet

as?

Yeah,
right.


 
cell_url <- function(x) {
  if (is.na(x)) return(NA)
  x %>%
    xml2::read_html() %>%
    xml2::xml_find_all("//a") %>%
    xml2::xml_attr("href")
}
 
cell_text <- function(x) {
  if (is.na(x)) return(NA)
  x %>%
    xml2::read_html() %>%
    xml2::xml_find_all("//a") %>%
    rvest::html_text()
}
 
urls %>%
  xml2::read_html() %>%
  tidy_table() %>%
  .[[1]] %>%
  mutate(text = purrr::map(html, cell_text),
         url = purrr::map(html, cell_url)) %>%
  tidyr::unnest(text, url) %>%
  select(-html)
#> # A tibble: 8 x 4
#>     row   col text     url                  
#>   <int> <int> <chr>    <chr>                
#> 1     1     1 Scraping example1.co.nz       
#> 2     1     1 HTML.    example2.co.nz       
#> 3     2     1 Sweet    example3.co.nz       
#> 4     1     2 <NA>     <NA>                 
#> 5     2     2 as?      example4.co.nz       
#> 6     1     3 <NA>     <NA>                 
#> 7     2     3 Yeah,    example5.co.nz       
#> 8     2     3 right.   http://www.tui.co.nz/

Compass directions

The concept of compass directions is compactly expressed by this triptych.

The most useful ones are more fully explained here.

A complete explanation is in ‘poster’ form here.

Philosophy

Information in in many spreadsheets cannot be easily imported into R. Why?

Most R packages that import spreadsheets have difficulty unless the layout of the spreadsheet conforms to a strict definition of a ‘table’, e.g.:

  • observations in rows
  • variables in columns
  • a single header row
  • all information represented by characters, whether textual, logical, or numeric

These rules are designed to eliminate ambiguity in the interpretation of the information. But most spreadsheeting software relaxes these rules in a trade of ambiguity for expression via other media:

  • proximity (other than headers, i.e. other than being the first value at the top of a column)
  • formatting (colours and borders)

Humans can usually resolve the ambiguities with contextual knowledge, but computers are limited by their ignorance. Programmers are hampered by:

  • their language’s expressiveness
  • loss of information in transfer from spreadsheet to programming library

Information is lost when software discards it in order to force the data into tabular form. Sometimes date formatting is retained, but mostly formatting and comments are lost, and position has to be inferred.

Similar projects

unpivotr is inspired by the United Kingdom Office of National Statistics ONSdatabaker, which is implemented in Python. unpivotr differs already by using compass directions and searching for headers between boundaries.

The rsheets project of several R packages is in the early stages of importing spreadsheet information from Excel and Google Sheets into R, manipulating it, and potentially parsing and processing formulas and writing out to spreadsheet files. In particular, jailbreaker attempts to extract non-tabular data from spreadsheets into tabular structures automatically via some clever algorithms.

unpivotr differs from jailbreaker in that it provides tools for unpivoting complex and non-tabular data layouts using I not AI (intelligence, not artificial intelligence).

Roadmap

As common unpivoting idioms emerge, they will be generalised and added to the package. E.g. combining long header titles that are spelled out across cells, rather than being in one cell merged across others.

News

unpivotr 0.3.1

  • Performance improvements to tidy_table().
  • Fixed a CRAN test on some platforms.

unpivotr 0.3.0

  • Made compatible with tidyxl version 1.0.0 (avoids dplyr::distinct(), which doesn't handle list columns).
  • Updated to use the new dplyr/rlang combination instead of the old dplyr/lazyeval one.

unpivotr 0.2.1

This release overhauls the tidy_table() function of unpivotr to preserve the original data types of table cells and to support HTML tables.

  • tidytable() has been renamed tidy_table(). tidytable() is an error, rather than a deprecation warning, because tidy_table() is so different from before.
  • There is a new tidy_table() method and vignette for HTML.
  • There is no tidy_table() method for matrices. Convert matrices to data.frames first, choosing what to do with row and column names.
  • tidy_table() returns only relevant columns, according to the data types of the columns in the given data frame. It uses tibble::type_sum() to determine the column type and to name the columns, so whereas characters used to be returned in a column called character, they are now returned in a column called chr. The full list of column names is in ?tidy_table and is chr, cplx, cplx, dbl, fctr, int, lgl, list. The columns fctr and list are list-columns, where each element is itself a list. This means that factors with different levels are kept separate. For HTML tables, an html column is returned containing the standalone HTML of each cell.
  • Both rowname and colname arguments to tidy_table() now default to FALSE.
  • All functions consistently return a tibble.
  • Some error messages are more helpful.

unpivotr 0.1.1

  • Fixed breakages introduced by dplyr 0.6.

unpivotr 0.1.0

  • Moved images from inst/extdata to vignettes.

unpivotr 0.1.0

  • Bumped version to match tidyxl, which is now more mature.
  • Updated README and vignettes to use the new tidyxl api.

unpivotr 0.0.0.9000

  • Added a NEWS.md file to track changes to the package.

Reference manual

It appears you don't have a PDF plugin for this browser. You can click here to download the reference manual.

install.packages("unpivotr")

0.3.1 by Duncan Garmonsway, 5 months ago


https://github.com/nacnudus/unpivotr


Report a bug at https://github.com/nacnudus/unpivotr/issues


Browse source code at https://github.com/cran/unpivotr


Authors: Duncan Garmonsway [aut, cre]


Documentation:   PDF Manual  


MIT + file LICENSE license


Imports methods, rlang, magrittr, data.table, dplyr, dtplyr, purrr, tidyr, tibble, xml2

Suggests knitr, rmarkdown, tidyxl, stringr, htmltools, rvest, selectr, testthat


See at CRAN