Missing values are ubiquitous in data and need to be explored and
handled in the initial stages of analysis. 'naniar' provides data structures
and functions that facilitate the plotting of missing values and examination
of imputations. This allows missing data dependencies to be explored with
minimal deviation from the common work patterns of 'ggplot2' and tidy data.
The work is fully discussed at Tierney & Cook (2018)
naniar
provides principled, tidy ways to summarise, visualise, and
manipulate missing data with minimal deviations from the workflows in
ggplot2 and tidy data. It does this by providing:
bind_shadow()
and nabular()
n_miss()
and n_complete()
pct_miss()
and pct_complete()
miss_var_summary()
and miss_var_table()
miss_case_summary()
, miss_case_table()
geom_miss_point()
gg_miss_var()
gg_miss_case()
gg_miss_fct()
For more details on the workflow and theory underpinning naniar, read the vignette Getting started with naniar.
For a short primer on the data visualisation available in naniar, read the vignette Gallery of Missing Data Visualisations.
You can install naniar from CRAN:
install.packages("naniar")
Or you can install the development version on github using remotes
:
# install.packages("remotes")remotes::install_github("njtierney/naniar")
Visualising missing data might sound a little strange - how do you
visualise something that is not there? One approach to visualising
missing data comes from ggobi and
manet, which replaces NA
values with values 10% lower than the minimum value in that variable.
This visualisation is provided with the geom_miss_point()
ggplot2 geom
library(ggplot2)ggplot(data = airquality,aes(x = Ozone,y = Solar.R)) +geom_point()#> Warning: Removed 42 rows containing missing values (geom_point).
ggplot2 does not handle these missing values, and we get a warning message about the missing values.
We can instead use geom_miss_point()
to display the missing data
library(naniar)ggplot(data = airquality,aes(x = Ozone,y = Solar.R)) +geom_miss_point()
geom_miss_point()
has shifted the missing values to now be 10% below
the minimum value. The missing values are a different colour so that
missingness becomes pre-attentive. As it is a ggplot2 geom, it supports
features like faceting and other ggplot features.
p1 <-ggplot(data = airquality,aes(x = Ozone,y = Solar.R)) +geom_miss_point() +facet_wrap(~Month, ncol = 2) +theme(legend.position = "bottom")p1
naniar provides a data structure for working with missing data, the shadow matrix (Swayne and Buja, 1998). The shadow matrix is the same dimension as the data, and consists of binary indicators of missingness of data values, where missing is represented as “NA”, and not missing is represented as “!NA”, and variable names are kep the same, with the added suffix “_NA" to the variables.
head(airquality)#> Ozone Solar.R Wind Temp Month Day#> 1 41 190 7.4 67 5 1#> 2 36 118 8.0 72 5 2#> 3 12 149 12.6 74 5 3#> 4 18 313 11.5 62 5 4#> 5 NA NA 14.3 56 5 5#> 6 28 NA 14.9 66 5 6as_shadow(airquality)#> # A tibble: 153 x 6#> Ozone_NA Solar.R_NA Wind_NA Temp_NA Month_NA Day_NA#> <fct> <fct> <fct> <fct> <fct> <fct>#> 1 !NA !NA !NA !NA !NA !NA#> 2 !NA !NA !NA !NA !NA !NA#> 3 !NA !NA !NA !NA !NA !NA#> 4 !NA !NA !NA !NA !NA !NA#> 5 NA NA !NA !NA !NA !NA#> 6 !NA NA !NA !NA !NA !NA#> 7 !NA !NA !NA !NA !NA !NA#> 8 !NA !NA !NA !NA !NA !NA#> 9 !NA !NA !NA !NA !NA !NA#> 10 NA !NA !NA !NA !NA !NA#> # … with 143 more rows
Binding the shadow data to the data you help keep better track of the
missing values. This format is called “nabular”, a portmanteau of NA
and tabular
. You can bind the shadow to the data using bind_shadow
or nabular
:
bind_shadow(airquality)#> # A tibble: 153 x 12#> Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA#> <int> <int> <dbl> <int> <int> <int> <fct> <fct> <fct>#> 1 41 190 7.4 67 5 1 !NA !NA !NA#> 2 36 118 8 72 5 2 !NA !NA !NA#> 3 12 149 12.6 74 5 3 !NA !NA !NA#> 4 18 313 11.5 62 5 4 !NA !NA !NA#> 5 NA NA 14.3 56 5 5 NA NA !NA#> 6 28 NA 14.9 66 5 6 !NA NA !NA#> 7 23 299 8.6 65 5 7 !NA !NA !NA#> 8 19 99 13.8 59 5 8 !NA !NA !NA#> 9 8 19 20.1 61 5 9 !NA !NA !NA#> 10 NA 194 8.6 69 5 10 NA !NA !NA#> # … with 143 more rows, and 3 more variables: Temp_NA <fct>,#> # Month_NA <fct>, Day_NA <fct>nabular(airquality)#> # A tibble: 153 x 12#> Ozone Solar.R Wind Temp Month Day Ozone_NA Solar.R_NA Wind_NA#> <int> <int> <dbl> <int> <int> <int> <fct> <fct> <fct>#> 1 41 190 7.4 67 5 1 !NA !NA !NA#> 2 36 118 8 72 5 2 !NA !NA !NA#> 3 12 149 12.6 74 5 3 !NA !NA !NA#> 4 18 313 11.5 62 5 4 !NA !NA !NA#> 5 NA NA 14.3 56 5 5 NA NA !NA#> 6 28 NA 14.9 66 5 6 !NA NA !NA#> 7 23 299 8.6 65 5 7 !NA !NA !NA#> 8 19 99 13.8 59 5 8 !NA !NA !NA#> 9 8 19 20.1 61 5 9 !NA !NA !NA#> 10 NA 194 8.6 69 5 10 NA !NA !NA#> # … with 143 more rows, and 3 more variables: Temp_NA <fct>,#> # Month_NA <fct>, Day_NA <fct>
Using the nabular format helps you manage where missing values are in your dataset and make it easy to do visualisations where you split by missingness:
airquality %>%bind_shadow() %>%ggplot(aes(x = Temp,fill = Ozone_NA)) +geom_density(alpha = 0.5)
And even visualise imputations
airquality %>%bind_shadow() %>%simputation::impute_lm(Ozone ~ Temp + Solar.R) %>%ggplot(aes(x = Solar.R,y = Ozone,colour = Ozone_NA)) +geom_point()#> Warning: Removed 7 rows containing missing values (geom_point).
Or perform an upset plot -
to plot of the combinations of missingness across cases, using the
gg_miss_upset
function
gg_miss_upset(airquality)
naniar does this while following consistent principles that are easy to read, thanks to the tools of the tidyverse.
naniar also provides handy visualations for each variable:
gg_miss_var(airquality)
Or the number of missings in a given variable at a repeating span
gg_miss_span(pedestrian,var = hourly_counts,span_every = 1500)
You can read about all of the visualisations in naniar in the vignette Gallery of missing data visualisations using naniar.
naniar also provides handy helpers for calculating the number, proportion, and percentage of missing and complete observations:
n_miss(airquality)#> [1] 44n_complete(airquality)#> [1] 874prop_miss(airquality)#> [1] 0.04793028prop_complete(airquality)#> [1] 0.9520697pct_miss(airquality)#> [1] 4.793028pct_complete(airquality)#> [1] 95.20697
naniar provides numerical summaries of missing data, that follow a
consistent rule that uses a syntax begining with miss_
. Summaries
focussing on variables or a single selected variable, start with
miss_var_
, and summaries for cases (the initial collected row order of
the data), they start with miss_case_
. All of these functions that
return dataframes also work with dplyr’s group_by()
.
For example, we can look at the number and percent of missings in each
case and variable with miss_var_summary()
, and miss_case_summary()
,
which both return output ordered by the number of missing values.
miss_var_summary(airquality)#> # A tibble: 6 x 3#> variable n_miss pct_miss#> <chr> <int> <dbl>#> 1 Ozone 37 24.2#> 2 Solar.R 7 4.58#> 3 Wind 0 0#> 4 Temp 0 0#> 5 Month 0 0#> 6 Day 0 0miss_case_summary(airquality)#> # A tibble: 153 x 3#> case n_miss pct_miss#> <int> <int> <dbl>#> 1 5 2 33.3#> 2 27 2 33.3#> 3 6 1 16.7#> 4 10 1 16.7#> 5 11 1 16.7#> 6 25 1 16.7#> 7 26 1 16.7#> 8 32 1 16.7#> 9 33 1 16.7#> 10 34 1 16.7#> # … with 143 more rows
You could also group_by()
to work out the number of missings in each
variable across the levels within it.
library(dplyr)#>#> Attaching package: 'dplyr'#> The following objects are masked from 'package:stats':#>#> filter, lag#> The following objects are masked from 'package:base':#>#> intersect, setdiff, setequal, unionairquality %>%group_by(Month) %>%miss_var_summary()#> # A tibble: 25 x 4#> Month variable n_miss pct_miss#> <int> <chr> <int> <dbl>#> 1 5 Ozone 5 16.1#> 2 5 Solar.R 4 12.9#> 3 5 Wind 0 0#> 4 5 Temp 0 0#> 5 5 Day 0 0#> 6 6 Ozone 21 70#> 7 6 Solar.R 0 0#> 8 6 Wind 0 0#> 9 6 Temp 0 0#> 10 6 Day 0 0#> # … with 15 more rows
You can read more about all of these functions in the vignette “Getting Started with naniar”.
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
geom_miss_*
family to include categorical variables,
Bivariate plots: scatterplots, density overlaysnullabor
package)Firstly, thanks to Di Cook for giving the
initial inspiration for the package and laying down the rich theory and
literature that the work in naniar is built upon. Naming credit (once
again!) goes to Miles McBain. Among
various other things, Miles also worked out how to overload the missing
data and make it work as a geom. Thanks also to Colin
Fay for helping me understand tidy
evaluation and for features such as replace_to_na
, miss_*_cumsum
,
and more.
naniar was previously named ggmissing
and initially provided a ggplot
geom and some other visualisations. ggmissing
was changed to naniar
to reflect the fact that this package is going to be bigger in scope,
and is not just related to ggplot2
. Specifically, the package is
designed to provide a suite of tools for generating visualisations of
missing values and imputations, manipulate, and summarise missing data.
Well, I think it is useful to think of missing values in data being like
this other dimension, perhaps like C.S. Lewis’s
Narnia - a
different world, hidden away. You go inside, and sometimes it seems like
you’ve spent no time in there but time has passed very quickly, or the
opposite. Also, NA
niar = na in r, and if you so desire, naniar may
sound like “noneoya” in an nz/aussie accent. Full credit to @MilesMcbain
for the name, and @Hadley for the rearranged spelling.
geom_miss_point()
ggplot2 layer can now be converted into an interactive web-based version by the ggplotly()
function in the plotly package. In order for this to work, naniar now exports the geom2trace.GeomMissPoint()
function (users should never need to call geom2trace.GeomMissPoint()
directly -- ggplotly()
calls it for you).usethis::use_spell_check()
@seealso
bug (#228) (@sfirke)Thanks to a PR (#223) from @romainfrancois:
This fixes two problems that were identified as part of reverse dependency checks of dplyr 0.8.0 release candidate. https://github.com/tidyverse/dplyr/blob/revdep_dplyr_0_8_0_RC/revdep/problems.md#naniar
n() must be imported or prefixed like any other function. In the PR, I've changed 1:n() to dplyr::row_number() as naniar seems to prefix all dplyr functions.
update_shadow was only restoring the class attributes, changed so that it restores all attributes, this was causing problems when data was a grouped_df. This likely was a problem before too, but dplyr 0.8.0 is stricter about what is a grouped data frame.
new_tibble
new_tibble
#220 - Thanks to Kirill Müller.rlang
#218 - thanks for Lionel Henry.Add custom label support for missings and not missings with functions add_label_missings
and add_label_shadow()
and add_any_miss()
. So you can now do `add_label_missings(data, missing = "custom_missing_label", complete = "custom_complete_label")
impute_median()
and scoped variants
any_shade()
returns a logical TRUE or FALSE depending on if there are any shade
values
nabular()
an alias for bind_shadow()
to tie the nabular
term into the work.
is_nabular()
checks if input is nabular.
geom_miss_point()
now gains the arguments from shadow_shift()
/impute_below()
for altering the amount of jitter
and proportion below (prop_below
).
Added two new vignettes, "Exploring Imputed Values", and "Special Missing Values"
miss_var_summary
and miss_case_summary
now no longer provide the
cumulative sum of missingness in the summaries - this summary can be added back
to the data with the option add_cumsum = TRUE
. #186
gg_miss_upset
to replace workflow of:data %>%
as_shadow_upset() %>%
UpSetR::upset()
recode_shadow
now works! This function allows you to recode your missing
values into special missing values. These special missing values are stored in
the shadow part of the dataframe, which ends in _NA
.shade
where appropriate throughout naniar, and also added
verifiers, is_shade
, are_shade
, which_are_shade
, and removed which_are_shadow
.as_shadow
and bind_shadow
now return data of class shadow
. This will
feed into recode_shadow
methods for flexibly adding new types of missing data.shadow
might be changed to nabble
or something similar.add_label_shadow()
and add_label_missings()
gain arguments so you can only label according to the missingness / shadowy-ness of given variables.which_are_shadow()
, to tell you which values are shadows.long_shadow()
, which converts data in shadow/nabular form into a long format suitable for plotting. Related to #165miss_scan_count
gg_miss_upset
gets a better default presentation by ordering by the largest
intersections, and also an improved error message when data with only 1 or no
variables have missing values.shadow_shift
gains a more informative error message when it doesn't know the class.common_na_string
to include escape characters for "?", "", "." so
that if they are used in replacement or searching functions they don't return
the wildcard results from the characters "?", "", and ".".miss_case_table
and miss_var_table
now has final column names pct_vars
,
and pct_cases
instead of pct_miss
- fixes #178.old_names | new_names |
---|---|
miss_case_pct |
pct_miss_case |
miss_case_prop |
prop_miss_case |
miss_var_pct |
pct_miss_var |
miss_var_prop |
prop_miss_var |
complete_case_pct |
pct_complete_case |
complete_case_prop |
prop_complete_case |
complete_var_pct |
pct_complete_var |
complete_var_prop |
prop_complete_var |
These old names will be made defunct in 0.5.0, and removed completely in 0.6.0.
impute_below
has changed to be an alias of shadow_shift
- that is it operates on a single vector. impute_below_all
operates on all columns in a dataframe (as specified in #159)miss_scan_count
actually return
'd something.gg_miss_var(airquality)
now prints the ggplot - a typo meant that this did not print the plotThis is a patch release that removes tidyselect
from the package Imports, as
it is unnecessary. Fixes #174
=========================
Added all_miss()
/ all_na()
equivalent to all(is.na(x))
Added any_complete()
equivalent to all(complete.cases(x))
Added any_miss()
equivalent to anyNA(x)
Added common_na_numbers
and finalised common_na_strings
- to provide a
list of commonly used NA values
#168
Added miss_var_which
, to lists the variable names with missings
Added as_shadow_upset
which gets the data into a format suitable for
plotting as an UpSetR
plot:
airquality %>% as_shadow_upset() %>% UpSetR::upset()
Added some imputation functions to assist with exploring missingness structure and visualisation:
impute_below
Perfoms as for shadow_shift
, but performs on all columns.
This means that it imputes missing values 10% below the range of the
data (powered by shadow_shift
), to facilitate graphical exloration of
the data. Closes #145
There are also scoped variants that work for specific named columns:
impute_below_at
, and for columns that satisfy some predicate function:
impute_below_if
.impute_mean
, imputes the mean value, and scoped variants
impute_mean_at
, and impute_mean_if
.impute_below
and shadow_shift
gain arguments prop_below
and jitter
to control the degree of shift, and also the extent of jitter.
Added complete_{case/var}_{pct/prop}
, which complement
miss_{var/case}_{pct/prop}
#150
Added unbind_shadow
and unbind_data
as helpers to remove shadow columns
from data, and data from shadows, respectively.
Added is_shadow
and are_shadow
to determine if something contains a
shadow column. simimlar to rlang::is_na
and rland::are_na
, is_shadow
this returns a logical vector of length 1, and are_shadow
returns a logical
vector of length of the number of names of a data.frame. This might be
revisited at a later point (see any_shade
in add_label_shadow
).
Aesthetics now map as expected in geom_miss_point(). This means you can write
things like geom_miss_point(aes(colour = Month))
and it works appropriately.
Fixed by Luke Smith in Pull request
#144, fixing
#137.
miss_var_summary
and miss_case_summary
now return use order = TRUE
by
default, so cases and variables with the most missings are presented in
descending order. Fixes #163
Changes for Visualisation:
gg_miss_case
and gg_miss_var
to
lorikeet purple (from ochRe package: https://github.com/ropenscilabs/ochRe)gg_miss_case
order_cases = TRUE
.show_pct
option to be consistent with gg_miss_var
#153gg_miss_which
is rotated 90 degrees so it is easier to read variable namesgg_miss_fct
uses a minimal theme and tilts the axis labels
#118.imported is_na
and are_na
from rlang
.
Added common_na_strings
, a list of common NA
values
#168.
Added some detail on alternative methods for replacing with NA in the vignette "replacing values with NA".
=========================
Speed improvements. Thanks to the help, contributions, and discussion with Romain François and Jim Hester, naniar now has greatly improved speed for calculating the missingness in each row. These speedups should continue to improve in future releases.
New "scoped variants" of replace_with_na
, thankyou to Colin Fay for his
work on this:
replace_with_na_all
replaces all NAs across the dataframe that meet a
specified condition (using the syntax ~.x == -99
)replace_with_na_at
replaces all NAs across for specified variablesreplace_with_na_if
replaces all NAs for those variables that satisfy some
predicate function (e.g., is.character)added which_na
- replacement for which(is.na(x))
miss_scan_count
. This makes it easier for users to search for particular
occurrences of these values across their variables.
#119
n_miss_row
calculates the number of missing values in each row, returning a
vector. There are also 3 other functions which are similar in spirit:
n_complete_row
, prop_miss_row
, and prop_complete_row
, which return a
vector of the number of complete obserations, the proportion of missings in a
row, and the proportion of complete obserations in a row
add_miss_cluster
is a new function that calculates a cluster of missingness
for each row, using hclust
. This can be useful in exploratory modelling
of missingness, similar to
Tierney et al 2015. and
Barnett et al. 2017
Now exported where_na
- a function that returns the positions of NA values.
For a dataframe it returns a matrix of row and col positions of NAs, and for
a vector it returns a vector of positions of NAs. (#105)
facet
features and order_cases
.bind_shadow
gains a only_miss
argument. When set to FALSE (the default) it
will bind a dataframe with all of the variables duplicated with their shadow.
Setting this to TRUE will bind variables only those variables that contain
missing values.gg_miss_case
to be clearer and less
cluttered ( #117), also
added n order_cases
option to order by cases.facet
argument to gg_miss_var
, gg_miss_case
, and
gg_miss_span
. This makes it easier for users to visualise these plots
across the values of another variable. In the future I will consider adding
facet
to the other shorthand plotting function, but at the moment these
seemed to be the ones that would benefit the most from this feature.oceanbuoys
now is numeric type for year, latitude, and longitude,
previously it was factor.
See related issueshadow_shift
when there are Inf or -Inf values (see #117)Deprecated replace_to_na
, with replace_with_na
, as it is a more natural
phrase ("replace coffee to tea" vs "replace coffee with tea"). This will be
made defunct in the next version.
cast_shadow
no longer works when called as cast_shadow(data)
. This
action used to return all variables, and then shadow variables for the
variables that only contained missing values. This was inconsistent with
the use of cast_shadow(data, var1, var2)
. A new option has been added
to bind_shadow
that controls this - discussed below. See more details at
issue 65.
Change behaviour of cast_shadow
so that the default option is to return
only the variables that contain missings. This is different to bind_shadow
,
which binds a complete shadow matrix to the dataframe. A way to think about
this is that the shadow is only cast on variables that contain missing values,
whereas a bind is binding a complete shadow to the data. This may change in
the future to be the default option for bind_shadow
.
naniar
"=========================
naniar
onto CRAN, updates to naniar
will
happen reasonably regularly after this approximately every 1-2 months=========================
naniar
miss_case_cumsum
/ miss_var_cumsum
/ replace_to_na
gg_var_cumsum
& gg_case_cumsum
group_by
is now respected by the following functions:
miss_case_cumsum()
miss_case_summary()
miss_case_table()
miss_prop_summary()
miss_var_cumsum()
miss_var_run()
miss_var_span()
miss_var_summary()
miss_var_table()
label_missing*
to label_miss
to be more consistent with the rest
of naniarpct
and prop
helpers (#78)miss_df_pct
- this was literally the same as pct_miss
or prop_miss
.gg_miss_var
gets a show_pct
argument to show the percentage of missing
values (Thanks Jennifer for the
helpful feedback! :))miss_var_summary
& miss_case_summary
now have consistent output (one was
ordered by n_missing, not the other).miss_case_pct
enquo_x
is now x
(as adviced by
Hadley)=========================
replace_to_na
is a complement to tidyr::replace_na
and replaces a
specified value from a variable to NA.gg_miss_fct
returns a heatmap of the number of missings per variable for
each level of a factor. This feature was very kindly contributed by
Colin Fay.gg_miss_
functions now return a ggplot object, which behave as such.
gg_miss_
basic themes can be overriden with ggplot functions. This fix
was very kindly contributed by Colin Fay.add_*
functions handle bare unqouted names where appropriate as per #61add_*
familygeom_missing_point()
to geom_miss_point()
, to keep consistent
with the rest of the functions in naniar
.=========================
brfss
and tao
as per #59=========================
add_label_missings()
add_label_shadow()
cast_shadow()
cast_shadow_shift()
cast_shadow_shift_label()
added github issue / contribution / pull request guides
ts
generic functions are now miss_var_span
and miss_var_run
, and
gg_miss_span
and work on data.frame
's, as opposed to just ts
objects.
add_shadow_shift()
adds a column of shadow_shifted values to the current
dataframe, adding "_shift" as a suffix
cast_shadow()
- acts like bind_shadow()
but allows for specifying which
columns to add
shadow_shift
now has a method for factors - powered by
forcats::fct_explicit_na()
#3
is_na
function to label_na
tidy-miss-[topic]
gg_missing_*
is changed to gg_miss_*
to fit with other syntaxmiss_cat
, shadow_df
and shadow_cat
, as they are
no longer needed, and have been superceded by label_missing_2d
,
as_shadow
, and is_na
.pedestrian
- contains hourly counts of pedestriansmiss_ts_run()
: return the number of missings / complete in a single runmiss_ts_summary()
: return the number of missings in a given time periodgg_miss_ts()
: plot the number of missings in a given time periodnaniar
to narnia
- I had to explain the spelling a
few times when I was introducing the package and I realised that I should
change the name. Fortunately it isn't on CRAN yet.=========================
prop_miss
and the complement prop_complete
. Where n_miss
returns
the number of missing values, prop_miss
returns the proportion of missing
values. Likewise, prop_complete
returns the proportion of complete values.The left hand side functions have been made defunct in favour of the right hand side.
- percent_missing_case()
--> miss_case_pct()
- percent_missing_var()
--> miss_var_pct()
- percent_missing_df()
--> miss_df_pct()
- summary_missing_case()
--> miss_case_summary()
- summary_missing_var()
--> miss_var_summary()
- table_missing_case()
--> miss_case_table()
- table_missing_var()
--> miss_var_table()
=========================
miss_*
= I want to explore missing valuesmiss_case_*
= I want to explore missing casesmiss_case_pct
= I want to find the percentage of cases containing a
missing valuemiss_case_summary
= I want to find the number / percentage of missings in
each casemiss_case_table
= I want a tabulation of the number / percentage of cases
missingThis is more consistent and easier to reason with.
Thus, I have renamed the following functions:
- percent_missing_case()
--> miss_case_pct()
- percent_missing_var()
--> miss_var_pct()
- percent_missing_df()
--> miss_df_pct()
- summary_missing_case()
--> miss_case_summary()
- summary_missing_var()
--> miss_var_summary()
- table_missing_case()
--> miss_case_table()
- table_missing_var()
--> miss_var_table()
These will be made defunct in the next release, 0.0.6.9000 ("The Wood Between Worlds").
=========================
n_complete
is a complement to n_miss
, and counts the number of complete
values in a vector, matrix, or dataframe.shadow_shift
now handles cases where there is only 1 complete value in a vector.testthat
.=========================
After a burst of effort on this package I have done some refactoring and thought hard about where this package is going to go. This meant that I had to make the decision to rename the package from ggmissing to naniar. The name may strike you as strange but it reflects the fact that there are many changes happening, and that we will be working on creating a nice utopia (like Narnia by CS Lewis) that helps us make it easier to work with missing data
add_n_miss
and add_prop_miss
are helpers that add columns to a dataframe
containing the number and proportion of missing values. An example has been
provided to use decision trees to explore missing data structure as in
Tierney et al
geom_miss_point()
now supports transparency, thanks to @seasmith (Luke Smith)
more shadows. These are mainly around bind_shadow
and gather_shadow
,
which are helper functions to assist with creating
geom_missing_point()
broke after the new release of ggplot2 2.2.0, but this
is now fixed by ensuring that it inherits from GeomPoint, rather than just a new
Geom. Thanks to Mitchell O'hara-Wild for his help with this.
missing data summaries table_missing_var
and table_missing_case
also now
return more sensible numbers and variable names. It is possible these function
names will change in the future, as these are kind of verbose.
semantic versioning was incorrectly entered in the DESCRIPTION file as 0.2.9000, so I changed it to 0.0.2.9000, and then to 0.0.3.9000 now to indicate the new changes, hopefully this won't come back to bite me later. I think I accidentally did this with visdat at some point as well. Live and learn.
gathered related functions into single R files rather than leaving them in their own.
correctly imported the %>%
operator from magrittr, and removed a lot of
chaff around @importFrom
- really don't need to use @importFrom
that often.
=========================
geom_missing_point()
now works in a way that we expect! Thanks to Miles
McBain for working out how to get this to work.=========================
percent_missing_df
returns the percentage of missing data for a data.framepercent_missing_var
the percentage of variables that contain missing valuespercent_missing_case
the percentage of cases that contain missing values.table_missing_var
table of missing information for variablestable_missing_case
table of missing information for casessummary_missing_var
summary of missing information for variables (counts, percentages)summary_missing_case
summary of missing information for variables (counts, percentages)