Toolbox for Model Building and Forecasting
Implements functions and instruments for regression model building and its
application to forecasting. The main scope of the package is in variables selection
and models specification for cases of time series data. This includes promotional
modelling, selection between different dynamic regressions with non-standard
distributions of errors, selection based on cross validation, solutions to the fat
regression model problem and more. Models developed in the package are tailored
specifically for forecasting purposes. So as a results there are several methods
that allow producing forecasts from these models and visualising them.
The package greybox contains functions for model building, which is currently done via the model selection and combinations based on information criteria. The resulting model can then be used in analysis and forecasting.
There are several groups of functions in the package.
Regression model functions
- alm - advanced linear regression model that implements likelihood estimation of parameters for Normal, Laplace, Asymmetric Laplace, Logistic, Student's t, S, Folded Normal, Log Normal, Chi-Squared, Beta, Poisson, Negative Binomial, Cumulative Logistic and Cumulative Normal distributions. In a sense this is similar to
glm() function, but with a different set of distributions and with a focus on forecasting.
- stepwise - function implements stepwise IC based on partial correlations.
- lmCombine - function combines the regression models from the provided data, based on IC weights and returns the combined alm object.
Exogenous variables transformation functions
- xregExpander - function produces lags and leads of the provided data.
- xregTransformer - function produces non-linear transformations of the provided data (logs, inverse etc).
- xregMultiplier - function produces cross-products of the variables in the matrix. Could be useful when exploring interaction effects of dummy variables.
The data analysis functions
- cramer - calculates Cramer's V for two categorical variables. Plus tests the significance of such association.
- mcor - function returns the coefficients of multiple correlation between the variables. This is useful when measuring association between categorical and numerical variables.
- association (aka 'assoc()') - function returns matrix of measures of association, choosing between cramer(), mcor() and cor() depending on the types of variables.
- determination (aka 'determ()') - function returns the vector of coefficients of determination (R^2) for the provided data. This is useful for the diagnostics of multicollinearity.
- tableplot - plots the graph for two categorical variables.
- spread - plots the matrix of scatter / boxplot / tableplot diagrams - depending on the type of the provided variables.
- graphmaker - plots the original series, the fitted values and the forecasts.
Models evaluation functions
- ro - rolling origin evaluation (see the vignette).
- rmc - Regression for Multiple Comparison of forecasting methods. Can be used, for example, when RelMAE is calculated for several forecasting methods and an analysis of statistical significance in accuracy of methods needs to be carried out. This can be especially useful when you have a lot of methods to compare. The test is faster than Nemenyi in this case and becomes more powerful and accurate.
- measures - the error measures for the provided forecasts. Includes MPE, MAPE, MASE, sMAE, sMSE, RelMAE, RelRMSE, MIS, sMIS, RelMIS, pinball and others.
- qlaplace, dlaplace, rlaplace, plaplace - functions for Laplace distribution.
- qalaplace, dalaplace, ralaplace, palaplace - functions for Asymmetric Laplace distribution.
- qs, ds, rs, ps - functions for S distribution.
- qfnorm, dfnorm, rfnorm, pfnorm - functions for folded normal distribution.
- qtplnorm, dtplnorm, rtplnorm, ptplnorm - functions for three parameter log normal distribution.
Methods for the introduced and some existing classes:
- pointLik - point likelihood method for the time series models.
- pAIC, pAICc, pBIC, pBICc - respective point values for the information criteria, based on pointLik.
- summary - returns summary of the regression (either selected or combined).
- vcov - covariance matrix for combined models. This is an approximate thing. The real one is quite messy and not yet available.
- confint - confidence intervals for combined models.
- predict, forecast - point and interval forecasts for the response variable. forecast method relies on the parameter h (the forecast horizon), while predict is focused on the newdata. See vignettes for the details.
- nParam - returns number of estimated parameters in the model (including variance, location, shift).
- getResponse - returns the response variable from the model.
- plot - plots the basic linear graph of actuals and fitted. Similar thing plots graphs for forecasts of greybox functions.
- AICc - AICc for regression with normally distributed residuals.
- BICc - BICc for regression with normally distributed residuals.
- is.greybox, is.alm etc. - functions to check if the object was generated by respective functions.
- lmDynamic - linear regression with time varying parameters based on pAIC.
The stable version of the package is available on CRAN, so you can install it by running:
A recent, development version, is available via github and can be installed using "devtools" in R. First make sure that you have devtools:
and after that run:
greybox v0.5.1 (Release data: 2019-04-27)
- The parameter "silent" is now renamed into "quiet" in ro(), xregExpander(), xregMultiplier() and xregTransformer() functions. It only specifies the output in console now.
- Similarly, "bruteForce" is now "bruteforce" in lmCombine(), lmDynamic() and determ(),
- "B" is now "parameters" in alm(),
- "checks" is now "fast" (note, the opposite meaning),
- "style" is now "outplot" in rmc().
- All the functions now return "fitted" instead of "fitted.values". This is not a big deal, the fitted() method will work as always.
- forecast() function now accepts h - the forecast horizon. If h!=nrow(newdata), then the function will either cut off values, or produce forecasts for the explanatory variables.
- alm() would not do checks properly when only one explanatory variable was used.
- lmCombine() and lmDynamic() would not work because they would refer to non-existent "ourModel". Thanks leungi for the bug report!
- lmCombine() and lmDynamic() used to warn about the computational time after doing things, not before that.
- predict.alm() would not work well if only one-step-ahead prediction was needed.
greybox v0.5.0 (Release data: 2019-04-20)
- New function - polyprod(), returning the product of two polynomials.
- alm() now has parameters ar and i, which define the order of respective elements of ARIMA model.
- alm() now checks for stationarity of AR.
- alm() now produces the fitted for the zero observations.
- alm() now accepts parameters for the nloptr.
- In case of occurrence model, the expected entropy is added to the likelihood.
- Tuning of the initials for the recursive model.
- alm() with distribution=c("pnorm","plogis") and ARI(p,d) now produces adequate estimates of probability and forecasts.
- Error measures imported from smooth. Accuracy() function is renamed into measures().
greybox v0.4.2 (Release data: 2019-03-10)
- nParam, nobs, sigma and AICc, BICc methods for the "varest" class of the functions for "vars" package. Should allow selection using the corrected AIC and BIC.
- graphmaker() now does not need for forecast to start at the right place. If start(actuals)=start(forecast), it will place forecast at the end.
- graphmaker() now also allows not to reset par, so that you can add arbitrary elements to the graph.
- rmc() now returns the exponentiated values of means and intervals in case of distribution="dlnorm".
- nParam() method has been renamed into nparam().
- A new method - actuals(), which returns actuals from the model (similar to "getResponse" of forecast package).
- nobs() and actuals() now also have a hidden parameter all, which determines, whether to return all the values or only demand sizes in case of occurrence model. The default behaviour for nobs is FALSE, and for actuals is TRUE. This is driven by the functions relying on these methods.
- xregExpander() now allows specifying how to fill in the gaps for the lagged variables.
- Fixed a bug in stepwise(), due to which the wrong fitted values were generated.
- Annoying bug with occurrence!="none" models and factors.
- lmCombine and lmDynamic did not work correctly with NAs.
- alm produced errors for simple regressions and occurrence.
- stepwise would produce errors, when there was no variability in the data.
greybox v0.4.1 (Release data: 2019-01-27)
- The extended vignette on marketing analytics tools (tableplot, spread, cramer, mcor, assoc and determ).
- determination() can now be smart and use stepwise(). This might be especially useful for cases of fat regressions diagnostics.
- stepwise() now also works with factors. And faster than the previous version, when dat is numeric.
- use parameter in mcor(), cramer() and assoc(). By default NAs are removed.
- lmDynamic and lmCombine now work with factors and with all the alm distributions.
- lmDynamic and lmCombine now also have the parameter paralle, which defines whether to make calculations in parallel or not.
- alm() now removes NaNs if they are present in the data.
- Fixes in mcor and assoc, which did not work correctly in some cases of factors provided as x or y.
- xregExpander() did not work appropriately when extrapolate=FALSE.
- Fixed an annoying bug in predict.greybox(), when the newdata did not contain the response variable.
- stepwise() sometimes produced data with wrong colnames. Now it doesn't.
greybox v0.4.0 (Release data: 2019-01-04)
- Added Burnham & Anderson in the library.bib in the vignettes.
- summary() now prints the name of the response variable.
- lmCombine now returns logLik adequate to the selected distribution.
- Sample size, estimated parameters and degrees of freedom are now also returned in the summary.
- Use Choleski decomposition in vcov.alm function instead of solve.
- xregMultiplier() function, allowing producing cross-products of variables.
- Two new cool functions: tableplot() - produces plots for the two categorical variables, showing graphically, where the most frequent values happen; spread() - plots a matrix of scatterplots / boxplots / tableplots, depending on the type of the provided variable.
- Added a clarification about the most efficient use of RMC (together with RelMAE / RelMSE).
- alm() now works with factors.
- spread() now allows doing log transforms of numerical data.
- New function: cramer() - that calculates Cramer's V and the according statistics. Good for measuring the association between the categorical variables.
- New function: mcor() - multiple correlation between the numerical and catgorical variables.
- New function: association() aka assoc() - returns the matrix of measures of association (the values depend on the types of variables under consideration).
- xregExpander() now has 'extrapolate' parameter which allows deciding whether the missing values need to be extrapolated or not.
- graphmaker() now does not plot forecast if it is NA. In addition, the legend is now slightly more flexible.
- summary() now prints the df for dchisq and size for the dnbinom.
- tableplot() now also accepts dataframes, plotting the first two columns.
- rmc() now should work much faster in cases of distribution=c("dnorm","dlnorm").
- plot.rmc() now only resets par(), when style="lines". In the "mcb" style, it won't change par, so that the user can add any elements they want.
- alm() now estimates sigma parameter for dfnorm directly using likelihood.
- lmCombine() and lmDynamic() did not work well when the data.frame was provided as data.
- pointLik() did not work with lmCombine() because scale parameter was not available. Similarly, it did not work with stepwise() in case of normal distribution.
- predict.alm() was misbehaving in case of non-null occurrence.
- determination() now works with factors.
- Additional explanations for RMC.
- Bugfix in alm() for cases of occurrence and the provided factors.
- vcov.alm would not work in cases of occurrence model having different set of variables than the sizes one.
- nParam() did not take the number of parameters in the occurrence part into account.
- plot.predict.alm() now works fine in case of newdata=NULL.
- Fixes in alm() for dalaplace, dnbinom and dchisq distributions and the usage of factors.
- predict.alm() would not work for the models with intercept only.
- alm() in cases of distribution="plogis" or "pnorm" some times could not produce errors correctly (due to exp(huge number)). Now it does.
greybox v0.3.3 (Release data: 2018-11-27)
- Student t distribution in alm.
- Beta distribution in alm. When will I stop? I guess, I'll do that when I stop procrastinating...
- New functions for three parameter log normal distribution.
- New function for the non-linear transformation of the provided variables - xregTransformer. Use with care!
- Renamed parameter "b" into "scale" in laplace, alaplace and s functions.
- lmCombine now returns a matrix with the selected variables and the respective information criteria.
- Corrected a typo of "plogos" in alm.
- is-functions for greybox now rely on "inherits" function.
- Some bugfixes in alm() with dchisq. But there's a lot of confusion there, including stuff in predict.alm.
- Stepwise was not calculating the number of degrees of freedom correctly in case of distribution="dnorm".
- Stepwise did not call for alm in case of distribution!="dnorm".
- Bugfix in rs, rlaplace and other r-functions, where thethe duplicates of the provided parameters were removed. This caused problems in cases of huge samples, when identical random numbers could have appeared.
greybox v0.3.2 (Release data: 2018-10-25)
- Updates in the vignette of alm.
- Although the square link is tricky in case of Chi Squared distribution, it is the correct thing to do. The alm() function now checks if the generated mu_t is positive, and if not, it returns a big number, forcing the solver to stick with the positive solution.
- If inverting Hessian fails, return very big values (meaning high uncertainty).
- predict.alm() now saves the original level of probability.
- New initialisation for plogis, pnorm, dpois and dnbinom in alm().
- dpois, plogis, pnorm and dnbinom now use maxeval=500. All the others have 100. This should improve the estimates of parameters in difficult cases.
- We now return only that data, that was used in the model construction in alm().
- lmCombine and lmDynamic now should work with the distributions of alm(). The only two that are not 100% correct are dchisq and dfnorm - the fitted values of those are incorrect.
- removed getResponse.alm. Now getResponse.greybox does what is necessary.
- Residuals of dnbinom and dpois are now calculated as y - fitted.
- predict with distriubion="dnbinom" in cases, when scale is not available, is now calculated based on the definition of scale via variance and mean.
- pointLik.ets() is now calculated differently, so that sum(pointLik) is close to the logLik produced by ets() function. The problem with logLik of ets() is that it is not calculated correctly, chopping off some parts of normal distribution. Total disaster!
- New set of distribution functions - for Asymmetric Laplace Distribution (ALD).
- alm() now estimates models with Asymmetric Laplace Distribution with predefined alpha parameter. This is equivalent to the quantile regression with alpha quantile, but is done from the likelihood point of view. It also allows estimating alpha in sample.
- The correct prediction and confidence intervals for the alm() with ALD.
- predict function now also works, when newdata is not provided (although why would you want to do that?).
- predict.alm() sometimes produced NAs in the lower bound.
- When having varying probability, plot.predict sometimes struggled to use the correct value.
- plot.predict.greybox() now passes values from ellipsis to graphmaker.
- The intervals for dnorm are now corrected for the cases of occurrence model.
- plot.predict works differently when there are Inf values in the bounds.
- predict() did not work correctly for simple linear regression.
- alm() returned a vector in data for cases of the model with intercept only.
- predict.alm() with distribution=c(dlaplace, ds, dfnorm) did not work in some cases of fixed level of probability.
- predict.alm() now writes lower and upper values in the existing elements of the list instead of creating the new ones.
- predict.alm() did not produce prediction intervals for "dnorm".
- plot.greybox() now checks whether there is a need to transform the data to the binary variable or not.
greybox v0.3.1 (Release data: 2018-09-07)
- Corrected some typos in README.md and added description of several functions.
- predict() and forecast() functions now produce confidence and prediction intervals for the provided holdout sample data. forecast() is just a wrapper around predict().
- Normal and log-normal distributions are now available in alm().
- rmc() now uses alm().
- stepwise(), lmCombine() and lmDynamic() can now also be constructed with distributions from alm(). They use lm() in case of "dnorm" and alm() otherwise.
- alm() now does not return vcov if you didn't ask for it (should increase speed of computation for large datasets).
- alm() can be constructed with the provided vector of parameters (needed for vcov method).
- We now use well-known analytical solutions for the cases of distribution="dnorm" of alm() and other functions.
- Code of lmCombine and lmDynamic is slightly simplified.
- We now use Choleski decomposition for the calculation of the inverse of matrices in alm.
- distribution="dlogis" is now available for alm().
- alm() now also supports logit and probit models, which are called using distribution="plogis" and distribution="pnorm" respectively (reference to the names of respective CDFs in R).
- alm() now has occurrence parameter, which allows dealing with zeroes in the data. In this case, a mixture distribution can be used.
- alm() with dlnorm now also returns analytical covariance matrix instead of hessian based one.
- stepwise(), lmCombine() and lmDynamic() now rely on .lm.fit() function, when distribution="dnorm", so the speed of calculation should be substantially higher.
- New functions for class checks: is.greybox(), is.alm(), is.greyboxC(), is.greyboxD(), is.rmc() and is.rollingOrigin().
- stepwise() now calculates only the necessary correlations. This allows further inceasing the speed of computation.
- alm() uses its own mean function, so this should also increas its speed.
- Correct prediction intervals for the model with the occurrence part and a new parameter in prediction function - side - which allows producing one-sided PIs.
- stepwise() should now work better with big data.
- Futher optimisation of stepwise in order to decrease the used memory.
- alm() and all the other functions now return "data" instead of "model" and don't produce terms and qr. This should save some space.
- vcov.alm() now uses call in order to reestimate the model.
- rmc() now returns groups of methods. This can be used for analytical purposes.
- alm() now uses a more refined parameters for vcov calculation for "dchisq" and returns a slightly different call with vcov.
- pointLik.alm() method for alm class.
- alm() now extracts meaningful residuals depending on the distribution used. e.g. dnorm -> y - mu, dlnorm -> log(y) - mu
- stepwise() now allows defining occurrence model. So now you can do something like: stepwise(ourData, distribution="dlnorm", occurrence=stepwise(ourData, distribution="plogis"))
- predict function now returns probabilities for the lower and upper intervals. So if you had side="upper", then the lower will be "0", and the upper will be the specified level.
- dpois and dnbinom distributions in alm. alm() allows producing prediction intervals for both of them. But covariance matrix of parameters for dnbinom might be hard to calculate...
- The dispersion parameter of dnbinom in alm() is now estimated separately, which now solves a lot of problems.
- Renamed parameter A into B for alm(). Very serious thing!
- distribution="dchisq" in alm() now estimates the non-central Chi Squared distribution. The returned scale corresponds to the estimated number of degrees of freedom, while mu is the exponent of the expectation.
- rmc() now colours the lines depending on the number of groups. If there's only one, then there's one group and the differences are not significant.
- Started a new vignette for the alm() function.
- graphmaker() is now moved from smooth to greybox.
- Fixed a bug with the style="line" in rmc(), where the grouping would be wrong in cases, when one method significantly differs from the others.
- logLik previously was not calculated correctly for the mixture models.
- Bugfix in hessian calculation, when Choleski decomposition works...
- Bugfix in pointLik for the models with occurrence.
- predict() function failed with newdata with one observation.
- Initials of both Poisson and NegBin in case of non-zero data are now taken with logs. This leads to more robust starting points.
greybox v0.3.0 (Release data: 2018-08-05)
- New cool function - lmDynamic() - that constructs a dynamic linear regression based on point ICs.
- New set of functions for Folded normal distribution.
- New function - alm - Advanced Linear Model.
- Folded normal distribution for rmc() with value="a".
- Proper model for chi-squared distribution in alm and rmc.
- Renamed distributions in the alm function.
- determination() function did not work in cases of 2 variables.
- vcov() and confint() were misbehaving when nVars==1.
greybox v0.2.3 (Release data: 2018-08-02)
- determination() now automatically drops variables with no variability.
- New function - nemenyi() - imported from TStools with minor bugfixes and corrections.
- It appears that Nikos is against the move of nemenyi() function from TStools to greybox. This was a misunderstanding between the two of us. So no nemenyi() function here, nothing to see here, move along!
- New function for multiple comparison of methods based on regression analysis - rmc(). This is a parametric analogue of nemenyi test. The function works with errors, their absolute and squared values and relies on lm / glm.
- New methods imported from smooth: errorType, pointLik and pAIC.
- plots of ro() were misaligned in case of co=FALSE
- ro() now also returns the correct actual values (previously they could be cut off when ci=FALSE).
greybox v0.2.2 (Release data: 2018-05-25)
- New description of the package and badges in README.md
- New function - determination() - returns R-squares for the provided data. This can be useful when you need to analyse the multicollinearity effect.
- nParam method for logLik class.
- BICc - new method for the classes, implementing, guess what?
- Updated description of the package in the help file.
- ro() now returns a class and has print and plot methods associated with it.
- ro() is much more flexible now, returning whatever you want in an adequate format.
- New methods for the greybox functions: confint, vcov.
- Renamed "combiner" into "lmCombine", because it makes more sense. We will use "combine" name for a more general function that would combine forecasts from arbitrary provided models (e.g. smooth, forecast and lm classes).
- sigma() method returned the wrong standard error in cases of combined models.
greybox v0.2.1 (Release data: 2018-05-01)
- New description of the package and badges in README.md
greybox v0.2.1 (Release data: 2018-05-01)
- print.summary now specifies digits. Summary does not round up anything. This corresponds to the normal behaviour of these methods.
- Implemented Laplace distribution, which is useful when models are estimated using MAE.
- Sped up qs() and qlaplace() functions using the inverse cumulative functions.
- New function - ro() - Rolling origin.
- qs() returned weird values when several 0 and 1 were specified as probabilities.
greybox v0.2.0 (Release data: 2018-03-10)
- combiner now uses a more clever mechanism in case of bruteForce==FALSE.
- combiner now also checks if the provided data has ncol>nrow and sets bruteForce if it has.
- Use Kendall Tau as default in cor() for stepwise.
- Don't use Kendall Tau as default everywhere - only for fat regressions.
- New summary and print methods for models from stepwise. No statistical tests printed, only confidence intervals and ICs.
- AICc for smooth functions in case of iSS models should take only the demand sizes into account, not all the parameters.
greybox v0.1.1 (Release data: 2018-03-05)
- We now do not depend on smooth. We suggest it. It's smooth that should depend on greybox!
- New function imported from smooth - AICc.
- New functions for the S distribution (the maximisation of likelihood of which corresponds to the minimum of HAM): ds, ps, qs, rs.
- stepwise now returns the object of two classes: greybox and lm.
- combiner now returns three classes: greybox, lm and greyboxC.
- nParam is moved to greybox from smooth.
- If smooth is not installed, plot forecasts using simpler function.
- The forecasts are now produced for the combined models in cases of fat regressions.
greybox v0.1.0 (Release data: 2018-03-03)
- Initial release. stepwise() and xregExpander() are imported here from smooth package.
- combiner() function that combines lm() models. This thing is in the development right now.
- combiner() has a meaningful summary() now. Working to make it more accesible to lm functions.
- summary() for combiner now returns the list of values.
- stepwise() should now perform slightly better.
- combiner() can now be smart and use stepwise for the models pool creation.
- combined lm model can now be used together with predict() and forecast() functions.
- plot() and forecast() methods for the combined functions.