Refs:
The typical way to fit a distribution is to use function MASS::fitdistr
:
library(MASS)
set.seed(101)
my_data <- rnorm(250, mean=1, sd=0.45) # unkonwn distribution parameters
fit <- fitdistr(my_data, densfun="normal") # we assume my_data ~ Normal(?,?)
fit
## mean sd
## 0.99613836 0.42405694
## (0.02681972) (0.01896440)
hist(my_data, pch=20, breaks=25, prob=TRUE, main="")
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T)
fitdistr
uses optim
to estimate the parameter values by maximizing the likelihood function. So it works like this:
log_likelihood <- function(params) { -sum(dnorm(my_data, params[1], params[2], log=TRUE)) }
fit2 <- optim(c(0,1), log_likelihood) # c(0,1) are just initial guesses
fit2
## $par
## [1] 0.9962070 0.4240523
##
## $value
## [1] 140.2628
##
## $counts
## function gradient
## 65 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
hist(my_data, pch=20, breaks=25, prob=TRUE)
curve(dnorm(x, fit2$par[1], fit2$par[2]), col="blue", lwd=6, add=T) # optim fit
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T) # fitdistr fit
fitdistrplus
This tutorial uses the fitdistrplus
package for fitting distributions.
library(fitdistrplus)
Let’s use one of the datasets provided by the packge:
data("groundbeef", package = "fitdistrplus")
my_data <- groundbeef$serving
plot(my_data, pch=20)
We can first plot the empirical density and the histogram to gain insight of the data:
plotdist(my_data, histo = TRUE, demp = TRUE)
Another tool is to show some descriptive statistics, like the moments, to help us in making a decision:
descdist(my_data, discrete=FALSE, boot=500)
## summary statistics
## ------
## min: 10 max: 200
## median: 79
## mean: 73.64567
## estimated sd: 35.88487
## estimated skewness: 0.7352745
## estimated kurtosis: 3.551384
The plot also includes a nonparametric bootstrap procedure for the values of kurtosis and skewness.
Say, in the previous eg, we chose the weibull, gamma and log-normal to fit:
fit_w <- fitdist(my_data, "weibull")
fit_g <- fitdist(my_data, "gamma")
fit_ln <- fitdist(my_data, "lnorm")
summary(fit_ln)
## Fitting of the distribution ' lnorm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## meanlog 4.1693701 0.03366988
## sdlog 0.5366095 0.02380783
## Loglikelihood: -1261.319 AIC: 2526.639 BIC: 2533.713
## Correlation matrix:
## meanlog sdlog
## meanlog 1 0
## sdlog 0 1
we can plot the results:
par(mfrow=c(2,2))
plot.legend <- c("Weibull", "lognormal", "gamma")
denscomp(list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
cdfcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
qqcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
ppcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
The fitting can work with other non-base distribution. It only needs that the correspodent, d, p, q functions are implemented.
In the next eg, the endosulfan
dataset cannot be properly fit by the basic distributions like the log-normal:
data("endosulfan", package = "fitdistrplus")
my_data <- endosulfan$ATV
fit_ln <- fitdist(my_data, "lnorm")
cdfcomp(fit_ln, xlogscale = TRUE, ylogscale = TRUE)
To solve this it is used the Burr and Pareto distributions available at package actuar
library(actuar)
fit_ll <- fitdist(my_data, "llogis", start = list(shape = 1, scale = 500))
fit_P <- fitdist(my_data, "pareto", start = list(shape = 1, scale = 500))
fit_B <- fitdist(my_data, "burr", start = list(shape1 = 0.3, shape2 = 1, rate = 1))
cdfcomp(list(fit_ln, fit_ll, fit_P, fit_B), xlogscale = TRUE, ylogscale = TRUE,
legendtext = c("lognormal", "loglogistic", "Pareto", "Burr"), lwd=2)
The package also provides some goodness-of-it statistics:
gofstat(list(fit_ln, fit_ll, fit_P, fit_B), fitnames = c("lnorm", "llogis", "Pareto", "Burr"))
## Goodness-of-fit statistics
## lnorm llogis Pareto Burr
## Kolmogorov-Smirnov statistic 0.1672498 0.1195888 0.08488002 0.06154925
## Cramer-von Mises statistic 0.6373593 0.3827449 0.13926498 0.06803071
## Anderson-Darling statistic 3.4721179 2.8315975 0.89206283 0.52393018
##
## Goodness-of-fit criteria
## lnorm llogis Pareto Burr
## Aikake's Information Criterion 1068.810 1069.246 1048.112 1045.731
## Bayesian Information Criterion 1074.099 1074.535 1053.400 1053.664
Visually and using the previous statistics, it seems that the Burr distribution seems the preferred one among the candidates we chose to explore.
We can apply a bootstrap to estimate the uncertainty in the parameters:
ests <- bootdist(fit_B, niter = 1e3)
summary(ests)
## Parametric bootstrap medians and 95% percentile CI
## Median 2.5% 97.5%
## shape1 0.2024304 0.09504934 0.377547
## shape2 1.5802731 1.01204175 2.959759
## rate 1.4655764 0.65204235 2.702179
##
## The estimation method converged only for 995 among 1000 iterations
plot(ests)
quantile(ests, probs=.05) # 95% percentile bootstrap confidence interval
## (original) estimated quantiles for each specified probability (non-censored data)
## p=0.05
## estimate 0.2939259
## Median of bootstrap estimates
## p=0.05
## estimate 0.3077854
##
## two-sided 95 % CI of each quantile
## p=0.05
## 2.5 % 0.1707940
## 97.5 % 0.4960287
##
## The estimation method converged only for 995 among 1000 bootstrap iterations.