Fitting a Statistical Distribution to Sampled Data
I’m generally not too interested in fitting analytical distributions to my data. With large enough samples (which I am normally fortunate enough to have!) I can safely assume normality for most statistics of interest.
Recently I had a relatively small chunk of data and finding a decent analytical approximation was important. So I had a look at the tools available in R for addressing this problem. The fitdistrplus package seemed like a good option. Here’s a sample workflow.
Create some Data
To have something to work with, generate 1000 samples from a log-normal distribution.
Load up the package and generate a skewness-kurtosis plot.
There’s nothing magical in those summary statistics, but the plot is most revealing. The data are represented by the blue point. Various distributions are represented by symbols, lines and shaded areas.
We can see that our data point is close to the log-normal curve (no surprises there!), which indicates that it is the most likely distribution.
We don’t need to take this at face value though because we can fit a few distributions and compare the results.
We’ll start out by fitting a log-normal distribution using
The quantile-quantile plot indicates that, as expected, a log-normal distribution gives a pretty good representation of our data. We can compare this to the results of fitting a normal distribution, where we see that there is significant divergence of the tails of the quantile-quantile plot.
If we fit a selection of plausible distributions then we can objectively evaluate the quality of those fits.
According to these data the log-normal distribution is the optimal fit: smallest AIC and largest log-likelihood.
Of course, with real (as opposed to simulated) data, the situation will probably not be as clear cut. But with these tools it’s generally possible to select an appropriate distribution and derive appropriate parameters.