Introducing R: Descriptive Statistics
In the previous installment we derived two new categorical variables for the National Health and Nutrition Examination Survey data. This time we will get some simple descriptive statistics from the data.
Firstly, let’s start by looking at a summary of the entire data set. We can exclude the identifier field, since this has no real significance.
This gives the quantiles and mean for each of the numerical variables, and the counts for each of the categorical variables. The average age of the subjects is 35. The subjects have masses between 10.4 and 218.2 kg.
We could have extracted these statistics for each of the numerical variables individually.
It gets a little painful to type out the variable name every time, but we can attach the DS0012 variable to R’s search path, which makes things much more compact.
That’s better. We can also get a table of counts for an individual categorical variable.
This is precisely the information that we got in the summary above: children make up the largest portion of the sample, followed by adults and then seniors. Teenagers are in the minority. What about generating a contingency table which cross-tabulates two categorical variables?
Now that is interesting: it seems that the majority of children in the data are underweight. Should we be concerned? No, the interpretation of BMI for children is different: the nominal thresholds between each of the categories no longer apply and BMI is compared to typical values for children of similar age. Among teenagers and adults the majority of the sample have normal BMIs. However, even the overweight and obese categories for adults are already well populated. In the mature and senior portion of the sample, BMIs more often indicate overweight or obese.
Finally, let’s generate a three way contingency table of BMI, age and gender.
It’s a little difficult to make sense of all that, but as we will see later on, there are great tools for understanding the contents of multiway contingency tables.
Right, that has given us a general feel for what the data looks like. The next step is to generate some plots.
The last thing that we need to do is detach the DS0012 variable