Introducing R: Descriptive Statistics

In the previous installment we derived two new categorical variables for the National Health and Nutrition Examination Survey data. This time we will get some simple descriptive statistics from the data.

Firstly, let’s start by looking at a summary of the entire data set. We can exclude the identifier field, since this has no real significance.

> summary(DS0012[, c(-1, -7)])
 gender        age             mass            height           BMI             BMI.category
 M:4448   Min.   : 2.00   Min.   : 10.40   Min.   :0.815   Min.   :12.50   underweight:1759
 F:4413   1st Qu.:12.00   1st Qu.: 49.00   1st Qu.:1.503   1st Qu.:19.97   normal     :2589
          Median :33.00   Median : 68.70   Median :1.624   Median :25.16   overweight :2260
          Mean   :35.45   Mean   : 66.68   Mean   :1.561   Mean   :25.71   obese      :2253
          3rd Qu.:56.00   3rd Qu.: 85.20   3rd Qu.:1.717   3rd Qu.:30.08 
          Max.   :80.00   Max.   :218.20   Max.   :2.038   Max.   :73.43

This gives the quantiles and mean for each of the numerical variables, and the counts for each of the categorical variables. The average age of the subjects is 35. The subjects have masses between 10.4 and 218.2 kg.

We could have extracted these statistics for each of the numerical variables individually.

> mean(DS0012$BMI)
[1] 25.7057
> median(DS0012$BMI)
[1] 25.15504
> quantile(DS0012$BMI)
      0%      25%      50%      75%     100%
12.50312 19.97228 25.15504 30.08150 73.42526

It gets a little painful to type out the variable name every time, but we can attach the DS0012 variable to R’s search path, which makes things much more compact.

> attach(DS0012)
> mean(BMI)
[1] 25.7057

That’s better. We can also get a table of counts for an individual categorical variable.

> table(age.category)
age.category
   child teenager    adult   mature   senior
    2220      757     2105     1793     1986

This is precisely the information that we got in the summary above: children make up the largest portion of the sample, followed by adults and then seniors. Teenagers are in the minority. What about generating a contingency table which cross-tabulates two categorical variables?

> table(age.category, BMI.category)
            BMI.category
age.category underweight normal overweight obese
    child           1537    519        117    47
    teenager         111    390        143   113
    adult             50    765        630   660
    mature            31    421        638   703
    senior            30 4   94        732   730

Now that is interesting: it seems that the majority of children in the data are underweight. Should we be concerned? No, the interpretation of BMI for children is different: the nominal thresholds between each of the categories no longer apply and BMI is compared to typical values for children of similar age. Among teenagers and adults the majority of the sample have normal BMIs. However, even the overweight and obese categories for adults are already well populated. In the mature and senior portion of the sample, BMIs more often indicate overweight or obese.

Finally, let’s generate a three way contingency table of BMI, age and gender.

> (bmi.age.gender = table(BMI = BMI.category, age = age.category, gender))
, , gender = M

             age
BMI           child teenager adult mature senior
  underweight   818       67    14     10     15
  normal        259      199   393    184    227
  overweight     52       75   354    374    390
  obese          23       53   291    313    337

, , gender = F

             age
BMI           child teenager adult mature senior
  underweight   719       44    36     21     15
  normal        260      191   372    237    267
  overweight     65       68   276    264    342
  obese          24       60   369    390    393

It’s a little difficult to make sense of all that, but as we will see later on, there are great tools for understanding the contents of multiway contingency tables.

Right, that has given us a general feel for what the data looks like. The next step is to generate some plots.

The last thing that we need to do is detach the DS0012 variable

> detach(DS0012)

Categorically Variable