Introducing R: Categorical Variables

In the previous installment we sucked some data from the National Health and Nutrition Examination Survey into R and did some preliminary work: selecting only the fields of interest, renaming columns and removing missing data. Now we are going to play with some categorical data.

There is already one categorical field in the data representing gender. However, the labels are not ideal:

> head(DS0012)
     id gender age  mass height      BMI
1 41475      2  62 138.9  1.547 58.03923
2 41476      2   6  22.0  1.204 15.17643
3 41477      1  71  83.9  1.671 30.04755
5 41479      1  52  65.7  1.544 27.55946
6 41480      1   6  27.0  1.227 17.93390
7 41481      1  21  77.9  1.827 23.33782
> unique(DS0012$gender)
[1] 2 1

Reference to the excellent codebook accompanying the data reveals that one should interpret 1 as male and 2 as female. We can make things a little more transparent by converting this field to a factor and introducing appropriate labels.

> DS0012$gender <- factor(DS0012$gender, labels = c("M", "F"))
> head(DS0012)
     id gender age  mass height      BMI
1 41475      F  62 138.9  1.547 58.03923
2 41476      F   6  22.0  1.204 15.17643
3 41477      M  71  83.9  1.671 30.04755
5 41479      M  52  65.7  1.544 27.55946
6 41480      M   6  27.0  1.227 17.93390
7 41481      M  21  77.9  1.827 23.33782

That’s better! Next we introduce a new categorical field which indicates age group. The boundaries between these fields are somewhat arbitrary (and might be rather politically incorrect), but they more or less make sense. Note that respondents above the age of 80 had their ages simply coded as 80.

# [ 0, 13) - child
# [13, 18) - teenager
# [18, 40) - adult
# [40, 60) - mature
# > 60 - senior
#
> DS0012$age.category <- cut(DS0012$age, breaks = c(0, 13, 18, 40, 60, 81), right = FALSE,
                             labels = c("child", "teenager", "adult", "mature", "senior"))
> head(DS0012)
     id gender age  mass height      BMI age.category
1 41475      F  62 138.9  1.547 58.03923       senior
2 41476      F   6  22.0  1.204 15.17643        child
3 41477      M  71  83.9  1.671 30.04755       senior
5 41479      M  52  65.7  1.544 27.55946       mature
6 41480      M   6  27.0  1.227 17.93390        child
7 41481      M  21  77.9  1.827 23.33782        adult

Finally we introduce BMI categories. These are rather broad categories, but will suffice for our analysis.

# < 18.5 - underweight
# 18.5 to 25.0 - normal
# 25.0 to 30.0 - overweight
# > 30 - obese
#
DS0012$BMI.category <- cut(DS0012$BMI, breaks = c(0, 18.5, 25, 30, 100),
labels = c("underweight", "normal", "overweight", "obese"))

This is what the final data look like

> head(DS0012)
     id gender age  mass height      BMI age.category BMI.category
1 41475      F  62 138.9  1.547 58.03923       senior        obese
2 41476      F   6  22.0  1.204 15.17643        child  underweight
3 41477      M  71  83.9  1.671 30.04755       senior        obese
5 41479      M  52  65.7  1.544 27.55946       mature   overweight
6 41480      M   6  27.0  1.227 17.93390        child  underweight
7 41481      M  21  77.9  1.827 23.33782        adult       normal

Next installment: some descriptive statistics.

Categorically Variable