Introducing R: Loading Data
I have just started preparing a series of talks aimed at introducing the use of R to a rather broad audience consisting of physicists, chemists, statisticians, biologists and computer scientists (plus a few other disciplines thrown in for good measure). I want to use a single consistent set of data throughout the talks. Finding something that would resonate with such a disparate set of people was quite a challenge. After playing around with a couple of options, I settled on using data for age, height and mass. These are things that we can all identify with. The next challenge was to actually find a suitable data set, which was surprisingly difficult. Eventually I stumbled upon the data from the National Health and Nutrition Examination Survey (NHANES), The data from the survey are available here. These data have been divided into a number of sets, each of which has been excellently curated and has a detailed codebook.
I started with the Body Measurements data (DS12), which I downloaded in tab-delimited format. The first task was to load this into R.
So there are 9762 records, each of which has 65 fields. We will only retain a subset of those (sequence number, gender, age, mass and height). Although the field name for mass suggests that it might in fact be weight (BMXWT), it is actually mass in kilograms. Height is given in centimetres.
There is some missing data (mass and height fields), which we remove.
So, we lost around 10% of the original data, but at least what we are left with is clean. Next we change the column labels to something a little less cryptic and convert the units for height to metres
Lastly we add in a derived column for Body Mass Index (BMI). There was already BMI data in the original data set, however, it is illustrative to calculate it again here.
This is what the final data look like:
Next installment: defining some meaningful categories.