Introducing R: Categorical Variables

In the previous installment we sucked some data from the National Health and Nutrition Examination Survey into R and did some preliminary work: selecting only the fields of interest, renaming columns and removing missing data. Now we are going to play with some categorical data.

There is already one categorical field in the data representing gender. However, the labels are not ideal:

> head(DS0012)
     id gender age  mass height      BMI
1 41475      2  62 138.9  1.547 58.03923
2 41476      2   6  22.0  1.204 15.17643
3 41477      1  71  83.9  1.671 30.04755
5 41479      1  52  65.7  1.544 27.55946
6 41480      1   6  27.0  1.227 17.93390
7 41481      1  21  77.9  1.827 23.33782
> unique(DS0012$gender)
[1] 2 1

Reference to the excellent codebook accompanying the data reveals that one should interpret 1 as male and 2 as female. We can make things a little more transparent by converting this field to a factor and introducing appropriate labels.

> DS0012$gender <- factor(DS0012$gender, labels = c("M", "F"))
> head(DS0012)
     id gender age  mass height      BMI
1 41475      F  62 138.9  1.547 58.03923
2 41476      F   6  22.0  1.204 15.17643
3 41477      M  71  83.9  1.671 30.04755
5 41479      M  52  65.7  1.544 27.55946
6 41480      M   6  27.0  1.227 17.93390
7 41481      M  21  77.9  1.827 23.33782

That’s better! Next we introduce a new categorical field which indicates age group. The boundaries between these fields are somewhat arbitrary (and might be rather politically incorrect), but they more or less make sense. Note that respondents above the age of 80 had their ages simply coded as 80.

# [ 0, 13) - child
# [13, 18) - teenager
# [18, 40) - adult
# [40, 60) - mature
# > 60     - senior
#
> DS0012$age.category <- cut(DS0012$age, breaks = c(0, 13, 18, 40, 60, 81), right = FALSE,
                             labels = c("child", "teenager", "adult", "mature", "senior"))
> head(DS0012)
     id gender age  mass height      BMI age.category
1 41475      F  62 138.9  1.547 58.03923       senior
2 41476      F   6  22.0  1.204 15.17643        child
3 41477      M  71  83.9  1.671 30.04755       senior
5 41479      M  52  65.7  1.544 27.55946       mature
6 41480      M   6  27.0  1.227 17.93390        child
7 41481      M  21  77.9  1.827 23.33782        adult

Finally we introduce BMI categories. These are rather broad categories, but will suffice for our analysis.

# < 18.5       - underweight
# 18.5 to 25.0 - normal
# 25.0 to 30.0 - overweight
# > 30         - obese
#
DS0012$BMI.category <- cut(DS0012$BMI, breaks = c(0, 18.5, 25, 30, 100),
                           labels = c("underweight", "normal", "overweight", "obese"))

This is what the final data look like

> head(DS0012)
     id gender age  mass height      BMI age.category BMI.category
1 41475      F  62 138.9  1.547 58.03923       senior        obese
2 41476      F   6  22.0  1.204 15.17643        child  underweight
3 41477      M  71  83.9  1.671 30.04755       senior        obese
5 41479      M  52  65.7  1.544 27.55946       mature   overweight
6 41480      M   6  27.0  1.227 17.93390        child  underweight
7 41481      M  21  77.9  1.827 23.33782        adult       normal

Next installment: some descriptive statistics.

Introducing R: Loading Data

I have just started preparing a series of talks aimed at introducing the use of R to a rather broad audience consisting of physicists, chemists, statisticians, biologists and computer scientists (plus a few other disciplines thrown in for good measure). I want to use a single consistent set of data throughout the talks. Finding something that would resonate with such a disparate set of people was quite a challenge. After playing around with a couple of options, I settled on using data for age, height and mass. These are things that we can all identify with. The next challenge was to actually find a suitable data set, which was surprisingly difficult. Eventually I stumbled upon the data from the National Health and Nutrition Examination Survey (NHANES), The data from the survey are available here. These data have been divided into a number of sets, each of which has been excellently curated and has a detailed codebook.

I started with the Body Measurements data (DS12), which I downloaded in tab-delimited format. The first task was to load this into R.

> DS0012 <- read.delim("data/DS0012/25505-0012-Data.tsv")
> dim(DS0012)
[1] 9762   65

So there are 9762 records, each of which has 65 fields. We will only retain a subset of those (sequence number, gender, age, mass and height). Although the field name for mass suggests that it might in fact be weight (BMXWT), it is actually mass in kilograms. Height is given in centimetres.

> DS0012 = DS0012[,c("SEQN", "RIAGENDR", "RIDAGEYR", "BMXWT", "BMXHT")]

There is some missing data (mass and height fields), which we remove.

> summary(DS0012)
      SEQN          RIAGENDR        RIDAGEYR         BMXWT            BMXHT
 Min.   :41475   Min.   :1.000   Min.   : 0.00   Min.   :  3.10   Min.   : 81.5
 1st Qu.:44008   1st Qu.:1.000   1st Qu.:10.00   1st Qu.: 36.70   1st Qu.:150.3
 Median :46540   Median :1.000   Median :29.00   Median : 66.10   Median :162.4
 Mean   :46547   Mean   :1.497   Mean   :32.88   Mean   : 62.17   Mean   :156.1
 3rd Qu.:49094   3rd Qu.:2.000   3rd Qu.:54.00   3rd Qu.: 83.65   3rd Qu.:171.7
 Max.   :51623   Max.   :2.000   Max.   :80.00   Max.   :218.20   Max.   :203.8
                                                 NA's   :131      NA's   :889
> DS0012 = subset(DS0012, !(is.na(BMXWT) | is.na(BMXHT)))
> > dim(DS0012)
[1] 8861    6

So, we lost around 10% of the original data, but at least what we are left with is clean. Next we change the column labels to something a little less cryptic and convert the units for height to metres

> names(DS0012) <- c("id", "gender", "age", "mass", "height")
> DS0012$height = DS0012$height / 100

Lastly we add in a derived column for Body Mass Index (BMI). There was already BMI data in the original data set, however, it is illustrative to calculate it again here.

> DS0012 = transform(DS0012, BMI = mass / height**2)

This is what the final data look like:

> head(DS0012)
     id gender age  mass height      BMI
1 41475      2  62 138.9  1.547 58.03923
2 41476      2   6  22.0  1.204 15.17643
3 41477      1  71  83.9  1.671 30.04755
5 41479      1  52  65.7  1.544 27.55946
6 41480      1   6  27.0  1.227 17.93390
7 41481      1  21  77.9  1.827 23.33782

Next installment: defining some meaningful categories.

Support & Resistance Indicator

I was recently browsing through the variety of of MetaTrader indicators for support and resistance levels. None of them ticked all of my boxes. Either they were not aesthetically pleasing (making a mess of my pristine charts) or they failed to produce what I consider to be reasonable levels. So, embracing my pioneering spirit, I set out to fashion my own indicator, one which will ultimately tick all of my boxes!

Sample output from v1.2 of my support-resistance indicator is shown below for the weekly chart of GBPUSD.

support-resistance-GBPUSD-W1

I am pretty happy with this as a starting point. At present it caters for

  • up to 10 levels of either support or resistance;
  • labelling of levels;
  • only updating on new candle (reduces computation load).

There are some obvious refinements, which I will implement shortly. Foremost among these is the elimination of “trivial” levels (e.g., the first support level).

Algorithmic Trading Status [April 2013]

I am currently trading three automated strategies on an account with a balance of around $3000. The strategies are named ernie, gemini #1 and gemini #2. The statistics for each of these are given below for the month of April 2013. The first strategy, ernie, is of the “slow and steady” variety and achieved a net profit of $48.42 off 31 trades with a 77.4% success rate. Next, gemini #1 is a scalping strategy, which yielded a net profit of only $29.51 off 70 trades with a 70% success rate. Finally, gemini #2 is looking for bigger moves, scoring a net profit of $208.23 off only 9 trades with a 55.6% success rate.

The total net profit for the month, $286.16, corresponds to a monthly growth in equity of roughly 9%. Not too bad, considering that all I did each day was check in to see that all was going according to plan. So the time spent developing these strategies finally seems to be paying off. Looking forward to seeing the performance over coming months.

pair open close volume profit swap
EURUSD 2013.04.01 08:38 2013.04.01 15:12 sell 0.05 -8.90 0.00
EURUSD 2013.04.01 15:12 2013.04.01 18:31 buy 0.01 5.50 0.00
EURUSD 2013.04.03 08:03 2013.04.03 09:03 sell 0.05 5.40 0.00
GBPUSD 2013.04.03 08:47 2013.04.03 22:52 sell 0.05 -24.30 0.00
GBPUSD 2013.04.03 22:52 2013.04.04 17:03 buy 0.01 5.50 0.02
EURUSD 2013.04.05 02:23 2013.04.05 03:23 buy 0.05 3.15 0.00
EURUSD 2013.04.05 05:43 2013.04.05 06:49 buy 0.05 3.05 0.00
EURUSD 2013.04.09 08:40 2013.04.09 12:28 buy 0.05 0.05 0.00
EURUSD 2013.04.09 23:47 2013.04.10 03:36 buy 0.05 0.25 -0.05
GBPUSD 2013.04.10 00:29 2013.04.10 04:06 buy 0.05 0.00 0.00
AUDUSD 2013.04.10 01:03 2013.04.10 02:03 buy 0.05 5.40 0.00
EURUSD 2013.04.10 15:02 2013.04.10 16:25 buy 0.05 0.15 0.00
AUDUSD 2013.04.11 16:16 2013.04.11 17:24 buy 0.05 4.15 0.00
GBPUSD 2013.04.11 22:49 2013.04.12 02:32 buy 0.01 0.01 0.01
EURUSD 2013.04.15 03:03 2013.04.15 04:06 buy 0.05 3.70 0.00
AUDUSD 2013.04.16 02:33 2013.04.16 03:58 sell 0.05 3.20 0.00
AUDUSD 2013.04.16 04:18 2013.04.16 05:15 sell 0.05 2.55 0.00
EURUSD 2013.04.17 00:07 2013.04.17 01:13 buy 0.05 6.50 0.00
GBPUSD 2013.04.17 02:52 2013.04.17 04:11 buy 0.05 0.50 0.00
EURUSD 2013.04.17 06:09 2013.04.17 07:54 buy 0.05 1.95 0.00
AUDUSD 2013.04.17 10:08 2013.04.17 11:08 sell 0.05 5.05 0.00
AUDUSD 2013.04.18 00:49 2013.04.18 01:54 sell 0.05 1.95 0.00
GBPUSD 2013.04.18 03:00 2013.04.18 04:00 sell 0.05 3.30 0.00
AUDUSD 2013.04.22 04:00 2013.04.22 08:07 sell 0.05 -1.05 0.00
GBPUSD 2013.04.22 04:28 2013.04.22 07:20 sell 0.05 4.30 0.00
AUDUSD 2013.04.22 08:07 2013.04.24 23:55 buy 0.01 -0.25 0.11
EURUSD 2013.04.23 16:15 2013.04.23 20:09 sell 0.05 0.00 0.00
GBPUSD 2013.04.24 06:21 2013.04.24 08:03 sell 0.05 0.25 0.00
AUDUSD 2013.04.24 23:55 2013.04.26 21:30 sell 0.01 0.19 -0.39
GBPUSD 2013.04.25 17:46 2013.04.25 21:49 buy 0.06 6.84 0.00
GBPUSD 2013.04.26 09:01 2013.04.26 12:57 buy 0.06 0.00 0.00
48.69 -0.27
pair open close volume profit swap
EURUSD 2013.04.01 10:00 2013.04.01 15:15 sell 0.02 -0.88 0.00
EURUSD 2013.04.01 15:15 2013.04.01 15:44 buy 0.02 3.60 0.00
EURUSD 2013.04.01 18:30 2013.04.01 23:00 sell 0.02 2.78 0.00
EURUSD 2013.04.01 23:00 2013.04.02 03:37 buy 0.02 3.60 -0.02
EURUSD 2013.04.02 04:30 2013.04.02 08:30 sell 0.02 2.06 0.00
EURUSD 2013.04.02 08:30 2013.04.03 01:15 buy 0.02 -7.48 -0.02
EURUSD 2013.04.03 01:30 2013.04.03 03:15 sell 0.02 2.14 0.00
EURUSD 2013.04.03 04:15 2013.04.03 11:30 buy 0.02 2.90 0.00
EURUSD 2013.04.03 11:30 2013.04.04 00:30 sell 0.02 -5.64 0.02
EURUSD 2013.04.04 04:30 2013.04.04 16:45 buy 0.02 -1.66 0.00
EURUSD 2013.04.04 20:30 2013.04.05 01:45 sell 0.02 -3.16 0.00
EURUSD 2013.04.05 01:45 2013.04.05 03:45 buy 0.02 1.20 0.00
EURUSD 2013.04.05 03:45 2013.04.05 05:45 sell 0.02 2.66 0.00
EURUSD 2013.04.05 09:45 2013.04.05 13:02 buy 0.02 3.60 0.00
EURUSD 2013.04.05 15:15 2013.04.05 22:31 sell 0.02 -12.50 0.00
EURUSD 2013.04.08 00:00 2013.04.08 00:15 buy 0.02 3.60 0.00
EURUSD 2013.04.08 08:45 2013.04.08 10:00 buy 0.02 1.72 0.00
EURUSD 2013.04.08 13:45 2013.04.08 16:35 sell 0.02 3.60 0.00
EURUSD 2013.04.08 18:45 2013.04.08 20:29 buy 0.02 3.60 0.00
EURUSD 2013.04.09 01:00 2013.04.09 07:45 sell 0.02 -6.16 0.00
EURUSD 2013.04.09 08:00 2013.04.09 13:15 buy 0.02 0.72 0.00
EURUSD 2013.04.09 18:30 2013.04.09 23:00 sell 0.02 2.52 0.00
EURUSD 2013.04.09 23:00 2013.04.10 05:15 buy 0.02 -0.06 -0.02
EURUSD 2013.04.10 05:15 2013.04.10 08:30 sell 0.02 1.52 0.00
EURUSD 2013.04.10 08:30 2013.04.10 10:01 buy 0.02 3.60 0.00
EURUSD 2013.04.10 11:30 2013.04.10 11:52 sell 0.02 3.60 0.00
EURUSD 2013.04.10 15:15 2013.04.11 07:45 buy 0.02 -4.54 -0.06
EURUSD 2013.04.11 08:15 2013.04.11 21:45 sell 0.02 -7.88 0.00
EURUSD 2013.04.11 21:45 2013.04.12 02:45 buy 0.02 1.86 -0.02
EURUSD 2013.04.12 02:45 2013.04.12 09:35 sell 0.02 3.60 0.00
EURUSD 2013.04.12 10:30 2013.04.12 16:15 buy 0.02 0.58 0.00
EURUSD 2013.04.12 16:30 2013.04.12 22:12 sell 0.02 -0.48 0.00
EURUSD 2013.04.12 22:12 2013.04.12 22:56 buy 0.02 3.60 0.00
EURUSD 2013.04.12 23:30 2013.04.15 05:06 sell 0.02 3.60 0.01
EURUSD 2013.04.15 06:30 2013.04.15 10:35 buy 0.02 3.60 0.00
EURUSD 2013.04.15 21:15 2013.04.16 06:45 buy 0.02 -0.44 -0.02
EURUSD 2013.04.16 07:15 2013.04.16 10:41 sell 0.02 3.60 0.00
EURUSD 2013.04.16 11:15 2013.04.16 12:31 buy 0.02 3.60 0.00
EURUSD 2013.04.16 13:45 2013.04.17 00:00 sell 0.02 -14.46 0.01
EURUSD 2013.04.17 06:15 2013.04.17 10:30 buy 0.02 1.00 0.00
EURUSD 2013.04.17 12:45 2013.04.17 17:56 buy 0.02 -24.34 0.00
EURUSD 2013.04.17 17:56 2013.04.17 18:04 buy 0.02 3.60 0.00
EURUSD 2013.04.17 19:30 2013.04.18 05:15 buy 0.02 1.18 -0.07
EURUSD 2013.04.18 05:15 2013.04.18 10:15 sell 0.02 1.08 0.00
EURUSD 2013.04.18 10:15 2013.04.18 10:30 buy 0.02 3.60 0.00
EURUSD 2013.04.18 12:45 2013.04.18 14:46 sell 0.02 3.60 0.00
EURUSD 2013.04.18 19:15 2013.04.18 20:43 sell 0.02 3.60 0.00
EURUSD 2013.04.19 02:00 2013.04.19 16:15 sell 0.02 -3.82 0.00
EURUSD 2013.04.19 19:00 2013.04.22 04:30 buy 0.02 1.80 -0.02
EURUSD 2013.04.22 11:00 2013.04.22 19:00 buy 0.02 2.28 0.00
EURUSD 2013.04.22 21:30 2013.04.23 04:00 sell 0.02 0.72 0.01
EURUSD 2013.04.23 04:00 2013.04.23 10:00 buy 0.02 1.10 0.00
EURUSD 2013.04.23 10:00 2013.04.23 10:28 sell 0.02 3.60 0.00
EURUSD 2013.04.23 12:15 2013.04.23 14:09 buy 0.02 3.60 0.00
EURUSD 2013.04.23 16:15 2013.04.23 21:15 sell 0.02 -0.54 0.00
EURUSD 2013.04.23 21:30 2013.04.24 11:48 buy 0.05 9.00 -0.07
EURUSD 2013.04.24 13:00 2013.04.24 15:15 sell 0.05 9.00 0.00
EURUSD 2013.04.25 01:00 2013.04.25 03:30 buy 0.05 4.85 0.00
EURUSD 2013.04.25 03:30 2013.04.25 09:30 sell 0.05 -8.00 0.00
EURUSD 2013.04.25 12:00 2013.04.25 16:24 sell 0.05 9.00 0.00
EURUSD 2013.04.25 16:30 2013.04.25 21:30 buy 0.05 -8.30 0.00
EURUSD 2013.04.26 02:30 2013.04.26 04:17 buy 0.05 9.00 0.00
EURUSD 2013.04.26 04:45 2013.04.26 09:00 sell 0.05 4.35 0.00
EURUSD 2013.04.26 09:00 2013.04.26 10:38 buy 0.05 9.00 0.00
EURUSD 2013.04.26 13:30 2013.04.26 15:30 buy 0.05 9.00 0.00
EURUSD 2013.04.29 00:45 2013.04.29 01:27 sell 0.05 9.00 0.00
EURUSD 2013.04.29 10:15 2013.04.30 03:45 sell 0.05 -7.15 0.03
EURUSD 2013.04.30 04:15 2013.04.30 08:15 buy 0.05 1.70 0.00
EURUSD 2013.04.30 09:30 2013.04.30 16:15 buy 0.05 -1.15 0.00
EURUSD 2013.04.30 16:30 2013.05.01 04:30 sell 0.05 -29.35 0.02
29.73 -0.22
pair open close volume profit swap
EURUSD 2013.04.04 16:45 2013.04.08 10:21 buy 0.09 139.14 -0.20
EURUSD 2013.04.15 21:04 2013.04.15 22:01 buy 0.09 -30.15 0.00
EURUSD 2013.04.16 12:45 2013.04.17 16:59 buy 0.09 22.77 -0.11
EURUSD 2013.04.17 17:15 2013.04.17 20:19 sell 0.09 32.58 0.00
EURUSD 2013.04.19 16:45 2013.04.19 17:50 buy 0.09 -29.70 0.00
EURUSD 2013.04.23 10:35 2013.04.23 11:17 sell 0.09 35.55 0.00
EURUSD 2013.04.25 16:30 2013.04.26 17:51 sell 0.19 0.00 0.04
EURUSD 2013.04.30 09:40 2013.04.30 16:30 sell 0.20 -66.00 0.00
EURUSD 2013.04.30 16:40 2013.05.01 17:12 buy 0.19 104.50 -0.19
208.69 -0.46

Locations of Geosynchronous Satellites

A year or so ago I went to a talk which included the diagram below. It shows the locations of the Earth’s fleet of geosynchronous satellites. According to the speaker, the information in this diagram was already quite dated: the satellites and their locations had changed.

geosynch-old

I decided to update the diagram using the locations of satellites from the list of geosynchronous satellites published on Wikipedia. Probably not the most definitive source of data on this subject, but it was a good starting point.

First, grab the Wikipedia page and extract the two tables (one for satellites over each of the eastern and western hemispheres. Then concantenate those tables.

download.file("http://en.wikipedia.org/wiki/List_of_satellites_in_geosynchronous_orbit",
              "wiki-geosynchronous-satellites.html")
#
W = readHTMLTable("wiki-geosynchronous-satellites.html", which = 3, trim = TRUE,
                  stringsAsFactors = FALSE)
E = readHTMLTable("wiki-geosynchronous-satellites.html", which = 4, trim = TRUE,
                  stringsAsFactors = FALSE)
#
geosat = rbind(W, E)

The longitude column indicates whether the satellites are east (E) or west (W) of the prime meridian. It is going to be more convenient to convert these to numerical values and make the ones that are to the west negative.

index = grep("°W", geosat[,1])
geosat[index, 1] = paste0("-", geosat[index, 1])
#
geosat[,1] = sub("°[WE]$", "", geosat[,1])

After some housekeeping, converting the range of longitudes from [-180°,180°) to [0°,360°), and retaining only the necessary columns, we end up with data that looks like this:

> head(geosat)
  Location   Satellite      Type
1      212  EchoStar-1 Broadcast
2      221  Americom-8 Broadcast
3      223  Americom-7 Broadcast
4      225 Americom-10 Broadcast
5      227   Galaxy-12 Broadcast
6      229 Americom-11 Broadcast

Well, that is really the tricky stuff. Generating the plots was quite straight forward. First I grouped the longitudes into bins. Then, for each bin

  1. assigned radial distance vector for satellites (not realistic, but allowed bins with multiple satellites to be plotted neatly);
  2. converted longitude and radial distance into Cartesian coordinates;
  3. plotted points at these locations; and
  4. inserted labels;

This is the result:

geosynchronous-orbit

Have a look at a higher resolution pdf version.