## Package MatchIt: Balancing experimental data

A balanced experimental design is one in which the distribution of the covariates is the same in both the control and treatment groups. However, although achievable in an experimental scenario, for observational data this ideal is seldom attained. The MatchIt package provides a means of pre-processing data so that the treated and control groups are as similar as possible, minimising the dependence between the treatment variable and the other covariates.

I will illustrate this with a simple (and somewhat contrived) example. Consider the following situation: a group of school children is being taught history. There are 150 children in the grade and they are split between two teachers, Claire and Jane. Jane has some revolutionary ideas about teaching and she claims that she is able to achieve better results than Claire. They set out to test this claim using a set of standardised questions. The results of the questions along with the gender and age of each of the children are recorded. Claire’s husband loads the data into R and it looks like this:

Claire is teaching 90 children, almost two thirds of which are boys. Jane is only teaching 60 children with almost the opposite gender ratio. The distribution of ages is reflected in the plot below. There is clearly a wider range of ages in the children taught by Jane and their median ages are higher than those taught by Claire.

The distribution of the aggregate result of the questions (expressed as a %) are given below. It appears that for both teachers the girls fare better than the boys.

If we make a simple comparison of the average grade for children taught by Claire and Jane, then it seems like Jane might be onto something: her average is about 5% higher than Claire’s.

But that is not the full story: if we break down the averages by gender and teacher then the results are less convincing: boys do slightly better under Jane’s regime, while girls do slightly worse.

Let’s apply the tried and tested t-test to see if there is a significant difference between the average grades.

This indicates that the difference in the mean grade for children taught by Claire and Jane is significant at the 5% level (p-value = 0.02537). So perhaps Jane is right. However, contradictory performance of the two genders taken separately should cause us to think carefully about this result. Perhaps the fact that Jane is teaching a larger proportion of girls is having an effect? Or maybe the older children in Jane’s class are skewing the statistics?

This is where MatchIt comes into play. First we add a dichotomous treatment variable, which is 1 for every child taught by Jane and 0 otherwise.

The resulting data is then fed into the matching routine.

We see that in the original (unmatched) data there were 90 records in the control (Claire) group and only 60 in the treated (Jane) group. After matching there are 60 records in each group. The remaining 30 records are unmatched. We then extract the matched records.

The gender and age disparities between the control and treated group have not been removed but they are now much improved. What influence does this have on a comparison of the grades?

The difference in average grade is no longer significant. So perhaps Jane has not really discovered anything revolutionary after all. The initial difference in results was probably due to the imbalance between the gender ratios in the control and treatment groups. Certainly the data indicate that there is a marked difference in performance between the girls and boys!

Finally, it seems to me that MatchIt is most effective when the control group is larger than the treatment group. This was the situation for the example above. If the converse applies then MatchIt does not appear to have much of an effect.

## Things That Might Go Wrong

There are some situations where things might not run quite so smoothly.

1. If you get an error message like this then it normally means that the output variable from your model is not dichotomous. You need to make it either a boolean or 0/1 variable.