Conrad Wolfram: Teaching kids real math with computers

Conrad Wolfram gives a thought provoking talk on a different way to teach Mathematics in schools.

We’ve had the biggest transformation of any ancient subject that I could ever imagine with computers. Calculating was typically the limiting step, and now often it isn’t. So I think in terms of the fact that math has been liberated from calculating. But that math liberation didn’t get into education yet. See, I think of calculating, in a sense, as the machinery of math. It’s the chore. It’s the thing you’d like to avoid if you can, like to get a machine to do. It’s a means to an end, not an end in itself, and automation allows us to have that machinery.
Conrad Wolfram

Mathematics is a tool that we use to solve problems. So we should enthuse children with its the power to do precisely this: solve problems, answer questions and build things.

Mortality by Year and Age

Taking another look at the data from the lifespan package. Plot below shows the evolution of mortality in the US as a function of year and age.

deaths-year-age

> library(lifespan)
> library(ggplot2)
> ggplot(deathsage, aes(x = year, y = age)) +
+   geom_raster(aes(fill = count)) +
+   labs(x = "Year", y = "Age") +
+   scale_y_continuous(breaks = seq(0, 120, 10), limits = c(0, 110)) +
+   scale_fill_gradient("Deaths", low = "#FFFFFF") +
+   facet_wrap(~ sex) +
+   theme_minimal() +
+   theme(panel.grid = element_blank())

Also, based on a suggestion from @robjohnnoble, population data have been included in the package.

> tail(population)
    year  count
112 2011 310.50
113 2012 312.86
114 2013 315.18
115 2014 317.68
116 2015 320.22
117 2016 322.48

Life Expectancy by Country

I was rather inspired by this plot on Wikipedia’s List of Countries by Life Expectancy.

Life expectancy by country from Wikipedia

Shouldn’t be too hard to reproduce with a bit of scraping. Here are the results (click on the static image to view the interactive plot):

Life expectancy by country

The bubble plot above compares female and male life expectancies for a number of countries. The diagonal line corresponds to equal female and male life expectancy. The size of each bubble is proportional to the corresponding country’s population while its colour indicates its continent. Countries in Africa generally have the lowest life expectancies, while those in Europe are generally the highest. Since all of the bubbles are located above the diagonal, we find that females consistently live longer than males, regardless of country.

And finally, here’s the code:

Mortality Rate by Age

Working further with the mortality data from http://www.cdc.gov/, I’ve added a breakdown of deaths by age and gender to the lifespan package on GitHub.

Here’s a summary plot:

deaths-by-age

> library(lifespan)
> NYEARS = length(unique(deaths$year))
> ggplot(deathsage, aes(x = age, y = count / NYEARS / 1000)) +
+   geom_area(aes(fill = sex), position = "identity", alpha = 0.5) +
+   geom_line(aes(group = sex)) +
+   # facet_wrap(~ sex, ncol = 1) +
+   labs(x = "Age", y = "Deaths per Year [thousands]") +
+   scale_x_continuous(breaks = seq(0, 150, 10), limits = c(0, 120)) +
+   theme_minimal() + theme(legend.title = element_blank())

There are a few interesting observations to be made. We’ll start with the most obvious:

  • on average, females live longer than males;
  • modal age at death is 81 for males and 86 for females;
  • there are more infant deaths among males than females (probably linked to greater number of male births); and
  • there is a rapid escalation in deaths among teenage males (consistent with fact that teenage males are more likely to commit suicide, be involved in fatal vehicle accidents, or be victims of homicide).

Another way of looking at the same data is with a stacked area plot. It’s more difficult to see compare genders, but gives a better indication of the overall mortality rate.

deaths-by-age-stacked

Escalating Life Expectancy

I’ve added mortality data to the lifespan package. A result that immediately emerges from these data is that average life expectancy is steadily climbing.

death-average-age

> library(lifespan)
> ggplot(deaths, aes(x = year, y = avgage)) +
+   geom_boxplot(aes(group = year, fill = sex)) +
+   facet_wrap(~ sex) +
+   labs(x = "", y = "Average Age at Death") +
+   theme_minimal() + theme(legend.title = element_blank())

The effect is more pronounced for men, rising from around 66.5 in 1994 to 70.0 in 2014. The corresponding values for women are 74.6 and 76.5 respectively. Good news for everyone.

When do most deaths occur? It would seem that the peak lies in Winter, specifically January. There is a broad trough during the Summer months. Fractionally more women die in Winter, whereas slightly more men die during Summer.

deaths-per-day

Birth Month by Gender

Based on some feedback to a previous post I normalised the birth counts by the (average) number of days in each month. As pointed out by a reader, the results indicate a gradual increase in the number of conceptions during (northern hemisphere) Autumn and Winter, roughly up to the end of December. Normalising the data to give births per day also shifts the peak from August to September.

births-per-day

> library(lifespan)
> library(dplyr)
> month.days <- data.frame(
+   month = month.abb,
+   days = c(31, 28.25, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
+ )
> group_by(births, month) %>% summarise(count = sum(count)) %>%
+   merge(month.days) %>%
+   mutate(perday = count / days) %>%
+   ggplot(aes(x = month, y = perday / 1000)) +
+   geom_bar(stat = "identity", fill = "#39A75E") +
+   labs(x = "", y = "Total Births per Day [thousands]") +
+   theme_classic()

I also broke the births data down by gender. The September peak persists for both genders but something else that’s interesting pops out: there are consistently more boys being born than girls. The average ratio of boys to girls between 1994 and 2014 is 1.048. The slightly higher birth rate for boys is a well known phenomenon. The ratio varies somewhat between countries, with the global value being around 1.07.

births-boxplot

> group_by(births, year, month, sex) %>% summarise(count = sum(count)) %>%
+   merge(month.days) %>%
+   mutate(perday = count / days) %>% ggplot() +
+   geom_boxplot(aes(x = month, y = perday, fill = sex)) +
+   labs(x = "", y = "Births per Day") +
+   theme_classic() + theme(legend.title = element_blank())

That ratio is remarkably consistent from month to month and year to year.

births-gender-ratio

> library(reshape2)
> group_by(births, year, month, sex) %>% summarise(count = sum(count)) %>%
+   dcast(year + month ~ sex) %>% mutate(ratio = M/F) %>%
+   ggplot() +
+   geom_boxplot(aes(x = year, y = ratio, group = year)) +
+   labs(x = "", y = "Monthly Birth Ratio: Boys/Girls") +
+   theme_classic() + theme(legend.title = element_blank())

Most Probable Birth Month

In a previous post I showed that the data from www.baseball-reference.com support Malcolm Gladwell’s contention that more professional baseball players are born in August than any other month. Although this might be explained by the 31 July cutoff for admission to baseball leagues, it was suggested that it could also be linked to a larger proportion of babies being born in August.

In order to explore this idea I gathered data from http://www.cdc.gov/ for births in the USA between 1994 and 2014. These data as well as the baseball data have been published as a R package here. Install using

> devtools::install_github("DataWookie/lifespan")
> library(lifespan)

Let’s explore the hypothesis regarding non-uniform birth months.

> library(dplyr)
> group_by(births, month) %>% summarise(count = sum(count))
Source: local data frame [12 x 2]

    month   count
   (fctr)   (int)
1     Jan 6906798
2     Feb 6448725
3     Mar 7080880
4     Apr 6788266
5     May 7112239
6     Jun 7059986
7     Jul 7461489
8     Aug 7552007
9     Sep 7365904
10    Oct 7220646
11    Nov 6813037
12    Dec 7079453

There is definitely significant non-uniformity:

> chisq.test(.Last.value$count, p = c(31, 28.25, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31),
             rescale.p = TRUE)

	Chi-squared test for given probabilities

data:  .Last.value$count
X-squared = 77600, df = 11, p-value <2e-16

We can dig into that a little deeper and see the total number of births between 1994 and 2014 broken down by month. The aggregate for August is certainly higher than any other month, but only marginally larger than that for July.
births-totals
Delving still deeper we find that the monthly counts exhibit significant variation from year to year and that August has some appreciable outliers.
births-boxplot
Specifically, August in 2006 and 2007 appear to have been bumper births months. Interesting!

> group_by(births, year, month) %>% summarise(count = sum(count)) %>% ungroup() %>%
+   arrange(desc(count))
Source: local data frame [252 x 3]

    year  month  count
   (int) (fctr)  (int)
1   2007    Aug 391117
2   2006    Aug 388481
3   2007    Jul 380356
4   2008    Jul 376105
5   2006    Sep 375389
6   2008    Aug 374028
7   2007    Oct 370069
8   2005    Aug 370045
9   2009    Jul 369117
10  2008    Sep 368660
..   ...    ...    ...

Of course, a peak in overall births in August does not mean that there’s a direct causative link to the peak in professional baseball players’ births. But the contribution cannot be ignored.

Streaming from zip to bz2

I’ve got a massive bunch of zip archives, each of which contains only a single file. And the name of the enclosed file varies. Dealing with these data is painful.

It’d be a lot more convenient if the files were compressed with gzip or bzip2 and had a consistent naming convention. How would you go about making that conversion without actually unpacking the zip archive, finding the name of the enclosed file and then recompressing? Enter funzip.

To illustrate, first we create a zip archive with a single file.

$ ls -l foo.txt 
-rw-rw-r-- 1 user user 2311 Jul  8 14:06 foo.txt
$ zip foo.zip foo.txt 
  adding: foo.txt (deflated 62%)
$ ls -l foo.zip 
-rw-rw-r-- 1 user user 1031 Jul  8 14:06 foo.zip

Then extract the single file to standard output using funzip and pipe the results through bzip2.

$ funzip foo.zip | bzip2 >foo.bz2
$ ls -l foo.bz2
-rw-rw-r-- 1 user user 924 Jul  8 14:06 foo.bz2

Another, more robust, approach is to simply use `unzip` with `-c` (extract to stdout) and `-qq` (be super quiet).

unzip -qq -c foo.zip | bzip2 >foo.bz2

Voila!

Major League Baseball Birth Months

The cutoff date for almost all nonschool baseball leagues in the United States is July 31, with the result that more major league players are born in August than in any other month.Malcolm Gladwell, Outliers

A quick analysis to confirm Gladwell’s assertion above. Used data scraped from www.baseball-reference.com. Here’s the evidence:

Distribution of birth months for Major League Baseball players.

Distribution of birth months for Major League Baseball players.

We can make a quick check to see whether the non-uniformity is statistically significant.

> chisq.test(table(baseball$month))

	Chi-squared test for given probabilities

data:  table(baseball$month)
X-squared = 135, df = 11, p-value <2e-16

Yup, it appears to be highly significant.

Obviously the length of the months should make a small difference on the number of births. For example, all else being equal we would expect there to be more births in August (with 31 days) than in July (with only 30 days). We can be a bit more rigorous and take month lengths into account too.

> chisq.test(table(baseball$month), p = month$length / sum(month$length))

	Chi-squared test for given probabilities

data:  table(baseball$month)
X-squared = 115, df = 11, p-value <2e-16

Looks like the outcome is the same: there is a significant non-uniformity in the birth months of Major League Baseball players.

Upgrading Ubuntu 16.04 to Linux Kernel 4.4.12

I’ve had a few minor hardware issues with the default kernel in Ubuntu 16.04. For example, hibernate does not work on my laptop. So, in an effort to resolve these problems, I upgraded from the 4.4.0 version of the kernel to 4.4.12. Nothing tricky involved, but here’s the process.

Grab the headers and image.

$ wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.12-xenial/linux-headers-4.4.12-040412-generic_4.4.12-040412.201606011712_amd64.deb
$ wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.12-xenial/linux-headers-4.4.12-040412_4.4.12-040412.201606011712_all.deb
$ wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4.12-xenial/linux-image-4.4.12-040412-generic_4.4.12-040412.201606011712_amd64.deb

Then, become root and install the kernel.

# dpkg -i linux-headers-4.4*.deb linux-image-4.4*.deb

All hardware snags resolved. Enjoy!