Clustering the Words of William Shakespeare
In my previous post I used the tm package to do some simple text mining on the Complete Works of William Shakespeare. Today I am taking some of those results and using them to generate word clusters.
Preparing the Data
I will start with the Term Document Matrix (TDM) consisting of 71 words commonly used by Shakespeare.
This matrix is first converted from a sparse data format into a conventional matrix.
Next the TDM is normalised so that the rows sum to unity. Each entry in the normalised TDM then represents the number of times that a word occurs in a particular document relative to the number of occurrences across all of the documents.
We will be using a hierarchical clustering technique which operates on a dissimilarity matrix. We will use the Euclidean distance between each of the rows in the TDM, where each row is treated as a vector in a space of 182 dimensions.
Finally we perform agglomerative clustering using agnes() from the cluster package.
Plotting a Dendrogram
Let’s have a look at the results of our labours.
This dendrogram reflects the tree-like structure of the word clusters. We can see that the words “enter”, “exeunt” and “scene” are clustered together, which makes sense since they are related to stage directions. Also “thee” and “thou” have similar usage. In the previous analysis we found that the occurrences of “love” and “eye” were highly correlated and consequently we find them clustered here too.
This is rather cool. No doubt a similar analysis applied to contemporary literature would yield extremely different results. Anybody keen on clustering the Complete Works of Terry Pratchett?