Text Mining the Complete Works of William Shakespeare

I am starting a new project that will require some serious text mining. So, in the interests of bringing myself up to speed on the tm package, I thought I would apply it to the Complete Works of William Shakespeare and just see what falls out.

The first order of business was getting my hands on all that text. Fortunately it is available from a number of sources. I chose to use Project Gutenberg.

> TEXTFILE = "data/pg100.txt"
> if (!file.exists(TEXTFILE)) {
+     dir.create(dirname(TEXTFILE), FALSE)
+     download.file("http://www.gutenberg.org/cache/epub/100/pg100.txt", destfile = TEXTFILE)
+ }
> shakespeare = readLines(TEXTFILE)
> length(shakespeare)
[1] 124787

That’s quite a solid chunk of data: 124787 lines. Let’s take a closer look.

> head(shakespeare)
[1] "The Project Gutenberg EBook of The Complete Works of William Shakespeare, by"
[2] "William Shakespeare"
[3] ""
[4] "This eBook is for the use of anyone anywhere at no cost and with"
[5] "almost no restrictions whatsoever.  You may copy it, give it away or"
[6] "re-use it under the terms of the Project Gutenberg License included"
> tail(shakespeare)
[1] "http://www.gutenberg.org/2/4/6/8/24689"    ""
[3] "An alternative method of locating eBooks:" "http://www.gutenberg.org/GUTINDEX.ALL"
[5] ""                                          "*** END: FULL LICENSE ***"

There seems to be some header and footer text. We will want to get rid of that! Using a text editor I checked to see how many lines were occupied with metadata and then removed them before concatenating all of the lines into a single long, long, long string.

> shakespeare = shakespeare[-(1:173)]
> shakespeare = shakespeare[-(124195:length(shakespeare))]
> shakespeare = paste(shakespeare, collapse = " ")
> nchar(shakespeare)
[1] 5436541

While I had the text open in the editor I noticed that sections in the document were separated by the following text:


Obviously that is going to taint the analysis. But it also serves as a convenient marker to divide that long, long, long string into separate documents.

> shakespeare = strsplit(shakespeare, "<<[^>]*>>")[[1]]
> length(shakespeare)
[1] 218

This left me with a list of 218 documents. On further inspection, some of them appeared to be a little on the short side (in my limited experience, the bard is not known for brevity). As it turns out, the short documents were the dramatis personae for his plays. I removed them as well.

> (dramatis.personae <- grep("Dramatis Personae", shakespeare, ignore.case = TRUE))
 [1]   2   8  11  17  23  28  33  43  49  55  62  68  74  81  87  93  99 105 111 117 122 126 134 140 146 152 158
[28] 164 170 176 182 188 194 200 206 212
> length(shakespeare)
[1] 218
> shakespeare = shakespeare[-dramatis.personae]
> length(shakespeare)
[1] 182

Down to 182 documents, each of which is a complete work.

The next task was to convert these documents into a corpus.

> library(tm)
> doc.vec <- VectorSource(shakespeare)
> doc.corpus <- Corpus(doc.vec)
> summary(doc.corpus)
A corpus with 182 text documents

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:

There is a lot of information in those documents which is not particularly useful for text mining. So before proceeding any further, we will clean things up a bit. First we convert all of the text to lowercase and then remove punctuation, numbers and common English stopwords. Possibly the list of English stop words is not entirely appropriate for Shakespearean English, but it is a reasonable starting point.

> doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))
> doc.corpus <- tm_map(doc.corpus, removePunctuation)
> doc.corpus <- tm_map(doc.corpus, removeNumbers)
> doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))

Next we perform stemming, which removes affixes from words (so, for example, “run”, “runs” and “running” all become “run”).

> library(SnowballC)
> doc.corpus <- tm_map(doc.corpus, stemDocument)

All of these transformations have resulted in a lot of whitespace, which is then removed.

> doc.corpus <- tm_map(doc.corpus, stripWhitespace)

If we have a look at what’s left, we find that it’s just the lowercase, stripped down version of the text (which I have truncated here).

> inspect(doc.corpus[8])
A corpus with 1 text document

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
  create_date creator 
Available variables in the data frame are:

 act ii scene messina pompey hous enter pompey menecr mena warlik manner pompey great god just shall
 assist deed justest men menecr know worthi pompey delay deni pompey while suitor throne decay thing
 sue menecr ignor beg often harm wise powr deni us good find profit lose prayer pompey shall well
 peopl love sea mine power crescent augur hope say will come th full mark antoni egypt sit dinner
 will make war without door caesar get money lose heart lepidus flatter flatterd neither love either

This is where things start to get interesting. Next we create a Term Document Matrix (TDM) which reflects the number of times each word in the corpus is found in each of the documents.

> TDM <- TermDocumentMatrix(doc.corpus)
A term-document matrix (18651 terms, 182 documents)

Non-/sparse entries: 182898/3211584
Sparsity           : 95%
Maximal term length: 31 
Weighting          : term frequency (tf)
> inspect(TDM[1:10,1:10])
A term-document matrix (10 terms, 10 documents)

Non-/sparse entries: 1/99
Sparsity           : 99%
Maximal term length: 9 
Weighting          : term frequency (tf)

Terms       1 2 3 4 5 6 7 8 9 10
  aaron     0 0 0 0 0 0 0 0 0  0
  abaissiez 0 0 0 0 0 0 0 0 0  0
  abandon   0 0 0 0 0 0 0 0 0  0
  abandond  0 1 0 0 0 0 0 0 0  0
  abas      0 0 0 0 0 0 0 0 0  0
  abashd    0 0 0 0 0 0 0 0 0  0
  abat      0 0 0 0 0 0 0 0 0  0
  abatfowl  0 0 0 0 0 0 0 0 0  0
  abbess    0 0 0 0 0 0 0 0 0  0
  abbey     0 0 0 0 0 0 0 0 0  0

The extract from the TDM shows, for example, that the word “abandond” occurred once in document number 2 but was not present in any of the other first ten documents. We could have generated the transpose of the DTM as well.

> DTM <- DocumentTermMatrix(doc.corpus)
> inspect(DTM[1:10,1:10])
A document-term matrix (10 documents, 10 terms)

Non-/sparse entries: 1/99
Sparsity           : 99%
Maximal term length: 9 
Weighting          : term frequency (tf)

Docs aaron abaissiez abandon abandond abas abashd abat abatfowl abbess abbey
  1      0         0       0        0    0      0    0        0      0     0
  2      0         0       0        1    0      0    0        0      0     0
  3      0         0       0        0    0      0    0        0      0     0
  4      0         0       0        0    0      0    0        0      0     0
  5      0         0       0        0    0      0    0        0      0     0
  6      0         0       0        0    0      0    0        0      0     0
  7      0         0       0        0    0      0    0        0      0     0
  8      0         0       0        0    0      0    0        0      0     0
  9      0         0       0        0    0      0    0        0      0     0
  10     0         0       0        0    0      0    0        0      0     0

Which of these proves to be most convenient will depend on the relative number of documents and terms in your data.

Now we can start asking questions like: what are the most frequently occurring terms?

> findFreqTerms(TDM, 2000)
 [1] "come"  "enter" "good"  "king"  "let"   "lord"  "love"  "make"  "man"   "now"   "shall" "sir"   "thee"
[14] "thi"   "thou"  "well"  "will"

Each of these words occurred more that 2000 times.

What about associations between words? Let’s have a look at what other words had a high association with “love”.

> findAssocs(TDM, "love", 0.8)
beauti    eye
  0.83   0.80

Well that’s not too surprising!

From our first look at the TDM we know that there are many terms which do not occur very often. It might make sense to simply remove these sparse terms from the analysis.

> TDM.common = removeSparseTerms(TDM, 0.1)
> dim(TDM)
[1] 18651   182
> dim(TDM.common)
[1]  71 182

From the 18651 terms that we started with, we are now left with a TDM which considers on 71 commonly occurring terms.

> inspect(TDM.common[1:10,1:10])
A term-document matrix (10 terms, 10 documents)

Non-/sparse entries: 94/6
Sparsity           : 6%
Maximal term length: 6
Weighting          : term frequency (tf)

Terms     1 2  3  4  5  6  7  8 9 10
  act     1 4  7  9  6  3  2 14 1  0
  art    53 0  9  3  5  3  2 17 0  6
  away   18 5  8  4  2 10  5 13 1  7
  call   17 1  4  2  2  1  6 17 3  7
  can    44 8 12  5 10  6 10 24 1  5
  come   19 9 16 17 12 15 14 89 9 15
  day    43 2  2  4  1  5  3 17 2  3
  enter   0 7 12 11 10 10 14 87 4  6
  exeunt  0 3  8  8  5  4  7 49 1  4
  exit    0 6  8  5  6  5  3 31 3  2

Finally we are going to put together a visualisation. The TDM is stored as a sparse matrix. This was an apt representation for the initial TDM, but the reduced TDM containing only frequently occurring words is probably better stored as a normal matrix. We’ll make the conversion and see.

> library(slam)
> TDM.dense <- as.matrix(TDM.common)
> TDM.dense
> object.size(TDM.common)
207872 bytes
> object.size(TDM.dense)
112888 bytes

So, as it turns out the sparse representation was actually wasting space! (This will generally not be true though: it will only apply for a matrix consisting of just the common terms).

There are numerous options for visualising these data, of which we will look at only two. Let’s start with a simple word cloud.

> library(wordcloud)
> library(RColorBrewer)
> palette <- brewer.pal(9,"BuGn")[-(1:4)]
> wordcloud(rownames(TDM.dense), rowSums(TDM.dense), min.freq = 1, color = palette)


To produce the other plot we first need to convert the data into a tidier format.

> library(reshape2)
> TDM.dense = melt(TDM.dense, value.name = "count")
> head(TDM.dense)
  Terms Docs count
1   act    1     1
2   art    1    53
3  away    1    18
4  call    1    17
5   can    1    44
6  come    1    19

It’s then an easy matter to use ggplot2 to make up an attractive heat map.

> library(ggplot2)
> ggplot(TDM.dense, aes(x = Docs, y = Terms, fill = log10(count))) +
+     geom_tile(colour = "white") +
+     scale_fill_gradient(high="#FF0000" , low="#FFFFFF")+
+     ylab("") +
+     theme(panel.background = element_blank()) +
+     theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())


The colour scale indicates the number of times that each of the terms cropped up in each of the documents. I applied a logarithmic transform to the counts since there was a very large disparity in the numbers across terms and documents. The grey tiles correspond to terms which are not found in the corresponding document.

One can see that some terms, like “will” turn up frequently in most documents, while “love” is common in some and rare or absent in others.

That was interesting. Not sure that I would like to make any conclusions on the basis of the results above (Shakespeare is well outside my field of expertise!), but I now have a pretty good handle on how the tm package works. As always, feedback will be appreciated!


  • Build a search engine in 20 minutes or less
  • Feinerer, I. (2013). Introduction to the tm Package: Text Mining in R.
  • Feinerer, I., Hornik, K., & Meyer, D. (2008). Text Mining Infrastructure in R. Journal of Statistical Software, 25(5).
  • pp


    Nice article – can you please give me some pointers as to how to analyse “Job Descriptions”, clean up removing all the unnecessary terms, and then create a table of important terms along with frequency.

    Regards /

    • Hi, thanks for the comment. Well, my first suggestion would be to load up these job descriptions into R and follow through the steps in the analysis that I did. That should at least give you a flavour for how it should be done. This is certainly an interesting application and something that I had not previously considered. Best regards, Andrew.

  • Hey, I also do some text mining stuff on R largely by pulling data from Twitter. I was planning to do something on Shakespeare as well anyway. It would have been great if you did some basic clustering on terms or created word clouds. I tend to do that for Tweets about Bollywood movies (Indian Cinema). If you have time do visit my blog : http://tweetsent.wordpress.com/

    • Good suggestion about the word cloud: I have added one in. Clustering will have to wait for another day.

  • Andy

    Cool! I think it would be really interesting to order the ‘Docs’ by year of publication, to see how his use of words changed over time.

  • Pingback: Clustering the Words of William Shakespeare | Exegetic Analytics()

  • Pingback: No Need To Sit Through Shakespeare, Thanks To Statistical Exegesis « Irregular Times()

    • Sadly, it seems like Rowan didn’t really grasp the intention of the article, which was to have a bit of fun while learning about a new technology, while documenting the process for the general edification of anybody who might be interested.

      The objective of any statistical analysis is to reduce a large body of information into a few concise summary measures. So, yes, the word cloud may not come close to doing justice to the breadth of the writings of William Shakespeare, but it does summarise those words which he used most frequently.

      • Sandyn Skudneski

        A bit late to this but yup, he seemed to miss the point that you were trying out a new bike on a particular road that looked fun. Enjoyed it, makes me want to over fit like a maniac.

  • Madhavan Ramani

    Nice job, Andrew. It really was explained well and I feel up to more complex tasks. Thanks a lot!

    BTW, I have a task to do the opposite of stemming. e.g. Given a root word, I would like to create ontological words and also build a list of affixed words. Are you aware of a way to do this with tm or any of the related R packages? Also can you recommend an Ontology Editor that interfaces with R?

    • Thanks for the feedback, Madhavan. In answer to your questions: 1. No, I don’t but I suspect that you might be able to brute force this with a dictionary file. There must be a more elegant solution though! 2. I’m afraid not. Probably find a bit of Google will show up some answers though.

  • Fabsi10

    Just wanted to thank you for that tutorial! The best I read about textmining in R so far!

    • Thank you! I am very happy that you found it useful.

  • JM THeler

    Superb article, I am reproducing the code while installing the needed packages works great up to the last ggplot graph where I get the following error:

    ggplot(TDM.dense, aes(x = Docs, y = Terms, fill = log10(count))) +
    + geom_tile(colour = “white”) +
    + scale_fill_gradient(high=”#FF0000″ , low=”#FFFFFF”)+
    + ylab(“”) +
    + theme(panel.background = element_blank()) +
    + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
    Don’t know how to automatically pick scale for object of type function. Defaulting to continuous
    Don’t know how to automatically pick scale for object of type function. Defaulting to continuous
    Erreur dans Math.factor(count) :
    log10 ceci n’est pas pertinent pour des variables facteurs

    run in rstudio on ubuntu 13.10

    Thank you for this outstanding tutorial

    • Hi! Thanks for the positive comment. Not quite sure what the problem is there. Using my almost non-existent knowledge of French though, it seems that your TDM.dense$count is a factor variable… that’s not right. See the extract from my TDM.dense just a little further up the post. Does yours look like that? Should have three columns (Terms, Docs and count) with types factor, integer and numeric. Best regards, Andrew.

  • Bastian

    The TextDocumentMatrix function causes an error. This is my code:
    TEXTFILE = “data/pg100.txt”
    shakespeare = readLines(TEXTFILE)
    shakespeare = shakespeare[-(1:173)]
    shakespeare = shakespeare[-(124195:length(shakespeare))]
    shakespeare = paste(shakespeare, collapse = ” “)
    shakespeare = strsplit(shakespeare, “<]*>>”)[[1]]
    (dramatis.personae <- grep("Dramatis Personae", shakespeare, ignore.case = TRUE))
    shakespeare = shakespeare[-dramatis.personae]
    doc.vec <- VectorSource(shakespeare)
    doc.corpus <- VCorpus(doc.vec)
    doc.corpus <- tm_map(doc.corpus, tolower)
    doc.corpus <- tm_map(doc.corpus, removePunctuation)
    doc.corpus <- tm_map(doc.corpus, removeNumbers)
    doc.corpus <- tm_map(doc.corpus, removeWords, stopwords("english"))
    doc.corpus <- tm_map(doc.corpus, stemDocument)
    doc.corpus <- tm_map(doc.corpus, stripWhitespace)
    TDM <- TermDocumentMatrix(doc.corpus)
    The Error:
    Fehler in UseMethod("meta", x) :
    nicht anwendbare Methode für 'meta' auf Objekt der Klasse "character" angewendet
    Zusätzlich: Warnmeldung:
    In mclapply(unname(content(x)), termFreq, control) :
    all scheduled cores encountered errors in user code

    I am a bit frustrated. I did everything step by step.

    • Hi Bastian,

      Hmmm. My German is not great, but I gather that the error message is complaining that TermDocumentMatrix() does not have a method to handle data of type “character”. What is the class of your doc.corpus variable? Here is what I have

      > class(doc.corpus)
      [1] “VCorpus” “Corpus” “list”

      Would it be helpful if I emailed you the complete script that I compiled for this article?

      Best regards,

      • Bastian

        Ok there is the error my output is
        > class(doc.corpus)
        [1] “VCorpus” “Corpus”
        I cant figure out how the “List” was dropped or never created. I followed your steps precisely.
        The funny thin is, when I do this
        doc.corpusclean <- tm_map(doc.corpus,PlainTextDocument)
        TDM <- TermDocumentMatrix(doc.corpusclean)
        it works, but the doc-structure is destroyed (clear). I came to this conclusion after I read this info:
        "It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trim won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters."
        Actually I run R 3.1.0 on the actual RStudio with the latest packages installed.

        If you send me the whole source it would be great, maybe I can figure out where the break is. Thanks!

  • Bastian

    HA!! On my Windows machine it works now.
    I left out the line
    doc.corpus <- tm_map(doc.corpus, tolower)
    and now I have a TDM.
    looks like yours.
    Now… how can we get the tolower function to work as desired? Maybe the new version is a problem? Is there a possible workaround?

    • Piyush Bansal

      Got the same problem, you can do it as follows ….

      doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))

      Reason: This problem appears in tm 0.6 and has to do with using functions that are not in the list of getTransformation() from tm. The problem is that tolower just returns a character vector, and not a "PlainTextDocument" like tm_map would like. The tm packages provides the content_transformer function to take care of managing the PlainTextDocument.

      • Hi Piyush,

        Your feedback is greatly appreciated and apologies for taking so long to get back to you. I have updated my post to reflect the changes you suggest. Thanks again!

        Best regards,

  • Thomas

    Hey there. Great tutorial. I am just starting with R and i run into some difficulties when trying to execute your code fragments. Is there any way that you can provide me your full source code? It would be great to have a working code base to start off with this. Thanks in advance.

    • Hi! Sure, I would be happy to provide you with the source. Best regards, Andrew.

  • Libardo

    Hi Andrew, thanks so much for yout tutorial.
    I am reproducing the code; it works great up to the visualization where I get the following errors:
    [1] “data.frame”

    Docs: int
    Count: num

    > wordcloud(rownames(TDM.dense), rowSums(TDM.dense), min.freq = 1, color = palette)
    Error in rowSums(TDM.dense) : ‘x’ must be numeric

    + + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
    Error in +geom_tile(colour = “white”) :
    invalid argument to unary operator

    Thanks for your help

    • Hi Libardo,

      Thanks for the feedback. Okay, I think that your second problem is caused by one too many “+” signs. When R echoes its output it uses a “+” to indicate a line continuation. You should NOT include these in your commands. So you should have something like

      ggplot(..) + geom_tile(colour = “white”) + …

      Note that there is only one “+” between function calls. Not too sure about the first problem. Can you send me the output from head() on your TDM.dense?


      • Libardo

        Andrew, thanks for your reply.
        You are right, second problem solved.

        > TDM.dense = melt(TDM.dense, value.name = “count”)
        > head(TDM.dense)
        Terms Docs count
        1 act 1 1
        2 art 1 53
        3 away 1 18
        4 call 1 17
        5 can 1 44
        6 come 1 19

        > library(wordcloud)
        > library(RColorBrewer)
        > palette wordcloud(rownames(TDM.dense), rowSums(TDM.dense), min.freq = 1, color = palette)
        Error in rowSums(TDM.dense) : ‘x’ must be numeric


        • Hi Libardo,

          Okay, I can see the problem. Your data are in the form of a data frame. Your call to melt() converts the data from a matrix to a data frame. But you actually want the matrix to do the word cloud… So, scroll back up the post a bit to where you see the call to as.matrix(). That’s the point in the analysis when the data are in the right format to create the word cloud.

          Best regards,

          • Libardo

            Andrew, thanks very much for your help. Everything is working fine including the clustering.

  • Hi Andrew ,

    I have followed the code , including updating tolower to

    > doc.corpus <- tm_map(doc.corpus, content_transformer(tolower))

    But when I call inspect(TDM[1:10,1:10]) I am getting

    Terms       character(0) character(0) character(0) character(0) character(0)
      aaron                0            0            0            0            0
      aba                  0            0            0            0            0

    Why is it not showing as 1 2 3 4 ?

    Thanks for the great tutorial 😛

    • Hi Billy,

      Thanks for your comments and feedback. Without seeing your code it’s hard to tell where something might have gone wrong. I’ll email you a copy of my complete script. Hopefully that will resolve the issue for you.

      Best regards,

    • Dawood

      In case u use plaintotext just remove this command because it destroy the structure/

  • Vincent Feltkamp

    Thanks for the tutorial. Am going to try to subtract the average (English/Shakespearan) relative frequencies from each document’s relative word frequencies, to see which words are more common in each text. This might be useful for text-tagging.

  • Hello Andrew,

    Thank you for this great tutorial. I am trying to apply this (great) piece of code to my project and I have two problems:
    1. My corpus is built on 700 txt files stored in five folders, (1.txt, 2.txt….123.txt) . How can I import them in one corpus ?
    2. I work with hebrew text. How can I modify the english stopwords list in the tm package?

    Thank you.


  • Hi Yohanan,

    Okay, suppose that you have all those *.txt files in a folder called “data” then you would read them all and concatenate into a single vector as follows:

    text = lapply(list.files(path = "data", pattern = "*.txt", full.names = TRUE), readLines)
    text = do.call(c, text)

    It doesn’t look like the tm package currently supports Hebrew stopwords. It does cater for English, Catalan, Romanian, Danish, Dutch, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Russian, Spanish and Swedish though. But that’s obviously no help for your particular problem.

    I think you might find this [1] thread on StackOverflow pertinent and this [2] GitHub project provides a list of Hebrew stopwords. Hope that helps!


    [1] http://stackoverflow.com/questions/1365510/where-can-i-find-a-list-of-hebrew-stop-words
    [2] https://github.com/gidim/HebrewStopWords

  • Pingback: text_mining_the_complete_works_of_william_shakespeare [CommRes.net]()