Python: First Steps with MongoDB

I’m busy working my way through Kyle Banker’s MongoDB in Action. Much of the example code in the book is given in Ruby. Despite the fact that I’d love to learn more about Ruby, for the moment it makes more sense for me to follow along with Python.

mongodb-logo

MongoDB Installation

If you haven’t already installed MongoDB, now is the time to do it! On a Debian Linux system the installation is very simple.

$ sudo apt install mongodb

Python Package Installation

Next install PyMongo, the Python driver for MongoDB.

$ pip3 install pymongo

Check that the install was successful.

>>> import pymongo
>>> pymongo.version
'3.3.0'

Detailed documentation for PyMongo can be found here.

Creating a Client

To start interacting with the MongoDB server we need to instantiate a MongoClient.

>>> client = pymongo.MongoClient()

This will connect to localhost using the default port. Alternative values for host and port can be specified.

Connect to a Database

Next we connect to a particular database called test. If the database does not yet exist then it will be created.

>>> db = client.test

Create a Collection

A database will hold one or more collections of documents. We’ll create a users collection.

>>> users = db.users
>>> users
Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True),
'test'), 'users')

As mentioned in the documentation, MongoDB is lazy about the creation of databases and collections. Neither the database nor collection is actually created until data are written to them.

Working with Documents

As you would expect, MongoDB caters for the four basic CRUD operations.

Create

Documents are represented as dictionaries in Python. We’ll create a couple of light user profiles.

>>> smith = {"last_name": "Smith", "age": 30}
>>> jones = {"last_name": "Jones", "age": 40}

We use the insert_one() method to store each document in the collection.

>>> users.insert_one(smith)
<pymongo.results.InsertOneResult object at 0x7f57d36d9678>

Each document is allocated a unique identifier which can be accessed via the inserted_id attribute.

>>> jones_id = users.insert_one(jones).inserted_id
>>> jones_id
ObjectId('57ea4adfad4b2a1378640b42')

Although these identifiers look pretty random, there is actually a wel defined structure. The first 8 characters (4 bytes) are a timestamp, followed by a 6 character machine identifier then a 4 character process identifier and finally a 6 character counter.

We can verify that the collection has been created.

>>> db.collection_names()
['users', 'system.indexes']

There’s also an insert_many() method which can be used to simultaneously insert multiple documents.

Read

The find_one() method can be used to search the collection. As its name implies it only returns a single document.

>>> users.find_one({"last_name": "Smith"})
{'_id': ObjectId('57ea4acfad4b2a1378640b41'), 'age': 30, 'last_name': 'Smith'}
>>> users.find_one({"_id": jones_id})
{'_id': ObjectId('57ea4adfad4b2a1378640b42'), 'age': 40, 'last_name': 'Jones'}

A more general query can be made using the find() method which, rather than returning a document, returns a cursor which can be used to iterate over the results. With our minimal collection this doesn’t seem very useful, but a cursor really comes into its own with a massive collection.

>>> users.find({"last_name": "Smith"})
<pymongo.cursor.Cursor object at 0x7f57d77fe3c8>
>>> users.find({"age": {"$gt": 20}})
<pymongo.cursor.Cursor object at 0x7f57d77fe8d0>

A cursor is an iterable and can be used to neatly access the query results.

>>> cursor = users.find({"age": {"$gt": 20}})
>>> for user in cursor:
...     user["last_name"]
... 
'Smith'
'Jones'

Operations like count() and sort() can be applied to the results returned by find().

Update

The update() method is used to modify existing documents. A compound document is passed as the argument to update(), the first part of which is used to match those documents to which the change is to be applied and the second part gives the details of the change.

>>> users.update({"last_name": "Smith"}, {"$set": {"city": "Durban"}})
{'updatedExisting': True, 'nModified': 1, 'n': 1, 'ok': 1}

The example above uses the $set modifier. There are a number of other modifiers available like $inc, $mul, $rename and $unset.

By default the update is only applied to the first matching record. The change can be applied to all matching records by specifying multi = True.

Delete

Deleting records happens via the remove() method with an argument which specifies which records are to be deleted.

>>> users.remove({"age": {"$gte": 40}})
{'n': 1, 'ok': 1}

Conclusion

Well those are the basic operations. Nothing too scary. I’ll be back with the Python implementation of the Twitter archival sample application.

Review: Mastering Python Scientific Computing

Cover of Mastering Python Scientific Computing

I was asked to review “Mastering Python Scientific Computing” (authored by Hemant Kumar Mehta and published in 2015 by Packt Publishing). I was disappointed by the book. The title lead me to believe that it would help me to achieve mastery. I don’t feel that it brought me any closer to this goal. To be sure, the book contains a lot of useful information. But the way that it is written is a weird compromise between high level overview and low level details. And it’s the middle ground in between, where explanations are given and a link is made between the details and the global picture, which I normally find most instructive.

A complete guide for Python programmers to master scientific computing using Python APIs and tools.Mastering Python Scientific Computing

I had high hopes for this book. I tried to love this book. But in the end, despite my best efforts, I got completely frustrated after Chapter 5 and just gave up. Here’s the Table of Contents and some notes on the first two chapters.

  1. The Landscape of Scientific Computing and Why Python?
    The first chapter presents an overview of Scientific Computing with some examples of applications and an outline of a typical workflow. It then digs rather deeply into the issue of error analysis before giving some background on the Python programming language. Although Python and Scientific Computing are central to this book, I found an introduction to both topics within a single chapter to be a little disconcerting.

  2. A Deeper Dive into Scientific Workflows and the Ingredients of Scientific Computing Recipes
    This chapter starts off by addressing a range of techniques used in Scientific Computing, including optimisation, interpolation and extrapolation, numerical integration and differentiation, differential equations and random number generation. Attention then shifts to Python as a platform for Scientific Computing. A lengthy list of relevant packages is given before Python’s capabilities for interactive computing via IPython are introduced. The chapter concludes with a description of symbolic computing using SymPy and some examples of Python’s plotting capabilities.

  3. Efficiently Fabricating and Managing Scientific Data
  4. Scientific Computing APIs for Python
  5. Performing Numerical Computing
  6. Applying Python for Symbolic Computing
  7. Data Analysis and Visualization
  8. Parallel and Large-scale Scientific Computing
  9. Revisiting Real-life Case Studies
  10. Best Practices for Scientific Computing

The book is riddled with numerous small errors. These really should have been picked up by the author and editor, or indeed by either of the two reviewers credited at the front of the book. I have submitted errata via the publisher’s web site, which were duly acknowledged.

Installing XGBoost on Ubuntu

xgboost

XGBoost is the flavour of the moment for serious competitors on kaggle. It was developed by Tianqi Chen and provides a particularly efficient implementation of the Gradient Boosting algorithm. Although there is a CLI implementation of XGBoost you’ll probably be more interested in using it from either R or Python. Below are instructions for getting it installed for each of these languages. It’s pretty painless.

Installing for R

Installation in R is extremely simple.

> install.packages('xgboost')
> library(xgboost)

It’s also supported as a model in caret, which is especially handy for feature selection and model parameter tuning.

Installing for Python

Download the latest version from the github repository. The simplest way to do this is to grab the archive of a recent release. Unpack the archive, then become root and then execute the following:

# cd xgboost-master
# make
# cd python-package/
# python setup.py install --user

And you’re ready to roll:

import xgboost

If you run into trouble during the process you might have to install a few other packages:

# apt-get install g++ gfortran
# apt-get install python-dev python-numpy python-scipy python-matplotlib python-pandas
# apt-get install libatlas-base-dev

Conclusion

Enjoy building great models with the absurdly powerful tool. I’ve found that it effortlessly consumes vast data sets that grind other algorithms to a halt. Get started by looking at some code examples. Also worth looking at are

Review: Beautiful Data

I’ve just finished reading Beautiful Data (published by O’Reilly in 2009), a collection of essays edited by Toby Segaran and Jeff Hammerbacher. The 20 essays from 39 contributors address a diverse array of topics relating to data and how it’s collected, analysed and interpreted.

beautiful-data-cover

Since this is a collection of essays, the writing style and level of technical detail varies considerably between chapters. To be honest, I didn’t find every chapter absolutely riveting, but I generally came away from each of them having learned a thing or two. Below is a list of chapter titles with occasional comments.

  1. Seeing Your Life in Data
    Nathan Yau writes about personal data collection, highlighting your.flowingdata which is a Twitter app for gathering personal data. Although I am keenly interested in the data logged by my Garmin 910XT, I don’t think that I have the discipline to tweet every time I go out for a run. Regardless though, it’s a cool idea.

  2. The Beautiful People: Keeping Users in Mind When Designing Data Collection Methods
  3. Embedded Image Data Processing on Mars
    Instruments on planetary probes operate under significant technological constraints. So it’s fascinating to learn the details behind the imaging system on the Phoenix Mars lander.

  4. Cloud Storage Design in a PNUTShell
  5. Information Platforms and the Rise of the Data Scientist
    Jeff Hammerbacher along with DJ Patil coined the term “Data Scientist”. The chapter starts with a story of how 17 year old Hammerbacher was fired from his job as a cashier in a grocery store and ends with the creation of the role “Data Scientist” at Facebook, reflecting the various and disparate tasks currently undertaken by people working intensively with data.

    By decoupling the requirements of specifying structure from the ability to store data and innovating on APIs for data retrieval, the storage systems of large web properties are starting to look less like databases and more like dataspaces.
    Beautiful Data, p. 83

  6. The Geographic Beauty of a Photographic Archive
  7. Data Finds Data
  8. Portable Data in Real Time
    Jud Valeski describes the evolution of APIs for public access to data. Having spent a lot of time recently messing around with the APIs for Twitter and Quandl, this was interesting stuff.

  9. Surfacing the Deep Web
  10. Building Radiohead’s House of Cards
    I’ve watched the video for Radiohead’s House of Cards a few times before and thought that it was a cool concept. Now that I know what went into making the video (courtesy of this chapter), my appreciation has gone up a number of notches. The authors explain in appreciable detail how they gathered data using LIDAR systems and used processing to generate the aetherial animations in the video.

    Although at the time of publishing some of the data for the video were released as open source, it appears to have subsequently been withdrawn. That’s a pity. I think I would have enjoyed hacking on that. And it would have been good motivation to learn more about processing.

  11. Visualizing Urban Data
  12. The Design of Sense.us
  13. What Data Doesn’t Do
    Coco Krumme provides a somewhat dissenting view, writing about the limitations of data. She reminds us that a naive interpretation of statistics can be very misleading; that more data is not always better data; that data alone do not provide explanations; and that even good models have limitations.

  14. Natural Language Corpus Data
    This is probably the most technical chapter in the book. Peter Norvig gives a tutorial on Natural Language Processing (NLP) with sample code in Python. He certainly provides enough information to get you up and running with NLP. He also points out a number of potential gotchas and ways to get around them.

  15. Life in Data: The Story of DNA
    char(3*10^9) human_genome;
    

    I’m not sure why, but that snippet of code really amused me. Great way of capturing an obscure biological fact in a form which resonates with my inner geek.

  16. Beautifying Data in the Real World
  17. Superficial Data Analysis: Exploring Millions of Social Stereotypes
    The authors write about processing the data gathered at FaceStat.com, providing numerous handy snippets of R code. Although the FaceStat web site has been discontinued, the principles of their analysis will find applications elsewhere.

  18. Bay Area Blues: The Effect of the Housing Crisis
  19. Beautiful Political Data
  20. Connecting Data
    This chapter addresses the issue of connecting data from disparate sources. Here I found something that is going to be of immediate use to me: Collective Entity Resolution. It appears that this algorithm will solve a problem I have grappled with for a few months. This bit of information alone made the book a worthwhile read.

You’re not going to learn the details of any new technical skills from this book. But you will definitely uncover many inspiring thoughts and ideas. And you’ll probably find a gem or too, just like I did.

#MonthOfJulia Day 25: Interfacing with Other Languages

Julia-Logo-Other-Languages

Julia has native support for calling C and FORTRAN functions. There are also add on packages which provide interfaces to C++, R and Python. We’ll have a brief look at the support for C and R here. Further details on these and the other supported languages can be found on github.

Why would you want to call other languages from within Julia? Here are a couple of reasons:

  • to access functionality which is not implemented in Julia;
  • to exploit some efficiency associated with another language.

The second reason should apply relatively seldom because, as we saw some time ago, Julia provides performance which rivals native C or FORTRAN code.

C

C functions are called via ccall(), where the name of the C function and the library it lives in are passed as a tuple in the first argument, followed by the return type of the function and the types of the function arguments, and finally the arguments themselves. It’s a bit klunky, but it works!

julia&gt; ccall((:sqrt, &quot;libm&quot;), Float64, (Float64,), 64.0)
8.0

It makes sense to wrap a call like that in a native Julia function.

julia&gt; csqrt(x) = ccall((:sqrt, &quot;libm&quot;), Float64, (Float64,), x);
julia&gt; csqrt(64.0)
8.0

This function will not be vectorised by default (just try call csqrt() on a vector!), but it’s a simple matter to produce a vectorised version using the @vectorize_1arg macro.

julia&gt; @vectorize_1arg Real csqrt;
julia&gt; methods(csqrt)
# 4 methods for generic function &quot;csqrt&quot;:
csqrt{T&lt;:Real}(::AbstractArray{T&lt;:Real,1}) at operators.jl:359
csqrt{T&lt;:Real}(::AbstractArray{T&lt;:Real,2}) at operators.jl:360
csqrt{T&lt;:Real}(::AbstractArray{T&lt;:Real,N}) at operators.jl:362
csqrt(x) at none:6

Note that a few extra specialised methods have been introduced and now calling csqrt() on a vector works perfectly.

julia&gt; csqrt([1, 4, 9, 16])
4-element Array{Float64,1}:
 1.0
 2.0
 3.0
 4.0

R

I’ll freely admit that I don’t dabble in C too often these days. R, on the other hand, is a daily workhorse. So being able to import R functionality into Julia is very appealing. The first thing that we need to do is load up a few packages, the most important of which is RCall. There’s great documentation for the package here.

julia&gt; using RCall
julia&gt; using DataArrays, DataFrames

We immediately have access to R’s builtin data sets and we can display them using rprint().

julia&gt; rprint(:HairEyeColor)
, , Sex = Male

       Eye
Hair    Brown Blue Hazel Green
  Black    32   11    10     3
  Brown    53   50    25    15
  Red      10   10     7     7
  Blond     3   30     5     8

, , Sex = Female

       Eye
Hair    Brown Blue Hazel Green
  Black    36    9     5     2
  Brown    66   34    29    14
  Red      16    7     7     7
  Blond     4   64     5     8

We can also copy those data across from R to Julia.

julia&gt; airquality = DataFrame(:airquality);
julia&gt; head(airquality)
6x6 DataFrame
| Row | Ozone | Solar.R | Wind | Temp | Month | Day |
|-----|-------|---------|------|------|-------|-----|
| 1   | 41    | 190     | 7.4  | 67   | 5     | 1   |
| 2   | 36    | 118     | 8.0  | 72   | 5     | 2   |
| 3   | 12    | 149     | 12.6 | 74   | 5     | 3   |
| 4   | 18    | 313     | 11.5 | 62   | 5     | 4   |
| 5   | NA    | NA      | 14.3 | 56   | 5     | 5   |
| 6   | 28    | NA      | 14.9 | 66   | 5     | 6   |

rcopy() provides a high-level interface to function calls in R.

julia&gt; rcopy(&quot;runif(3)&quot;)
3-element Array{Float64,1}:
 0.752226
 0.683104
 0.290194

However, for some complex objects there is no simple way to translate between R and Julia, and in these cases rcopy() fails. We can see in the case below that the object of class lm returned by lm() does not diffuse intact across the R-Julia membrane.

julia&gt; &quot;fit &lt;- lm(bwt ~ ., data = MASS::birthwt)&quot; |&gt; rcopy
ERROR: `rcopy` has no method matching rcopy(::LangSxp)
 in rcopy at no file
 in map_to! at abstractarray.jl:1311
 in map_to! at abstractarray.jl:1320
 in map at abstractarray.jl:1331
 in rcopy at /home/colliera/.julia/v0.3/RCall/src/sexp.jl:131
 in rcopy at /home/colliera/.julia/v0.3/RCall/src/iface.jl:35
 in |&gt; at operators.jl:178

But the call to lm() was successful and we can still look at the results.

julia&gt; rprint(:fit)

Call:
lm(formula = bwt ~ ., data = MASS::birthwt)

Coefficients:
(Intercept)          low          age          lwt         race  
    3612.51     -1131.22        -6.25         1.05      -100.90  
      smoke          ptl           ht           ui          ftv  
    -174.12        81.34      -181.95      -336.78        -7.58 

You can use R to generate plots with either the base functionality or that provided by libraries like ggplot2 or lattice.

julia&gt; reval(&quot;plot(1:10)&quot;);                # Will pop up a graphics window...
julia&gt; reval(&quot;library(ggplot2)&quot;);
julia&gt; rprint(&quot;ggplot(MASS::birthwt, aes(x = age, y = bwt)) + geom_point() + theme_classic()&quot;)
julia&gt; reval(&quot;dev.off()&quot;)                  # ... and close the window.

Watch the videos below for some other perspectives on multi-language programming with Julia. Also check out the complete code for today (including examples with C++, FORTRAN and Python) on github.