Python: First Steps with MongoDB

I’m busy working my way through Kyle Banker’s MongoDB in Action. Much of the example code in the book is given in Ruby. Despite the fact that I’d love to learn more about Ruby, for the moment it makes more sense for me to follow along with Python.

mongodb-logo

MongoDB Installation

If you haven’t already installed MongoDB, now is the time to do it! On a Debian Linux system the installation is very simple.

$ sudo apt install mongodb

Python Package Installation

Next install PyMongo, the Python driver for MongoDB.

$ pip3 install pymongo

Check that the install was successful.

>>> import pymongo
>>> pymongo.version
'3.3.0'

Detailed documentation for PyMongo can be found here.

Creating a Client

To start interacting with the MongoDB server we need to instantiate a MongoClient.

>>> client = pymongo.MongoClient()

This will connect to localhost using the default port. Alternative values for host and port can be specified.

Connect to a Database

Next we connect to a particular database called test. If the database does not yet exist then it will be created.

>>> db = client.test

Create a Collection

A database will hold one or more collections of documents. We’ll create a users collection.

>>> users = db.users
>>> users
Collection(Database(MongoClient(host=['localhost:27017'], document_class=dict, tz_aware=False, connect=True),
'test'), 'users')

As mentioned in the documentation, MongoDB is lazy about the creation of databases and collections. Neither the database nor collection is actually created until data are written to them.

Working with Documents

As you would expect, MongoDB caters for the four basic CRUD operations.

Create

Documents are represented as dictionaries in Python. We’ll create a couple of light user profiles.

>>> smith = {"last_name": "Smith", "age": 30}
>>> jones = {"last_name": "Jones", "age": 40}

We use the insert_one() method to store each document in the collection.

>>> users.insert_one(smith)
<pymongo.results.InsertOneResult object at 0x7f57d36d9678>

Each document is allocated a unique identifier which can be accessed via the inserted_id attribute.

>>> jones_id = users.insert_one(jones).inserted_id
>>> jones_id
ObjectId('57ea4adfad4b2a1378640b42')

Although these identifiers look pretty random, there is actually a wel defined structure. The first 8 characters (4 bytes) are a timestamp, followed by a 6 character machine identifier then a 4 character process identifier and finally a 6 character counter.

We can verify that the collection has been created.

>>> db.collection_names()
['users', 'system.indexes']

There’s also an insert_many() method which can be used to simultaneously insert multiple documents.

Read

The find_one() method can be used to search the collection. As its name implies it only returns a single document.

>>> users.find_one({"last_name": "Smith"})
{'_id': ObjectId('57ea4acfad4b2a1378640b41'), 'age': 30, 'last_name': 'Smith'}
>>> users.find_one({"_id": jones_id})
{'_id': ObjectId('57ea4adfad4b2a1378640b42'), 'age': 40, 'last_name': 'Jones'}

A more general query can be made using the find() method which, rather than returning a document, returns a cursor which can be used to iterate over the results. With our minimal collection this doesn’t seem very useful, but a cursor really comes into its own with a massive collection.

>>> users.find({"last_name": "Smith"})
<pymongo.cursor.Cursor object at 0x7f57d77fe3c8>
>>> users.find({"age": {"$gt": 20}})
<pymongo.cursor.Cursor object at 0x7f57d77fe8d0>

A cursor is an iterable and can be used to neatly access the query results.

>>> cursor = users.find({"age": {"$gt": 20}})
>>> for user in cursor:
...     user["last_name"]
... 
'Smith'
'Jones'

Operations like count() and sort() can be applied to the results returned by find().

Update

The update() method is used to modify existing documents. A compound document is passed as the argument to update(), the first part of which is used to match those documents to which the change is to be applied and the second part gives the details of the change.

>>> users.update({"last_name": "Smith"}, {"$set": {"city": "Durban"}})
{'updatedExisting': True, 'nModified': 1, 'n': 1, 'ok': 1}

The example above uses the $set modifier. There are a number of other modifiers available like $inc, $mul, $rename and $unset.

By default the update is only applied to the first matching record. The change can be applied to all matching records by specifying multi = True.

Delete

Deleting records happens via the remove() method with an argument which specifies which records are to be deleted.

>>> users.remove({"age": {"$gte": 40}})
{'n': 1, 'ok': 1}

Conclusion

Well those are the basic operations. Nothing too scary. I’ll be back with the Python implementation of the Twitter archival sample application.

Review: Mastering Python Scientific Computing

Cover of Mastering Python Scientific Computing

I was asked to review “Mastering Python Scientific Computing” (authored by Hemant Kumar Mehta and published in 2015 by Packt Publishing). I was disappointed by the book. The title lead me to believe that it would help me to achieve mastery. I don’t feel that it brought me any closer to this goal. To be sure, the book contains a lot of useful information. But the way that it is written is a weird compromise between high level overview and low level details. And it’s the middle ground in between, where explanations are given and a link is made between the details and the global picture, which I normally find most instructive.

A complete guide for Python programmers to master scientific computing using Python APIs and tools.Mastering Python Scientific Computing

I had high hopes for this book. I tried to love this book. But in the end, despite my best efforts, I got completely frustrated after Chapter 5 and just gave up. Here’s the Table of Contents and some notes on the first two chapters.

  1. The Landscape of Scientific Computing and Why Python?
    The first chapter presents an overview of Scientific Computing with some examples of applications and an outline of a typical workflow. It then digs rather deeply into the issue of error analysis before giving some background on the Python programming language. Although Python and Scientific Computing are central to this book, I found an introduction to both topics within a single chapter to be a little disconcerting.

  2. A Deeper Dive into Scientific Workflows and the Ingredients of Scientific Computing Recipes
    This chapter starts off by addressing a range of techniques used in Scientific Computing, including optimisation, interpolation and extrapolation, numerical integration and differentiation, differential equations and random number generation. Attention then shifts to Python as a platform for Scientific Computing. A lengthy list of relevant packages is given before Python’s capabilities for interactive computing via IPython are introduced. The chapter concludes with a description of symbolic computing using SymPy and some examples of Python’s plotting capabilities.

  3. Efficiently Fabricating and Managing Scientific Data
  4. Scientific Computing APIs for Python
  5. Performing Numerical Computing
  6. Applying Python for Symbolic Computing
  7. Data Analysis and Visualization
  8. Parallel and Large-scale Scientific Computing
  9. Revisiting Real-life Case Studies
  10. Best Practices for Scientific Computing

The book is riddled with numerous small errors. These really should have been picked up by the author and editor, or indeed by either of the two reviewers credited at the front of the book. I have submitted errata via the publisher’s web site, which were duly acknowledged.

Installing XGBoost on Ubuntu

xgboost

XGBoost is the flavour of the moment for serious competitors on kaggle. It was developed by Tianqi Chen and provides a particularly efficient implementation of the Gradient Boosting algorithm. Although there is a CLI implementation of XGBoost you’ll probably be more interested in using it from either R or Python. Below are instructions for getting it installed for each of these languages. It’s pretty painless.

Installing for R

Installation in R is extremely simple.

> install.packages('xgboost')
> library(xgboost)

It’s also supported as a model in caret, which is especially handy for feature selection and model parameter tuning.

Installing for Python

Download the latest version from the github repository. The simplest way to do this is to grab the archive of a recent release. Unpack the archive, then become root and then execute the following:

# cd xgboost-master
# make
# cd python-package/
# python setup.py install --user

And you’re ready to roll:

import xgboost

If you run into trouble during the process you might have to install a few other packages:

# apt-get install g++ gfortran
# apt-get install python-dev python-numpy python-scipy python-matplotlib python-pandas
# apt-get install libatlas-base-dev

Conclusion

Enjoy building great models with the absurdly powerful tool. I’ve found that it effortlessly consumes vast data sets that grind other algorithms to a halt. Get started by looking at some code examples. Also worth looking at are