Posted by Andrew B. Collier on 2017-07-04.
I’m busy experimenting with Spark. This is what I did to set up a local cluster on my Ubuntu machine. Before you embark on this you should first set up Hadoop.
- Download the latest release of Spark here.
- Unpack the archive.
- Move the resulting folder and create a symbolic link so that you can have multiple versions of Spark installed.
SPARK_HOME to your environment.
- Start a standalone master server.
At this point you can browse to http://127.0.0.1:8080/ to view the status screen.
- Start a worker process.
To get this to work I had to make an entry for my machine in
- Test out the Spark shell.
You’ll note that this exposes the native Scala interface to Spark.
To get this to work properly it might be necessary to first set up the path to the Hadoop libraries.
- Maybe Scala is not your cup of tea and you’d prefer to use Python. No problem!
Of course you’ll probably want to interact with Python via a Jupyter Notebook, in which case take a look at this.
- Finally, if you prefer to work with R, that’s also catered for.
This is a light-weight interface to Spark from R. To find out more about sparkR, check out the documentation here. For a more user friendly experience you might want to look at sparklyr.
- When you are done you can shut down the slave and master Spark processes.