A Docker Based Lab for Big Data
As organizations become increasingly reliant on data services for their core missions, having access to systems such as storage, cluster computing, and streaming technologies becomes increasingly important. It allows for the use and analysis of data which would be very difficult to work with using other technologies.
There are three broad categories of data systems:
- Storage. Data needs a place to live. Storage systems such as HDFS or Object Storage (such as Amazon S3), distributed relational databases like Galera and Citus, and NoSQL provide the medium for information to be consumed for analytics and machine learning. Because such systems frequently require that data be processed in large volume, they also redundantly store information so that it can be read and written in parallel.
- Compute. Compute provides the brains of Big Data. Compute systems, like Hadoop MapReduce and Spark, provide the toolbox for performing scalable computation. Such resources can be used for performing analytics, stream processing, machine learning, and others.
- Streaming. Streaming systems like Apache Kafka provide ways to integrate Big Data technologies using a common interchange. This allows for systematic enrichment and analysis; and the integration of data from multiple sources.
Experimenting With Integrated Technologies
In this guide, we will walk you through deployment of a Docker based environment that provides:
- Python 3.7, the standard Python based data libraries, Spark (which has been integrated with Python so that you are able to use the
pyspark
shell), TensorFlow, PyTorch, and Jupyter (along with its new integrated development environment, JupyterLab) - MinIO, an S3 compatible object storage which is often used for cloud-native data storage
- Apache Kafka, a high performance distributed streaming platform that is used for enabling the exchange of data between systems and bulding real-time streaming data pipelines that are used for powering analytics/machine learning
- ZooKeeper, a runtime dependency of Apache Kafka
The deployment files and a set of example notebooks which demonstrate the capabilities of the environment are located in the Oak-Tree PySpark public repository and have been validated to work on Ubuntu 18.04.2 LTS. In this article you will perform the following steps required to deploy the environment:
- Installing and using Git
- Installing and deploying Docker and Docker-Compose
- Deploying and utilizing Spark, Jupyter, Kafka, and ZooKeeper as a set of integrated containers
Before following this guide, you will need access to a Ubuntu 18.04.2 LTS or above machine.
Task: Install the Prerequisites and Deploy JupyterLab
All code text that looks like this
represents terminal input/output. Any code that starts with $
signifies a command you must enter in the command line. Example command and resulting output:
$ input command
output resulting from command
1. Open up a terminal and type in the following commands to install git
.
$ sudo apt update
$ sudo apt install git
If you are following this course as part of a course, the instructor will provide the password needed for the sudo password
prompt. When prompted to continue [Y/n], press Y and hit enter.
When git
has finished installing, your terminal will look similar to the below image:
2. Install docker
and docker-compose
to your machine.
Update the package manager and install docker
.
$ sudo apt-get upgrade
$ sudo apt-get install docker.io
Verify that docker
was installed properly.
$ sudo systemctl status docker
The output will look similar to below:
Press Q to exit this output.
Install curl
.
$ sudo apt-get install curl
Download docker-compose
and save it as docker-compose
in your home directory.
$ cd ~
$ sudo curl -L https://oak-tree.tech/documents/101/docker-compose-Linux-x86_64 -o ./docker-compose
Set executable permissions for docker-compose
so that it can be used from the command line:
$ sudo chmod +x docker-compose
Test if docker-compose works:
$ ./docker-compose --version
docker-compose version 1.21.2, build a133471
Move the local file to /usr/local/bin
$ sudo mv docker-compose /usr/local/bin
Validate docker-compose
is in that directory:
$ which docker-compose
/usr/local/bin/docker-compose
Execute a docker-compose
command to verify it has been installed properly.
$ docker-compose --version
docker-compose version 1.21.2, build a133471
3. Accessing JupyterLab
In the home directory, use git
to clone the example-files to your machine:
$ cd ~
$ git clone https://code.oak-tree.tech/courseware/oak-tree/pyspark-examples.git
Navigate inside the repository and start-up the docker-compose deployment:
$ cd pyspark-examples/
Execute the docker-compose.yaml
file. This file will deploy a ZooKeeper, Spark, Jupyter, and Kafka instances. After execution of the command, you will see a lot of logs run through your terminal, these are the instances being deployed.
$ sudo docker-compose up
Once the deployments have finished initializing, search the console output for the Jupyter URL.
You can use the Find
function by pressing CTRL+SHIFT+F
in your terminal.
Search for 127.0.0.1:8888
and you will find the entire Jupyter URL to access the hub.
The complete Jupyter URL will look similar to http://127.0.0.1:8888/?token=0276e1837789712feay4982fh91274
Copy the URL be selecting it in the terminal by pressing CTRL+SHIFT+C
then paste it in a Firefox web browser. The UI will then present JupyterLab.
4. Accessing MinIO Storage
In the JupyterLab launcher section, navigate to Console Python 3 and click on the icon to view the in-browser terminal.
You can type in env
and press enter to view all of the environment variables. Of these variables, the Minio object storage access and password keys are present. In the JupyterLab terminal, type the following command to view the Minio variables:
$ env | grep OBJECTS
The OBJECTS_ENDPOINT
URL can be used alongside the ACCESSID
and SECRET
to retrieve data from Jupyter. Several of the example notebooks show how these can be used from within Spark or utilizing a library like boto3
.
You can access the MinIO storage UI from port 9000 of your localhost: http://127.0.0.1:9000
. The username and password will be the ACCESSID
and SECRET
defined in the Jupyter environment.
Comments
Loading
No results found