Docker¶
Learning Objectives
- Understand when to use Docker vs. virtual environments
- Learn basic Docker concepts and commands
- Create Dockerfiles for data analysis projects
- Develop best practices for Docker in scientific computing
Why Docker?¶
While virtual environments (venv, conda, renv) handle package dependencies, Docker goes further by providing:
- Complete Environment: OS, system libraries, and all dependencies
- Perfect Reproducibility: Identical environment across any platform
- System Dependencies: Handle complex system-level requirements
- Cross-Platform: Works the same on Linux, Mac, and Windows
- Production Ready: Easy transition from development to deployment
When to Use Docker vs. Virtual Environments¶
Use Docker when you need:
- System-level dependencies (e.g., CUDA, system libraries)
- Cross-platform compatibility
- Multiple language environments (e.g., R + Python)
- Production deployment
- Continuous Integration/Deployment (CI/CD)
Use virtual environments when:
- Working with single-language projects
- Need lighter-weight solutions
- Quick local development
- Limited system resources
Docker Basics¶
Key Concepts¶
- Image: Blueprint for your environment
- Container: Running instance of an image
- Dockerfile: Instructions to build an image
- Registry: Storage for Docker images (e.g., Docker Hub)
Essential Commands¶
# Build image from Dockerfile
docker build -t myproject .
# Run container
docker run -it myproject
# List containers
docker ps
# Stop container
docker stop container_id
# Remove container
docker rm container_id
Dockerfile for Data Analysis¶
Basic Structure¶
# Base image
FROM python:3.9
# Set working directory
WORKDIR /app
# Copy requirements
COPY requirements.txt .
# Install dependencies
RUN pip install -r requirements.txt
# Copy project files
COPY . .
# Command to run
CMD ["python", "analysis.py"]
R + Python Environment¶
# Use rocker/tidyverse as base
FROM rocker/tidyverse:4.2.0
# Install Python
RUN apt-get update && apt-get install -y \
python3 \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
# Install Python packages
COPY requirements.txt .
RUN pip3 install -r requirements.txt
# Install R packages
RUN R -e "install.packages(c('reticulate', 'renv'))"
# Copy project files
COPY . .
# Set up renv
RUN R -e "renv::restore()"
Best Practices¶
1. Layer Optimization¶
# Good: Combine commands to reduce layers
RUN apt-get update && apt-get install -y \
package1 \
package2 \
&& rm -rf /var/lib/apt/lists/*
# Bad: Multiple RUN commands
RUN apt-get update
RUN apt-get install package1
RUN apt-get install package2
2. Use .dockerignore¶
3. Multi-stage Builds¶
# Build stage
FROM python:3.9 AS builder
COPY requirements.txt .
RUN pip install --user -r requirements.txt
# Runtime stage
FROM python:3.9-slim
COPY --from=builder /root/.local /root/.local
COPY . .
Package Development Workflow¶
1. Development Container¶
# Dockerfile.dev
FROM python:3.9
WORKDIR /app
# Install dev tools
COPY requirements-dev.txt .
RUN pip install -r requirements-dev.txt
# Mount source code as volume
VOLUME /app
# Keep container running
CMD ["bash"]
2. Testing Container¶
3. Production Container¶
# Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY . .
RUN pip install .
CMD ["python", "-m", "mypackage"]
Recommended Workflow¶
-
Initial Setup:
-
Development Workflow:
-
Testing Workflow:
-
Release Workflow:
Example docker-compose.yml¶
version: '3'
services:
app:
build:
context: .
dockerfile: Dockerfile.dev
volumes:
- .:/app
ports:
- "8888:8888" # For Jupyter
environment:
- PYTHONPATH=/app
Common Issues and Solutions¶
-
Large Image Sizes:
-
Slow Builds:
-
Permission Issues: