Installation¶
This guide will help you install and set up FlumenData on your system.
Prerequisites¶
Required Software¶
- Python 3.6+: Required for FlumenData CLI
- Linux/macOS: Usually pre-installed
- Windows: Install via Microsoft Store or python.org
- Docker: Version 20.10 or higher
- Docker Compose: Version 2.0 or higher
- Git: For cloning the repository
Hardware Requirements¶
Minimum: - 4 CPU cores - 16 GB RAM - 20 GB free disk space
Recommended: - 8+ CPU cores - 32 GB RAM - 50 GB free disk space (for data storage)
Operating System¶
FlumenData has been tested on: - Linux (Ubuntu 20.04+, Debian 11+, RHEL 8+) - macOS (11.0+) - Windows (via WSL2)
Installation Steps¶
1. Clone the Repository¶
2. Verify Docker Installation¶
# Check Docker version
docker --version
# Check Docker Compose version
docker compose version
# Test Docker is running
docker ps
3. Configure Environment Variables¶
FlumenData uses a .env file for configuration. Create it from the template:
If no .env.example exists, the Makefile will generate default values. Common variables:
# PostgreSQL
POSTGRES_USER=flumen
POSTGRES_PASSWORD=flumen123
POSTGRES_DB=flumendata
POSTGRES_PORT=5432
# MinIO
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin123
MINIO_BUCKET=lakehouse
# Hive Metastore
HIVE_METASTORE_URI=thrift://hive-metastore:9083
# Spark
SPARK_MASTER_HOST=spark-master
SPARK_MASTER_PORT=7077
# Delta Lake
DELTA_VERSION=4.0.0
SCALA_BINARY_VERSION=2.13
4. Initialize the Environment¶
Run the complete initialization process:
This command will: 1. Generate all configuration files 2. Build custom Docker images (Hive, Spark) 3. Start Tier 0 services (PostgreSQL, MinIO) 4. Initialize MinIO buckets 5. Start Tier 1 services (Hive Metastore, Spark cluster) 6. Run health checks 7. Display environment summary
Expected Output:
[config] Generating all configuration files...
[tier0] Starting foundation services...
[tier0] All services healthy
[minio] Creating lakehouse bucket...
[tier1] Starting data platform services...
[tier1] All services healthy
[summary] Environment is ready!
5. Verify Installation¶
Check that all services are running:
# View all containers
python3 flumen ps
# Check health status
python3 flumen health
# Display environment summary
python3 flumen summary
6. Access Web Interfaces¶
Open your browser and visit:
- Spark Master UI: http://localhost:8080
- MinIO Console: http://localhost:9001
- Username:
minioadmin - Password:
minioadmin123
Post-Installation¶
Test the Installation¶
Run integration tests to verify everything works:
# Test all services
python3 flumen test
# Test specific tier
python3 flumen test --tier 0 # PostgreSQL, MinIO
python3 flumen test --tier 1 # Hive Metastore, Spark
Create Your First Database¶
# Open Spark SQL shell
python3 flumen shell-spark-sql
# Create a database
CREATE DATABASE my_database
LOCATION 's3a://lakehouse/warehouse/my_database.db';
# Verify it was created
SHOW DATABASES;
Troubleshooting Installation¶
Docker permission denied¶
If you get permission errors:
# Add your user to docker group (Linux)
sudo usermod -aG docker $USER
newgrp docker
# Verify
docker ps
Port already in use¶
If ports are already in use, update the .env file:
# Example: Change PostgreSQL port
POSTGRES_PORT=5433
# Regenerate configuration
python3 flumen config
# Restart services
python3 flumen restart
Services not starting¶
Check Docker resources:
# View Docker resource usage
docker stats
# View Docker system info
docker system info | grep -i "memory\|cpus"
Increase Docker Desktop resources if needed: - Settings → Resources → Memory: 16 GB minimum - Settings → Resources → CPUs: 4 cores minimum
Slow initialization¶
First-time startup downloads Docker images and JAR dependencies:
- Docker images: ~2 GB (Spark, Hive, PostgreSQL, MinIO)
- JAR dependencies: ~500 MB (Delta Lake, Hadoop AWS, etc.)
Subsequent starts are much faster as everything is cached.
Next Steps¶
- Quick Start Guide - Create your first Delta table
- Architecture Overview - Understand FlumenData components
- Configuration Guide - Customize your setup
Uninstallation¶
To completely remove FlumenData:
# Stop and remove all containers and volumes
python3 flumen clean
# Remove Docker images
docker rmi flumendata/hive:standalone-metastore-4.1.0
docker rmi flumendata/spark:4.0.1-health
# Remove cloned directory
cd ..
rm -rf FlumenData
Data Loss
The python3 flumen clean command permanently deletes all data stored in Docker volumes. Export any important data before running this command.