Installation¶

This guide will help you install and set up FlumenData on your system.

Prerequisites¶

Required Software¶

Python 3.6+: Required for FlumenData CLI
Linux/macOS: Usually pre-installed
Windows: Install via Microsoft Store or python.org
Docker: Version 20.10 or higher
Docker Compose: Version 2.0 or higher
Git: For cloning the repository

Hardware Requirements¶

Minimum: - 4 CPU cores - 16 GB RAM - 20 GB free disk space

Recommended: - 8+ CPU cores - 32 GB RAM - 50 GB free disk space (for data storage)

Operating System¶

FlumenData has been tested on: - Linux (Ubuntu 20.04+, Debian 11+, RHEL 8+) - macOS (11.0+) - Windows (via WSL2)

Installation Steps¶

1. Clone the Repository¶

git clone https://github.com/lucianomauda/FlumenData.git
cd FlumenData

2. Verify Docker Installation¶

# Check Docker version
docker --version

# Check Docker Compose version
docker compose version

# Test Docker is running
docker ps

3. Configure Environment Variables¶

FlumenData uses a .env file for configuration. Create it from the template:

# If .env.example exists
cp .env.example .env

# Edit with your preferred editor
nano .env

If no .env.example exists, the Makefile will generate default values. Common variables:

# PostgreSQL
POSTGRES_USER=flumen
POSTGRES_PASSWORD=flumen123
POSTGRES_DB=flumendata
POSTGRES_PORT=5432

# MinIO
MINIO_ROOT_USER=minioadmin
MINIO_ROOT_PASSWORD=minioadmin123
MINIO_BUCKET=lakehouse

# Hive Metastore
HIVE_METASTORE_URI=thrift://hive-metastore:9083

# Spark
SPARK_MASTER_HOST=spark-master
SPARK_MASTER_PORT=7077

# Delta Lake
DELTA_VERSION=4.0.0
SCALA_BINARY_VERSION=2.13

4. Initialize the Environment¶

Run the complete initialization process:

python3 flumen init

This command will: 1. Generate all configuration files 2. Build custom Docker images (Hive, Spark) 3. Start Tier 0 services (PostgreSQL, MinIO) 4. Initialize MinIO buckets 5. Start Tier 1 services (Hive Metastore, Spark cluster) 6. Run health checks 7. Display environment summary

Expected Output:

[config] Generating all configuration files...
[tier0] Starting foundation services...
[tier0] All services healthy
[minio] Creating lakehouse bucket...
[tier1] Starting data platform services...
[tier1] All services healthy
[summary] Environment is ready!

5. Verify Installation¶

Check that all services are running:

# View all containers
python3 flumen ps

# Check health status
python3 flumen health

# Display environment summary
python3 flumen summary

6. Access Web Interfaces¶

Open your browser and visit:

Spark Master UI: http://localhost:8080
MinIO Console: http://localhost:9001
Username: minioadmin
Password: minioadmin123

Post-Installation¶

Test the Installation¶

Run integration tests to verify everything works:

# Test all services
python3 flumen test

# Test specific tier
python3 flumen test --tier 0    # PostgreSQL, MinIO
python3 flumen test --tier 1    # Hive Metastore, Spark

Create Your First Database¶

# Open Spark SQL shell
python3 flumen shell-spark-sql

# Create a database
CREATE DATABASE my_database
LOCATION 's3a://lakehouse/warehouse/my_database.db';

# Verify it was created
SHOW DATABASES;

Troubleshooting Installation¶

Docker permission denied¶

If you get permission errors:

# Add your user to docker group (Linux)
sudo usermod -aG docker $USER
newgrp docker

# Verify
docker ps

Port already in use¶

If ports are already in use, update the .env file:

# Example: Change PostgreSQL port
POSTGRES_PORT=5433

# Regenerate configuration
python3 flumen config

# Restart services
python3 flumen restart

Services not starting¶

Check Docker resources:

# View Docker resource usage
docker stats

# View Docker system info
docker system info | grep -i "memory\|cpus"

Increase Docker Desktop resources if needed: - Settings → Resources → Memory: 16 GB minimum - Settings → Resources → CPUs: 4 cores minimum

Slow initialization¶

First-time startup downloads Docker images and JAR dependencies:

Docker images: ~2 GB (Spark, Hive, PostgreSQL, MinIO)
JAR dependencies: ~500 MB (Delta Lake, Hadoop AWS, etc.)

Subsequent starts are much faster as everything is cached.

Next Steps¶

Quick Start Guide - Create your first Delta table
Architecture Overview - Understand FlumenData components
Configuration Guide - Customize your setup

Uninstallation¶

To completely remove FlumenData:

# Stop and remove all containers and volumes
python3 flumen clean

# Remove Docker images
docker rmi flumendata/hive:standalone-metastore-4.1.0
docker rmi flumendata/spark:4.0.1-health

# Remove cloned directory
cd ..
rm -rf FlumenData

Data Loss

The python3 flumen clean command permanently deletes all data stored in Docker volumes. Export any important data before running this command.