FlumenData¶

Composable, Docker Compose–based Lakehouse with Spark 4, Delta Lake 4, Trino, Superset, and MinIO.

Trino SQL JupyterLab Superset BI Healthy tiers

Project Status

Tier 0 is validated: PostgreSQL and MinIO expose healthchecks, named volumes, and generated config under /config. Tier 1 is operational: Apache Spark 4.0.1, Hive Metastore 4.1.0, and Delta Lake 4.0 are deployed and tested. Tier 2 & Tier 3 are live: JupyterLab, Trino, and Superset are ready for demos.

Quickstart¶

# 1) Clone the repository
git clone https://github.com/lucianomauda/FlumenData.git
cd FlumenData

# 2) Initialize the complete environment
make init

# 3) Verify all services are healthy
make health

# 4) View environment summary
make summary

Architecture¶

FlumenData implements a modern lakehouse architecture with:

graph TD
    subgraph Tier0
        MINIO[MinIO S3]
        POSTGRES[PostgreSQL]
    end
    subgraph Tier1
        SPARK[Spark 4.0.1]
        HIVE[Hive Metastore]
        DELTA[Delta Lake Tables]
    end
    subgraph Tier2
        JUPYTER[JupyterLab]
    end
    subgraph Tier3
        TRINO[Trino]
        SUPERSET[Superset]
    end

    MINIO --> DELTA
    POSTGRES --> HIVE
    HIVE --> SPARK
    SPARK --> DELTA
    TRINO --> HIVE
    TRINO --> MINIO
    SUPERSET --> TRINO
    JUPYTER --> SPARK

Technology Stack¶

Storage Layer: - MinIO - S3-compatible object storage for the data lake - Delta Lake 4.0 - ACID table format with time travel capabilities

Metadata Layer: - Hive Metastore 4.1.0 - Industry-standard catalog (2-level namespace: database.table) - PostgreSQL - Backend for Hive Metastore metadata

Compute Layer: - Apache Spark 4.0.1 - Distributed query and processing engine (Master + 2 Workers)

Analytics Layer: - JupyterLab - Browser-based PySpark IDE baked into the stack

SQL & BI Layer: - Trino - Distributed SQL gateway across the lakehouse - Apache Superset - Dashboards, charts, and SQL Lab

Project Structure¶

/FlumenData/
├── config/             # Rendered configuration (auto-generated, do not edit)
├── docker/             # Custom Dockerfiles
├── docs/               # MkDocs Material documentation (EN + PT)
├── makefiles/          # Service-specific Makefile modules
├── templates/          # Configuration templates
├── .env                # Environment variables
├── docker-compose.tier0.yml  # Foundation services
├── docker-compose.tier1.yml  # Data platform services
└── Makefile            # Main orchestration

Services¶

Tier 0 - Foundation¶

PostgreSQL 17.6 – Relational metadata store postgres:17.6-alpine3.22
MinIO – S3-compatible object storage minio/minio:RELEASE.2025-09-07T16-13-09Z

Tier 1 - Data Platform¶

Hive Metastore 4.1.0 – Lakehouse catalog Custom image: flumendata/hive:standalone-metastore-4.1.0
Apache Spark 4.0.1 – Distributed compute engine Custom image: flumendata/spark:4.0.1-health

Tier 2 - Analytics & Development¶

JupyterLab (Spark 4.0.1) – Ready-to-use PySpark IDE Custom image: flumendata/jupyterlab:spark-4.0.1

Tier 3 - SQL & BI¶

Trino 450 – Federated SQL query engine Image: trinodb/trino:450
Apache Superset 5.0.0 – BI dashboards & charts Custom image: flumendata/superset:5.0.0

Key Features¶

Delta Lake Integration¶

ACID transactions on object storage
Time travel (historical queries)
Schema evolution
Unified batch and streaming

Hive Metastore Catalog¶

2-level namespace (database.table)
PostgreSQL backend for reliability
Compatible with Spark, Presto, Trino
Standard Thrift API (port 9083)

Spark Cluster¶

1 Master + 2 Workers
Pre-configured for Delta Lake
S3A integration with MinIO
Ivy cache for fast dependency resolution

Make Commands¶

Initialization¶

make init          # Complete environment setup
make config        # Generate all configuration files
make up            # Start all services

Service Management¶

make up-tier0      # Start foundation services
make up-tier1      # Start data platform services
make down          # Stop all services
make restart       # Restart all services

Health & Validation¶

make health        # Check all services health
make health-tier0  # Check Tier 0 services
make health-tier1  # Check Tier 1 services

Testing¶

make test          # Run all tests
make test-tier0    # Test foundation services
make test-tier1    # Test data platform services

Verification¶

make verify-hive   # Verify Hive Metastore setup
make summary       # Display environment summary
make ps            # Show running containers

Logs¶

make logs          # Show logs for all services
make logs-tier0    # Show Tier 0 logs
make logs-tier1    # Show Tier 1 logs
make logs-spark    # Show Spark logs
make logs-hive     # Show Hive Metastore logs

Development¶

make shell-postgres    # Open PostgreSQL shell
make shell-spark       # Open Spark shell
make shell-pyspark     # Open PySpark shell
make shell-spark-sql   # Open Spark SQL shell
make mc                # Open MinIO client

Maintenance¶

make reset         # Complete reset and reinitialize
make clean         # Stop and remove all data (DANGEROUS)

Conventions¶

All code and comments are in English
Configuration is generated via Makefile targets into /config/ - never edit rendered files manually
Every service must have a healthcheck, named volumes, and static config under /config/
Documentation is maintained in both English (docs/*.md) and Portuguese (docs/*.pt.md)

Web Interfaces¶

After running make init, access these UIs:

Spark Master UI: http://localhost:8080
MinIO Console: http://localhost:9001 (minioadmin / minioadmin123)
Buckets: lakehouse (Delta tables) and storage (staging files)
JupyterLab: http://localhost:8888 (run make token-jupyterlab for access)
Trino Console: http://localhost:${TRINO_PORT}
Superset: http://localhost:${SUPERSET_PORT} (login: admin / admin123)

Roadmap¶

✅ Tier 0 – Foundation: PostgreSQL, MinIO
✅ Tier 1 – Data Platform: Spark, Hive Metastore, Delta Lake
✅ Tier 2 – Analytics & Development: JupyterLab
✅ Tier 3 – SQL & BI: Trino, Superset