Skip to content

FlumenData

FlumenData logo

Composable, Docker Compose–based Lakehouse with Spark 4, Delta Lake 4, Trino, Superset, and MinIO.

Trino SQL JupyterLab Superset BI Healthy tiers

Project Status

Tier 0 is validated: PostgreSQL and MinIO expose healthchecks, named volumes, and generated config under /config. Tier 1 is operational: Apache Spark 4.0.1, Hive Metastore 4.1.0, and Delta Lake 4.0 are deployed and tested. Tier 2 & Tier 3 are live: JupyterLab, Trino, and Superset are ready for demos.

Quickstart

# 1) Clone the repository
git clone https://github.com/lucianomauda/FlumenData.git
cd FlumenData

# 2) Initialize the complete environment
make init

# 3) Verify all services are healthy
make health

# 4) View environment summary
make summary

Architecture

FlumenData implements a modern lakehouse architecture with:

graph TD
    subgraph Tier0
        MINIO[MinIO S3]
        POSTGRES[PostgreSQL]
    end
    subgraph Tier1
        SPARK[Spark 4.0.1]
        HIVE[Hive Metastore]
        DELTA[Delta Lake Tables]
    end
    subgraph Tier2
        JUPYTER[JupyterLab]
    end
    subgraph Tier3
        TRINO[Trino]
        SUPERSET[Superset]
    end

    MINIO --> DELTA
    POSTGRES --> HIVE
    HIVE --> SPARK
    SPARK --> DELTA
    TRINO --> HIVE
    TRINO --> MINIO
    SUPERSET --> TRINO
    JUPYTER --> SPARK

Technology Stack

Storage Layer: - MinIO - S3-compatible object storage for the data lake - Delta Lake 4.0 - ACID table format with time travel capabilities

Metadata Layer: - Hive Metastore 4.1.0 - Industry-standard catalog (2-level namespace: database.table) - PostgreSQL - Backend for Hive Metastore metadata

Compute Layer: - Apache Spark 4.0.1 - Distributed query and processing engine (Master + 2 Workers)

Analytics Layer: - JupyterLab - Browser-based PySpark IDE baked into the stack

SQL & BI Layer: - Trino - Distributed SQL gateway across the lakehouse - Apache Superset - Dashboards, charts, and SQL Lab

Project Structure

/FlumenData/
├── config/             # Rendered configuration (auto-generated, do not edit)
├── docker/             # Custom Dockerfiles
├── docs/               # MkDocs Material documentation (EN + PT)
├── makefiles/          # Service-specific Makefile modules
├── templates/          # Configuration templates
├── .env                # Environment variables
├── docker-compose.tier0.yml  # Foundation services
├── docker-compose.tier1.yml  # Data platform services
└── Makefile            # Main orchestration

Services

Tier 0 - Foundation

  • PostgreSQL 17.6 – Relational metadata store postgres:17.6-alpine3.22

  • MinIO – S3-compatible object storage minio/minio:RELEASE.2025-09-07T16-13-09Z

Tier 1 - Data Platform

  • Hive Metastore 4.1.0 – Lakehouse catalog Custom image: flumendata/hive:standalone-metastore-4.1.0

  • Apache Spark 4.0.1 – Distributed compute engine Custom image: flumendata/spark:4.0.1-health

Tier 2 - Analytics & Development

Tier 3 - SQL & BI

  • Trino 450 – Federated SQL query engine Image: trinodb/trino:450

  • Apache Superset 5.0.0 – BI dashboards & charts Custom image: flumendata/superset:5.0.0

Key Features

Delta Lake Integration

  • ACID transactions on object storage
  • Time travel (historical queries)
  • Schema evolution
  • Unified batch and streaming

Hive Metastore Catalog

  • 2-level namespace (database.table)
  • PostgreSQL backend for reliability
  • Compatible with Spark, Presto, Trino
  • Standard Thrift API (port 9083)

Spark Cluster

  • 1 Master + 2 Workers
  • Pre-configured for Delta Lake
  • S3A integration with MinIO
  • Ivy cache for fast dependency resolution

Make Commands

Initialization

make init          # Complete environment setup
make config        # Generate all configuration files
make up            # Start all services

Service Management

make up-tier0      # Start foundation services
make up-tier1      # Start data platform services
make down          # Stop all services
make restart       # Restart all services

Health & Validation

make health        # Check all services health
make health-tier0  # Check Tier 0 services
make health-tier1  # Check Tier 1 services

Testing

make test          # Run all tests
make test-tier0    # Test foundation services
make test-tier1    # Test data platform services

Verification

make verify-hive   # Verify Hive Metastore setup
make summary       # Display environment summary
make ps            # Show running containers

Logs

make logs          # Show logs for all services
make logs-tier0    # Show Tier 0 logs
make logs-tier1    # Show Tier 1 logs
make logs-spark    # Show Spark logs
make logs-hive     # Show Hive Metastore logs

Development

make shell-postgres    # Open PostgreSQL shell
make shell-spark       # Open Spark shell
make shell-pyspark     # Open PySpark shell
make shell-spark-sql   # Open Spark SQL shell
make mc                # Open MinIO client

Maintenance

make reset         # Complete reset and reinitialize
make clean         # Stop and remove all data (DANGEROUS)

Conventions

  • All code and comments are in English
  • Configuration is generated via Makefile targets into /config/ - never edit rendered files manually
  • Every service must have a healthcheck, named volumes, and static config under /config/
  • Documentation is maintained in both English (docs/*.md) and Portuguese (docs/*.pt.md)

Web Interfaces

After running make init, access these UIs:

  • Spark Master UI: http://localhost:8080
  • MinIO Console: http://localhost:9001 (minioadmin / minioadmin123)
  • Buckets: lakehouse (Delta tables) and storage (staging files)
  • JupyterLab: http://localhost:8888 (run make token-jupyterlab for access)
  • Trino Console: http://localhost:${TRINO_PORT}
  • Superset: http://localhost:${SUPERSET_PORT} (login: admin / admin123)

Roadmap

  • Tier 0 – Foundation: PostgreSQL, MinIO
  • Tier 1 – Data Platform: Spark, Hive Metastore, Delta Lake
  • Tier 2 – Analytics & Development: JupyterLab
  • Tier 3 – SQL & BI: Trino, Superset