Architecture¶

FlumenData implements a modern lakehouse architecture that combines the best features of data lakes and data warehouses.

Overview¶

graph TB
    subgraph "Tier 1 - Data Platform"
        SPARK[Apache Spark 4.0.1<br/>Master + 2 Workers]
        HIVE[Hive Metastore 4.1.0<br/>Catalog Service]
    end

    subgraph "Tier 0 - Foundation"
        POSTGRES[PostgreSQL 17.6<br/>Metadata Store]
        MINIO[MinIO<br/>Object Storage]
    end

    subgraph "Data Layer"
        DELTA[Delta Lake 4.0<br/>ACID Tables]
    end

    SPARK --> HIVE
    SPARK --> DELTA
    DELTA --> MINIO
    HIVE --> POSTGRES
    HIVE --> MINIO
    SPARK --> MINIO

Architecture Layers¶

1. Storage Layer (Tier 0)¶

MinIO - Object Storage¶

Purpose: S3-compatible object storage for all table data
Technology: MinIO (S3 API)
Port: 9000 (API), 9001 (Console)
Data Format: Parquet files organized by Delta Lake

Bucket Structure:

lakehouse/
└── warehouse/
    ├── database1.db/
    │   ├── table1/
    │   └── table2/
    └── database2.db/

PostgreSQL - Metadata Backend¶

Purpose: Store Hive Metastore metadata
Technology: PostgreSQL 17.6
Port: 5432
Stores:
Database definitions
Table schemas
Partition information
Column statistics
Table locations (S3A paths)

2. Metadata Layer (Tier 1)¶

Hive Metastore¶

Purpose: Centralized catalog for lakehouse
Technology: Apache Hive 4.1.0 (standalone metastore)
Port: 9083 (Thrift)
Architecture:
Thrift service for metadata API
2-level namespace: database.table
Metadata stored in PostgreSQL
Compatible with Spark, Trino, Presto
Key Features:
ACID transaction metadata
Schema evolution tracking
Partition management
Statistics storage

3. Compute Layer (Tier 1)¶

Apache Spark Cluster¶

Purpose: Distributed query and processing engine
Technology: Apache Spark 4.0.1
Ports: 7077 (Master), 8080 (UI)
Components:
Master Node: Job coordination and scheduling
Worker Nodes (x2): Task execution
- 2 CPU cores per worker
- 2 GB memory per worker

4. Table Format Layer¶

Delta Lake¶

Purpose: ACID table format with time travel
Technology: Delta Lake 4.0
Features:
ACID transactions
Schema evolution
Time travel
Audit history
Unified batch/streaming

Structure:

table/
├── _delta_log/
│   ├── 00000000000000000000.json
│   ├── 00000000000000000001.json
│   └── _last_checkpoint
└── part-*.parquet

Data Flow¶

1. Write Path¶

sequenceDiagram
    participant User
    participant Spark
    participant Hive
    participant Postgres
    participant Delta
    participant MinIO

    User->>Spark: CREATE TABLE / INSERT
    Spark->>Hive: Register metadata
    Hive->>Postgres: Store schema
    Spark->>Delta: Write data + transaction log
    Delta->>MinIO: Store Parquet files
    MinIO-->>Delta: Confirm write
    Delta-->>Spark: Transaction complete
    Spark-->>User: Success

Steps: 1. User submits SQL query to Spark 2. Spark contacts Hive Metastore for table metadata 3. Hive stores/updates schema in PostgreSQL 4. Spark writes data using Delta Lake format 5. Delta Lake writes Parquet files to MinIO 6. Delta Lake commits transaction log 7. Success returned to user

2. Read Path¶

sequenceDiagram
    participant User
    participant Spark
    participant Hive
    participant Postgres
    participant Delta
    participant MinIO

    User->>Spark: SELECT query
    Spark->>Hive: Get table metadata
    Hive->>Postgres: Fetch schema
    Postgres-->>Hive: Return schema
    Hive-->>Spark: Table metadata
    Spark->>Delta: Plan query
    Delta->>MinIO: Read Parquet files
    MinIO-->>Delta: Return data
    Delta-->>Spark: Data + stats
    Spark-->>User: Query results

Steps: 1. User submits SELECT query to Spark 2. Spark requests table metadata from Hive 3. Hive retrieves schema from PostgreSQL 4. Spark uses Delta Lake to plan optimal read 5. Delta Lake reads transaction log from MinIO 6. Delta Lake performs partition pruning 7. Only required Parquet files are read 8. Results returned to user

Component Interactions¶

Spark ↔ Hive Metastore¶

Protocol: Thrift (port 9083)
Purpose:
Table creation/deletion
Schema evolution
Partition management
Statistics collection

Spark ↔ MinIO¶

Protocol: S3A (HTTP on port 9000)
Purpose:
Read/write Parquet files
Read Delta transaction logs
List partitions

Hive ↔ PostgreSQL¶

Protocol: JDBC
Purpose:
Store table metadata
Schema versioning
Partition tracking

Delta Lake ↔ MinIO¶

Protocol: S3A
Purpose:
ACID transaction logs
Parquet data files
Checkpoint files

Deployment Architecture¶

Docker Compose Tiers¶

Tier 0 (Foundation):

postgres:      # Metadata backend
minio:         # Object storage

Tier 1 (Data Platform):

hive-metastore:   # Catalog service
spark-master:     # Spark coordinator
spark-worker1:    # Spark executor
spark-worker2:    # Spark executor

Tier 2 (Analytics & Development):

jupyterlab:    # PySpark workspace

Tier 3 (SQL & BI):

trino:         # Distributed SQL engine
superset:      # BI dashboards

Startup Dependencies¶

graph TD
    T0[Tier 0 Services] --> POSTGRES[PostgreSQL]
    T0 --> MINIO[MinIO]

    POSTGRES --> HIVE[Hive Metastore]
    MINIO --> HIVE

    HIVE --> SPARK_MASTER[Spark Master]
    SPARK_MASTER --> SPARK_W1[Spark Worker 1]
    SPARK_MASTER --> SPARK_W2[Spark Worker 2]

    style T0 fill:#e1f5ff
    style HIVE fill:#fff4e1
    style SPARK_MASTER fill:#fff4e1

Health Check Chain¶

All services have health checks to ensure proper startup:

PostgreSQL: pg_isready command
MinIO: HTTP GET on /minio/health/live
Hive Metastore: Process check pgrep -f HiveMetaStore
Spark Master: HTTP GET on port 8080
Spark Workers: Check Master UI accessibility

Network Architecture¶

Container Network¶

All services run on the same Docker network (tier0_network and tier1_network):

DNS: Container names resolve to IP addresses
Internal Communication: No encryption (TLS optional)
External Access: Only selected ports exposed to host

Port Mapping¶

| Service | Internal Port | External Port | Protocol | | Service | Host Port | Container Port | Protocol | |---------|-----------|----------------|----------| | PostgreSQL | 5432 | 5432 | TCP | | MinIO API | 9000 | 9000 | HTTP | | MinIO Console | 9001 | 9001 | HTTP | | Hive Metastore | 9083 | 9083 | Thrift | | Spark Master | 7077 | 7077 | RPC | | Spark Master UI | 8080 | 8080 | HTTP | | JupyterLab | 8888 | 8888 | HTTP | | Trino | ${TRINO_PORT:-8082} | 8080 | HTTP | | Superset | ${SUPERSET_PORT:-8088} | 8088 | HTTP |

Storage Architecture¶

Bind Mounts (User-Accessible Data)¶

Critical data directories mounted from ${DATA_DIR}:

${DATA_DIR}/minio/            # MinIO lakehouse storage (can grow to TBs)
  ├── lakehouse/              # Delta Lake tables
  └── storage/                # Staging bucket
${DATA_DIR}/notebooks/        # JupyterLab notebooks (version control this!)

These directories are on your host filesystem and can be: - Backed up easily - Accessed from Windows Explorer (if using WSL) - Version controlled (especially notebooks/)

Docker Volumes (Internal Data)¶

Tier 0 Volumes:

postgres_data    # PostgreSQL database files (Hive metadata)

Tier 1 Volumes:

spark_conf       # Spark configuration
spark_ivy        # JAR dependency cache
spark_work       # Spark work directory
spark_logs       # Spark logs

Tier 3 Volumes:

flumen_superset_home       # Superset dashboards, uploads, config

Data Retention¶

Delta Lake: Configurable retention (default 30 days)
Hive Metadata: Retained indefinitely in PostgreSQL
Spark Logs: Retained in volume (manual cleanup)
MinIO Objects: Retained until explicitly deleted

Scalability Considerations¶

Current Setup¶

1 Spark Master
2 Spark Workers
1 Hive Metastore instance
Single-node PostgreSQL

Future Scaling Options¶

Add more Spark Workers: Scale compute horizontally
Partition tables: Improve query performance
PostgreSQL High Availability: Primary + replica
Multiple Hive Metastore instances: Load balancing
Distributed MinIO: Multi-node object storage

Security Architecture¶

Current Security¶

No encryption: Internal communication unencrypted
Basic authentication: MinIO username/password
Docker network isolation: Services isolated from host network
No TLS: HTTP only for web interfaces

Production Considerations¶

Enable TLS for all services
Use secrets management (HashiCorp Vault, AWS Secrets Manager)
Implement network policies
Enable audit logging
Use RBAC for access control

High Availability¶

Current setup is not high-available - suitable for development only.

For production: - PostgreSQL: Primary-replica setup with automatic failover - Hive Metastore: Multiple instances behind load balancer - MinIO: Distributed mode with erasure coding - Spark: Multiple masters in HA mode

Next Steps¶

Installation Guide - Deploy FlumenData
Quick Start - Create your first table
Configuration - Customize the setup