Quick Start¶

This guide will walk you through creating your first Delta Lake table and running queries in FlumenData.

Prerequisites¶

Complete the Installation guide and verify all services are running:

python3 flumen health

Step 1: Create a Database¶

Open the Spark SQL shell:

python3 flumen shell-spark-sql

Create your first database:

CREATE DATABASE quickstart
LOCATION 's3a://lakehouse/warehouse/quickstart.db';

USE quickstart;

SHOW DATABASES;

Step 2: Create a Delta Table¶

Create a sample Delta table with customer data:

CREATE TABLE customers (
  customer_id BIGINT,
  name STRING,
  email STRING,
  signup_date DATE,
  country STRING
) USING DELTA
LOCATION 's3a://lakehouse/warehouse/quickstart.db/customers';

Verify the table was created:

SHOW TABLES;

DESCRIBE EXTENDED customers;

Step 3: Insert Data¶

Insert sample records:

INSERT INTO customers VALUES
  (1, 'Alice Johnson', 'alice@example.com', '2024-01-15', 'USA'),
  (2, 'Bob Smith', 'bob@example.com', '2024-02-20', 'Canada'),
  (3, 'Carlos Garcia', 'carlos@example.com', '2024-03-10', 'Mexico'),
  (4, 'Diana Lee', 'diana@example.com', '2024-04-05', 'USA'),
  (5, 'Eva Mueller', 'eva@example.com', '2024-05-12', 'Germany');

Verify the data:

SELECT * FROM customers ORDER BY customer_id;

Step 4: Query Data¶

Run analytical queries:

-- Count customers by country
SELECT country, COUNT(*) as total_customers
FROM customers
GROUP BY country
ORDER BY total_customers DESC;

-- Find customers who signed up in Q1 2024
SELECT name, email, signup_date
FROM customers
WHERE signup_date BETWEEN '2024-01-01' AND '2024-03-31'
ORDER BY signup_date;

Step 5: Update Records¶

Delta Lake supports ACID updates:

-- Update a customer's email
UPDATE customers
SET email = 'alice.johnson@newdomain.com'
WHERE customer_id = 1;

-- Verify the update
SELECT * FROM customers WHERE customer_id = 1;

Step 6: Time Travel¶

Query previous versions of the table:

-- View table history
DESCRIBE HISTORY customers;

-- Query the original data (before update)
SELECT * FROM customers VERSION AS OF 0;

-- Compare with current version
SELECT * FROM customers;

Step 7: Delete Records¶

Delete specific records:

DELETE FROM customers WHERE country = 'Germany';

-- Verify deletion
SELECT * FROM customers ORDER BY customer_id;

Step 8: Create Partitioned Table¶

Create a partitioned table for better performance:

CREATE TABLE orders (
  order_id BIGINT,
  customer_id BIGINT,
  amount DECIMAL(10,2),
  order_date DATE,
  status STRING
) USING DELTA
PARTITIONED BY (DATE(order_date))
LOCATION 's3a://lakehouse/warehouse/quickstart.db/orders';

Insert partitioned data:

INSERT INTO orders VALUES
  (101, 1, 150.00, '2024-11-01', 'completed'),
  (102, 2, 75.50, '2024-11-01', 'completed'),
  (103, 3, 200.00, '2024-11-02', 'pending'),
  (104, 1, 320.00, '2024-11-03', 'completed'),
  (105, 4, 95.00, '2024-11-03', 'cancelled');

Query with partition pruning:

-- This query only scans the relevant partition
SELECT * FROM orders
WHERE order_date = '2024-11-01';

Step 9: Join Tables¶

Perform joins across Delta tables:

-- Find customer names with their order totals
SELECT
  c.name,
  c.country,
  COUNT(o.order_id) as total_orders,
  SUM(o.amount) as total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.name, c.country
ORDER BY total_spent DESC;

Step 10: Explore with PySpark¶

Exit Spark SQL (Ctrl+D) and open PySpark:

python3 flumen shell-pyspark

Run Python code:

# Read Delta table
df = spark.read.format("delta").table("quickstart.customers")
df.show()

# Filter and transform
from pyspark.sql.functions import year, month

df_with_month = df.withColumn("signup_month", month("signup_date"))
df_with_month.show()

# Write to new Delta table
df_with_month.write.format("delta") \
    .mode("overwrite") \
    .saveAsTable("quickstart.customers_with_month")

# Verify
spark.sql("SELECT * FROM quickstart.customers_with_month").show()

Viewing Data in MinIO¶

Your Delta tables are stored as Parquet files in MinIO:

Open MinIO Console: http://localhost:9001
Login with: minioadmin / minioadmin123
Navigate to: lakehouse → warehouse → quickstart.db
You'll see:
customers/ - Customer data and Delta logs
orders/ - Order data partitioned by date
Each table has a _delta_log/ directory with transaction logs

Monitoring with Spark UI¶

View query execution details:

Open Spark Master UI: http://localhost:8080
Click on "Running Applications" or "Completed Applications"
Click on an application to see:
DAG visualization
Stage details
Task execution times
Memory usage

Common Operations Cheat Sheet¶

-- List all databases
SHOW DATABASES;

-- Use a database
USE quickstart;

-- List all tables
SHOW TABLES;

-- Describe table structure
DESCRIBE customers;

-- View table history
DESCRIBE HISTORY customers;

-- Optimize table (compaction)
OPTIMIZE customers;

-- Vacuum old files (removes files older than 7 days)
VACUUM customers RETAIN 168 HOURS;

-- Analyze table for query optimization
ANALYZE TABLE customers COMPUTE STATISTICS;

-- Drop table
DROP TABLE IF EXISTS customers;

-- Drop database
DROP DATABASE IF EXISTS quickstart CASCADE;

Clean Up¶

To remove the quickstart data:

# In Spark SQL
DROP DATABASE quickstart CASCADE;

Next Steps¶

Now that you've completed the quickstart:

Architecture Overview - Understand how components work together
Hive Metastore - Learn about catalog management
Apache Spark - Deep dive into Spark features
Configuration - Customize your environment

Troubleshooting¶

Table not found error¶

Ensure you're using the correct database:

USE quickstart;
SHOW TABLES;

Permission denied on S3A¶

Verify MinIO credentials in configuration:

docker exec flumen_spark_master cat /opt/spark/conf/spark-defaults.conf | grep s3a

Slow query performance¶

Optimize your tables:

-- Compact small files
OPTIMIZE customers;

-- Z-order for better data skipping
OPTIMIZE customers ZORDER BY (country);

Out of memory errors¶

Increase Spark worker memory in docker-compose.tier1.yml or reduce data size for testing.