Toggle dark/light theme
← Back to catalog
Architecture diagram
Expand

Stateful Streaming Lakehouse

Overview

Discover how Spark Structured Streaming powers a stateful lakehouse for industrial sales and shipment data. Applicable for both batch and streaming workloads, learn how to process incremental events from OneLake and Eventstream, and explore a production-ready streaming-first reference implementation.

Class
Core
Type
Tutorial
Difficulty
Advanced
Deploy Time
~8 min
Complete Time
~30 min

Workloads

Data EngineeringReal-Time Intelligence

Fabric Items Deployed

  • Lakehouse
  • Eventstream
  • Notebook
  • Spark Job Definition
  • Environment

Scenarios

StreamingBatch Processing

This jumpstart deploys a full Spark Structured Streaming lakehouse into your Microsoft Fabric workspace β€” a production-grade medallion pipeline (raw β†’ bronze β†’ silver) processing 9 entities from file and Eventstream sources, powered by ArcFlow, an open-source PySpark streaming ELT framework.

What You'll Learn

This jumpstart is designed as a hands-on workshop (Module 2 of the Stateful Streaming Lakehouse series). It covers two progression levels:

Part Notebook Focus
Part 1 explore_streaming Core streaming concepts with plain PySpark β€” no frameworks
Part 2 arcflow_elt_framework Same transforms, now packaged as production software with ArcFlow

By the end you will have:

  • Built a working raw β†’ bronze β†’ silver streaming pipeline
  • Understood checkpoint state β€” how Spark remembers what it already processed
  • Used the memory sink to develop safely without writing to Delta
  • Seen how packaging transforms as a framework changes testability, observability, and scalability
  • Understood the power of Sparks Structured Streaming API for managing state in both streaming and batch use cases

Entities

Entity Landing Format Source Custom Logic
shipmentscanevent Eventstream (Kafka) Real-time event stream Explode + flatten 30+ fields
shipment JSON files Files/landing/ Flatten nested addresses
order JSON files Files/landing/ Explode OrderLines[] to line-level grain
item JSON files Files/landing/ Cast types, validate SKU
customer Parquet files Files/landing/ Flatten address structs
route Parquet files Files/landing/ Dimensional lookup
facility Parquet files Files/landing/ Dimensional lookup
servicelevel Parquet files Files/landing/ Dimensional lookup
exceptiontype Parquet files Files/landing/ Dimensional lookup

JSON files follow an envelope pattern: {"_meta": {...}, "data": [...]}. Parquet files are flat with OrganizationId and GeneratedAt columns added by the data generator.

How It Works

Part 1 β€” Core Concepts with Native Spark

The explore_streaming notebook walks through streaming primitives with zero abstractions:

  1. Why Structured Streaming for batch pipelines? β€” The availableNow trigger gives you batch scheduling semantics with streaming state management. Re-run a job and it only processes new data.

  2. Write streaming queries on top of files β€” Read a file source as a stream, write to Delta with a checkpoint. Re-run it and watch only new data process.

  3. The memory sink development pattern β€” Stream data into driver memory to inspect and iterate on transforms without touching Delta tables. No checkpoints, no cleanup.

  4. Eventstream (Kafka) source pattern β€” The same transforms work whether the source is files or a message broker. Parse the binary Kafka payload through cast β†’ from_json β†’ explode β†’ expand.

  5. Build raw β†’ bronze β†’ silver β€” Schema enforcement, snake_case column normalization, audit timestamps in bronze. Flatten and expand nested structs in silver.

Part 2 β€” Thinking Like a Software Engineer

The arcflow_elt_framework notebook takes the exact same transforms and packages them with ArcFlow:

Concept What changes
Loose notebook functions β†’ @register_zone_transformer Transforms are named, discoverable, and referenced by config
Manual writeStream boilerplate β†’ YAML configuration Define sources, zones, and transforms declaratively
One stream at a time β†’ controller.run_full_pipeline() All tables, all zones, one call
query.status per stream β†’ controller.get_status() Every stream's health in one call
Manual zone sequencing β†’ Event-driven chaining StreamingQueryListener cascades downstream zones automatically
Ad-hoc Spark configs β†’ SparkConfigurator Auto-applies AQE, auto-compaction, V-Order, zstd compression on init

The test_input / test_output methods let you validate the full transform chain end-to-end β€” raw through silver β€” without writing a single row to storage.

Spark Job Definition

The production pipeline runs as a headless Spark Job Definition (main.py) that:

  1. Starts the LakeGen McMillan Industrial Group synthetic data generator β€” producing orders, shipments, scan events, and reference data
  2. Initializes the ArcFlow Controller with pipeline configuration
  3. Runs controller.run_full_pipeline(zones=['bronze', 'silver']) with await_termination=True

The same framework, same config, same transforms β€” just a different execution model. See Notebooks vs. Spark Jobs in Production for the tradeoffs, and Creating Your First Spark Job Definition for a step-by-step guide.

Key Patterns

Stateful Incremental Processing

Every streaming query uses checkpoint state to track what has been processed. Re-running a pipeline only processes new data β€” no duplicates, no full re-scans, no custom bookmarking logic.

python
query = (df.writeStream
    .format('delta')
    .option('checkpointLocation', 'Files/checkpoints/my_table')
    .trigger(availableNow=True)
    .toTable('bronze.my_table')
)

File Archival for Streaming from Object Storage

Target Schema

By the end of the workshop, you'll have a complete silver layer:

McMillan Industrial Group Silver Schema
McMillan Industrial Group Silver Schema

Resources