This jumpstart deploys a full Spark Structured Streaming lakehouse into your Microsoft Fabric workspace β a production-grade medallion pipeline (raw β bronze β silver) processing 9 entities from file and Eventstream sources, powered by ArcFlow, an open-source PySpark streaming ELT framework.
What You'll Learn
This jumpstart is designed as a hands-on workshop (Module 2 of the Stateful Streaming Lakehouse series). It covers two progression levels:
| Part | Notebook | Focus |
|---|---|---|
| Part 1 | explore_streaming | Core streaming concepts with plain PySpark β no frameworks |
| Part 2 | arcflow_elt_framework | Same transforms, now packaged as production software with ArcFlow |
By the end you will have:
- Built a working raw β bronze β silver streaming pipeline
- Understood checkpoint state β how Spark remembers what it already processed
- Used the
memorysink to develop safely without writing to Delta - Seen how packaging transforms as a framework changes testability, observability, and scalability
- Understood the power of Sparks Structured Streaming API for managing state in both streaming and batch use cases
Entities
| Entity | Landing Format | Source | Custom Logic |
|---|---|---|---|
| shipmentscanevent | Eventstream (Kafka) | Real-time event stream | Explode + flatten 30+ fields |
| shipment | JSON files | Files/landing/ | Flatten nested addresses |
| order | JSON files | Files/landing/ | Explode OrderLines[] to line-level grain |
| item | JSON files | Files/landing/ | Cast types, validate SKU |
| customer | Parquet files | Files/landing/ | Flatten address structs |
| route | Parquet files | Files/landing/ | Dimensional lookup |
| facility | Parquet files | Files/landing/ | Dimensional lookup |
| servicelevel | Parquet files | Files/landing/ | Dimensional lookup |
| exceptiontype | Parquet files | Files/landing/ | Dimensional lookup |
JSON files follow an envelope pattern: {"_meta": {...}, "data": [...]}. Parquet files are flat with OrganizationId and GeneratedAt columns added by the data generator.
How It Works
Part 1 β Core Concepts with Native Spark
The explore_streaming notebook walks through streaming primitives with zero abstractions:
Why Structured Streaming for batch pipelines? β The
availableNowtrigger gives you batch scheduling semantics with streaming state management. Re-run a job and it only processes new data.Write streaming queries on top of files β Read a file source as a stream, write to Delta with a checkpoint. Re-run it and watch only new data process.
The
memorysink development pattern β Stream data into driver memory to inspect and iterate on transforms without touching Delta tables. No checkpoints, no cleanup.Eventstream (Kafka) source pattern β The same transforms work whether the source is files or a message broker. Parse the binary Kafka payload through cast β
from_jsonβ explode β expand.Build raw β bronze β silver β Schema enforcement,
snake_casecolumn normalization, audit timestamps in bronze. Flatten and expand nested structs in silver.
Part 2 β Thinking Like a Software Engineer
The arcflow_elt_framework notebook takes the exact same transforms and packages them with ArcFlow:
| Concept | What changes |
|---|---|
Loose notebook functions β @register_zone_transformer | Transforms are named, discoverable, and referenced by config |
Manual writeStream boilerplate β YAML configuration | Define sources, zones, and transforms declaratively |
One stream at a time β controller.run_full_pipeline() | All tables, all zones, one call |
query.status per stream β controller.get_status() | Every stream's health in one call |
| Manual zone sequencing β Event-driven chaining | StreamingQueryListener cascades downstream zones automatically |
Ad-hoc Spark configs β SparkConfigurator | Auto-applies AQE, auto-compaction, V-Order, zstd compression on init |
The test_input / test_output methods let you validate the full transform chain end-to-end β raw through silver β without writing a single row to storage.
Spark Job Definition
The production pipeline runs as a headless Spark Job Definition (main.py) that:
- Starts the LakeGen McMillan Industrial Group synthetic data generator β producing orders, shipments, scan events, and reference data
- Initializes the ArcFlow Controller with pipeline configuration
- Runs
controller.run_full_pipeline(zones=['bronze', 'silver'])withawait_termination=True
The same framework, same config, same transforms β just a different execution model. See Notebooks vs. Spark Jobs in Production for the tradeoffs, and Creating Your First Spark Job Definition for a step-by-step guide.
Key Patterns
Stateful Incremental Processing
Every streaming query uses checkpoint state to track what has been processed. Re-running a pipeline only processes new data β no duplicates, no full re-scans, no custom bookmarking logic.
File Archival for Streaming from Object Storage
Target Schema
By the end of the workshop, you'll have a complete silver layer:
