Streaming providers
Fleans publishes domain events (EvaluateConditionEvent, EvaluateActivationConditionEvent,
ExecuteScriptEvent, …) through Orleans Streams. The stream provider is pluggable at startup:
the default is Orleans’ in-memory provider, and a Kafka-backed provider ships alongside it for
cross-silo event durability.
Provider switch
Section titled “Provider switch”The provider name is read once at silo startup from configuration:
| Configuration key | Values | Default |
|---|---|---|
Fleans:Streaming:Provider | memory, kafka, azurequeue | memory |
Fleans:Streaming:Kafka:Brokers | host:port[,…] | — |
Fleans:Streaming:Kafka:ConsumerGroup | string | fleans |
Fleans:Streaming:Kafka:TopicPrefix | string | fleans- |
Fleans:Streaming:Kafka:QueueCount | integer | 1 |
Fleans:Streaming:Kafka:NumPartitions | integer | 1 |
Fleans:Streaming:Kafka:ReplicationFactor | integer | 1 |
Fleans:Streaming:AzureQueue:ConnectionString | Azure Storage connection string | — |
Fleans:Streaming:AzureQueue:AccountName | Azure Storage account name | — |
Fleans:Streaming:AzureQueue:QueueNames | JSON array of strings | ["fleans-stream-0"…"fleans-stream-7"] |
Fleans:Streaming:AzureQueue:MessageVisibilityTimeout | TimeSpan string (e.g. 00:01:00) | Azure SDK default (30 s) |
Match is case-insensitive. Any other value throws ArgumentException at startup —
add a new case to FleanStreamingExtensions.AddFleanStreaming if you ship another provider.
For azurequeue, exactly one of ConnectionString or AccountName must be set.
ConnectionString is used for local dev (Azurite). AccountName enables Managed Identity
(DefaultAzureCredential) for production Azure deployments.
Choosing a provider
Section titled “Choosing a provider”memory — the default
Section titled “memory — the default”Backed by AddMemoryStreams. Zero infrastructure. Events are lost on silo crash. Subscription
state survives because Aspire wires PubSubStore to Redis, but in-flight stream messages do not.
Right for: local development, all default tests, and single-silo deployments where you accept the event-loss profile of a silo crash.
kafka — opt-in cross-silo durability
Section titled “kafka — opt-in cross-silo durability”Backed by an in-repo Fleans.Streaming.Kafka adapter built on Confluent.Kafka 2.6.1
(librdkafka 2.6.x bundled).
Right for: production rollouts that need event durability beyond a single silo, CI environments that want broker parity with production, and topologies where a non-Orleans consumer may join the topics later (note: the codec is currently the Orleans codec — see limitations below).
azurequeue — opt-in cross-silo durability (Azure-native)
Section titled “azurequeue — opt-in cross-silo durability (Azure-native)”Backed by Microsoft.Orleans.Streaming.AzureStorage — the first-party Microsoft Orleans
adapter for Azure Queue Storage. This is a thin wrapper, not a custom adapter, so behaviour,
retry semantics, and dead-letter handling are governed by the Microsoft package.
Right for: Azure-native deployments that need cross-silo event durability without running Kafka.
Uses Azure Queue Storage (cheaper than Service Bus at scale; Orleans StreamSequenceToken
provides per-queue ordering). In dev mode, Aspire auto-provisions Azurite so no Azure
subscription is needed.
For auth options: set ConnectionString for Azurite or an explicit connection string; set
AccountName to use DefaultAzureCredential (Managed Identity, recommended for Container Apps).
Production-readiness gaps
Section titled “Production-readiness gaps”The Kafka adapter shipped in v1 is intentionally minimal. The following gaps are tracked under #474 — Production-ready Kafka streaming; pick the tier that matches your risk tolerance.
🔴 Won’t connect
Section titled “🔴 Won’t connect”These are deploy-day blockers — your silo will fail to connect, period.
| Gap | Failure mode | Affected services |
|---|---|---|
No SecurityProtocol knob | librdkafka prints Disconnected: SASL authentication required and the silo health check goes red | Confluent Cloud, Aiven, Redpanda Cloud, any TLS-only listener |
| No SASL/PLAIN, SCRAM, or OAUTHBEARER | librdkafka cannot authenticate; broker rejects the connection during the AdminClient GetMetadata warm-up | All managed brokers requiring credential auth |
| No AWS MSK IAM | librdkafka ssl.ca.location + token vendor not wired | Amazon MSK with IAM auth |
| No client-cert / mTLS | No ssl.certificate.pem / ssl.key.pem config | Self-managed clusters with mutual TLS |
🟡 Will lose data on failure
Section titled “🟡 Will lose data on failure”These are SLO trade-offs — connection succeeds but the durability profile is below production-typical.
| Gap | Failure mode | Mitigation today |
|---|---|---|
ReplicationFactor=1 default | Broker crash with unflushed log segments → events permanently lost | Set Fleans:Streaming:Kafka:ReplicationFactor=3 once you have a 3-broker cluster |
| At-least-once delivery (offset commit after handler success) | Silo crash between handler success and offset commit → event replays | Safe by design — every consumer routes to a WorkflowInstance method guarded by HasActiveEntry. Do not remove that guard. See Delivery contract |
| No idempotent producer | Producer retries can reorder messages within a partition | None today; consumer-side guard above absorbs duplicates |
🟢 Operational gaps
Section titled “🟢 Operational gaps”These won’t break a deploy, but they will cost you operationally.
| Gap | Why it hurts | Workaround |
|---|---|---|
NumPartitions=1 default | New topics created with 1 partition — no broker-side write fan-out per topic | Bump Fleans:Streaming:Kafka:NumPartitions and re-create or kafka-topics --alter --partitions N existing topics (forward-only; cannot shrink). See Tuning throughput |
| No schema registry integration | Codec is the Orleans codec; non-Orleans consumers cannot decode the stream | Treat Kafka topics as silo-internal until a follow-up adds Avro/Protobuf framing |
| No DLQ for poison messages | A consistently-failing handler replays forever | Manual offset commit via AdminClient is your only escape today |
| No metrics/health-check for consumer lag | Lag is invisible to the Aspire dashboard’s health UI | Run kafka-consumer-groups.sh --describe --group fleans against the broker |
Manual test plan #35 exercises the happy path against a single-broker Aspire-provisioned fleans-kafka resource — it validates that the at-least-once contract holds across a silo restart, but it does not test any production failure mode listed above.
Production-readiness gaps — Azure Queue Storage
Section titled “Production-readiness gaps — Azure Queue Storage”The Azure Queue provider is backed by the Microsoft-maintained Microsoft.Orleans.Streaming.AzureStorage package, which covers several production concerns automatically.
| Concern | Status |
|---|---|
Managed Identity (AccountName path) | ✅ Covered — uses DefaultAzureCredential |
| Message retry / poison-message handling | ✅ Covered — MS provider retries up to configurable limit; exhausted messages move to *-poison queue automatically |
| OTLP monitoring | ✅ Covered — Orleans Streams telemetry emitted via Orleans.Telemetry; no extra wiring needed |
| Managed Identity token rotation | ✅ DefaultAzureCredential handles refresh transparently |
| Cross-queue message ordering | ⚠️ Per-queue FIFO only — ordering across the 8 default fleans-stream-* queues is not guaranteed (same trade-off as Kafka partitions) |
Manual test plan #44 exercises the happy path with Azurite.
Delivery contract — at-least-once
Section titled “Delivery contract — at-least-once”The Kafka adapter checkpoints offsets after the consumer-side handler completes. On a silo crash between handler success and offset commit, the next consumer instance resumes from the last committed offset, which means the in-flight event will replay.
The replay is safe because every Fleans consumer eventually routes to a method on
WorkflowInstance (CompleteActivity, FailActivity, …) that opens with a stale-callback
guard:
if (!State.HasActiveEntry(activityInstanceId)){ LogStaleCallbackIgnored(...); return;}Once the activity instance has completed once, its entry leaves HasActiveEntry, so a second
delivery is a logged no-op. Do not remove that guard without re-reading this page —
it is what makes the at-least-once contract safe for the workflow engine.
Topic provisioning
Section titled “Topic provisioning”KafkaQueueAdapterFactory.CreateAdapter() runs a one-shot topic-ensure step on first activation:
- Build the expected topic list (
{TopicPrefix.TrimEnd('-')}-{0..QueueCount-1}). AdminClient.GetMetadatato discover what already exists on the broker.AdminClient.CreateTopicsAsyncfor missing topics with the configuredNumPartitionsandReplicationFactor.TopicAlreadyExistsandNoErrorare both treated as success (a peer silo may have just created the topic).
The broker-side allow.auto.create.topics flag is not required — it is documented here as
a fallback for self-managed clusters that prefer it, but managed clusters (Confluent Cloud, MSK)
disable it by default and the client-side path is what we ship.
Switching the streaming provider
Section titled “Switching the streaming provider”Set Fleans__Streaming__Provider on every fleans-* workload via your deployment tool. The engine accepts four values: Redis (default), Memory, Kafka, AzureQueue. Both the Docker Compose bundle and the Helm chart ship with sensible defaults — you only need an explicit switch when moving off the default.
The bundle ships with Redis as the default streaming provider. You only need an override file to switch away from Redis — for the production-default Redis behaviour, no edits are required.
Drop a docker-compose.override.yaml next to the bundle’s docker-compose.yaml with the same env var on every fleans-* service:
services: fleans-core: environment: Fleans__Streaming__Provider: Memory # or Kafka, AzureQueue fleans-management: environment: Fleans__Streaming__Provider: Memory fleans-mcp: environment: Fleans__Streaming__Provider: Memory fleans-worker: environment: Fleans__Streaming__Provider: MemoryFor Kafka, add Fleans__Streaming__Kafka__Brokers: "kafka:9092" (replace with your broker host:port). For Azure Queue, add Fleans__Streaming__AzureQueue__ConnectionString: "<your-connection-string>". Then docker compose up -d from the bundle directory — Compose merges the override automatically.
This streaming override is a pure service-level env addition (unlike the plugin-host override in Writing custom-task plugins, which depends on the fleans_default network being created by an engine up first). You can docker compose down && docker compose up -d from the bundle directory at any time to apply the change — engine-first ordering is not required.
Memory is the chart default; nothing to set unless you’re switching away from it.
# Kafka — the chart's first-class supported provider.helm install fleans nightbaker/fleans \ --set streaming.provider=Kafka \ --set streaming.kafka.brokers="kafka.kafka.svc:9092"
# Redis — via extraEnv, until #599 lands.helm install fleans nightbaker/fleans \ --set 'extraEnv[0].name=Fleans__Streaming__Provider' \ --set 'extraEnv[0].value=Redis'
# Azure Queue Storage — via extraEnv.helm install fleans nightbaker/fleans \ --set 'extraEnv[0].name=Fleans__Streaming__Provider' \ --set 'extraEnv[0].value=AzureQueue' \ --set 'extraEnv[1].name=Fleans__Streaming__AzureQueue__ConnectionString' \ --set 'extraEnv[1].value=<your-connection-string>'(For the source-tree dev workflow — only relevant if you cloned the engine repo — see the engine CLAUDE.md’s Things to Know section for the AppHost-side FLEANS_STREAMING_PROVIDER knob.)
Provider notes
Section titled “Provider notes”Redis (default): Reuses the same Redis instance the engine already runs for clustering and PubSubStore, so stream events cost zero extra infrastructure. Uses the third-party Universley.OrleansContrib.StreamsProvider.Redis package (MIT, Orleans 10.x-compatible), pinned in Fleans.ServiceDefaults.csproj. Production requirement: enable Redis persistence (AOF or RDB) — without it, a Redis restart loses in-flight messages (same caveat as PubSubStore). The chart’s bundled fleans-redis enables AOF by default; see Self-host with Helm for the Redis subchart values.
Memory: No external infrastructure. Drops in-flight events on silo restart — single-silo debug-only. Suitable for local dev and CI smoke tests; never for production.
Kafka: Supply your own Kafka cluster — the chart’s streaming.kafka.brokers value takes a Kafka bootstrap-server string. For an in-cluster Kafka, the Bitnami Kafka Helm chart is the conventional choice. The silo forwards Fleans__Streaming__Provider=Kafka + Fleans__Streaming__Kafka__Brokers to the Kafka producer.
AzureQueue: Supply an Azure Queue Storage connection string — see Microsoft’s Azure Storage connection-string docs for the format. For local dev, run Azurite as a sibling Compose service and point the connection string at it. The silo forwards Fleans__Streaming__Provider=AzureQueue + Fleans__Streaming__AzureQueue__ConnectionString to the Azure Queue producer.
Connection multiplexer aliasing (Redis only)
Section titled “Connection multiplexer aliasing (Redis only)”The third-party Redis stream provider resolves a non-keyed IConnectionMultiplexer from DI. Aspire registers it as a keyed service (AddKeyedRedisClient("orleans-redis")). FleanStreamingExtensions.AddRedisStreams bridges the two with TryAddSingleton<IConnectionMultiplexer>(...) — reusing the same connection pool, no duplicate sockets. If another component needs a different non-keyed multiplexer, register it explicitly before AddFleanStreaming runs and TryAddSingleton will defer.
PubSubStore vs. event durability
Section titled “PubSubStore vs. event durability”PubSubStore is the Orleans grain-storage location for subscription state (who is
subscribed to which stream id). Aspire wires it to Redis at Fleans.Aspire/Program.cs, so
subscription state already survives silo restarts under both providers.
What changes when you opt into Kafka is event durability — the bytes flowing between
WorkflowEventsPublisher.Publish and WorkflowExecuteScriptEventHandler.OnNextAsync. With the
in-memory provider, those bytes live in the silo’s process memory. With Kafka, they land in a
broker topic before the handler runs.
Compatibility note
Section titled “Compatibility note”Confluent.Kafka 2.6.1 / librdkafka 2.6.x are the supported versions. The adapter targets
net10.0 (matching every other silo csproj in this repo).
Tuning throughput
Section titled “Tuning throughput”Throughput across the queue-backed providers is governed by two independent classes of knobs. The first is an Orleans-level concept shared by every provider; the second is Kafka-specific and does not have analogues elsewhere.
Orleans parallelism (all queue-backed providers)
Section titled “Orleans parallelism (all queue-backed providers)”Fleans:Streaming:Redis:TotalQueueCount, Fleans:Streaming:Kafka:QueueCount, and the length of Fleans:Streaming:AzureQueue:QueueNames all set the same thing: the number of Orleans pulling-agent grains that activate across the cluster. Stream IDs hash across this many partitions; each partition is consumed by one pulling agent. More partitions = more parallel consumption.
| Provider | Knob | Default | Notes |
|---|---|---|---|
| Redis | Fleans__Streaming__Redis__TotalQueueCount | 8 | Configurable as of v0.2.0 (#567). |
| Kafka | Fleans__Streaming__Kafka__QueueCount | 8 | Was 1 before v0.2.0 (#567) — call out as a behavior change for deployments that left it unset. |
| AzureQueue | Fleans__Streaming__AzureQueue__QueueNames__0..N | 8 entries (fleans-stream-0..7) | No scalar count — tuning above OR below 8 requires populating the explicit list by hand. Mid-deployment length changes rehash Stream IDs across queues. |
Sizing heuristic. Start with max(8, 2 × silo_count × cpu_per_silo) for high-volume deployments and measure with the Orleans dashboard (wired at Fleans.Api/Program.cs) before raising further. In a multi-silo deployment, agents distribute across silos via the hash ring — total cluster-wide count = the configured value, per-silo count ≈ ceil(value / silo_count).
Rehash caveat (all providers). Bumping the count rehashes Stream IDs across queues. Consistent with the project’s pre-v1 stance, expect in-flight workflow stalls across the bump window — no formal drain procedure is provided. After the bump, new workflow instances distribute over the new queue count; in-flight instances that straddle the rehash may stall until reactivation picks them up on the new queue assignment.
Tradeoff. Each partition adds a pulling-agent grain per silo plus provider-side resource overhead (a Redis connection, a Kafka consumer, an AzureQueue receiver lease).
Production caveats already documented elsewhere. Redis streaming requires AOF or RDB persistence — see the Streaming section above; Kafka requires its own cluster lifecycle ops — see Production-readiness gaps — Kafka.
Kafka-specific Kafka-side tuning
Section titled “Kafka-specific Kafka-side tuning”These knobs control the Kafka broker-side topology only. They are independent of QueueCount and do not multiply Orleans consumer parallelism.
| Knob | Default | What it does |
|---|---|---|
Fleans__Streaming__Kafka__NumPartitions | 1 | Partition count for each Kafka topic at topic creation. Adds broker-side write parallelism (the producer can write to N partitions concurrently) and partition-level ordering granularity within one Orleans queue. Does not add Orleans consumer parallelism — the consumer subscribes via consumer-group Subscribe mode (one consumer per topic, regardless of partition count). Forward-only: kafka-topics --alter --partitions N can grow but not shrink, and downgrading the engine after a bump leaves topics over-partitioned. |
Fleans__Streaming__Kafka__ReplicationFactor | 1 | Per-topic replication factor at topic creation. Production deployments should set to 3 once a 3-broker cluster is available — see the durability gap above. |
See also
Section titled “See also”- Observability — health checks, metrics, logging, tracing, dashboards, alerting
- Deployment — how to wire
Fleans:Streaming:Provider/Fleans:Streaming:Kafka:Brokersinto Docker Compose, Kubernetes, and bare-VM deployments. - Self-Hosting on Kubernetes — Kafka opt-in on the Helm chart.