Streaming providers

Fleans publishes domain events (EvaluateConditionEvent, EvaluateActivationConditionEvent, ExecuteScriptEvent, …) through Orleans Streams. The stream provider is pluggable at startup: the default is Orleans’ in-memory provider, and a Kafka-backed provider ships alongside it for cross-silo event durability.

Provider switch

The provider name is read once at silo startup from configuration:

Configuration key	Values	Default
`Fleans:Streaming:Provider`	`memory`, `kafka`, `azurequeue`	`memory`
`Fleans:Streaming:Kafka:Brokers`	`host:port[,…]`	—
`Fleans:Streaming:Kafka:ConsumerGroup`	string	`fleans`
`Fleans:Streaming:Kafka:TopicPrefix`	string	`fleans-`
`Fleans:Streaming:Kafka:QueueCount`	integer	`1`
`Fleans:Streaming:Kafka:NumPartitions`	integer	`1`
`Fleans:Streaming:Kafka:ReplicationFactor`	integer	`1`
`Fleans:Streaming:AzureQueue:ConnectionString`	Azure Storage connection string	—
`Fleans:Streaming:AzureQueue:AccountName`	Azure Storage account name	—
`Fleans:Streaming:AzureQueue:QueueNames`	JSON array of strings	`["fleans-stream-0"…"fleans-stream-7"]`
`Fleans:Streaming:AzureQueue:MessageVisibilityTimeout`	TimeSpan string (e.g. `00:01:00`)	Azure SDK default (30 s)

Match is case-insensitive. Any other value throws ArgumentException at startup — add a new case to FleanStreamingExtensions.AddFleanStreaming if you ship another provider.

For azurequeue, exactly one of ConnectionString or AccountName must be set. ConnectionString is used for local dev (Azurite). AccountName enables Managed Identity (DefaultAzureCredential) for production Azure deployments.

Choosing a provider

`memory` — the default

Backed by AddMemoryStreams. Zero infrastructure. Events are lost on silo crash. Subscription state survives because Aspire wires PubSubStore to Redis, but in-flight stream messages do not.

Right for: local development, all default tests, and single-silo deployments where you accept the event-loss profile of a silo crash.

`kafka` — opt-in cross-silo durability

Backed by an in-repo Fleans.Streaming.Kafka adapter built on Confluent.Kafka 2.6.1 (librdkafka 2.6.x bundled).

Right for: production rollouts that need event durability beyond a single silo, CI environments that want broker parity with production, and topologies where a non-Orleans consumer may join the topics later (note: the codec is currently the Orleans codec — see limitations below).

`azurequeue` — opt-in cross-silo durability (Azure-native)

Backed by Microsoft.Orleans.Streaming.AzureStorage — the first-party Microsoft Orleans adapter for Azure Queue Storage. This is a thin wrapper, not a custom adapter, so behaviour, retry semantics, and dead-letter handling are governed by the Microsoft package.

Right for: Azure-native deployments that need cross-silo event durability without running Kafka. Uses Azure Queue Storage (cheaper than Service Bus at scale; Orleans StreamSequenceToken provides per-queue ordering). In dev mode, Aspire auto-provisions Azurite so no Azure subscription is needed.

For auth options: set ConnectionString for Azurite or an explicit connection string; set AccountName to use DefaultAzureCredential (Managed Identity, recommended for Container Apps).

Production-readiness gaps

The Kafka adapter shipped in v1 is intentionally minimal. The following gaps are tracked under #474 — Production-ready Kafka streaming; pick the tier that matches your risk tolerance.

🔴 Won’t connect

These are deploy-day blockers — your silo will fail to connect, period.

Gap	Failure mode	Affected services
No `SecurityProtocol` knob	librdkafka prints `Disconnected: SASL authentication required` and the silo health check goes red	Confluent Cloud, Aiven, Redpanda Cloud, any TLS-only listener
No SASL/PLAIN, SCRAM, or OAUTHBEARER	librdkafka cannot authenticate; broker rejects the connection during the AdminClient `GetMetadata` warm-up	All managed brokers requiring credential auth
No AWS MSK IAM	librdkafka `ssl.ca.location` + token vendor not wired	Amazon MSK with IAM auth
No client-cert / mTLS	No `ssl.certificate.pem` / `ssl.key.pem` config	Self-managed clusters with mutual TLS

🟡 Will lose data on failure

These are SLO trade-offs — connection succeeds but the durability profile is below production-typical.

Gap	Failure mode	Mitigation today
`ReplicationFactor=1` default	Broker crash with unflushed log segments → events permanently lost	Set `Fleans:Streaming:Kafka:ReplicationFactor=3` once you have a 3-broker cluster
At-least-once delivery (offset commit after handler success)	Silo crash between handler success and offset commit → event replays	Safe by design — every consumer routes to a `WorkflowInstance` method guarded by `HasActiveEntry`. Do not remove that guard. See Delivery contract
No idempotent producer	Producer retries can reorder messages within a partition	None today; consumer-side guard above absorbs duplicates

🟢 Operational gaps

These won’t break a deploy, but they will cost you operationally.

Gap	Why it hurts	Workaround
`NumPartitions=1` default	New topics created with 1 partition — no broker-side write fan-out per topic	Bump `Fleans:Streaming:Kafka:NumPartitions` and re-create or `kafka-topics --alter --partitions N` existing topics (forward-only; cannot shrink). See Tuning throughput
No schema registry integration	Codec is the Orleans codec; non-Orleans consumers cannot decode the stream	Treat Kafka topics as silo-internal until a follow-up adds Avro/Protobuf framing
No DLQ for poison messages	A consistently-failing handler replays forever	Manual offset commit via `AdminClient` is your only escape today
No metrics/health-check for consumer lag	Lag is invisible to the Aspire dashboard’s health UI	Run `kafka-consumer-groups.sh --describe --group fleans` against the broker

Manual test plan #35 exercises the happy path against a single-broker Aspire-provisioned fleans-kafka resource — it validates that the at-least-once contract holds across a silo restart, but it does not test any production failure mode listed above.

Production-readiness gaps — Azure Queue Storage

The Azure Queue provider is backed by the Microsoft-maintained Microsoft.Orleans.Streaming.AzureStorage package, which covers several production concerns automatically.

Concern	Status
Managed Identity (`AccountName` path)	✅ Covered — uses `DefaultAzureCredential`
Message retry / poison-message handling	✅ Covered — MS provider retries up to configurable limit; exhausted messages move to `*-poison` queue automatically
OTLP monitoring	✅ Covered — Orleans Streams telemetry emitted via `Orleans.Telemetry`; no extra wiring needed
Managed Identity token rotation	✅ `DefaultAzureCredential` handles refresh transparently
Cross-queue message ordering	⚠️ Per-queue FIFO only — ordering across the 8 default `fleans-stream-*` queues is not guaranteed (same trade-off as Kafka partitions)

Manual test plan #44 exercises the happy path with Azurite.

Delivery contract — at-least-once

The Kafka adapter checkpoints offsets after the consumer-side handler completes. On a silo crash between handler success and offset commit, the next consumer instance resumes from the last committed offset, which means the in-flight event will replay.

The replay is safe because every Fleans consumer eventually routes to a method on WorkflowInstance (CompleteActivity, FailActivity, …) that opens with a stale-callback guard:

if (!State.HasActiveEntry(activityInstanceId))
{
    LogStaleCallbackIgnored(...);
    return;
}

Once the activity instance has completed once, its entry leaves HasActiveEntry, so a second delivery is a logged no-op. Do not remove that guard without re-reading this page — it is what makes the at-least-once contract safe for the workflow engine.

Topic provisioning

KafkaQueueAdapterFactory.CreateAdapter() runs a one-shot topic-ensure step on first activation:

Build the expected topic list ({TopicPrefix.TrimEnd('-')}-{0..QueueCount-1}).
AdminClient.GetMetadata to discover what already exists on the broker.
AdminClient.CreateTopicsAsync for missing topics with the configured NumPartitions and ReplicationFactor.
TopicAlreadyExists and NoError are both treated as success (a peer silo may have just created the topic).

The broker-side allow.auto.create.topics flag is not required — it is documented here as a fallback for self-managed clusters that prefer it, but managed clusters (Confluent Cloud, MSK) disable it by default and the client-side path is what we ship.

Switching the streaming provider

Set Fleans__Streaming__Provider on every fleans-* workload via your deployment tool. The engine accepts four values: Redis (default), Memory, Kafka, AzureQueue. Both the Docker Compose bundle and the Helm chart ship with sensible defaults — you only need an explicit switch when moving off the default.

Compose bundle
Helm

The bundle ships with Redis as the default streaming provider. You only need an override file to switch away from Redis — for the production-default Redis behaviour, no edits are required.

Drop a docker-compose.override.yaml next to the bundle’s docker-compose.yaml with the same env var on every fleans-* service:

services:
  fleans-core:
    environment:
      Fleans__Streaming__Provider: Memory    # or Kafka, AzureQueue
  fleans-management:
    environment:
      Fleans__Streaming__Provider: Memory
  fleans-mcp:
    environment:
      Fleans__Streaming__Provider: Memory
  fleans-worker:
    environment:
      Fleans__Streaming__Provider: Memory

For Kafka, add Fleans__Streaming__Kafka__Brokers: "kafka:9092" (replace with your broker host:port). For Azure Queue, add Fleans__Streaming__AzureQueue__ConnectionString: "<your-connection-string>". Then docker compose up -d from the bundle directory — Compose merges the override automatically.

This streaming override is a pure service-level env addition (unlike the plugin-host override in Writing custom-task plugins, which depends on the fleans_default network being created by an engine up first). You can docker compose down && docker compose up -d from the bundle directory at any time to apply the change — engine-first ordering is not required.

Memory is the chart default; nothing to set unless you’re switching away from it.

# Kafka — the chart's first-class supported provider.
helm install fleans nightbaker/fleans \
  --set streaming.provider=Kafka \
  --set streaming.kafka.brokers="kafka.kafka.svc:9092"

# Redis — via extraEnv, until #599 lands.
helm install fleans nightbaker/fleans \
  --set 'extraEnv[0].name=Fleans__Streaming__Provider' \
  --set 'extraEnv[0].value=Redis'

# Azure Queue Storage — via extraEnv.
helm install fleans nightbaker/fleans \
  --set 'extraEnv[0].name=Fleans__Streaming__Provider' \
  --set 'extraEnv[0].value=AzureQueue' \
  --set 'extraEnv[1].name=Fleans__Streaming__AzureQueue__ConnectionString' \
  --set 'extraEnv[1].value=<your-connection-string>'

(For the source-tree dev workflow — only relevant if you cloned the engine repo — see the engine CLAUDE.md’s Things to Know section for the AppHost-side FLEANS_STREAMING_PROVIDER knob.)

Provider notes

Redis (default): Reuses the same Redis instance the engine already runs for clustering and PubSubStore, so stream events cost zero extra infrastructure. Uses the third-party Universley.OrleansContrib.StreamsProvider.Redis package (MIT, Orleans 10.x-compatible), pinned in Fleans.ServiceDefaults.csproj. Production requirement: enable Redis persistence (AOF or RDB) — without it, a Redis restart loses in-flight messages (same caveat as PubSubStore). The chart’s bundled fleans-redis enables AOF by default; see Self-host with Helm for the Redis subchart values.

Memory: No external infrastructure. Drops in-flight events on silo restart — single-silo debug-only. Suitable for local dev and CI smoke tests; never for production.

Kafka: Supply your own Kafka cluster — the chart’s streaming.kafka.brokers value takes a Kafka bootstrap-server string. For an in-cluster Kafka, the Bitnami Kafka Helm chart is the conventional choice. The silo forwards Fleans__Streaming__Provider=Kafka + Fleans__Streaming__Kafka__Brokers to the Kafka producer.

AzureQueue: Supply an Azure Queue Storage connection string — see Microsoft’s Azure Storage connection-string docs for the format. For local dev, run Azurite as a sibling Compose service and point the connection string at it. The silo forwards Fleans__Streaming__Provider=AzureQueue + Fleans__Streaming__AzureQueue__ConnectionString to the Azure Queue producer.

Connection multiplexer aliasing (Redis only)

The third-party Redis stream provider resolves a non-keyed IConnectionMultiplexer from DI. Aspire registers it as a keyed service (AddKeyedRedisClient("orleans-redis")). FleanStreamingExtensions.AddRedisStreams bridges the two with TryAddSingleton<IConnectionMultiplexer>(...) — reusing the same connection pool, no duplicate sockets. If another component needs a different non-keyed multiplexer, register it explicitly before AddFleanStreaming runs and TryAddSingleton will defer.

PubSubStore vs. event durability

PubSubStore is the Orleans grain-storage location for subscription state (who is subscribed to which stream id). Aspire wires it to Redis at Fleans.Aspire/Program.cs, so subscription state already survives silo restarts under both providers.

What changes when you opt into Kafka is event durability — the bytes flowing between WorkflowEventsPublisher.Publish and WorkflowExecuteScriptEventHandler.OnNextAsync. With the in-memory provider, those bytes live in the silo’s process memory. With Kafka, they land in a broker topic before the handler runs.

Compatibility note

Confluent.Kafka 2.6.1 / librdkafka 2.6.x are the supported versions. The adapter targets net10.0 (matching every other silo csproj in this repo).

Tuning throughput

Throughput across the queue-backed providers is governed by two independent classes of knobs. The first is an Orleans-level concept shared by every provider; the second is Kafka-specific and does not have analogues elsewhere.

Orleans parallelism (all queue-backed providers)

Fleans:Streaming:Redis:TotalQueueCount, Fleans:Streaming:Kafka:QueueCount, and the length of Fleans:Streaming:AzureQueue:QueueNames all set the same thing: the number of Orleans pulling-agent grains that activate across the cluster. Stream IDs hash across this many partitions; each partition is consumed by one pulling agent. More partitions = more parallel consumption.

Provider	Knob	Default	Notes
Redis	`Fleans__Streaming__Redis__TotalQueueCount`	`8`	Configurable as of v0.2.0 (#567).
Kafka	`Fleans__Streaming__Kafka__QueueCount`	`8`	Was `1` before v0.2.0 (#567) — call out as a behavior change for deployments that left it unset.
AzureQueue	`Fleans__Streaming__AzureQueue__QueueNames__0..N`	8 entries (`fleans-stream-0..7`)	No scalar count — tuning above OR below 8 requires populating the explicit list by hand. Mid-deployment length changes rehash Stream IDs across queues.

Sizing heuristic. Start with max(8, 2 × silo_count × cpu_per_silo) for high-volume deployments and measure with the Orleans dashboard (wired at Fleans.Api/Program.cs) before raising further. In a multi-silo deployment, agents distribute across silos via the hash ring — total cluster-wide count = the configured value, per-silo count ≈ ceil(value / silo_count).

Rehash caveat (all providers). Bumping the count rehashes Stream IDs across queues. Consistent with the project’s pre-v1 stance, expect in-flight workflow stalls across the bump window — no formal drain procedure is provided. After the bump, new workflow instances distribute over the new queue count; in-flight instances that straddle the rehash may stall until reactivation picks them up on the new queue assignment.

Tradeoff. Each partition adds a pulling-agent grain per silo plus provider-side resource overhead (a Redis connection, a Kafka consumer, an AzureQueue receiver lease).

Production caveats already documented elsewhere. Redis streaming requires AOF or RDB persistence — see the Streaming section above; Kafka requires its own cluster lifecycle ops — see Production-readiness gaps — Kafka.

Kafka-specific Kafka-side tuning

These knobs control the Kafka broker-side topology only. They are independent of QueueCount and do not multiply Orleans consumer parallelism.

Knob	Default	What it does
`Fleans__Streaming__Kafka__NumPartitions`	`1`	Partition count for each Kafka topic at topic creation. Adds broker-side write parallelism (the producer can write to N partitions concurrently) and partition-level ordering granularity within one Orleans queue. Does not add Orleans consumer parallelism — the consumer subscribes via consumer-group `Subscribe` mode (one consumer per topic, regardless of partition count). Forward-only: `kafka-topics --alter --partitions N` can grow but not shrink, and downgrading the engine after a bump leaves topics over-partitioned.
`Fleans__Streaming__Kafka__ReplicationFactor`	`1`	Per-topic replication factor at topic creation. Production deployments should set to `3` once a 3-broker cluster is available — see the durability gap above.