Long-Running Ensembles¶
AgentEnsemble v3.0 introduces long-running mode: an ensemble that starts, listens for work, and runs continuously until explicitly stopped. This is the foundation for the Ensemble Network -- distributed multi-ensemble systems where autonomous ensembles communicate peer-to-peer.
One-shot vs. Long-running¶
| Mode | Description | Example |
|---|---|---|
One-shot (run()) |
Execute tasks, return output, done. | Research + report generation |
Long-running (start()) |
Bind a port, accept work, run until stopped. | Kitchen service in a hotel |
The existing Ensemble.run() API is completely unchanged.
Lifecycle States¶
A long-running ensemble transitions through four states:
| State | Behavior | Accepting work? |
|---|---|---|
STARTING |
Binding server port, registering capabilities | No |
READY |
Running, accepting and processing work | Yes |
DRAINING |
Finishing in-flight work, rejecting new requests | No |
STOPPED |
Shutdown complete, connections closed | No |
Starting and Stopping¶
Long-running mode requires a dashboard for WebSocket connectivity. Configure one
via .webDashboard(...) before calling start():
// 1. Create the WebDashboard bound to the desired port
WebDashboard dashboard = WebDashboard.builder().port(7329).build();
// 2. Build the ensemble with the dashboard wired in
Ensemble kitchen = Ensemble.builder()
.chatLanguageModel(model)
.task(Task.of("Manage kitchen operations"))
.shareTask("prepare-meal", mealTask)
.shareTool("check-inventory", inventoryTool)
.webDashboard(dashboard) // required; also starts the server
.build();
// 3. Transition to READY state and register the shutdown hook
kitchen.start(7329); // port is advisory for error messages / logs
// ... ensemble runs until stopped ...
kitchen.stop(); // DRAINING -> STOPPED
Idempotency¶
- Calling
start()on an already-started ensemble is a no-op. - Calling
stop()on an already-stopped or never-started ensemble is a no-op.
Graceful Shutdown¶
When stop() is called, the ensemble transitions to DRAINING, stops the WebSocket server
(if this ensemble owns the dashboard lifecycle), and then transitions to STOPPED.
The drainTimeout field is available for configuration and will be used by a future
implementation that waits for in-flight tasks to complete before stopping.
A JVM shutdown hook is automatically registered so that SIGTERM triggers graceful shutdown.
Ensemble kitchen = Ensemble.builder()
.chatLanguageModel(model)
.task(Task.of("Manage kitchen operations"))
.drainTimeout(Duration.ofMinutes(2)) // Configurable; default: 5 minutes
.build();
Sharing Tasks and Tools¶
Long-running ensembles can share capabilities with the network:
Share a Task¶
A shared task is a full task that other ensembles can delegate work to:
Task mealTask = Task.builder()
.description("Prepare a meal as specified")
.expectedOutput("Confirmation with preparation details and timing")
.build();
Ensemble.builder()
.chatLanguageModel(model)
.task(Task.of("Manage kitchen operations"))
.shareTask("prepare-meal", mealTask)
.build();
Share a Tool¶
A shared tool is a single tool that other ensembles' agents can invoke remotely:
Ensemble.builder()
.chatLanguageModel(model)
.task(Task.of("Manage kitchen operations"))
.shareTool("check-inventory", inventoryTool)
.shareTool("dietary-check", allergyCheckTool)
.build();
Validation¶
- Shared capability names must be unique within an ensemble.
- Names must not be null or blank.
- Task/tool references must not be null.
Capability Handshake¶
When a client connects to a long-running ensemble via WebSocket, the server sends a
hello message that includes the ensemble's shared capabilities. Because
HelloMessage uses @JsonInclude(NON_NULL), null fields are omitted from the wire payload:
{
"type": "hello",
"ensembleId": "run-abc123",
"sharedCapabilities": [
{"name": "prepare-meal", "description": "Prepare a meal as specified", "type": "TASK"},
{"name": "check-inventory", "description": "Check ingredient availability", "type": "TOOL"}
]
}
This is backward compatible with v2.x clients because MessageSerializer configures
Jackson with FAIL_ON_UNKNOWN_PROPERTIES = false, so older clients simply ignore the new
sharedCapabilities field.
K8s Health and Lifecycle Endpoints¶
Long-running ensembles expose HTTP endpoints for Kubernetes health probes and lifecycle management:
| Endpoint | Method | Purpose |
|---|---|---|
/api/health/live |
GET | Liveness probe -- returns 200 when the process is alive |
/api/health/ready |
GET | Readiness probe -- returns 200 only in READY state; 503 otherwise |
/api/lifecycle/drain |
POST | Triggers transition to DRAINING state |
/api/status |
GET | Extended status including lifecycleState field |
Kubernetes deployment example¶
apiVersion: apps/v1
kind: Deployment
metadata:
name: kitchen
spec:
replicas: 2
template:
spec:
terminationGracePeriodSeconds: 300 # Match drainTimeout
containers:
- name: kitchen
image: hotel/kitchen-ensemble:latest
ports:
- containerPort: 7329
livenessProbe:
httpGet:
path: /api/health/live
port: 7329
readinessProbe:
httpGet:
path: /api/health/ready
port: 7329
lifecycle:
preStop:
httpGet:
path: /api/lifecycle/drain
port: 7329
Set terminationGracePeriodSeconds to match the ensemble's drainTimeout so that
Kubernetes waits long enough for in-flight work to complete.
Consuming Shared Capabilities¶
Other ensembles can use shared tasks and tools via NetworkTask and NetworkTool:
NetworkConfig config = NetworkConfig.builder()
.ensemble("kitchen", "ws://kitchen:7329/ws")
.build();
try (NetworkClientRegistry registry = new NetworkClientRegistry(config)) {
EnsembleOutput result = Ensemble.builder()
.chatLanguageModel(model)
.task(Task.builder()
.description("Handle room service request")
.tools(
NetworkTask.from("kitchen", "prepare-meal", registry),
NetworkTool.from("kitchen", "check-inventory", registry))
.build())
.build()
.run();
}
See the Cross-Ensemble Delegation guide for details.