Failure model
MVX Metrics separates production failures from metrics-infrastructure failures.
A business operation may fail because of domain logic.
A recorder may fail because its lifecycle, queue, dispatcher, or hooks failed.
A runtime may fail because its thread, event loop, recorder registry, startup, or shutdown failed.
These are different failure areas and they should not be mixed.
Main failure boundaries
The main boundaries are:
production component
|
v
recorder handoff
|
v
recorder processing
|
v
runtime environment
Each area owns its own failure model.
production component
|
v
business errors
recorder
|
v
event-processing and metric-dispatch errors
runtime
|
v
thread, event loop, recorder registry, startup, shutdown errors
This separation lets application code decide how strongly metrics failures should affect business operations.
Production-side failures
Production code owns business behavior.
For example:
def save_document(self, document_id: str, content: str) -> None:
if not document_id:
raise ValueError("document_id must not be empty")
That error belongs to the production component.
It is not a metrics failure.
The metric event may record that the operation failed, but the source of the failure is still business logic.
business method fails
|
v
method emits FAILURE metric event
|
v
original business error is raised
Metrics observe the failure. They do not own it.
Recorder handoff failures
Production code emits metric events through the recorder:
self._metrics_recorder.register_event(event=event)
This call can fail if the recorder cannot accept the event.
Typical reasons include:
invalid recorder state;
queue overflow when overflow policy is configured to raise;
recorder already failed earlier;
invalid event object.
A production component may decide to suppress recorder errors:
def _send_metric_event(self, event: MetricEvent) -> None:
if self._metrics_recorder is None:
return
try:
self._metrics_recorder.register_event(event=event)
except Exception:
pass
This pattern treats metrics as optional observability.
If metrics fail, the business operation does not fail only because metrics emission failed.
This is a production integration decision.
A different application may choose to let recorder errors propagate.
Recorder lifecycle failures
AsyncioMetricsRecorder has its own lifecycle.
At a high level, recorder operations are valid only in specific states.
For example:
start
|
v
valid from VIRGIN or STOPPED
stop
|
v
valid from RUNNING
If an operation is requested in the wrong state, the recorder reports an invalid-state error.
This prevents lifecycle misuse from being silently ignored.
Event loop availability
AsyncioMetricsRecorder is an asyncio-based component.
When used directly, it must be created in an environment where a running asyncio event loop is available.
If no running loop exists, recorder construction fails with a loop-unavailable error.
This is one of the reasons MetricsRuntime exists.
Runtime-created recorders are created inside the runtime-owned event loop, so regular application code does not need to provide that loop manually.
Queue overflow
The recorder has an internal processing buffer and a pending-event limit.
If events arrive faster than they can be processed, the pending-event limit can be reached.
At that point, overflow behavior depends on recorder configuration:
pending-event limit reached
|
+--> RAISE_ERROR
| |
| v
| raise queue overflow error
|
+--> DROP
|
v
drop new event
This is an important architectural choice.
RAISE_ERROR makes pressure visible to the caller.
DROP keeps the production handoff non-blocking and lossy under pressure.
The choice belongs to recorder configuration and application policy.
Hook failures
Recorder hooks are extension points.
The recorder has startup and shutdown hooks:
_on_starting()
_on_stopped()
If _on_starting() fails, recorder startup fails.
If _on_stopped() fails during normal shutdown, recorder shutdown fails.
This keeps extension failures visible.
A subclass that opens a backend connection during startup, for example, should not silently enter running state if that backend setup fails.
Dispatcher failures
The dispatcher is the recorder-side task that processes accepted events and sends them to registered metrics.
The simplified path is:
event from queue
|
v
dispatch to registered metrics
|
v
metric.handle_event(event)
If dispatching raises an error, the recorder treats it as a metrics-infrastructure failure.
The recorder can move into failure state and store the error.
If logging is configured, dispatcher failures are logged as recorder infrastructure diagnostics.
The business operation that originally emitted the event has already crossed the recorder boundary.
Unexpected dispatcher cancellation
Dispatcher cancellation is expected during normal recorder shutdown.
Unexpected dispatcher cancellation is different.
If the dispatcher is cancelled while the recorder is not stopping, the recorder treats that as an error condition.
This protects the recorder from silently losing its processing task.
Cleanup failures
If dispatcher failure requires cleanup, cleanup itself may also fail.
Cleanup failures are not business failures.
They are recorder infrastructure failures.
When a log context is available, cleanup-related failures are logged as diagnostic events.
The goal is to make recorder-side problems visible without confusing them with domain operation errors.
Runtime startup failures
MetricsRuntime owns a dedicated thread and an asyncio event loop inside that thread.
Startup can fail if the runtime cannot reach a running state.
At a high level:
runtime.start()
|
v
create thread
|
v
create event loop
|
v
signal loop readiness
|
v
RUNNING
If that sequence fails, the runtime reports a startup error.
After startup failure, the runtime enters failure state.
Runtime invalid state
Runtime operations are valid only in specific states.
For example, recorder creation requires a running runtime.
If application code tries to create a recorder before the runtime is started, after it is closed, or while it is in failure state, the runtime reports an invalid-state error.
This keeps runtime lifecycle explicit.
Recorder registry failures
MetricsRuntime owns a recorder registry.
Registry operations have their own failures.
For example:
creating a recorder with an id that already exists;
getting a recorder that is not registered;
stopping a recorder that cannot be stopped;
recorder startup failure during
create_recorder().
These are runtime-recorder management failures.
They are different from metric interpretation failures and different from business operation failures.
Runtime shutdown failures
Runtime shutdown is responsible for stopping and removing runtime-owned recorders, stopping the event loop, joining the runtime thread, and clearing runtime references.
The simplified flow is:
runtime.shutdown()
|
v
stop/remove all recorders
|
v
stop event loop
|
v
join runtime thread
|
v
clear runtime references
If recorder shutdown fails, the runtime collects those failures and reports shutdown failure.
This keeps shutdown failure visible at the runtime boundary.
Logging failure diagnostics
Recorder and runtime failures can be logged through MVX Logger when a log context is configured.
This logging is diagnostic.
It does not turn metric events into log events.
It reports infrastructure problems such as:
recorder dispatch error
recorder cleanup failure
runtime lifecycle operation failure
recorder management failure
Logging is therefore part of the failure visibility model, not part of metric aggregation.
Failure ownership
The ownership model is:
business operation failed
|
v
production component owns the error
metric event handoff failed
|
v
recorder-facing integration owns the decision to suppress or propagate
recorder lifecycle or dispatcher failed
|
v
recorder owns the error
runtime startup, registry, or shutdown failed
|
v
runtime owns the error
This keeps failure handling localized.
Why this matters
Metrics should improve observability without making the business path fragile by default.
At the same time, metrics infrastructure should not hide its own failures.
The architecture therefore separates two decisions:
Should a production operation fail because metrics failed?
|
v
production integration decision
Should recorder/runtime failures be visible and diagnosable?
|
v
yes, through domain errors, states, and logging
This gives applications control.
A small script may suppress recorder errors.
A strict service may surface them.
A test may assert them.
The metrics core provides explicit failure boundaries for all of these choices.
Summary
MVX Metrics has separate failure boundaries for production code, recorders, and runtime.
Production code owns business errors.
Recorders own event-processing, lifecycle, queue, dispatcher, and hook errors.
MetricsRuntime owns thread, event loop, recorder registry, startup, shutdown, and recorder-management errors.
This separation keeps production behavior, metrics processing, and runtime management understandable and testable as separate parts of the system.