Monitoring

Network Gateway Monitoring

We recommend installing a Grafana Dashboard to help monitor your node and gateway.

The rest of this article explains more detail about the gateway logging configuration, and the various metrics exposed by the gateway.

Logs

Logging configuration follows the ASP.NET paradigms.

In particular, both the log levels and the logger can be configured in the configuration.

By default, a simple one-line console logger is used in development, and a JSON logger is used in production. These can be configured further in the app configuration, as per the ASP.NET guidance - eg this is done in the deployment folder for optimizing log readability in Docker.

An example of configuring for the systemd console logger is given below.

{
    "Logging": {
        "Console": {
            "FormatterName": "systemd",
            "FormatterOptions": {
                "IncludeScopes": true,
                "UseUtcTimestamp": true,
                "TimestampFormat": "yyyy-MM-ddTHH\\:mm\\:ss.fff\\Z "
            }
        }
    }
}

Health Checks

The Data Aggregator and Gateway API have a health check configured on their main ASP .NET url: /health will return a 200 status if all health checks pass, or a 500 if one of more health checks fail.

This can be integrated with Kubernetes or other health checking systems.

The Data Aggregator has a health check to check for database connectivity, and a custom health check to check for either recent start-up (with 10 seconds) or a ledger extension in the last 20 seconds (this can be configured with the Monitoring.UnhealthyCommitmentGapSeconds parameter). The Gateway API has a health check for each of its Database connections.

The health check endpoint will come up after the service loads. For the Data Aggregator, the migrations run before the health check is up, so we recommend the migrations run separately if they are slow-running, see releasing for more information.

Prometheus Metrics

The Network Gateway services export metrics in Prometheus format, via metric endpoints; to be picked up by Prometheus.

The default endpoints are:

Data Aggregator - http://localhost:1234
Gateway API - http://localhost:1235

But these can be changed with the configuration variable PrometheusMetricsPort.

Metric Types

Metrics fall into a number of groupings, separated out by distinct prefixes.

These are metrics provided by libraries:

dotnet_* - Metrics about the runtime (eg threadpool, known allocated memory)
process_* - Metrics about the process (eg process threads, process memory)
http_request_* and http_requests_* - Metrics about controller actions
httpclient_* - Metrics about requests that the service makes to upstream services (is the full nodes)
aspnetcore_* - Metrics related to ASP.NET core (eg healthcheck status)

There are custom metrics, all prefixed by ng_ (for network gateway):

ng_aggregator_* - metrics about aggregator status
ng_node_fetch_* - metrics about fetching data from a node
ng_ledger_sync_* - metrics about syncing the ledger from full nodes
ng_ledger_commit_* - metrics about committing the agreed ledger to the database
ng_node_ledger_* - metrics about the ledger / state of the full node/s (with node label)
ng_node_mempool_* - metrics about full node mempool/s (or the combination of them)
ng_db_mempool_* - metrics about the MempoolTransactions in the database
ng_construction_transaction_* - metrics relating to construction, submission or resubmission

Each service also exposes a /metrics endpoint, at a separate port to the health check / main APIs. This port can be changed with the PrometheusMetricsPort configuration, defaulting to 1234 for the Data Aggregator and 1235 for the Gateway API.

Many custom metrics are available which can be used for a comprehensive dashboard. The metrics should include a description to explain how they can be interpreted.

The various metrics are documented inline, which can be seen by going to the /metrics endpoint in your browser, or in prometheus.

Alerting

Alerting should align with your monitoring requirements. The thresholds below may need adjusting for your use case.

Some suggested alerts are below:

Importance	Explanation	Possible Alerting Criteria
High	MoreThanOnePrimary - More than one primary data aggregator (this can cause high levels of errors with both Data Aggregators trying to do the same thing)	`sum(ng_aggregator_is_primary) != 1`
High	HighTimeSinceLastLedgerCommit - The DB ledger hasn’t been updated in the last minute	`time() - ng_ledger_commit_last_commit_timestamp_seconds{container="data-aggregator"} > 60`
Medium	Resubmission Queue Backlog - This might indicate that resubmissions are delayed	`ng_db_mempool_transactions_needing_resubmission_total{container="data-aggregator"} > 100`
Medium	FailingDataAggregatorHealthChecks	`sum(aspnetcore_healthcheck_status{container="data-aggregator"}) >= 1`
Medium	FailingGatewayAPIHealthChecks	`sum(aspnetcore_healthcheck_status{container="gateway-api"}) >= 1`

Depending on your use cases, you may also wish to configure alerting on transaction submission or resubmission errors, possibly making use of some of the following metrics:

ng_construction_transaction_submission_request_count, ng_construction_transaction_submission_success_count and ng_construction_transaction_submission_error_count
ng_construction_transaction_resubmission_attempt_count, ng_construction_transaction_resubmission_success_count and ng_construction_transaction_resubmission_error_count