diff --git a/CHANGELOG.md b/CHANGELOG.md index 792429c..6daeee7 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -99,6 +99,55 @@ cd pithead && cp config.json.template config.json # set your Monero + Tari pay so paste a Tor-reachable URL (hosted `hc-ping.com`, or a self-hosted onion/public instance). Fails silently when offline / Tor down. The URL is the on/off switch and is stored as a secret in the owner-only `.env`. See [`docs/monitoring.md`](docs/monitoring.md) (#79). +- **Telegram operator bot β€” push alerts + on-demand status** (#121, #45): the dashboard can push a + high-value set of operational alerts to Telegram β€” a **πŸš€ "Pithead online"** heartbeat on start, + **node down / recovered**, **worker offline / back online**, **new worker joined / left**, **sync + finished**, **data disk filling up**, **dashboard DB write failing**, **no PPLNS share while + donating to XvB** (raffle wins skipped), **XvB registration rejected / failing**, **hashrate too + low for the chosen XvB tier**, **a node exposed on clearnet** during initial sync, and **a new + release being available** β€” and answer status commands on demand: **`/status`**, **`/info`** + (version + update availability, Monero DB mode, P2Pool sidechain, and Tor-only/clearnet privacy + posture), **`/hashrate`**, **`/workers`**, **`/sync`**, **`/system`**, **`/pool`**, **`/xvb`**, + **`/earnings`**, and **`/help`**. It also pushes a **πŸ“… once-a-day retrospective** at a configurable local time + (`telegram.daily_summary_time`, default **08:00**) β€” the last 24h across the fleet: an incident + roll-up (what went wrong during the day, or an all-clear), 24h hashrate with the P2Pool/XvB split, + shares found in the day, an estimated daily-earnings figure, and a per-machine 24h breakdown. The Telegram bot appears in the dashboard's + **network-egress panel** (#170) as a Tor-routed path alongside Healthchecks/XvB/update-check. All + traffic is **routed over Tor** (the same bridge SOCKS as Healthchecks/XvB), so the bot never + exposes the host IP to Telegram. Off by default; enable it with a `telegram` block in `config.json` (`enabled`, + `bot_token`, `chat_id`, per-event `events` toggles, and a `commands.enabled` switch for the + interactive half). Every alert is **debounced** so a momentary blip won't ping you and you get one + message per real transition β€” and each is built by *reusing* what the dashboard already computes: + worker offline/joined/left keys off the same per-rig **DOWN** status the UI shows, and the disk / + DB alerts cross the same thresholds as the dashboard's own low-disk and DB-health badges. Commands + **long-poll** (`getUpdates`) so they need no inbound port and ride the same Tor egress as the + alerts, are **read-only** (they never change the stack), and only the configured `chat_id` is + answered β€” every other update is ignored. The `bot_token` is treated as a secret (owner-only + `.env`, never logged), and both sends and polling **fail silently** on a Tor-only / offline host. + Messages are prefixed with the dashboard hostname so multiple stacks can share one chat. Full + walkthrough β€” creating a bot, finding your chat id, the command list, and the "one chat, two bots" + pattern for sharing a chat with the Healthchecks.io monitor (#79) β€” in + [`docs/telegram.md`](docs/telegram.md). +- **Host & performance warning badges + alerts** (#104): the top bar now surfaces the persistent + host conditions `setup` warns about, derived from **live** metrics (so they self-correct): **⚠ + HugePages off** (RandomX capped until reserved), **⚠ Low RAM** (under 16 GB β€” Tari can OOM during + sync), and **⚠ No AVX2** (slow RandomX). The first two also push a Telegram alert (`hugepages`, + `low_ram`) the first time they're seen β€” unlike the transient edge alerts, a stable bad state + fires on first detection, and HugePages clears with a recovery ping once a reboot applies them. + AVX2 is **badge-only** by design: a fixed hardware fact with nothing to act on at runtime doesn't + warrant a push. The bot's **`/status`** reply now ends with any active warning/error badges (the + same catalog the top bar draws) or an explicit "βœ… No warnings." +- **Hashrate-drop detector β€” chart markers + `hashrate_loss` alert** (#99): the dashboard now flags a + **sustained, significant fall** in total fleet hashrate β€” a rig gone dark, a network cut, a stalled + miner β€” separately from the existing "too low for your XvB tier" warning. It tracks a slow moving + average as the "normal" level (frozen while degraded so an outage can't quietly redefine normal), + and fires once the total stays below **`dashboard.hashrate_drop_threshold`** percent of that + baseline for **`dashboard.hashrate_drop_minutes`** (defaults: **50%** for **10 min**), with a + matching recovery edge. Each edge drops a **diamond marker on the hashrate chart** (amber for the + drop, green for the recovery; hover for the size) that is **persisted**, so an overnight drop is + still visible in the morning, and β€” when Telegram is on β€” pushes a **`hashrate_loss`** alert and + counts toward the daily incident roll-up. Both knobs are documented in + [`docs/configuration.md`](docs/configuration.md); the alert in [`docs/telegram.md`](docs/telegram.md). - **Optional clearnet initial sync (#183).** A default-off, per-component opt-in (`monero.clearnet_initial_sync` / `tari.clearnet_initial_sync`) that lets a node do its **one-time initial block download over clearnet** β€” much faster than over bandwidth-capped Tor circuits, which diff --git a/README.md b/README.md index 5b4ba61..b56f2a3 100644 --- a/README.md +++ b/README.md @@ -37,6 +37,12 @@ a Tor daemon. The `pithead` script renders config, provisions Tor, and drives do address in the miner config; the stack routes the hashrate. - πŸ“Š **Live dashboard.** Hashrate, the P2Pool/XvB split, the PPLNS window, and per-worker updates, served over HTTPS on your LAN. +- πŸ“Ÿ **Telegram operator bot.** Opt-in alerts for a downed node, a worker that dropped off, sync + finishing, low disk, a clearnet leak, or a sustained hashrate drop β€” plus a daily digest and + read-only commands (`/status`, `/hashrate`, `/workers`, `/earnings`). Routed over Tor. See the + [Telegram guide](docs/telegram.md). +- πŸ”” **Dead-man's switch.** An optional [Healthchecks.io](https://healthchecks.io/) ping tells you + when the whole box goes dark β€” the one failure a monitor running *on* that box can never report. - πŸš€ **Interactive setup.** `pithead setup` checks dependencies, writes config, provisions Tor, and (on Linux) tunes HugePages for RandomX. It prompts before any GRUB change, then offers to start. - πŸ”’ **Hardened defaults.** Non-root containers, SHA256-verified binaries, pinned image digests, diff --git a/build/dashboard/mining_dashboard/collector/system.py b/build/dashboard/mining_dashboard/collector/system.py index ef5947d..b015ff6 100644 --- a/build/dashboard/mining_dashboard/collector/system.py +++ b/build/dashboard/mining_dashboard/collector/system.py @@ -6,6 +6,26 @@ BYTES_IN_GB = 1024**3 _last_cpu_times = None +_avx2_supported = None # cached: the CPU flag can't change while the process runs + + +def get_cpu_avx2(): + """Whether the CPU advertises AVX2 (#104). RandomX runs far slower without it, so setup warns on + it β€” surface the same persistent fact as a live badge. Reads /proc/cpuinfo (host CPU flags are + visible inside the container); cached, since the flag is fixed for the life of the process. + Returns True/False, or None when it can't be determined (non-Linux / unreadable).""" + global _avx2_supported + if _avx2_supported is not None: + return _avx2_supported + try: + with open("/proc/cpuinfo") as f: + for line in f: + if line.startswith("flags"): + _avx2_supported = "avx2" in line.split() + return _avx2_supported + except OSError: + pass + return None # unknown β€” don't cache, and callers treat None as "can't judge" (no badge/alert) def get_disk_usage(): diff --git a/build/dashboard/mining_dashboard/config/config.py b/build/dashboard/mining_dashboard/config/config.py index 5563c99..b002a2b 100644 --- a/build/dashboard/mining_dashboard/config/config.py +++ b/build/dashboard/mining_dashboard/config/config.py @@ -16,6 +16,10 @@ DISK_WARN_PERCENT = 85 DISK_CRITICAL_PERCENT = 95 +# Minimum host RAM (GB) below which the dashboard flags a low-RAM badge/alert (#104). Mirrors the +# setup/doctor pre-flight threshold; a code-level default (not a config.json knob). +LOW_RAM_GB = int(float(os.environ.get("LOW_RAM_GB", 16))) + # --- Data Source File Paths --- # File paths for JSON metrics generated by local collectors STRATUM_STATS_PATH = f"{BASE_STATS_DIR}/local/stratum" @@ -219,6 +223,84 @@ # instance; a LAN-only self-hosted address is unreachable through Tor). HEALTHCHECKS_PING_URL = os.environ.get("HEALTHCHECKS_PING_URL", "").strip() +# --- Operator alerts: Telegram (Issue #121) --- +# Notifications-only Telegram pusher: a thin notifier that pushes a small, high-value set of +# operational edges (node down/recovered, worker offline/back, sync finished) to one chat. +# Disabled by default β€” with TELEGRAM_ENABLED unset/false the stack runs with no Telegram +# config and never sends or errors. The interactive bot / command interface is a separate +# feature (#45); this is the notifications-only split. +# +# `bot_token` is a secret: the pithead CLI renders it into the owner-only .env (like the node +# RPC password), and the notifier never writes it to a log line. On a Tor-only / no-clearnet +# host the Telegram API is unreachable and sends fail silently (consistent with #59). +TELEGRAM_ENABLED = os.environ.get("TELEGRAM_ENABLED", "false").strip().lower() == "true" +TELEGRAM_BOT_TOKEN = os.environ.get("TELEGRAM_BOT_TOKEN", "").strip() +TELEGRAM_CHAT_ID = os.environ.get("TELEGRAM_CHAT_ID", "").strip() + +# Interactive command interface (#45), separate opt-in from the alerts above. When on, the +# dashboard long-polls Telegram (getUpdates β€” outbound only, no inbound port) and answers +# read-only status commands from the configured chat_id. Off by default; the alerter works +# without it. See telegram_commands.py / docs/telegram.md. +TELEGRAM_COMMANDS_ENABLED = ( + os.environ.get("TELEGRAM_COMMANDS_ENABLED", "false").strip().lower() == "true" +) + + +def _telegram_event_enabled(name, default=True): + """Read one per-event toggle from TELEGRAM_EVENT_ (rendered from config.json's + telegram.events by pithead). Any toggle left unset defaults to on, so enabling Telegram + turns on the full set and an operator only has to opt *out* of the noisy ones.""" + raw = os.environ.get(f"TELEGRAM_EVENT_{name.upper()}") + if raw is None or raw.strip() == "": + return default + return raw.strip().lower() == "true" + + +# Per-event delivery toggles. Keys here are the canonical event names used throughout the +# alerter (AlertService.EVT_*) and must match the config.json telegram.events block. +TELEGRAM_EVENTS = { + "node_down": _telegram_event_enabled("node_down"), + "node_recovered": _telegram_event_enabled("node_recovered"), + "worker_offline": _telegram_event_enabled("worker_offline"), + "worker_recovered": _telegram_event_enabled("worker_recovered"), + "worker_joined": _telegram_event_enabled("worker_joined"), + "worker_left": _telegram_event_enabled("worker_left"), + "sync_finished": _telegram_event_enabled("sync_finished"), + "disk_space": _telegram_event_enabled("disk_space"), + "db_unhealthy": _telegram_event_enabled("db_unhealthy"), + "xvb_no_share": _telegram_event_enabled("xvb_no_share"), + "clearnet_exposed": _telegram_event_enabled("clearnet_exposed"), + "xvb_registration": _telegram_event_enabled("xvb_registration"), + "new_release": _telegram_event_enabled("new_release"), + "stack_online": _telegram_event_enabled("stack_online"), + "daily_summary": _telegram_event_enabled("daily_summary"), + "hashrate_low": _telegram_event_enabled("hashrate_low"), + "hashrate_loss": _telegram_event_enabled("hashrate_loss"), + "hugepages": _telegram_event_enabled("hugepages"), + "low_ram": _telegram_event_enabled("low_ram"), +} +# ponytail: daily_summary is a scheduled push, not an edge β€” it lives in the events dict only so it +# gets a per-event on/off toggle like the rest; its time is TELEGRAM_DAILY_SUMMARY_TIME below. + +# Local time (HH:MM, 24-hour) to push the once-daily status digest, when the daily_summary event is +# on. Uses the dashboard container's timezone (dashboard.timezone), so "08:00" means 8am wherever +# the box is. Rendered from config.json telegram.daily_summary_time. +TELEGRAM_DAILY_SUMMARY_TIME = os.environ.get("TELEGRAM_DAILY_SUMMARY_TIME", "08:00").strip() + +# Hashrate-degradation detector (Issue #99). Flags a sustained drop in total hashrate below +# HASHRATE_DROP_THRESHOLD_PCT of its trailing baseline for HASHRATE_DROP_MINUTES minutes β€” surfaced +# as a chart event marker (always on) and, when telegram.events.hashrate_loss is on, an alert. +# Rendered from config.json dashboard.hashrate_drop_threshold / dashboard.hashrate_drop_minutes. +HASHRATE_DROP_THRESHOLD_PCT = int(float(os.environ.get("HASHRATE_DROP_THRESHOLD_PCT", 50))) +HASHRATE_DROP_MINUTES = int(float(os.environ.get("HASHRATE_DROP_MINUTES", 10))) + +# Worker offline/online debounce (Issue #121). A worker must be unseen this long before it's +# reported OFFLINE, and seen continuously this long before "back online" β€” so a brief miner +# reconnect doesn't spam the chat. Workers flap more than nodes (rig reboots, Wi-Fi blips), +# so the window is wider than the node debounce above. +WORKER_OFFLINE_AFTER_SEC = int(os.environ.get("WORKER_OFFLINE_AFTER_SEC", 300)) +WORKER_RECOVERY_AFTER_SEC = int(os.environ.get("WORKER_RECOVERY_AFTER_SEC", 120)) + # --- Monero Configuration --- # Used to determine if the node is local (Docker) or remote MONERO_NODE_HOST = os.environ.get("MONERO_NODE_HOST", "172.28.0.26") @@ -317,10 +399,6 @@ # --- Data Retention Policies --- HISTORY_RETENTION_SEC = 30 * 24 * 3600 # 30 Days -# Retention for the known_workers persistence layer removed in #144. No live consumer in the current -# tree; kept for the deferred Telegram worker-presence monitor (#121), which reuses it as its -# retention default β€” consult that work before removing. -WORKER_RETENTION_SEC = 7 * 24 * 3600 # 7 Days # How long an offline worker lingers in the live "Workers Alive" table before it falls off (#182). # Operates on the live proxy-sourced list. A reconnect re-adds the worker. 1h keeps a # just-disconnected rig visible (shown as DOWN) but clears ghosts. diff --git a/build/dashboard/mining_dashboard/helper/utils.py b/build/dashboard/mining_dashboard/helper/utils.py index 45f251e..3df105d 100644 --- a/build/dashboard/mining_dashboard/helper/utils.py +++ b/build/dashboard/mining_dashboard/helper/utils.py @@ -36,6 +36,17 @@ def parse_hashrate(val_str, unit_str=None): return 0.0 +def effective_hashrate(worker): + """The single figure a worker contributes to the live headline total. + + Prefers the 10-minute average (the ``h15`` field β€” legacy name, it's the proxy's 10m rate), + falling back to the 1-minute rate (``h60`` then ``h10``) when a rig hasn't accumulated 10 + minutes yet, so a freshly-connected worker reads its real live rate instead of 0. Defined once + here so the aggregate total and every per-worker display use the *same* value and can't drift. + """ + return worker.get("h15", 0) or worker.get("h60", 0) or worker.get("h10", 0) or 0 + + def format_hashrate(hashrate): """ Formats a raw hashrate value into a human-readable string with appropriate units. diff --git a/build/dashboard/mining_dashboard/main.py b/build/dashboard/mining_dashboard/main.py index e1956a4..27fc0cb 100644 --- a/build/dashboard/mining_dashboard/main.py +++ b/build/dashboard/mining_dashboard/main.py @@ -14,6 +14,7 @@ from mining_dashboard.service.algo_service import AlgoService from mining_dashboard.service.data_service import DataService from mining_dashboard.service.storage_service import StateManager +from mining_dashboard.service.telegram_commands import TelegramCommandBot from mining_dashboard.web.server import create_app logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s") @@ -33,17 +34,24 @@ def build_app() -> web.Application: xvb_client = XvbClient(wallet_address=MONERO_WALLET_ADDRESS) data_service = DataService(state_manager, proxy_client, xvb_client) algo_service = AlgoService(state_manager, proxy_client, data_service) + # On-demand Telegram command interface (#45). Reads the snapshot data_service already collects; + # a no-op unless telegram.enabled + telegram.commands.enabled + bot_token + chat_id are set. + telegram_bot = TelegramCommandBot(data_service) async def start_background_tasks(app): """Initializes background services upon web application startup.""" app["data_task"] = asyncio.create_task(data_service.run()) app["algo_task"] = asyncio.create_task(algo_service.run()) + app["telegram_task"] = asyncio.create_task(telegram_bot.run()) async def cleanup_background_tasks(app): """Stops background tasks and closes resources on shutdown.""" app["data_task"].cancel() app["algo_task"].cancel() - await asyncio.gather(app["data_task"], app["algo_task"], return_exceptions=True) + app["telegram_task"].cancel() + await asyncio.gather( + app["data_task"], app["algo_task"], app["telegram_task"], return_exceptions=True + ) if "state_manager" in app: app["state_manager"].close() diff --git a/build/dashboard/mining_dashboard/service/alert_service.py b/build/dashboard/mining_dashboard/service/alert_service.py new file mode 100644 index 0000000..7152d3e --- /dev/null +++ b/build/dashboard/mining_dashboard/service/alert_service.py @@ -0,0 +1,570 @@ +import asyncio +import logging +import time + +from mining_dashboard.config.config import ( + DISK_CRITICAL_PERCENT, + DISK_WARN_PERCENT, + HOST_IP, + TELEGRAM_BOT_TOKEN, + TELEGRAM_CHAT_ID, + TELEGRAM_DAILY_SUMMARY_TIME, + TELEGRAM_ENABLED, + TELEGRAM_EVENTS, +) +from mining_dashboard.service.telegram_notifier import TelegramNotifier +from mining_dashboard.service.worker_presence import WorkerPresenceMonitor + +logger = logging.getLogger("AlertService") + + +def build_default_notifier(): + """Construct the Telegram notifier from the process config (Issue #121).""" + return TelegramNotifier( + enabled=TELEGRAM_ENABLED, + bot_token=TELEGRAM_BOT_TOKEN, + chat_id=TELEGRAM_CHAT_ID, + events=TELEGRAM_EVENTS, + ) + + +def _parse_hhmm(value): + """Parse a 'HH:MM' 24-hour string to minutes-since-midnight, or None if malformed (which + disables the daily digest rather than guessing a time).""" + try: + hh, mm = (value or "").strip().split(":") + h, m = int(hh), int(mm) + if 0 <= h < 24 and 0 <= m < 60: + return h * 60 + m + except (ValueError, AttributeError): + pass + return None + + +class AlertService: + """ + Turns the data loop's per-cycle signals into a small set of debounced operator alerts and + pushes them over Telegram (Issue #121). Notifications-only β€” no interactive bot (#45). + + It *consumes* signals the loop already computes rather than re-collecting anything: + + - **node down / recovered** β€” transitions of ``NodeHealthMonitor``'s debounced ``down`` + flag per node (#31). Tari is only alerted when it's treated as required; a non-blocking + Tari going down isn't operator-critical (we keep mining Monero), matching the + worker-rejection rule. + - **sync finished** β€” the sync gate's ``miner_released`` latch flipping open once (#35). + - **worker offline / back online / joined / left** β€” a debounced :class:`WorkerPresenceMonitor` + over the live worker rows (offline keys off the same DOWN status the dashboard shows; joined / + left track fleet membership). + - **disk filling / critical** β€” the data disk crossing the same ``DISK_WARN_PERCENT`` / + ``DISK_CRITICAL_PERCENT`` thresholds the dashboard's low-disk badge uses (#138): a full disk + corrupts monerod's DB mid-write. + - **DB write failing** β€” ``StateManager.is_db_healthy`` flipping false (#131): the dashboard + keeps serving but history/shares/stats stop persisting. + + Edge state is seeded silently on the first observation (``None`` baselines), so a dashboard + restart can't replay a stale transition as a fresh alert. The exception is the persistent + host-perf advisories (HugePages not reserved, low RAM β€” #104): a stable bad state never + "transitions", so those fire on first observation instead of seeding silently. + + :meth:`evaluate` is pure (folds signals into the alert list, no I/O) so it's fully + unit-testable; :meth:`process` calls it and dispatches each message off-thread so a slow or + blocked Telegram send never stalls the data loop. + """ + + # Event keys β€” must match config.json's telegram.events toggles and TELEGRAM_EVENTS. + EVT_NODE_DOWN = "node_down" + EVT_NODE_RECOVERED = "node_recovered" + EVT_WORKER_OFFLINE = "worker_offline" + EVT_WORKER_RECOVERED = "worker_recovered" + EVT_WORKER_JOINED = "worker_joined" + EVT_WORKER_LEFT = "worker_left" + EVT_SYNC_FINISHED = "sync_finished" + EVT_DISK_SPACE = "disk_space" + EVT_DB_UNHEALTHY = "db_unhealthy" + EVT_XVB_NO_SHARE = "xvb_no_share" + EVT_CLEARNET_EXPOSED = "clearnet_exposed" + EVT_XVB_REGISTRATION = "xvb_registration" + EVT_NEW_RELEASE = "new_release" + EVT_STACK_ONLINE = "stack_online" + EVT_DAILY_SUMMARY = "daily_summary" + EVT_HASHRATE_LOW = "hashrate_low" + EVT_HASHRATE_LOSS = "hashrate_loss" + EVT_HUGEPAGES = "hugepages" + EVT_LOW_RAM = "low_ram" + + # WorkerPresenceMonitor edge -> (event key, message template). + _WORKER_EDGES = { + "offline": (EVT_WORKER_OFFLINE, "\U0001f534 ⛏️ Worker offline: {name}"), + "recovered": (EVT_WORKER_RECOVERED, "\U0001f7e2 ⛏️ Worker back online: {name}"), + "joined": (EVT_WORKER_JOINED, "\U0001f389 New worker joined: {name}"), + "left": (EVT_WORKER_LEFT, "\U0001f44b Worker left: {name}"), + } + + def __init__( + self, + notifier=None, + worker_monitor=None, + host_label=HOST_IP, + daily_time=TELEGRAM_DAILY_SUMMARY_TIME, + ): + self.notifier = notifier if notifier is not None else build_default_notifier() + self.workers = worker_monitor if worker_monitor is not None else WorkerPresenceMonitor() + # Once-daily digest: target local minute-of-day (HH:MM β†’ h*60+m), and the day we last sent + # (so it fires once per day). A malformed time disables it. + self._daily_target_min = _parse_hhmm(daily_time) + self._daily_last = None + self._daily_seeded = False + # "Unknown Host" is config.py's placeholder when HOST_IP isn't set β€” don't prefix with it. + self.host_label = "" if host_label in (None, "", "Unknown Host") else host_label + # None = "not yet observed": the first cycle seeds the baseline without emitting. + self._prev_monero_down = None + self._prev_tari_down = None + self._prev_released = None + self._prev_disk_level = None + self._prev_db_healthy = None + self._prev_xvb_has_share = None + self._prev_clearnet_active = None + self._prev_xvb_reg = None + self._prev_update_available = None + self._prev_hashrate_low = None + # Persistent host-perf advisories (#104): unlike the transient edges above, these fire on the + # FIRST observation of the problem (a stable low-RAM box would never "transition"), so their + # baseline is "no problem" (False) rather than None β€” a problem present on the first cycle is + # a real edge and alerts once. + self._prev_hugepages_problem = False + self._prev_low_ram = False + # Tally of problem-state transitions since the last daily digest drained it (#342). Keyed by + # event, counted at the exact edge so recoveries / steady state don't inflate it. + self._incidents = {} + # One-shot "stack is online" ping, sent on the first cycle after the dashboard starts. + self._announced_online = False + + @property + def enabled(self): + return self.notifier.enabled + + def evaluate( + self, + *, + monero_down, + tari_down, + tari_required, + miner_released, + workers, + workers_expected, + disk_percent=0, + db_healthy=True, + xvb_enabled=False, + shares_in_window=0, + clearnet_active=False, + xvb_registration_state="", + update_available=False, + low_hr_warning=False, + hugepages_reserved=True, + low_ram=False, + now=None, + ): + """Pure: fold this cycle's signals into the list of ``(event_key, text)`` to send, + filtered to the events the operator left enabled.""" + alerts = [] + + # --- Stack online (one-shot on the first cycle after the dashboard starts) --- + if not self._announced_online: + self._announced_online = True + alerts.append( + ( + self.EVT_STACK_ONLINE, + self._fmt("\U0001f680 Pithead is online β€” dashboard up and monitoring."), + ) + ) + + # --- Node down / recovered (consume NodeHealthMonitor edges) --- + alerts += self._node_edges("Monero", monero_down, "_prev_monero_down") + if tari_required: + alerts += self._node_edges("Tari", tari_down, "_prev_tari_down") + else: + # Keep the baseline current while Tari is non-blocking, so flipping it back to + # required later doesn't fire a stale edge from a state we never alerted on. + self._prev_tari_down = tari_down + + # --- Sync finished (one-shot when the gate first opens) --- + if self._prev_released is None: + self._prev_released = miner_released + elif miner_released and not self._prev_released: + alerts.append( + ( + self.EVT_SYNC_FINISHED, + self._fmt("βœ… Node ready β€” required chain(s) synced; mining has started."), + ) + ) + self._prev_released = miner_released + + # --- Worker offline / recovered / joined / left (debounced off the DOWN status) --- + # Driven by each rig's status in the same worker rows the dashboard shows (DOWN = offline). + # Only meaningful while workers are actually expected: when the proxy is intentionally + # stopped (initial sync hold, or node-down failover) their absence is by design, so we + # reset the tracker instead of aging every rig into a false "offline". + if workers_expected: + for name, event in self.workers.update(workers, now=now): + evt, template = self._WORKER_EDGES[event] + if event == "offline": + self._record_incident(self.EVT_WORKER_OFFLINE) + alerts.append((evt, self._fmt(template.format(name=name)))) + else: + self.workers.reset() + + # --- Host health: data disk filling up, dashboard DB write failing --- + alerts += self._disk_edges(disk_percent) + alerts += self._db_edges(db_healthy) + + # --- Revenue / privacy: XvB PPLNS-share gate, clearnet-sync exposure --- + alerts += self._xvb_share_edges(xvb_enabled, shares_in_window) + alerts += self._clearnet_edges(clearnet_active) + + # --- XvB auto-registration health, and a new Pithead release being available --- + alerts += self._registration_edges(xvb_enabled, xvb_registration_state) + alerts += self._release_edges(update_available) + alerts += self._hashrate_low_edges(low_hr_warning) + + # --- Persistent host-perf advisories (#104): HugePages not reserved, low RAM --- + alerts += self._advisory_edge( + not hugepages_reserved, + "_prev_hugepages_problem", + self.EVT_HUGEPAGES, + "\U0001f7e0 \U0001f9e0 HugePages not reserved β€” RandomX hashrate is capped. Apply " + "setup's tuning (or edit GRUB) and reboot.", + recovery_text="\U0001f7e2 \U0001f9e0 HugePages now reserved β€” RandomX is unthrottled.", + ) + alerts += self._advisory_edge( + low_ram, + "_prev_low_ram", + self.EVT_LOW_RAM, + "\U0001f7e0 \U0001f4be Low RAM for this stack β€” syncing is memory-heavy (Tari can OOM). " + "Add RAM for a stable node.", + ) + + return [(evt, text) for evt, text in alerts if self.notifier.event_enabled(evt)] + + def _node_edges(self, label, down, attr): + prev = getattr(self, attr) + setattr(self, attr, down) + if prev is None or down == prev: + return [] + if down: + self._record_incident(self.EVT_NODE_DOWN) + return [ + ( + self.EVT_NODE_DOWN, + self._fmt( + f"\U0001f534 ⛓️ {label} node is DOWN β€” workers failing over to backup pools." + ), + ) + ] + return [ + ( + self.EVT_NODE_RECOVERED, + self._fmt(f"\U0001f7e2 ⛓️ {label} node recovered β€” workers readmitted."), + ) + ] + + def _disk_edges(self, disk_percent): + """Alert on the data disk crossing the dashboard's own warn/critical thresholds (#138).""" + level = ( + "critical" + if disk_percent >= DISK_CRITICAL_PERCENT + else "warn" + if disk_percent >= DISK_WARN_PERCENT + else "ok" + ) + prev = self._prev_disk_level + self._prev_disk_level = level + if prev is None or level == prev: + return [] + pct = f"{disk_percent:.0f}%" + if level in ("critical", "warn"): + self._record_incident(self.EVT_DISK_SPACE) + if level == "critical": + return [ + ( + self.EVT_DISK_SPACE, + self._fmt( + f"\U0001f534 \U0001f4be Data disk almost full ({pct}) β€” free space now; a " + "full disk can corrupt the Monero database." + ), + ) + ] + if level == "warn": + return [ + ( + self.EVT_DISK_SPACE, + self._fmt(f"\U0001f7e0 \U0001f4be Data disk filling up ({pct})."), + ) + ] + return [ + ( + self.EVT_DISK_SPACE, + self._fmt(f"\U0001f7e2 \U0001f4be Data disk back to healthy ({pct})."), + ) + ] + + def _db_edges(self, db_healthy): + """Alert when the dashboard can no longer persist to its SQLite DB (#131).""" + prev = self._prev_db_healthy + self._prev_db_healthy = db_healthy + if prev is None or db_healthy == prev: + return [] + if not db_healthy: + self._record_incident(self.EVT_DB_UNHEALTHY) + return [ + ( + self.EVT_DB_UNHEALTHY, + self._fmt( + "\U0001f534 \U0001f5c4️ Dashboard DB write failing β€” hashrate history, shares " + "and stats won't persist. Check disk space + permissions on the data dir." + ), + ) + ] + return [ + ( + self.EVT_DB_UNHEALTHY, + self._fmt("\U0001f7e2 \U0001f5c4️ Dashboard DB writes recovered."), + ) + ] + + def _xvb_share_edges(self, xvb_enabled, shares_in_window): + """Alert on losing / regaining the PPLNS share XvB needs to bank a raffle win (#158). + + Only meaningful while XvB is on. A donating rig with **no** share in the PPLNS window has + its wins skipped (and accrues a fail) regardless of tier β€” a make-or-break, revenue-costing + state worth a ping.""" + if not xvb_enabled: + # No XvB β†’ the share gate doesn't apply; drop the baseline so turning XvB back on later + # doesn't replay a stale edge. + self._prev_xvb_has_share = None + return [] + has_share = shares_in_window > 0 + prev = self._prev_xvb_has_share + self._prev_xvb_has_share = has_share + if prev is None or has_share == prev: + return [] + if not has_share: + self._record_incident(self.EVT_XVB_NO_SHARE) + return [ + ( + self.EVT_XVB_NO_SHARE, + self._fmt( + "⚠️ \U0001f3b0 No PPLNS share β€” XvB raffle wins are skipped until you land " + "one (donations are wasted meanwhile)." + ), + ) + ] + return [ + ( + self.EVT_XVB_NO_SHARE, + self._fmt( + "\U0001f7e2 \U0001f3b0 PPLNS share restored β€” XvB raffle wins count again." + ), + ) + ] + + def _clearnet_edges(self, clearnet_active): + """Alert while a node is doing its initial sync over CLEARNET (#183): the host IP is exposed + to that chain's P2P network until it finishes (it reverts to Tor automatically, #234).""" + prev = self._prev_clearnet_active + self._prev_clearnet_active = clearnet_active + if prev is None or clearnet_active == prev: + return [] + if clearnet_active: + self._record_incident(self.EVT_CLEARNET_EXPOSED) + return [ + ( + self.EVT_CLEARNET_EXPOSED, + self._fmt( + "⚠️ \U0001f310 Clearnet initial sync ACTIVE β€” this host's IP is exposed to the " + "chain's P2P network until it finishes syncing (reverts to Tor automatically)." + ), + ) + ] + return [ + ( + self.EVT_CLEARNET_EXPOSED, + self._fmt( + "\U0001f7e2 \U0001f9c5 Back on Tor-only β€” clearnet sync finished, host IP no " + "longer exposed." + ), + ) + ] + + def _registration_edges(self, xvb_enabled, state): + """Alert on XvB auto-registration going bad / recovering (#263). ``state`` is one of + ``""`` / ``registered`` / ``invalid`` (wallet rejected β€” permanent) / ``failing``.""" + if not xvb_enabled: + self._prev_xvb_reg = None + return [] + prev = self._prev_xvb_reg + self._prev_xvb_reg = state + if prev is None or state == prev: + return [] + if state in ("invalid", "failing"): + self._record_incident(self.EVT_XVB_REGISTRATION) + if state == "invalid": + return [ + ( + self.EVT_XVB_REGISTRATION, + self._fmt( + "\U0001f534 \U0001f3b0 XvB wallet rejected β€” auto-registration failed " + "(check the payout address); raffle wins won't count." + ), + ) + ] + if state == "failing": + return [ + ( + self.EVT_XVB_REGISTRATION, + self._fmt("⚠️ \U0001f3b0 XvB auto-registration failing β€” retrying."), + ) + ] + if state == "registered" and prev in ("invalid", "failing"): + return [ + ( + self.EVT_XVB_REGISTRATION, + self._fmt( + "\U0001f7e2 \U0001f3b0 XvB registration recovered β€” you're in the raffle." + ), + ) + ] + return [] + + def _release_edges(self, update_available): + """One-shot ping when a newer Pithead release becomes available (#224).""" + prev = self._prev_update_available + self._prev_update_available = bool(update_available) + if prev is None or not update_available or update_available == prev: + return [] + return [ + ( + self.EVT_NEW_RELEASE, + self._fmt( + "\U0001f195 A new Pithead release is available β€” see the dashboard header." + ), + ) + ] + + def _hashrate_low_edges(self, low_hr_warning): + """Alert when a manually-chosen XvB tier can't be sustained by the current hashrate (#158), + and when it recovers. Edge-only (fires on the transition, not every cycle).""" + prev = self._prev_hashrate_low + self._prev_hashrate_low = bool(low_hr_warning) + if prev is None or bool(low_hr_warning) == prev: + return [] + if low_hr_warning: + self._record_incident(self.EVT_HASHRATE_LOW) + return [ + ( + self.EVT_HASHRATE_LOW, + self._fmt( + "⚠️ \U0001f4c9 Hashrate too low for the chosen XvB tier β€” it can't be " + "sustained; lower the tier or add hashrate." + ), + ) + ] + return [ + ( + self.EVT_HASHRATE_LOW, + self._fmt("\U0001f7e2 \U0001f4c8 Hashrate back above the chosen XvB tier."), + ) + ] + + def _advisory_edge(self, problem, attr, event, problem_text, recovery_text=None): + """Persistent host-perf advisory (#104): fires once when ``problem`` is first observed true + (including on the first cycle β€” a stable bad state must still alert, unlike the seed-silent + transient edges), stays quiet while it persists, and β€” if ``recovery_text`` is given β€” fires + once when it clears. These are static host facts, not transient incidents, so they aren't + tallied in the daily incident log.""" + prev = getattr(self, attr) + setattr(self, attr, problem) + if problem == prev: + return [] + if problem: + return [(event, self._fmt(problem_text))] + return [(event, self._fmt(recovery_text))] if recovery_text else [] + + def _record_incident(self, key): + """Tally one problem-state transition for the daily incident log (#342).""" + self._incidents[key] = self._incidents.get(key, 0) + 1 + + def drain_incidents(self): + """Return the incidents tallied since the last drain and reset the counter. Called by the + daily digest so the count spans ~the last day (since the previous digest).""" + incidents, self._incidents = self._incidents, {} + return incidents + + def _fmt(self, text): + return f"[{self.host_label}] {text}" if self.host_label else text + + async def process(self, **signals): + """Evaluate this cycle's signals and dispatch any alerts. No-op (and cheap) when the + notifier is disabled. Each send runs off-thread so a slow Telegram call can't stall + the data loop. Returns the alerts that were dispatched (handy for tests/logging).""" + if not self.notifier.enabled: + return [] + try: + alerts = self.evaluate(**signals) + except Exception as exc: # never let an alerting bug break the data loop + logger.debug("Alert evaluation failed (%s)", type(exc).__name__) + return [] + for _evt, text in alerts: + await asyncio.to_thread(self.notifier.send, text) + return alerts + + async def degradation_alert(self, kind, drop_frac): + """Push a hashrate-loss / recovery alert for a :class:`DegradationMonitor` edge (#99). The + detector owns the debounce + thresholds; this only formats and sends (and records the loss + as an incident for the daily log). No-op when the event is toggled off.""" + if kind == "loss": + self._record_incident(self.EVT_HASHRATE_LOSS) + if not self.notifier.event_enabled(self.EVT_HASHRATE_LOSS): + return None + if kind == "loss": + text = self._fmt( + f"⚠️ \U0001f4c9 Hashrate dropped ~{drop_frac * 100:.0f}% β€” possible outage or a rig " + "gone dark." + ) + else: + text = self._fmt("\U0001f7e2 \U0001f4c8 Hashrate recovered.") + await asyncio.to_thread(self.notifier.send, text) + return text + + async def maybe_daily_summary(self, now, summary_provider): + """Push a once-daily status digest at the configured local time. + + ``summary_provider()`` builds the digest text and is called **only when a send is actually + due**, so it isn't run every cycle. No-op when the ``daily_summary`` event is off, the time + is malformed, or the digest has already gone out today. On a startup that's already past + today's time it waits for tomorrow rather than firing a stale digest immediately. Returns the + text sent (handy for tests), else ``None``. + """ + if self._daily_target_min is None or not self.notifier.event_enabled( + self.EVT_DAILY_SUMMARY + ): + return None + lt = time.localtime(now) + today = (lt.tm_year, lt.tm_yday) + now_min = lt.tm_hour * 60 + lt.tm_min + if not self._daily_seeded: + self._daily_seeded = True + # Started after today's send time β†’ don't replay it now; wait for tomorrow. + if now_min >= self._daily_target_min: + self._daily_last = today + if self._daily_last == today or now_min < self._daily_target_min: + return None + self._daily_last = today + try: + text = summary_provider() + except Exception as exc: # a bad summary build must not wedge the loop + logger.debug("Daily summary build failed (%s)", type(exc).__name__) + return None + if text: + await asyncio.to_thread(self.notifier.send, text) + return text diff --git a/build/dashboard/mining_dashboard/service/data_service.py b/build/dashboard/mining_dashboard/service/data_service.py index d626536..c8a3ea7 100644 --- a/build/dashboard/mining_dashboard/service/data_service.py +++ b/build/dashboard/mining_dashboard/service/data_service.py @@ -21,6 +21,7 @@ get_tari_stats, ) from mining_dashboard.collector.system import ( + get_cpu_avx2, get_cpu_usage, get_disk_usage, get_hugepages_status, @@ -32,6 +33,10 @@ CLEARNET_STATE_DIR, ENABLE_XVB, GITHUB_RELEASES_API, + HASHRATE_DROP_MINUTES, + HASHRATE_DROP_THRESHOLD_PCT, + HOST_IP, + LOW_RAM_GB, MONERO_CLEARNET_SYNC, REJECT_WORKERS_CONTAINER, SYNC_GATE_CONTAINERS, @@ -45,12 +50,18 @@ ) from mining_dashboard.helper.utils import ( DEFAULT_PPLNS_WINDOW, + effective_hashrate, + format_hashrate, pplns_block_time, shares_in_pplns_window, ) +from mining_dashboard.service.alert_service import AlertService from mining_dashboard.service.clearnet_sync import ClearnetSyncSupervisor +from mining_dashboard.service.degradation import DegradationMonitor from mining_dashboard.service.healthchecks import HealthchecksClient +from mining_dashboard.service.metrics import build_metrics from mining_dashboard.service.node_health import NodeHealthMonitor +from mining_dashboard.service.telegram_commands import format_daily_summary from mining_dashboard.service.update_checker import GitHubReleaseClient, UpdateChecker logger = logging.getLogger("DataService") @@ -241,12 +252,7 @@ def _aggregate_hashrate(workers): total_h10 = 0 for w in workers: if w.get("status") == "online": - w_hr = w.get("h15", 0) - if w_hr == 0: - w_hr = w.get("h60", 0) - if w_hr == 0: - w_hr = w.get("h10", 0) - total_hr += w_hr + total_hr += effective_hashrate(w) total_h10 += w.get("h10", 0) return total_hr, total_h10 @@ -409,6 +415,19 @@ def __init__(self, state_manager, proxy_client, xvb_client): ) # Per-chain "currently exposed on clearnet" flags, surfaced in the snapshot for the UI/banner. self.clearnet_sync_state = {"monero": False, "tari": False, "active": False} + + # Notifications-only Telegram alerter (Issue #121). Consumes the loop's existing edges + # (node down/recovered, sync gate open) plus a debounced per-worker presence tracker. + # Disabled unless telegram.enabled + bot_token + chat_id are configured, so this is a + # cheap no-op for the default stack. + self.alert_service = AlertService() + # Hashrate-degradation detector (Issue #99): flags a sustained total-hashrate drop and its + # recovery. Runs every cycle (cheap, self-contained EMA baseline) so it can mark the chart + # even with Telegram off; a loss also drives a hashrate_loss alert. + self.degradation = DegradationMonitor( + threshold_frac=HASHRATE_DROP_THRESHOLD_PCT / 100, + sustained_sec=HASHRATE_DROP_MINUTES * 60, + ) # True while we've stopped the proxy to reject workers. Persisted in the snapshot so # a dashboard restart mid-outage still readmits workers once the node recovers. self.workers_rejected = False @@ -771,8 +790,96 @@ async def run(self): if self.miner_released: await self._apply_worker_rejection(monero_down, tari_down) - # Fetch fresh shares list to populate UI + # 5. Operator alerts (Issues #121/#45): push debounced node/worker/sync/host + # edges to Telegram. Consumes the flags computed above; worker presence is only + # tracked while the proxy is actually serving (miner released and not rejected) β€” + # its intentional absence otherwise must not read as offline. Disk usage is read + # once here and reused in the snapshot below. No-op unless Telegram is configured; + # never raises. + disk_usage = get_disk_usage() + # Host-perf snapshot (#104), read once and reused for both the alerts and the + # system panel below. Cheap /proc reads. + hugepages = get_hugepages_status() + memory = get_memory_usage() + avx2 = get_cpu_avx2() + db_healthy = self.state_manager.is_db_healthy() + # Fetch fresh shares list (also used to populate the UI below) so the PPLNS-share + # gate the XvB alert watches is computed from the same figure the dashboard shows. shares_list = await asyncio.to_thread(self.state_manager.get_shares) + pool_local = p2pool_stats.get("pool", {}) + pool_type = p2pool_stats.get("p2p", {}).get("type", "Main") + shares_in_window = shares_in_pplns_window( + shares_list, + pool_local.get("pplns_window", DEFAULT_PPLNS_WINDOW), + pplns_block_time(pool_type), + ) + # Build the domain metrics once per cycle for the alerter β€” but only when the + # bot is actually on, so the default (Telegram-off) stack pays nothing. Reused + # for the hashrate-low edge and the daily digest. + alert_metrics = ( + build_metrics(self.latest_data, self.state_manager) + if self.alert_service.enabled + else None + ) + await self.alert_service.process( + monero_down=monero_down, + tari_down=tari_down, + tari_required=TARI_REQUIRED, + miner_released=self.miner_released, + # The same worker rows the dashboard shows; the monitor reads each rig's + # status (DOWN = offline) so alerts line up with the on-screen state. + workers=final_workers, + workers_expected=self.miner_released and not self.workers_rejected, + disk_percent=(disk_usage or {}).get("percent", 0) or 0, + db_healthy=db_healthy, + xvb_enabled=ENABLE_XVB, + shares_in_window=shares_in_window, + clearnet_active=bool(self.clearnet_sync_state.get("active")), + xvb_registration_state=(self.state_manager.get_xvb_stats() or {}).get( + "registration_state", "" + ), + # From the previous cycle's snapshot (the update check writes it below); a + # one-cycle lag is fine for a one-shot "new release" ping. + update_available=bool( + (self.latest_data.get("update") or {}).get("available") + ), + low_hr_warning=bool(alert_metrics and alert_metrics.low_hr_warning), + # Persistent host-perf conditions (#104). HugePages "Disabled" = not + # reserved (recoverable via reboot); low_ram compares live total to the + # threshold. avx2 is badge-only (no alert), so it isn't passed here. + hugepages_reserved=(hugepages[0] != "Disabled"), + low_ram=(0 < (memory.get("total_gb") or 0) < LOW_RAM_GB), + ) + # Once-daily status digest, reusing the metrics built above (only when the bot + # is on, which is also the only time maybe_daily_summary would send). + await self.alert_service.maybe_daily_summary( + time.time(), + # bind this cycle's metrics (the provider runs within this iteration); drain + # the day's incident tally into the digest (#342). + lambda m=alert_metrics: format_daily_summary( + m, + self.latest_data, + HOST_IP, + incidents=self.alert_service.drain_incidents(), + ), + ) + # 6. Degradation detector (#99): a sustained total-hashrate drop / recovery is + # persisted as a chart event marker and pushed as a hashrate_loss alert. + deg_edge = self.degradation.update(total_hr) + if deg_edge: + kind, drop_frac, _baseline, current = deg_edge + if kind == "loss": + ev_type = "hashrate_loss" + detail = ( + f"Hashrate βˆ’{drop_frac * 100:.0f}% ({format_hashrate(current)})" + ) + else: + ev_type = "hashrate_recovered" + detail = f"Hashrate recovered ({format_hashrate(current)})" + await asyncio.to_thread( + self.state_manager.add_event, time.time(), ev_type, detail + ) + await self.alert_service.degradation_alert(kind, drop_frac) self.latest_data.update( { @@ -793,9 +900,10 @@ async def run(self): "miner_held": self.miner_held, "clearnet_sync": self.clearnet_sync_state, "system": { - "disk": get_disk_usage(), - "hugepages": get_hugepages_status(), - "memory": get_memory_usage(), + "disk": disk_usage, + "hugepages": hugepages, + "memory": memory, + "avx2": avx2, "load": get_load_average(), "cpu_percent": get_cpu_usage(), }, diff --git a/build/dashboard/mining_dashboard/service/degradation.py b/build/dashboard/mining_dashboard/service/degradation.py new file mode 100644 index 0000000..afd66ef --- /dev/null +++ b/build/dashboard/mining_dashboard/service/degradation.py @@ -0,0 +1,87 @@ +"""Hashrate-degradation detector (Issue #99). + +Flags a **sustained significant drop** in total effective hashrate β€” an outage or a rig going +dark β€” and its recovery, as debounced edges. One detector, consumed by two sinks: an event marker +on the dashboard chart, and a Telegram alert (#121). Defining it here once keeps the thresholds and +debounce in a single place rather than duplicated per sink. + +Design: + +- **Self-contained baseline.** The "normal" level is a slow exponential moving average of the total + hashrate kept in-process β€” no per-cycle DB read, so the detector can run every loop even when + Telegram is off (the chart marker is a passive dashboard feature). The baseline is **frozen while + degraded**, so a drop that persists doesn't drag the baseline down and mask itself. +- **Debounce / hysteresis.** A drop must stay below ``threshold_frac`` of the baseline for + ``sustained_sec`` before a ``loss`` edge fires, and climb back above ``recovery_frac`` for + ``recovery_sec`` before ``recovered`` β€” so a brief blip doesn't mark the chart or ping you. +- **Cold-start safe.** Until the baseline exceeds ``min_baseline`` (a tiny/just-started fleet), no + edges fire β€” a stack that hasn't ramped yet can't be "degraded". + +``update(current, now)`` returns ``None`` or a ``(kind, drop_frac, baseline, current)`` tuple, +``kind`` in ``{"loss", "recovered"}``. +""" + +import time + + +class DegradationMonitor: + def __init__( + self, + threshold_frac=0.5, + sustained_sec=600, + recovery_frac=0.8, + recovery_sec=120, + min_baseline=500, + ema_alpha=0.01, + clock=time.monotonic, + ): + self.threshold_frac = threshold_frac + self.sustained_sec = sustained_sec + self.recovery_frac = recovery_frac + self.recovery_sec = recovery_sec + self.min_baseline = min_baseline + self.ema_alpha = ema_alpha + self._clock = clock + self._baseline = None # EMA of total hashrate; the "normal" level + self._degraded = False + self._below_since = None + self._above_since = None + + def update(self, current, now=None): + now = self._clock() if now is None else now + current = current or 0 + # Update the baseline only while healthy, so a sustained drop can't erode it and hide itself. + if self._baseline is None: + self._baseline = current + elif not self._degraded: + self._baseline = (1 - self.ema_alpha) * self._baseline + self.ema_alpha * current + baseline = self._baseline + + if baseline < self.min_baseline: + # Not enough hashrate to judge (cold start / tiny fleet) β€” no false alarms. + self._below_since = self._above_since = None + return None + + drop_frac = max(0.0, 1 - current / baseline) if baseline else 0.0 + + if not self._degraded: + if current < self.threshold_frac * baseline: + if self._below_since is None: + self._below_since = now + if now - self._below_since >= self.sustained_sec: + self._degraded = True + self._below_since = None + return ("loss", drop_frac, baseline, current) + else: + self._below_since = None + else: + if current >= self.recovery_frac * baseline: + if self._above_since is None: + self._above_since = now + if now - self._above_since >= self.recovery_sec: + self._degraded = False + self._above_since = None + return ("recovered", drop_frac, baseline, current) + else: + self._above_since = None + return None diff --git a/build/dashboard/mining_dashboard/service/egress.py b/build/dashboard/mining_dashboard/service/egress.py index b410ba5..838bc92 100644 --- a/build/dashboard/mining_dashboard/service/egress.py +++ b/build/dashboard/mining_dashboard/service/egress.py @@ -9,8 +9,9 @@ * The **#270 egress firewall** (``DOCKER-USER``, fail-closed) DROPs non-Tor egress from the *container* subnet β€” so a container's clearnet route can't actually leave while it's on. * It does **not** cover the **host-networked dashboard** (``network_mode: host``), whose own egress - (XvB stats fetch, update check) bypasses ``DOCKER-USER`` entirely. Those rely solely on their - SOCKS config β€” a clearnet route there is a real leak regardless of the firewall. + (XvB stats fetch, update check, Healthchecks ping, Telegram bot) bypasses ``DOCKER-USER`` entirely. + Those rely solely on their SOCKS config β€” a clearnet route there is a real leak regardless of the + firewall. (All four are Tor-routed by default, so none leak.) So a connection is a *leak* only when its route is clearnet AND it isn't neutralised by a backstop. """ @@ -39,6 +40,7 @@ def compute_egress_posture( tari_clearnet_sync, remote_monero, healthchecks_enabled, + telegram_enabled, ): """Pure derivation of the egress posture from config knobs. Returns ``{components, summary}``.""" xvb = _xvb_route(xvb_enabled, xvb_tor) @@ -98,6 +100,8 @@ def compute_egress_posture( {"to": "update check (github)", "route": TOR}, # socks5h, #224 # Healthchecks.io dead-man's-switch ping β€” always over Tor when a URL is set (#79). {"to": "Healthchecks.io ping", "route": TOR if healthchecks_enabled else INACTIVE}, + # Telegram bot (alerts + command long-poll) β€” always over Tor when on (#121/#340). + {"to": "Telegram bot", "route": TOR if telegram_enabled else INACTIVE}, ], }, { @@ -150,6 +154,7 @@ def egress_posture_from_config(): tari_clearnet_sync=config.TARI_CLEARNET_SYNC, remote_monero=config.MONERO_NODE_HOST != config.LOCAL_MONERO_HOST, healthchecks_enabled=bool(config.HEALTHCHECKS_PING_URL), + telegram_enabled=config.TELEGRAM_ENABLED, ) @@ -201,6 +206,7 @@ def compute_topology( tari_clearnet_sync, remote_monero, healthchecks_enabled, + telegram_enabled, ): """Pure derivation of the stack topology. Returns ``{nodes, edges, summary}``. @@ -217,6 +223,7 @@ def compute_topology( tari_clearnet_sync=tari_clearnet_sync, remote_monero=remote_monero, healthchecks_enabled=healthchecks_enabled, + telegram_enabled=telegram_enabled, ) xvb = _xvb_route(xvb_enabled, xvb_tor) sidechain = CLEARNET if p2pool_clearnet else TOR @@ -242,6 +249,14 @@ def compute_topology( "Healthchecks ping", "egress", ), + # Telegram bot (alerts + command long-poll) β€” always over Tor when on (#121/#340). + _edge( + "dashboard", + "tor", + TOR if telegram_enabled else INACTIVE, + "Telegram bot", + "egress", + ), # The Tor hub to the network: SOCKS egress for every daemon + onion-service ingress. _edge("tor", "internet", TOR, "SOCKS + onion circuits", "p2p"), # Internal mesh (hidden until expanded). @@ -284,4 +299,5 @@ def topology_from_config(): tari_clearnet_sync=config.TARI_CLEARNET_SYNC, remote_monero=config.MONERO_NODE_HOST != config.LOCAL_MONERO_HOST, healthchecks_enabled=bool(config.HEALTHCHECKS_PING_URL), + telegram_enabled=config.TELEGRAM_ENABLED, ) diff --git a/build/dashboard/mining_dashboard/service/storage_service.py b/build/dashboard/mining_dashboard/service/storage_service.py index c5e6599..33af2c4 100644 --- a/build/dashboard/mining_dashboard/service/storage_service.py +++ b/build/dashboard/mining_dashboard/service/storage_service.py @@ -43,6 +43,7 @@ def __init__(self, db_path: str = None): self.state = { "hashrate_history": deque(), "shares": [], + "events": [], # degradation / recovery markers for the chart (#99) "xvb": { "total_donated_time": 0.0, "current_mode": "P2POOL", @@ -119,6 +120,9 @@ def _create_tables(self): self._conn.execute( "CREATE TABLE IF NOT EXISTS shares (ts REAL PRIMARY KEY, difficulty REAL)" ) + # Degradation / recovery event markers for the chart (#99). No PK β€” two events can share a + # timestamp; type is "hashrate_loss"|"hashrate_recovered"|... and detail is the tooltip text. + self._conn.execute("CREATE TABLE IF NOT EXISTS events (ts REAL, type TEXT, detail TEXT)") def _create_indexes(self): """Creates indexes. Called after migrations so the indexed columns are guaranteed to @@ -234,6 +238,17 @@ def load(self): ) self.state["shares"] = [dict(row) for row in cursor.fetchall()] + # 4. Load chart events (#99) β€” the events table is additive, so guard against a + # pre-migration DB that predates it. + try: + cursor.execute( + "SELECT ts, type, detail FROM events WHERE ts > ? ORDER BY ts ASC", + (history_cutoff,), + ) + self.state["events"] = [dict(row) for row in cursor.fetchall()] + except sqlite3.Error: + self.state["events"] = [] + self.logger.info(f"State successfully loaded from {self.db_path}") except sqlite3.Error as e: self.logger.error(f"DB Load Error: {e}") @@ -366,6 +381,37 @@ def get_shares(self) -> list[dict[str, Any]]: with self._lock: return list(self.state.get("shares", [])) + def add_event(self, ts: float, event_type: str, detail: str = ""): + """Record a chart event marker (#99) β€” a degradation/recovery point the chart draws and the + history window prunes, mirroring shares. Persisted so it survives a dashboard restart.""" + with self._lock: + self.state.setdefault("events", []).append( + {"ts": ts, "type": event_type, "detail": detail} + ) + cutoff = time.time() - HISTORY_RETENTION_SEC + self.state["events"] = [e for e in self.state["events"] if e["ts"] >= cutoff] + try: + with self._db_lock: + if not self._conn: + return + with self._conn: + self._conn.execute( + "INSERT INTO events (ts, type, detail) VALUES (?, ?, ?)", + (ts, event_type, detail), + ) + if random.random() < 0.05: # noqa: S311 β€” pruning sampler, not a security context + self._conn.execute( + "DELETE FROM events WHERE ts < ?", + (time.time() - HISTORY_RETENTION_SEC,), + ) + except sqlite3.Error as e: + self._db_error("Event Insert Error", e) + + def get_events(self) -> list[dict[str, Any]]: + """Returns a copy of the chart events (#99).""" + with self._lock: + return list(self.state.get("events", [])) + def get_xvb_stats(self) -> dict[str, Any]: """Returns the current XvB mining statistics dictionary.""" with self._lock: diff --git a/build/dashboard/mining_dashboard/service/telegram_commands.py b/build/dashboard/mining_dashboard/service/telegram_commands.py new file mode 100644 index 0000000..454f76c --- /dev/null +++ b/build/dashboard/mining_dashboard/service/telegram_commands.py @@ -0,0 +1,613 @@ +import asyncio +import logging +import time + +import requests + +from mining_dashboard.config.config import ( + HOST_IP, + TELEGRAM_BOT_TOKEN, + TELEGRAM_CHAT_ID, + TELEGRAM_COMMANDS_ENABLED, + TELEGRAM_ENABLED, + TOR_SOCKS_PROXY, +) +from mining_dashboard.helper.utils import effective_hashrate, format_duration, format_hashrate +from mining_dashboard.service.earnings import xmr_per_hs_day +from mining_dashboard.service.egress import egress_posture_from_config +from mining_dashboard.service.metrics import build_metrics +from mining_dashboard.service.telegram_notifier import TELEGRAM_API_BASE +from mining_dashboard.version import resolve_version +from mining_dashboard.web.views import build_badges + +logger = logging.getLogger("TelegramCommands") + +# Seconds handed to getUpdates: Telegram holds the request open until an update arrives or this +# elapses, so the bot makes ~one request per interval while idle (long-poll, not busy-poll). +LONG_POLL_SECONDS = 25 +# Quiet retry after a failed poll β€” a Tor-only / offline host can't reach api.telegram.org, so a +# persistently-blocked bot backs off instead of hot-looping (and never spams ERROR; #59 discipline). +POLL_ERROR_BACKOFF_SECONDS = 15 + +# The commands the bot answers. All are read-only status queries β€” the bot can never change the +# stack (start/stop/apply live on the CLI), so a leaked chat can at worst read status, not act. +COMMANDS = ( + "status", + "info", + "hashrate", + "workers", + "sync", + "system", + "pool", + "xvb", + "earnings", + "help", +) + +HELP_TEXT = ( + "Pithead bot β€” commands:\n" + "/status β€” stack health at a glance\n" + "/info β€” version, updates, DB mode, privacy posture\n" + "/hashrate β€” total + per-worker hashrate\n" + "/workers β€” each rig's online/offline state\n" + "/sync β€” Monero + Tari node sync progress\n" + "/system β€” host disk, RAM, CPU, HugePages\n" + "/pool β€” P2Pool sidechain + Monero network\n" + "/xvb β€” XvB mode, tier, and raffle eligibility\n" + "/earnings β€” estimated P2Pool XMR per day\n" + "/help β€” this message" +) + + +def _prefix(host_label): + """Hostname tag so replies from several stacks sharing one chat stay distinguishable. + 'Unknown Host' is config.py's placeholder when HOST_IP is unset β€” drop it, don't print it.""" + if host_label in (None, "", "Unknown Host"): + return "" + return f"[{host_label}] " + + +def parse_command(text): + """Extract the command word from a message, or ``None`` if it isn't a slash command. + + Returns the bare command (lowercased, with any ``@botname`` suffix stripped β€” Telegram appends + it in groups, e.g. ``/status@PitheadBot``). An unrecognized slash command comes back as + ``"unknown"`` so the caller can nudge with the help text; plain chatter returns ``None`` and is + ignored, so the bot never talks over a group it happens to share. + """ + if not text: + return None + text = text.strip() + if not text.startswith("/"): + return None + word = text.split(maxsplit=1)[0] + cmd = word[1:].split("@", 1)[0].lower() + if not cmd: + return None + return cmd if cmd in COMMANDS else "unknown" + + +def _node_state(sync): + """One-glance node health from a :class:`~mining_dashboard.service.metrics.SyncMetric`.""" + if sync.down: + return "\U0001f534 down" + if sync.done: + return "\U0001f7e2 synced" + return f"⏳ syncing {sync.percent:.1f}%" + + +def _human_count(n): + """Compact SI-suffixed number for large figures like network difficulty (380_000_000_000 β†’ + '380.00 G'). Small values pass through as a plain integer.""" + n = float(n or 0) + for unit in ("", "K", "M", "G", "T", "P"): + if abs(n) < 1000: + return f"{n:.2f} {unit}".strip() if unit else f"{int(n)}" + n /= 1000 + return f"{n:.2f} E" + + +def format_status(metrics, mining_active, host_label="", warnings=None, merge_mining=None): + """Overall stack health β€” the answer to '/status'. Pure: folds a :class:`Metrics` (plus the + mining-active flag the loop derives from the sync gate, any active warning/error badges, and the + Tari merge-mine link state) into text; no I/O. ``merge_mining`` is the gRPC-connected flag β€” a + distinct signal from a synced Tari node (the link can be down while the node is up), or ``None`` + to omit the line (Tari not in play).""" + lines = [ + f"{_prefix(host_label)}\U0001f4ca Pithead status", + f"Monero node: {_node_state(metrics.monero)}", + f"Tari node: {_node_state(metrics.tari)}", + ] + if merge_mining is not None: + lines.append(f"Merge-mining: {'🟒 Tari linked' if merge_mining else '⏸ Tari not linked'}") + if metrics.global_syncing: + lines.append("Mining: ⏳ holding β€” chain(s) syncing") + elif mining_active: + lines.append(f"Mining: \U0001f7e2 active ({metrics.mode})") + else: + lines.append("Mining: \U0001f534 not mining") + lines.append(f"Workers: {metrics.workers_online}/{metrics.workers_total} online") + lines.append(f"Hashrate: {format_hashrate(metrics.total_h15)} (10m avg)") + lines.append(f"PPLNS shares: {metrics.shares_in_window} in window") + # Surface the same warning/error badges the dashboard's top bar shows (#104), so /status is a + # one-glance "anything wrong?" β€” or an explicit all-clear. + if warnings: + lines.append("") + lines.append("⚠️ Warnings:") + lines.extend(f"β€’ {w}" for w in warnings) + else: + lines.append("") + lines.append("βœ… No warnings.") + return "\n".join(lines) + + +def format_info(version, update, metrics, egress_summary, host_label=""): + """The 'about this stack' card β€” the answer to '/info'. Folds the build version, whether an + upgrade is available, the Monero DB mode, the P2Pool sidechain, and the privacy (egress) posture + into one glance. Static-ish facts, kept out of /status (which is live health).""" + lines = [f"{_prefix(host_label)}\U0001f4df Pithead info"] + + ver = (version or {}).get("text", "unknown") + lines.append(f"Version: {ver}{' (dev build)' if (version or {}).get('dev') else ''}") + + update = update or {} + if update.get("available") and update.get("latest"): + lines.append(f"Updates: \U0001f195 {update['latest']} available β€” ./pithead upgrade") + else: + lines.append("Updates: βœ… Up to date") + + mode = metrics.monero_mode + lines.append(f"Monero DB: {mode}" if mode in ("Pruned", "Full") else "Monero DB: unknown") + lines.append(f"Sidechain: P2Pool {metrics.pool_type}") + + egress_summary = egress_summary or {} + if egress_summary.get("all_tor", True): + lines.append("Egress: \U0001f9c5 Tor-only") + else: + lines.append(f"Egress: ⚠️ {egress_summary.get('label', 'clearnet exposure')}") + return "\n".join(lines) + + +def status_warnings(data, metrics, db_healthy): + """The active warning/error badges for /status: every ``bad`` badge plus the ``⚠``-flagged + ``warn`` badges (which the informational states β€” 'Syncing…', 'Miner held' β€” deliberately lack), + reusing :func:`build_badges` so this never drifts from the dashboard's own top bar. The leading + ``⚠`` is stripped since the section already has one header.""" + out = [] + for b in build_badges(data, metrics, "", db_healthy=db_healthy): + if b["variant"] == "bad" or b["text"].startswith("⚠"): + out.append(b["text"].removeprefix("⚠ ")) + return out + + +def format_hashrate_reply(metrics, workers, host_label=""): + """Total + per-online-worker hashrate β€” the answer to '/hashrate'. + + Both the total and each per-worker figure use the same :func:`effective_hashrate` (10m average, + 1m fallback for a rig without 10m history yet), so the per-worker lines add up to the total β€” + a just-connected worker reads its real live rate, not 0. + """ + lines = [ + f"{_prefix(host_label)}⚑ Hashrate", + f"Total: {format_hashrate(metrics.total_h15)} (10m avg)", + ] + online = [w for w in workers if w.get("status") == "online"] + if not online: + lines.append("No workers online.") + for w in sorted(online, key=effective_hashrate, reverse=True): + lines.append(f"β€’ {w.get('name', '?')}: {format_hashrate(effective_hashrate(w))}") + return "\n".join(lines) + + +def format_workers(workers, host_label=""): + """Per-worker online/offline roll-call β€” the answer to '/workers'. Offline first-sighted + workers are those xmrig-proxy still lists with a dead connection.""" + if not workers: + return f"{_prefix(host_label)}\U0001f477 Workers\nNo workers connected." + lines = [f"{_prefix(host_label)}\U0001f477 Workers"] + # Online first, then by name β€” the offline ones are what an operator scans for. + for w in sorted(workers, key=lambda w: (w.get("status") != "online", w.get("name", ""))): + if w.get("status") == "online": + up = w.get("uptime") or 0 + tail = f" Β· up {format_duration(up)}" if up else "" + lines.append( + f"\U0001f7e2 {w.get('name', '?')} β€” {format_hashrate(effective_hashrate(w))}{tail}" + ) + else: + lines.append(f"\U0001f534 {w.get('name', '?')} β€” offline") + return "\n".join(lines) + + +def _sync_line(name, sync): + if sync.down: + return f"{name}: \U0001f534 node down" + if sync.done: + return f"{name}: \U0001f7e2 synced" + if sync.has_target: + return f"{name}: ⏳ {sync.percent:.1f}% ({sync.current:,}/{sync.target:,})" + return f"{name}: ⏳ syncing {sync.percent:.1f}%" + + +def format_sync(metrics, host_label=""): + """Monero + Tari sync progress β€” the answer to '/sync'.""" + return "\n".join( + [ + f"{_prefix(host_label)}\U0001f504 Sync status", + _sync_line("Monero", metrics.monero), + _sync_line("Tari", metrics.tari), + ] + ) + + +def format_system(system, host_label=""): + """Host resource usage β€” the answer to '/system'. Reads the ``system`` snapshot the dashboard + already collects (disk / RAM / CPU / load / HugePages).""" + disk = system.get("disk", {}) + mem = system.get("memory", {}) + hp_status, _hp_class, hp_value = system.get("hugepages", ["Unknown", "", "0/0"]) + return "\n".join( + [ + f"{_prefix(host_label)}\U0001f5a5️ System", + f"Disk: {disk.get('used_gb', 0):.1f}/{disk.get('total_gb', 0):.1f} GB " + f"({disk.get('percent_str', '0%')})", + f"RAM: {mem.get('used_gb', 0):.1f}/{mem.get('total_gb', 0):.1f} GB " + f"({mem.get('percent_str', '0%')})", + f"CPU: {system.get('cpu_percent', '0%')} Β· load {system.get('load', 'n/a')}", + f"HugePages: {hp_status} ({hp_value})", + ] + ) + + +def format_pool(metrics, data=None, host_label=""): + """P2Pool sidechain + Monero network figures β€” the answer to '/pool'. Enriched with the share + submission health and best share the proxy tracks, and the node's found blocks (#82).""" + data = data or {} + lines = [ + f"{_prefix(host_label)}\U0001f30a Pool & network", + f"Sidechain: P2Pool {metrics.pool_type}", + f"Pool hashrate: {format_hashrate(metrics.pool_hashrate)}", + ] + blocks = (data.get("pool", {}) or {}).get("pool", {}).get("blocks_found") + if blocks: + lines.append(f"Blocks found: {blocks:,}") + lines.append( + f"Network: height {metrics.network_height:,} Β· diff {_human_count(metrics.network_difficulty)}" + ) + lines.append( + f"PPLNS shares: {metrics.shares_in_window} in window ({metrics.pplns_window} blocks)" + ) + # Current share effort β€” a luck indicator (<100% = finding shares faster than average). + stratum = data.get("stratum", {}) or {} + if "current_effort" in stratum: + lines.append(f"Effort: {stratum['current_effort']:.1f}%") + # Share submission health from the xmrig-proxy /summary (#82): accepted/rejected + best found. + summary = data.get("proxy_summary", {}) or {} + accepted = summary.get("accepted", 0) or 0 + rejected = summary.get("rejected", 0) or 0 + if accepted or rejected: + total = accepted + rejected + reject_pct = (rejected / total * 100) if total else 0.0 + lines.append(f"Shares to pool: {accepted:,} βœ“ / {rejected:,} βœ— ({reject_pct:.2f}% rejects)") + best = summary.get("best", 0) or 0 + if best: + lines.append(f"Best share: \U0001f48e {int(best):,}") + return "\n".join(lines) + + +def format_xvb(metrics, host_label=""): + """XvB mode / tier / raffle eligibility β€” the answer to '/xvb'.""" + prefix = _prefix(host_label) + if not metrics.xvb_enabled: + return f"{prefix}\U0001f3b0 XvB is disabled." + lines = [ + f"{prefix}\U0001f3b0 XvB", + f"Mode: {metrics.mode}", + f"Current tier: {metrics.current_tier}", + f"Target tier: {metrics.target_tier}", + f"Routed to XvB: {format_hashrate(metrics.xvb_routed_1h)} (1h)", + # Credited averages are what XvB itself measures β€” the figures that actually set your tier + # (routed above is what we send; credited is what counts). Showing both explains the tier. + f"Credited by XvB: {format_hashrate(metrics.xvb_1h)} (1h) Β· " + f"{format_hashrate(metrics.xvb_24h)} (24h)", + ] + # The share half of raffle eligibility (#158): no PPLNS share means XvB wins are skipped. + if metrics.shares_in_window > 0: + lines.append("PPLNS share: \U0001f7e2 held (raffle-eligible)") + else: + lines.append("PPLNS share: ⚠ none β€” XvB wins skipped") + if metrics.xvb_stale: + lines.append("⚠ XvB stats are stale β€” showing last-known values.") + return "\n".join(lines) + + +def format_earnings(metrics, network, host_label=""): + """Estimated P2Pool XMR earnings β€” the answer to '/earnings'. Reuses the same rate the dashboard + calculator uses (``xmr_per_hs_day``) applied to the displayed P2Pool 1h-average hashrate; Tari + merge-mining earnings are a separate thing and not included (#12).""" + reward_atomic = (network or {}).get("reward", 0) or 0 + coeff_day = xmr_per_hs_day(reward_atomic, metrics.network_difficulty) + if coeff_day <= 0: + return f"{_prefix(host_label)}\U0001f4b0 Earnings estimate unavailable (waiting on network data)." + daily_1h = coeff_day * metrics.p2pool_1h + lines = [ + f"{_prefix(host_label)}\U0001f4b0 Estimated P2Pool earnings", + f"1h avg {format_hashrate(metrics.p2pool_1h)} β†’ ~{daily_1h:.6f} XMR/day", + ] + # The 24h average smooths the variance a 1h window carries, so it's the steadier projection β€” + # shown (and used for the 30-day figure) only once there's a day of history to average. + if metrics.p2pool_24h > 0: + daily_24h = coeff_day * metrics.p2pool_24h + lines.append( + f"24h avg {format_hashrate(metrics.p2pool_24h)} β†’ ~{daily_24h:.6f} XMR/day " + f"Β· ~{daily_24h * 30:.5f} XMR/30d" + ) + else: + lines.append(f"~{daily_1h * 30:.5f} XMR/30d") + lines.append("Estimate only β€” excludes XvB-donated hashrate and Tari merge-mining.") + return "\n".join(lines) + + +# Friendly labels for the daily incident log (#342), keyed by AlertService event. +_INCIDENT_LABELS = { + "node_down": "node down", + "worker_offline": "worker offline", + "disk_space": "disk warning", + "db_unhealthy": "DB write fail", + "xvb_no_share": "XvB no-share", + "xvb_registration": "XvB registration", + "clearnet_exposed": "clearnet exposure", + "hashrate_low": "hashrate low", + "hashrate_loss": "hashrate drop", +} + + +def _incident_line(incidents): + """One-line roll-up of the day's problems, or an all-clear. ``incidents`` is a {event: count} + dict (from ``AlertService.drain_incidents``); ``None`` means the caller didn't track any.""" + if incidents is None: + return None + if not incidents: + return "\U0001f7e2 No incidents in the last 24h" + parts = [ + f"{n}Γ— {_INCIDENT_LABELS.get(k, k)}" + for k, n in sorted(incidents.items(), key=lambda kv: (-kv[1], kv[0])) + ] + return "\U0001f6a8 Incidents (24h): " + " Β· ".join(parts) + + +def format_daily_summary(metrics, data, host_label="", now=None, incidents=None): + """The once-a-day retrospective pushed by the alerter β€” **what happened across the fleet over + the last 24h**, not a live snapshot. Reuses the same domain values the dashboard shows. + + Consistency by construction: the fleet 24h figure is the sum of each rig's 24h average, and the + XvB split is that total apportioned by the day's routing fraction β€” so the per-rig lines add up + to the headline and P2Pool + XvB equals it. ``now`` is injectable for tests; it stamps the + message and bounds the 24h share count. + """ + now = time.time() if now is None else now + stamp = time.strftime("%Y-%m-%d %H:%M", time.localtime(now)) + online = [w for w in data.get("workers", []) if w.get("status") == "online"] + fleet_24h = sum(w.get("h24h", 0) or 0 for w in online) + shares_24h = sum(1 for s in data.get("shares", []) if s.get("ts", 0) >= now - 86400) + + lines = [f"{_prefix(host_label)}\U0001f4c5 Daily summary β€” {stamp}"] + incident_line = _incident_line(incidents) + if incident_line: + lines.append(incident_line) + lines.append(f"⚑ 24h hashrate: {format_hashrate(fleet_24h)}") + if metrics.xvb_enabled: + routed = (metrics.p2pool_24h or 0) + (metrics.xvb_routed_24h or 0) + xvb_frac = (metrics.xvb_routed_24h or 0) / routed if routed else 0 + xvb_hr = fleet_24h * xvb_frac + lines.append( + f" \U0001f535 P2Pool {format_hashrate(fleet_24h - xvb_hr)} Β· " + f"\U0001f3b2 XvB {format_hashrate(xvb_hr)} ({xvb_frac * 100:.0f}% to XvB)" + ) + lines.append(f"\U0001f3b0 XvB tier: {metrics.current_tier}") + lines.append(f"\U0001f3af Shares (24h): {shares_24h}") + + reward = (data.get("network", {}) or {}).get("reward", 0) or 0 + coeff = xmr_per_hs_day(reward, metrics.network_difficulty) + if coeff > 0: + lines.append( + f"\U0001f4b0 Est. earnings: ~{coeff * (metrics.p2pool_24h or 0):.6f} XMR/day (P2Pool)" + ) + + lines.append(f"\U0001f477 Miners: {metrics.workers_online}/{metrics.workers_total} online") + for w in sorted(online, key=lambda w: w.get("h24h", 0) or 0, reverse=True): + lines.append(f" β€’ {w.get('name', '?')}: {format_hashrate(w.get('h24h', 0))}") + + disk = (data.get("system", {}) or {}).get("disk", {}) or {} + lines.append(f"\U0001f4be Disk: {disk.get('percent_str', 'n/a')} used") + return "\n".join(lines) + + +class TelegramCommandBot: + """ + On-demand Telegram command interface (Issue #45) β€” the interactive half of the operator bot. + + Answers a small set of **read-only** status commands (``/status``, ``/hashrate``, ``/workers``, + ``/sync``, ``/help``) from the data the dashboard already collects, so it never re-implements + collection β€” it reuses :func:`build_metrics`, the same domain layer the web UI renders, so a + Telegram reply and the dashboard can never disagree. + + Discipline (mirrors :class:`TelegramNotifier`): + + - **Off by default, opt-in.** Enabled only when Telegram is on *and* ``telegram.commands.enabled`` + is set *and* both ``bot_token`` and ``chat_id`` are present. Otherwise :meth:`run` returns + immediately, so the background task is a cheap no-op for the default stack. + - **Long-poll, no inbound port.** Uses ``getUpdates`` (outbound only) over the same egress the + notifier uses β€” a webhook would need a public inbound endpoint the Tor-first appliance can't + offer. Nothing is exposed. + - **Single-chat access control.** Only the configured ``chat_id`` is answered; every other update + is dropped silently, so an unknown chat gets no reply and can't use the bot as a probe oracle. + - **Read-only.** No command mutates the stack (lifecycle stays on the CLI), so a compromised chat + can at worst read status. + - **Fail silent, never leaks the token.** Network errors (offline / Tor-only host) are swallowed + at debug and the poll backs off; the ``bot_token`` only ever appears in the request URL and is + never written to a log line. + - **No stale replay.** On startup the backlog is skipped (offset primed past pending updates), so + a command sent while the dashboard was down isn't executed minutes later on restart. + """ + + def __init__( + self, + data_service, + *, + enabled=None, + bot_token=TELEGRAM_BOT_TOKEN, + chat_id=TELEGRAM_CHAT_ID, + host_label=HOST_IP, + api_base=TELEGRAM_API_BASE, + long_poll=LONG_POLL_SECONDS, + tor_proxy=TOR_SOCKS_PROXY, + ): + self.data_service = data_service + self._token = (bot_token or "").strip() + # chat_id may be a negative group id (e.g. -1001234567890); keep it a string for exact + # equality against the id Telegram sends back. + self.chat_id = str(chat_id or "").strip() + self.host_label = host_label + self._api_base = api_base.rstrip("/") + self.long_poll = long_poll + # Route getUpdates + replies over the bridge Tor SOCKS proxy, so polling Telegram never + # exposes the host IP (same discipline as the notifier / Healthchecks pinger). + self._proxies = {"http": tor_proxy, "https": tor_proxy} if tor_proxy else None + if enabled is None: + enabled = bool(TELEGRAM_ENABLED and TELEGRAM_COMMANDS_ENABLED) + self.enabled = bool(enabled and self._token and self.chat_id) + self._offset = None + + def reply_for(self, text): + """Map an incoming message to a reply string, or ``None`` to stay silent. + + Reads the latest snapshot and runs the shared :func:`build_metrics` (a couple of quick local + SQLite reads); the caller runs this off-thread so a slow read can't stall the poll loop. + """ + cmd = parse_command(text) + if cmd is None: + return None + if cmd == "help": + return f"{_prefix(self.host_label)}{HELP_TEXT}" + if cmd == "unknown": + return f"{_prefix(self.host_label)}Unknown command.\n{HELP_TEXT}" + + data = self.data_service.latest_data or {} + # /system reads the raw snapshot only β€” no need to build the full metrics. + if cmd == "system": + return format_system(data.get("system", {}), self.host_label) + + metrics = build_metrics(data, self.data_service.state_manager) + if cmd == "status": + mining = bool(data.get("miner_released") and not data.get("workers_rejected")) + warnings = status_warnings( + data, metrics, self.data_service.state_manager.is_db_healthy() + ) + # Merge-mine link = the Tari gRPC being READY (not merely the node being up) β€” the same + # rule the dashboard's βœ” uses (#313). Omitted until Tari has been polled at all. + tari = data.get("tari") + merge = (bool(tari.get("connected")) and bool(tari.get("active"))) if tari else None + return format_status( + metrics, mining, self.host_label, warnings=warnings, merge_mining=merge + ) + if cmd == "info": + return format_info( + resolve_version(), + data.get("update"), + metrics, + egress_posture_from_config()["summary"], + self.host_label, + ) + if cmd == "hashrate": + return format_hashrate_reply(metrics, data.get("workers", []), self.host_label) + if cmd == "workers": + return format_workers(data.get("workers", []), self.host_label) + if cmd == "sync": + return format_sync(metrics, self.host_label) + if cmd == "pool": + return format_pool(metrics, data, self.host_label) + if cmd == "xvb": + return format_xvb(metrics, self.host_label) + if cmd == "earnings": + return format_earnings(metrics, data.get("network", {}), self.host_label) + return None + + async def run(self): + """Long-poll for commands until cancelled. A no-op when disabled. + + The network calls use ``requests`` (so they ride the same Tor SOCKS proxy as the notifier) + run off the event loop via :func:`asyncio.to_thread`, so a 25s long-poll never blocks it. + """ + if not self.enabled: + return + logger.info("Telegram command interface enabled β€” polling for commands (over Tor).") + await asyncio.to_thread(self._prime_offset) + while True: + try: + updates = await asyncio.to_thread(self._get_updates, self.long_poll) + except asyncio.CancelledError: + raise + except Exception as exc: + logger.debug("Telegram getUpdates failed (%s)", type(exc).__name__) + await asyncio.sleep(POLL_ERROR_BACKOFF_SECONDS) + continue + for update in updates: + self._offset = update.get("update_id", 0) + 1 + await self._handle_update(update) + + def _prime_offset(self): + """Advance the offset past any pending backlog without acting on it, so a command queued + while the dashboard was down isn't run on startup.""" + try: + updates = self._get_updates(0) + if updates: + self._offset = updates[-1].get("update_id", 0) + 1 + except Exception as exc: + logger.debug("Telegram offset prime skipped (%s)", type(exc).__name__) + + def _get_updates(self, poll_timeout): + """Blocking ``getUpdates`` over Tor. Called via ``to_thread`` from the loop.""" + params = {"timeout": poll_timeout, "allowed_updates": '["message"]'} + if self._offset is not None: + params["offset"] = self._offset + url = f"{self._api_base}/bot{self._token}/getUpdates" + # The read timeout must outlast Telegram's long-poll hold, or requests aborts the request + # the server is legitimately keeping open; (connect, read) tuple. + resp = requests.get( + url, params=params, timeout=(10, poll_timeout + 10), proxies=self._proxies + ) + resp.raise_for_status() + payload = resp.json() + if not payload.get("ok"): + return [] + return payload.get("result", []) + + async def _handle_update(self, update): + message = update.get("message") or {} + chat = message.get("chat") or {} + # Access control: only the configured chat may drive the bot. Anything else is dropped + # silently β€” no reply, so an unknown chat can't even confirm the bot exists. + if str(chat.get("id")) != self.chat_id: + return + reply = await asyncio.to_thread(self._safe_reply_for, message.get("text", "")) + if reply: + await asyncio.to_thread(self._send, reply) + + def _safe_reply_for(self, text): + """Never let a formatting/read bug kill the poll loop β€” a broken command just goes quiet.""" + try: + return self.reply_for(text) + except Exception as exc: + logger.debug("Telegram command handling failed (%s)", type(exc).__name__) + return None + + def _send(self, text): + """Blocking reply over Tor. Called via ``to_thread``.""" + url = f"{self._api_base}/bot{self._token}/sendMessage" + payload = {"chat_id": self.chat_id, "text": text, "disable_web_page_preview": True} + try: + resp = requests.post(url, json=payload, timeout=10, proxies=self._proxies) + resp.raise_for_status() + except Exception as exc: + # Log only the exception type β€” a requests error can embed the token-bearing URL. + logger.debug("Telegram reply failed (%s)", type(exc).__name__) diff --git a/build/dashboard/mining_dashboard/service/telegram_notifier.py b/build/dashboard/mining_dashboard/service/telegram_notifier.py new file mode 100644 index 0000000..510a972 --- /dev/null +++ b/build/dashboard/mining_dashboard/service/telegram_notifier.py @@ -0,0 +1,96 @@ +import logging + +import requests + +from mining_dashboard.config.config import TOR_SOCKS_PROXY + +logger = logging.getLogger("TelegramNotifier") + +# Telegram Bot API base. Overridable in tests so we never touch the network. +TELEGRAM_API_BASE = "https://api.telegram.org" + + +class TelegramNotifier: + """ + Thin, fire-and-forget Telegram push notifier (Issue #121). + + Pushes short operational alerts to a single chat via the Bot API ``sendMessage`` + endpoint. Deliberately minimal β€” there is no interactive bot, no commands, no polling + (that's #45); this is the notifications-only half. + + Discipline (mirrors the rest of the stack): + + - **Disabled by default.** ``enabled`` is only true when explicitly switched on *and* + both ``bot_token`` and ``chat_id`` are present. A missing/half-filled config leaves it + off, so :meth:`send` is a silent no-op rather than an error on every cycle. + - **Per-event toggles.** ``events`` gates which alert kinds are delivered, so an operator + can enable Telegram and still silence the ones they find noisy. + - **Always over Tor.** Sends ride the bridge Tor SOCKS proxy (``socks5h``, so the DNS lookup + goes through Tor too), so Telegram sees a Tor exit, not the host IP β€” never a clearnet beacon, + matching the Healthchecks.io pinger (#79) and the XvB fetch (#163). + - **Fail silent.** Any network error (offline host, Tor down, Telegram unreachable / blocking a + Tor exit) is swallowed and logged at debug β€” an alerter must never crash the data loop or spam + ERROR for the very condition it exists to report (#59 discipline). + - **Never logs the token.** ``bot_token`` is a secret; it only ever appears in the request + URL and is never written to a log line β€” not even inside an exception message (which for + ``requests`` would otherwise include the full URL). + """ + + def __init__( + self, + enabled=False, + bot_token="", + chat_id="", + events=None, + timeout=10, + api_base=TELEGRAM_API_BASE, + tor_proxy=TOR_SOCKS_PROXY, + ): + self.bot_token = (bot_token or "").strip() + # chat_id may be a negative integer (Telegram group ids look like -1001234567890); + # keep it as a string so render/transport never reformat it. + self.chat_id = str(chat_id or "").strip() + self.events = dict(events or {}) + self.timeout = timeout + self._api_base = api_base.rstrip("/") + # Route over the bridge Tor SOCKS proxy so the host IP is never exposed to Telegram. + # tor_proxy is a test seam; the default wires the configured proxy. + self._proxies = {"http": tor_proxy, "https": tor_proxy} if tor_proxy else None + self.enabled = bool(enabled and self.bot_token and self.chat_id) + + if enabled and not self.enabled: + # Switched on but unusable β€” tell the operator once, without leaking the token. + logger.warning( + "Telegram alerts enabled but bot_token/chat_id are missing β€” alerts stay off." + ) + + def event_enabled(self, event): + """True only when the notifier is usable *and* this event kind is toggled on.""" + return self.enabled and bool(self.events.get(event, False)) + + def send(self, text): + """Push one message. Returns True on a successful 2xx send, False otherwise + (including when disabled). Never raises.""" + if not self.enabled: + return False + + url = f"{self._api_base}/bot{self.bot_token}/sendMessage" + try: + resp = requests.post( + url, + json={ + "chat_id": self.chat_id, + "text": text, + "disable_web_page_preview": True, + }, + timeout=self.timeout, + proxies=self._proxies, + ) + resp.raise_for_status() + return True + except requests.RequestException as exc: + # Log only the exception *type*: a requests error message can embed the full URL, + # which contains the bot token. Telegram being unreachable on a private/Tor-only + # host is expected, so this stays at debug to avoid log noise. + logger.debug("Telegram send failed (%s)", type(exc).__name__) + return False diff --git a/build/dashboard/mining_dashboard/service/worker_presence.py b/build/dashboard/mining_dashboard/service/worker_presence.py new file mode 100644 index 0000000..10ddd46 --- /dev/null +++ b/build/dashboard/mining_dashboard/service/worker_presence.py @@ -0,0 +1,119 @@ +import time + +from mining_dashboard.config.config import ( + WORKER_OFFLINE_AFTER_SEC, + WORKER_RECOVERY_AFTER_SEC, +) + + +class WorkerPresenceMonitor: + """ + Per-worker, flap-protected offline/online tracker (Issue #121). + + The alerter needs a stable "rig-3 went offline / rig-3 is back" signal. It's driven off the + **same worker rows the dashboard shows** β€” each rig's ``status`` (``"online"`` when xmrig-proxy + reports live connections, else the ``DOWN`` state the UI renders). This is the per-worker + analogue of ``NodeHealthMonitor`` (#31), multiplexed over many workers keyed by name, so a + Telegram "offline" alert lines up with the rig showing **DOWN** on screen: + + - **DOWN drives offline.** A rig is offline-pending while it's shown DOWN (listed by the proxy + but not connected). Once it's been DOWN continuously for ``offline_after`` it's declared + OFFLINE; once it's been back online for ``recovery_after`` the OFFLINE clears β€” so a single + dropped poll or a brief reconnect can't spam recoveredβ†’offline. + - **Silent baseline.** A rig's first sighting registers its current state with no edge β€” a + brand-new rig is not a "recovery", and a rig already DOWN at dashboard start is not a fresh + "offline" (a restart must not replay a stale transition). + - **Joined / left the table.** A rig the proxy reports for the first time is a ``joined`` edge; + one it stops listing entirely (aged off the worker table, #182) is a ``left`` edge β€” the + fleet-membership signal, distinct from a known rig going DOWN and back. Both are suppressed + until the monitor is *primed*: the very first cycle (and the first after a :meth:`reset`) + baselines whatever's already connected silently, so a dashboard restart doesn't replay every + rig as a fresh "joined", nor a readmission after a failover. + + :meth:`update` takes this cycle's worker rows (dicts with ``name`` + ``status``) and returns a + list of ``(name, event)`` edges, ``event`` in ``{"offline", "recovered", "joined", "left"}``. + + Clock defaults to wall-clock ``time.time``; injectable for deterministic tests. + """ + + def __init__( + self, + offline_after=WORKER_OFFLINE_AFTER_SEC, + recovery_after=WORKER_RECOVERY_AFTER_SEC, + clock=time.time, + ): + self.offline_after = offline_after + self.recovery_after = recovery_after + self._clock = clock + # name -> {state, online_since, down_since} + # state : "online" | "offline" (the debounced, edge-emitting state) + # online_since : when the current continuous-online streak began (None while DOWN) + # down_since : when the current continuous-DOWN streak began (None while online) + self._workers = {} + # False until the first cycle has baselined the currently-connected rigs, so joined/left + # edges don't fire for the startup (or post-reset) roster. + self._primed = False + + def update(self, workers, now=None): + """Feed this cycle's worker rows; return the list of debounced transitions.""" + now = self._clock() if now is None else now + online = {w.get("name") for w in workers if w.get("status") == "online"} + present = {w.get("name") for w in workers} + primed = self._primed + edges = [] + + # Rigs the proxy no longer lists at all have fallen off the worker table β€” the dashboard no + # longer shows them, so forget them (a later return re-baselines) and, once primed, report + # them as having LEFT the fleet. + for name in list(self._workers): + if name not in present: + del self._workers[name] + if primed: + edges.append((name, "left")) + + for name in present: + if name in self._workers: + self._step(name, name in online, now, edges) + else: + # First sighting β€” baseline to the rig's current state; once primed, a genuinely + # new rig is a JOINED edge. + is_online = name in online + self._workers[name] = { + "state": "online" if is_online else "offline", + "online_since": now if is_online else None, + "down_since": None if is_online else now, + } + if primed: + edges.append((name, "joined")) + + self._primed = True + return edges + + def _step(self, name, is_online, now, edges): + """Debounce a *known* rig's online/DOWN status into offline/recovered edges.""" + w = self._workers[name] + if is_online: + w["down_since"] = None + if w["online_since"] is None: + w["online_since"] = now + if w["state"] == "offline" and (now - w["online_since"]) >= self.recovery_after: + w["state"] = "online" + edges.append((name, "recovered")) + else: + w["online_since"] = None + if w["down_since"] is None: + w["down_since"] = now + if w["state"] == "online" and (now - w["down_since"]) >= self.offline_after: + w["state"] = "offline" + edges.append((name, "offline")) + + def reset(self): + """Drop all per-worker state. + + Called when the proxy is *intentionally* stopped β€” during the initial sync hold (#35) + or node-down worker failover (#31) β€” so the expected absence of workers doesn't age + into false "offline" alerts, and re-admission re-baselines every worker silently (no + spurious "recovered", and no "joined"/"left" for the readmitted roster). + """ + self._workers.clear() + self._primed = False diff --git a/build/dashboard/mining_dashboard/web/static/chart.mjs b/build/dashboard/mining_dashboard/web/static/chart.mjs index e030243..cf638b4 100644 --- a/build/dashboard/mining_dashboard/web/static/chart.mjs +++ b/build/dashboard/mining_dashboard/web/static/chart.mjs @@ -76,12 +76,20 @@ function paletteColors() { accent, purple, shares: v("--bad", "#da3633"), + evtLoss: v("--warn", "#d29922"), // degradation event marker (#99) + evtOk: v("--ok", "#3fb950"), // recovery event marker grid: v("--border", "#30363d"), ticks: v("--text-muted", "#8b949e"), band: withAlpha(accent, "26"), // drag-to-zoom selection band (β‰ˆ 0.15 alpha) }; } +// Per-point colour for the degradation event markers (#99): green for a recovery, warn/red for a +// loss. Returns one colour per point so a single dataset can show both. +export function eventColors(events, c) { + return (events || []).map((e) => (e.kind === "hashrate_recovered" ? c.evtOk : c.evtLoss)); +} + // Area-fill gradient stops (Issue #145): strong near the line, fading toward the axis, so a flat // series reads as a solid mass instead of a thin strip. Line a touch thicker than the default so // the top edge pops against the fill. @@ -214,6 +222,20 @@ export class ChartCard extends Component { pointHitRadius: 100, showLine: false, }, + // Degradation/recovery markers (#99) on their own hidden axis, just below the share rug. + // A diamond per event, red for a loss and green for a recovery; tooltip carries the label. + { + label: "Events", + data: d.events || [], + yAxisID: "events", + pointStyle: "rectRot", + pointRadius: 7, + pointHoverRadius: 10, + pointHitRadius: 100, + showLine: false, + pointBackgroundColor: eventColors(d.events, c), + pointBorderColor: eventColors(d.events, c), + }, ], }, options: { @@ -232,6 +254,7 @@ export class ChartCard extends Component { label(context) { if (context.dataset.label === "Shares") return self.shareCounts[context.dataIndex] + " Shares"; + if (context.dataset.label === "Events") return context.raw.label; let label = context.dataset.label || ""; if (label) label += ": "; if (context.parsed.y !== null) label += context.parsed.y + " H/s"; @@ -276,6 +299,8 @@ export class ChartCard extends Component { // Hidden 0–1 axis the Shares scatter rides on; markers pin near the top (0.93, // set server-side) so they never affect the hashrate y-range (Issue #145). shares: { type: "linear", display: false, min: 0, max: 1 }, + // Hidden 0–1 axis the degradation event markers ride on (#99), pinned near the top. + events: { type: "linear", display: false, min: 0, max: 1 }, }, }, }); @@ -310,6 +335,9 @@ export class ChartCard extends Component { ds[2].borderColor = c.shares; ds[2].backgroundColor = c.shares; ds[2].pointRadius = d.shares.map((s) => s.r); + ds[3].data = d.events || []; + ds[3].pointBackgroundColor = eventColors(d.events, c); + ds[3].pointBorderColor = eventColors(d.events, c); this.chart.options.scales.y.grid.color = c.grid; this.chart.options.scales.y.ticks.color = c.ticks; this.applyVisibility(); diff --git a/build/dashboard/mining_dashboard/web/views.py b/build/dashboard/mining_dashboard/web/views.py index d86ce80..08ff6ce 100644 --- a/build/dashboard/mining_dashboard/web/views.py +++ b/build/dashboard/mining_dashboard/web/views.py @@ -23,6 +23,7 @@ HASHRATE_WINDOW_COLUMNS, HASHRATE_WINDOWS, HOST_IP, + LOW_RAM_GB, UPDATE_INTERVAL, ) from mining_dashboard.helper.utils import ( @@ -151,7 +152,9 @@ def build_raffle_eligibility(metrics): # -------------------------------------------------------------------------------------- -def build_chart(history, shares, range_arg, window=None, avg_window=DEFAULT_HASHRATE_WINDOW): +def build_chart( + history, shares, range_arg, window=None, avg_window=DEFAULT_HASHRATE_WINDOW, events=None +): """Build the Chart.js datasets from history. Each point carries its real timestamp as the x value (epoch ms) so a linear time axis spaces points to scale; runs of missing samples (outages) are split by a ``null`` break so the line doesn't connect across the gap. @@ -188,6 +191,7 @@ def build_chart(history, shares, range_arg, window=None, avg_window=DEFAULT_HASH "p2pool": p2pool, "xvb": xvb, "shares": _share_points(filtered_history, filtered_shares), + "events": _event_points(_filter_events(events or [], range_arg, window)), "tension": _chart_tension(duration_s), } @@ -214,6 +218,21 @@ def _filter_range(history, shares, range_arg, window=None): ) +def _filter_events(events, range_arg, window=None): + """Restrict degradation events (#99) to the selected window β€” same bounds as ``_filter_range``, + but for the ``ts``-keyed events list.""" + if window is not None: + lo, hi = window + return [e for e in events if lo <= e["ts"] <= hi] + if range_arg == "all": + return events + secs = _RANGE_SECONDS.get(range_arg, 0) + if secs <= 0: + return events + cutoff = time.time() - secs + return [e for e in events if e["ts"] >= cutoff] + + def _window_duration(filtered_history, range_arg, window): """Seconds the chart currently spans β€” drives adaptive resolution/smoothing. From the window if zoomed, else the preset length, else (``all``/unknown) the actual data extent.""" @@ -339,6 +358,26 @@ def _share_points(filtered_history, filtered_shares): return points +# Event markers ride just below the share rug on their own hidden 0–1 axis (#99), so a "something +# went wrong" marker sits at the event's real time without touching the hashrate y-range. +_EVENT_MARKER_Y = 0.82 + + +def _event_points(filtered_events): + """Sparse degradation/recovery markers (#99): one point per event at its timestamp, carrying the + tooltip ``label`` and ``kind`` (e.g. ``hashrate_loss`` vs ``hashrate_recovered``) so the client + can colour a loss vs a recovery.""" + return [ + { + "x": int(e["ts"] * 1000), + "y": _EVENT_MARKER_Y, + "kind": e.get("type", ""), + "label": e.get("detail") or e.get("type", "event"), + } + for e in filtered_events + ] + + # -------------------------------------------------------------------------------------- # Section builders: Metrics (+ passthrough) -> display data. # -------------------------------------------------------------------------------------- @@ -796,6 +835,36 @@ def build_badges(data, metrics, mode_variant, db_healthy=True): } ) + # Persistent host/performance conditions (#104), derived from live metrics so they self-correct + # (HugePages appear after a reboot, etc.). These mirror the thresholds setup/doctor pre-flight on. + system = data.get("system", {}) or {} + hp_status = (system.get("hugepages") or ["Unknown"])[0] + if hp_status == "Disabled": + badges.append( + { + "text": "⚠ HugePages off", + "variant": "warn", + "title": "HugePages aren't reserved β€” RandomX hashrate is capped until they are. Run setup's tuning (or edit GRUB) and reboot to apply.", + } + ) + ram_total = (system.get("memory") or {}).get("total_gb", 0) or 0 + if 0 < ram_total < LOW_RAM_GB: + badges.append( + { + "text": f"⚠ Low RAM ({ram_total:.0f} GB)", + "variant": "warn", + "title": f"Under {LOW_RAM_GB} GB of RAM β€” syncing (Tari especially) is memory-heavy and can OOM. Add RAM for a stable node.", + } + ) + if system.get("avx2") is False: + badges.append( + { + "text": "⚠ No AVX2", + "variant": "warn", + "title": "This CPU lacks AVX2 β€” RandomX mining will be significantly slower. A hardware limit; nothing to change at runtime.", + } + ) + return badges @@ -933,7 +1002,14 @@ def build_state(data, state_mgr, range_arg, window=None, avg_window=DEFAULT_HASH "proxy_summary": build_proxy_summary(data), "egress": egress, "topology": topology, - "chart": build_chart(history, data.get("shares", []), range_arg, window, avg_window), + "chart": build_chart( + history, + data.get("shares", []), + range_arg, + window, + avg_window, + events=state_mgr.get_events(), + ), } diff --git a/build/dashboard/tests/collector/test_system.py b/build/dashboard/tests/collector/test_system.py index 9e54b43..a4b526a 100644 --- a/build/dashboard/tests/collector/test_system.py +++ b/build/dashboard/tests/collector/test_system.py @@ -79,3 +79,30 @@ def test_allocated_when_unused(self): def test_unknown_when_missing(self): with patch("builtins.open", _fake_open("SomethingElse: 1\n")): assert system.get_hugepages_status() == ("Unknown", "status-warn", "0/0") + + +class TestCpuAvx2: + def setup_method(self): + system._avx2_supported = None # clear the process-lifetime cache between cases + + def test_flag_present(self): + cpuinfo = "processor\t: 0\nflags\t\t: fpu vme avx avx2 sse4_2\n" + with patch("builtins.open", _fake_open(cpuinfo)): + assert system.get_cpu_avx2() is True + + def test_flag_absent(self): + cpuinfo = "processor\t: 0\nflags\t\t: fpu vme avx sse4_2\n" # avx but not avx2 + with patch("builtins.open", _fake_open(cpuinfo)): + assert system.get_cpu_avx2() is False + + def test_unreadable_is_unknown(self): + with patch("builtins.open", side_effect=OSError): + assert system.get_cpu_avx2() is None + + def test_result_is_cached(self): + cpuinfo = "flags\t\t: avx2\n" + with patch("builtins.open", _fake_open(cpuinfo)): + assert system.get_cpu_avx2() is True + # Second call must not re-read (open would now raise) β€” the cached value stands. + with patch("builtins.open", side_effect=AssertionError("should not re-read")): + assert system.get_cpu_avx2() is True diff --git a/build/dashboard/tests/frontend/chart.test.mjs b/build/dashboard/tests/frontend/chart.test.mjs index fccff6c..806b609 100644 --- a/build/dashboard/tests/frontend/chart.test.mjs +++ b/build/dashboard/tests/frontend/chart.test.mjs @@ -9,7 +9,7 @@ import { test } from 'node:test'; import assert from 'node:assert/strict'; -import { withAlpha, padYAxis } from '../../mining_dashboard/web/static/chart.mjs'; +import { withAlpha, padYAxis, eventColors } from '../../mining_dashboard/web/static/chart.mjs'; test('withAlpha: appends an 8-bit alpha to a #rrggbb hex', () => { assert.equal(withAlpha('#58a6ff', '26'), '#58a6ff26'); @@ -52,3 +52,17 @@ test('padYAxis: no-op when the range is non-finite (all series hidden / no data) padYAxis(s); assert.ok(Number.isNaN(s.min) && Number.isNaN(s.max)); }); + +test('eventColors: maps recovery to ok, everything else to loss (#99)', () => { + const c = { evtOk: '#3fb950', evtLoss: '#d29922' }; + const events = [ + { kind: 'hashrate_loss' }, + { kind: 'hashrate_recovered' }, + { kind: '' }, + ]; + assert.deepEqual(eventColors(events, c), [c.evtLoss, c.evtOk, c.evtLoss]); +}); + +test('eventColors: tolerates a missing events list', () => { + assert.deepEqual(eventColors(undefined, { evtOk: 'g', evtLoss: 'r' }), []); +}); diff --git a/build/dashboard/tests/service/test_alert_service.py b/build/dashboard/tests/service/test_alert_service.py new file mode 100644 index 0000000..7a809d8 --- /dev/null +++ b/build/dashboard/tests/service/test_alert_service.py @@ -0,0 +1,569 @@ +from types import SimpleNamespace + +import mining_dashboard.service.alert_service as alert_mod +from mining_dashboard.config.config import TELEGRAM_EVENTS +from mining_dashboard.service.alert_service import AlertService +from mining_dashboard.service.worker_presence import WorkerPresenceMonitor + + +def test_every_alert_event_has_a_config_toggle(): + # The canonical event set (AlertService.EVT_*) must line up 1:1 with the per-event toggles in + # config.py TELEGRAM_EVENTS β€” so adding an alert but forgetting its toggle (or vice versa) fails + # here instead of silently shipping an un-toggleable / dead event. The config-surface side + # (config.reference.json, docker-compose.yml, pithead render) is guarded in tests/stack/run.sh. + evt_values = {v for k, v in vars(AlertService).items() if k.startswith("EVT_")} + assert evt_values == set(TELEGRAM_EVENTS) + + +class _FakeNotifier: + """Stand-in transport: records sends, lets tests gate which events are 'enabled'.""" + + def __init__(self, enabled=True, allow=None): + self.enabled = enabled + self._allow = allow # None => every event allowed + self.sent = [] + + def event_enabled(self, event): + if not self.enabled: + return False + return True if self._allow is None else event in self._allow + + def send(self, text): + self.sent.append(text) + return True + + +def _svc(notifier=None, announce_online=True, **kw): + notifier = notifier if notifier is not None else _FakeNotifier() + kw.setdefault("worker_monitor", WorkerPresenceMonitor(offline_after=300, recovery_after=120)) + kw.setdefault("host_label", "") + svc = AlertService(notifier=notifier, **kw) + # The one-shot "stack online" ping fires on the first evaluate; mark it already sent so it + # doesn't perturb the per-signal tests. TestStackOnline opts out to exercise it. + if announce_online: + svc._announced_online = True + return svc + + +def _ev( + svc, + *, + monero_down=False, + tari_down=False, + tari_required=True, + miner_released=True, + workers=(), + workers_expected=False, + disk_percent=0, + db_healthy=True, + xvb_enabled=False, + shares_in_window=0, + clearnet_active=False, + xvb_registration_state="", + update_available=False, + low_hr_warning=False, + hugepages_reserved=True, + low_ram=False, + now=0, +): + return svc.evaluate( + monero_down=monero_down, + tari_down=tari_down, + tari_required=tari_required, + miner_released=miner_released, + workers=list(workers), + workers_expected=workers_expected, + disk_percent=disk_percent, + db_healthy=db_healthy, + xvb_enabled=xvb_enabled, + shares_in_window=shares_in_window, + clearnet_active=clearnet_active, + xvb_registration_state=xvb_registration_state, + update_available=update_available, + low_hr_warning=low_hr_warning, + hugepages_reserved=hugepages_reserved, + low_ram=low_ram, + now=now, + ) + + +def _keys(alerts): + return [k for k, _ in alerts] + + +def _on(*names): + """Worker rows the proxy reports online.""" + return [{"name": n, "status": "online"} for n in names] + + +def _down(*names): + """Worker rows still listed but disconnected β€” the DOWN state the dashboard shows.""" + return [{"name": n, "status": "offline"} for n in names] + + +class TestNodeEdges: + def test_first_cycle_seeds_baseline_silently(self): + svc = _svc() + # Already-down at startup must not replay as a fresh alert (restart semantics). + assert _ev(svc, monero_down=True) == [] + + def test_down_then_recovered(self): + svc = _svc() + _ev(svc, monero_down=False) # seed + assert _keys(_ev(svc, monero_down=True)) == [AlertService.EVT_NODE_DOWN] + assert _ev(svc, monero_down=True) == [] # no repeat while still down + assert _keys(_ev(svc, monero_down=False)) == [AlertService.EVT_NODE_RECOVERED] + + def test_node_text_names_the_chain(self): + svc = _svc() + _ev(svc, monero_down=False) + _, text = _ev(svc, monero_down=True)[0] + assert "Monero" in text + + +class TestTariGating: + def test_non_blocking_tari_does_not_alert(self): + svc = _svc() + _ev(svc, tari_down=False, tari_required=False) + assert _ev(svc, tari_down=True, tari_required=False) == [] + + def test_no_stale_edge_when_tari_becomes_required(self): + # Tari went down while non-blocking (no alert). Re-marking it required must not then + # replay a down edge for a state we never alerted on. + svc = _svc() + _ev(svc, tari_down=False, tari_required=False) + _ev(svc, tari_down=True, tari_required=False) # silently tracked + assert _ev(svc, tari_down=True, tari_required=True) == [] + # ...but a genuine recovery from there still fires. + assert _keys(_ev(svc, tari_down=False, tari_required=True)) == [ + AlertService.EVT_NODE_RECOVERED + ] + + def test_required_tari_alerts(self): + svc = _svc() + _ev(svc, tari_down=False, tari_required=True) + _, text = _ev(svc, tari_down=True, tari_required=True)[0] + assert "Tari" in text + + +class TestSyncFinished: + def test_fires_once_when_gate_opens(self): + svc = _svc() + _ev(svc, miner_released=False) # seed: still syncing + assert _keys(_ev(svc, miner_released=True)) == [AlertService.EVT_SYNC_FINISHED] + assert _ev(svc, miner_released=True) == [] # one-shot + + def test_no_alert_on_restart_after_sync(self): + svc = _svc() + # First observation is already-released (restart after sync) -> baseline, no alert. + assert _ev(svc, miner_released=True) == [] + + +class TestWorkerEdges: + def test_offline_then_recovered(self): + # Offline is driven by the DOWN status the dashboard shows, not by the rig vanishing. + svc = _svc() + assert _ev(svc, workers=_on("rig-1"), workers_expected=True, now=0) == [] + assert _ev(svc, workers=_down("rig-1"), workers_expected=True, now=0) == [] + assert _keys(_ev(svc, workers=_down("rig-1"), workers_expected=True, now=300)) == [ + AlertService.EVT_WORKER_OFFLINE + ] + _ev(svc, workers=_on("rig-1"), workers_expected=True, now=300) + assert _keys(_ev(svc, workers=_on("rig-1"), workers_expected=True, now=420)) == [ + AlertService.EVT_WORKER_RECOVERED + ] + + def test_not_expected_resets_and_silences(self): + svc = _svc() + _ev(svc, workers=_on("rig-1"), workers_expected=True, now=0) + _ev(svc, workers=_down("rig-1"), workers_expected=True, now=0) + _ev(svc, workers=_down("rig-1"), workers_expected=True, now=300) # rig-1 now offline + # Proxy intentionally stopped (sync hold / failover): reset, no alert. + assert _ev(svc, workers=[], workers_expected=False, now=330) == [] + # Re-admission re-baselines silently β€” no spurious "recovered". + assert _ev(svc, workers=_on("rig-1"), workers_expected=True, now=360) == [] + + +class TestWorkerMembership: + def test_joined_after_baseline(self): + svc = _svc() + _ev(svc, workers=_on("rig-1"), workers_expected=True) # prime + assert _keys(_ev(svc, workers=_on("rig-1", "rig-2"), workers_expected=True)) == [ + AlertService.EVT_WORKER_JOINED + ] + + def test_left_when_rig_drops_off_the_table(self): + svc = _svc() + _ev(svc, workers=_on("rig-1", "rig-2"), workers_expected=True) # prime + assert _keys(_ev(svc, workers=_on("rig-1"), workers_expected=True)) == [ + AlertService.EVT_WORKER_LEFT + ] + + +class TestDiskEdges: + def test_warn_then_critical_then_recover(self): + svc = _svc() + assert _ev(svc, disk_percent=40) == [] # seed silently + assert _keys(_ev(svc, disk_percent=88)) == [AlertService.EVT_DISK_SPACE] # -> warn + assert _ev(svc, disk_percent=90) == [] # still warn, no repeat + assert _keys(_ev(svc, disk_percent=97)) == [AlertService.EVT_DISK_SPACE] # -> critical + _, text = _ev(svc, disk_percent=40)[0] # -> recovered + assert "healthy" in text + + def test_seed_high_does_not_replay(self): + svc = _svc() + # Already-full at startup must not fire (restart semantics). + assert _ev(svc, disk_percent=99) == [] + + +class TestDbEdges: + def test_unhealthy_then_recovered(self): + svc = _svc() + assert _ev(svc, db_healthy=True) == [] # seed + assert _keys(_ev(svc, db_healthy=False)) == [AlertService.EVT_DB_UNHEALTHY] + assert _ev(svc, db_healthy=False) == [] # no repeat + _, text = _ev(svc, db_healthy=True)[0] + assert "recovered" in text + + +class TestXvbShareEdges: + def test_no_share_then_restored(self): + svc = _svc() + assert _ev(svc, xvb_enabled=True, shares_in_window=3) == [] # seed: has a share + assert _keys(_ev(svc, xvb_enabled=True, shares_in_window=0)) == [ + AlertService.EVT_XVB_NO_SHARE + ] + assert _ev(svc, xvb_enabled=True, shares_in_window=0) == [] # no repeat + _, text = _ev(svc, xvb_enabled=True, shares_in_window=1)[0] # restored + assert "restored" in text + + def test_silent_while_xvb_disabled(self): + svc = _svc() + # XvB off β†’ the share gate doesn't apply, even with zero shares. + assert _ev(svc, xvb_enabled=False, shares_in_window=0) == [] + # Turning XvB on re-seeds silently (no stale replay), then alerts on a real loss. + assert _ev(svc, xvb_enabled=True, shares_in_window=2) == [] + assert _keys(_ev(svc, xvb_enabled=True, shares_in_window=0)) == [ + AlertService.EVT_XVB_NO_SHARE + ] + + +class TestClearnetEdges: + def test_exposed_then_reverted(self): + svc = _svc() + assert _ev(svc, clearnet_active=False) == [] # seed + assert _keys(_ev(svc, clearnet_active=True)) == [AlertService.EVT_CLEARNET_EXPOSED] + assert _ev(svc, clearnet_active=True) == [] # no repeat + _, text = _ev(svc, clearnet_active=False)[0] + assert "Tor-only" in text + + +class TestStackOnline: + def test_online_fires_once_on_first_cycle(self): + svc = _svc(announce_online=False) + assert _keys(_ev(svc)) == [AlertService.EVT_STACK_ONLINE] + assert _ev(svc) == [] # one-shot β€” not on later cycles + + def test_online_text_is_friendly(self): + svc = _svc(announce_online=False) + _, text = _ev(svc)[0] + assert "online" in text.lower() + + +class TestXvbRegistration: + def test_invalid_then_recovered(self): + svc = _svc() + assert _ev(svc, xvb_enabled=True, xvb_registration_state="registered") == [] # seed + assert _keys(_ev(svc, xvb_enabled=True, xvb_registration_state="invalid")) == [ + AlertService.EVT_XVB_REGISTRATION + ] + assert _keys(_ev(svc, xvb_enabled=True, xvb_registration_state="registered")) == [ + AlertService.EVT_XVB_REGISTRATION + ] + + def test_failing_alerts(self): + svc = _svc() + _ev(svc, xvb_enabled=True, xvb_registration_state="registered") + _, text = _ev(svc, xvb_enabled=True, xvb_registration_state="failing")[0] + assert "failing" in text.lower() + + def test_silent_while_disabled(self): + svc = _svc() + assert _ev(svc, xvb_enabled=False, xvb_registration_state="invalid") == [] + + def test_benign_transition_is_silent(self): + # A change that isn't into invalid/failing (nor recovering from one) doesn't alert. + svc = _svc() + _ev(svc, xvb_enabled=True, xvb_registration_state="registered") # seed + assert _ev(svc, xvb_enabled=True, xvb_registration_state="") == [] + + +class TestNewRelease: + def test_fires_once_on_rising_edge(self): + svc = _svc() + assert _ev(svc, update_available=False) == [] # seed + assert _keys(_ev(svc, update_available=True)) == [AlertService.EVT_NEW_RELEASE] + assert _ev(svc, update_available=True) == [] # no repeat while still available + + +class TestHashrateLow: + def test_warns_then_recovers(self): + svc = _svc() + assert _ev(svc, low_hr_warning=False) == [] # seed + assert _keys(_ev(svc, low_hr_warning=True)) == [AlertService.EVT_HASHRATE_LOW] + assert _ev(svc, low_hr_warning=True) == [] # no repeat + _, text = _ev(svc, low_hr_warning=False)[0] + assert "back above" in text + + +class TestIncidentLog: + def test_tallies_problems_and_drains(self): + svc = _svc() + _ev(svc, monero_down=False) # seed + _ev(svc, monero_down=True) # +node_down + _ev(svc, db_healthy=True) # seed + _ev(svc, db_healthy=False) # +db_unhealthy + _ev(svc, disk_percent=50) # seed + _ev(svc, disk_percent=97) # +disk_space (critical) + assert svc.drain_incidents() == { + "node_down": 1, + "db_unhealthy": 1, + "disk_space": 1, + } + assert svc.drain_incidents() == {} # drained β†’ reset + + def test_recoveries_are_not_incidents(self): + svc = _svc() + _ev(svc, monero_down=False) + _ev(svc, monero_down=True) # +1 + _ev(svc, monero_down=False) # recovery β€” not counted + assert svc.drain_incidents() == {"node_down": 1} + + def test_worker_offline_counts_once(self): + svc = _svc() + _ev(svc, workers=_on("r"), workers_expected=True, now=0) # prime + _ev(svc, workers=_down("r"), workers_expected=True, now=0) # DOWN streak + _ev(svc, workers=_down("r"), workers_expected=True, now=300) # offline β†’ incident + _ev(svc, workers=_on("r"), workers_expected=True, now=300) # back online + _ev(svc, workers=_on("r"), workers_expected=True, now=420) # recovered β€” not counted + assert svc.drain_incidents() == {"worker_offline": 1} + + +class TestEventFiltering: + def test_disabled_events_are_dropped(self): + svc = _svc(notifier=_FakeNotifier(allow={AlertService.EVT_NODE_DOWN})) + _ev(svc, workers=_on("rig-1"), workers_expected=True, now=0) + _ev(svc, workers=_down("rig-1"), workers_expected=True, now=0) + # worker_offline is computed but filtered out because it's not in the allow-set. + assert _ev(svc, workers=_down("rig-1"), workers_expected=True, now=300) == [] + + +class TestHostLabel: + def test_prefixes_when_set(self): + svc = _svc(host_label="box.lan") + _ev(svc, monero_down=False) + _, text = _ev(svc, monero_down=True)[0] + assert text.startswith("[box.lan] ") + + def test_placeholder_host_is_not_prefixed(self): + svc = _svc(host_label="Unknown Host") + _ev(svc, monero_down=False) + _, text = _ev(svc, monero_down=True)[0] + assert not text.startswith("[") + + +class TestProcess: + async def test_disabled_notifier_is_noop(self): + notifier = _FakeNotifier(enabled=False) + svc = _svc(notifier=notifier) + out = await svc.process( + monero_down=True, + tari_down=False, + tari_required=True, + miner_released=True, + workers=[], + workers_expected=False, + ) + assert out == [] + assert notifier.sent == [] + + async def test_enabled_notifier_dispatches(self): + notifier = _FakeNotifier() + svc = _svc(notifier=notifier) + # seed + await svc.process( + monero_down=False, + tari_down=False, + tari_required=True, + miner_released=True, + workers=[], + workers_expected=False, + ) + out = await svc.process( + monero_down=True, + tari_down=False, + tari_required=True, + miner_released=True, + workers=[], + workers_expected=False, + ) + assert _keys(out) == [AlertService.EVT_NODE_DOWN] + assert len(notifier.sent) == 1 and "DOWN" in notifier.sent[0] + + async def test_process_swallows_evaluate_error(self, monkeypatch): + # A bug in evaluate() must never break the data loop β€” process() catches, logs, returns []. + svc = _svc(notifier=_FakeNotifier()) + + def boom(**_kw): + raise RuntimeError("kaboom") + + monkeypatch.setattr(svc, "evaluate", boom) + out = await svc.process( + monero_down=True, + tari_down=False, + tari_required=True, + miner_released=True, + workers=[], + workers_expected=False, + ) + assert out == [] + + +def _fake_localtime(hour, minute, yday=100, year=2026): + """A time.localtime stand-in with just the fields maybe_daily_summary reads.""" + return lambda _now: SimpleNamespace(tm_year=year, tm_yday=yday, tm_hour=hour, tm_min=minute) + + +def _daily_svc(daily_time="08:00", notifier=None): + notifier = notifier if notifier is not None else _FakeNotifier() + return AlertService( + notifier=notifier, + worker_monitor=WorkerPresenceMonitor(), + host_label="", + daily_time=daily_time, + ) + + +class TestDailySummary: + async def test_fires_at_target_once_per_day(self, monkeypatch): + n = _FakeNotifier() + svc = _daily_svc(notifier=n) + prov = lambda: "digest" # noqa: E731 + # Before the target time β†’ nothing. + monkeypatch.setattr(alert_mod.time, "localtime", _fake_localtime(7, 59)) + assert await svc.maybe_daily_summary(0, prov) is None + # At the target β†’ fires once. + monkeypatch.setattr(alert_mod.time, "localtime", _fake_localtime(8, 0)) + assert await svc.maybe_daily_summary(0, prov) == "digest" + assert n.sent == ["digest"] + # Later the same day β†’ no repeat. + monkeypatch.setattr(alert_mod.time, "localtime", _fake_localtime(9, 30)) + assert await svc.maybe_daily_summary(0, prov) is None + # Next day at the target β†’ fires again. + monkeypatch.setattr(alert_mod.time, "localtime", _fake_localtime(8, 0, yday=101)) + assert await svc.maybe_daily_summary(0, prov) == "digest" + assert n.sent == ["digest", "digest"] + + async def test_late_start_waits_for_next_day(self, monkeypatch): + svc = _daily_svc() + prov = lambda: "digest" # noqa: E731 + # First observation is already past 08:00 β†’ don't replay today. + monkeypatch.setattr(alert_mod.time, "localtime", _fake_localtime(10, 0)) + assert await svc.maybe_daily_summary(0, prov) is None + monkeypatch.setattr(alert_mod.time, "localtime", _fake_localtime(23, 0)) + assert await svc.maybe_daily_summary(0, prov) is None + # Next day at 08:00 β†’ fires. + monkeypatch.setattr(alert_mod.time, "localtime", _fake_localtime(8, 0, yday=101)) + assert await svc.maybe_daily_summary(0, prov) == "digest" + + async def test_malformed_time_disables(self, monkeypatch): + svc = _daily_svc(daily_time="not-a-time") + monkeypatch.setattr(alert_mod.time, "localtime", _fake_localtime(8, 0)) + assert await svc.maybe_daily_summary(0, lambda: "x") is None + + async def test_gated_off_by_event_toggle(self, monkeypatch): + # daily_summary not in the allow-set β†’ the notifier reports it disabled. + svc = _daily_svc(notifier=_FakeNotifier(allow=set())) + monkeypatch.setattr(alert_mod.time, "localtime", _fake_localtime(8, 0)) + assert await svc.maybe_daily_summary(0, lambda: "x") is None + + async def test_provider_error_is_swallowed_and_marks_day_done(self, monkeypatch): + n = _FakeNotifier() + svc = _daily_svc(notifier=n) + + def boom(): + raise RuntimeError("bad build") + + monkeypatch.setattr(alert_mod.time, "localtime", _fake_localtime(8, 0)) + assert await svc.maybe_daily_summary(0, boom) is None + assert n.sent == [] + # Marked done for today even though the build failed β†’ no retry storm. + monkeypatch.setattr(alert_mod.time, "localtime", _fake_localtime(8, 5)) + assert await svc.maybe_daily_summary(0, lambda: "digest") is None + + +class TestHostAdvisories: + """Persistent host-perf advisories (#104): unlike the transient edges, these fire on the FIRST + observation of the problem (a stable bad box would never 'transition'), stay quiet while it + persists, and β€” for HugePages β€” clear when fixed. They are not tallied as daily incidents.""" + + def test_hugepages_not_reserved_fires_once_then_recovers(self): + svc = _svc() + # First cycle already bad β†’ fires (not seed-silent). + assert _keys(_ev(svc, hugepages_reserved=False)) == [AlertService.EVT_HUGEPAGES] + # Persists β†’ silent. + assert _keys(_ev(svc, hugepages_reserved=False)) == [] + # Reboot applied HugePages β†’ one recovery edge. + assert _keys(_ev(svc, hugepages_reserved=True)) == [AlertService.EVT_HUGEPAGES] + assert _keys(_ev(svc, hugepages_reserved=True)) == [] + + def test_healthy_hugepages_never_fires(self): + svc = _svc() + assert _keys(_ev(svc, hugepages_reserved=True)) == [] + assert _keys(_ev(svc, hugepages_reserved=True)) == [] + + def test_low_ram_fires_once_no_recovery(self): + svc = _svc() + assert _keys(_ev(svc, low_ram=True)) == [AlertService.EVT_LOW_RAM] + assert _keys(_ev(svc, low_ram=True)) == [] # persists, silent + # RAM "recovering" (unlikely at runtime) is silent β€” no false good-news ping. + assert _keys(_ev(svc, low_ram=False)) == [] + + def test_advisories_not_counted_as_incidents(self): + # Static host facts shouldn't inflate the daily incident roll-up (#342). + svc = _svc() + _ev(svc, hugepages_reserved=False, low_ram=True) + assert svc.drain_incidents() == {} + + def test_gated_off_by_toggle(self): + svc = _svc(notifier=_FakeNotifier(allow={AlertService.EVT_NODE_DOWN})) + assert _keys(_ev(svc, hugepages_reserved=False, low_ram=True)) == [] + + +class TestDegradationAlert: + """The #99 hashrate-loss / recovery push. The DegradationMonitor owns the debounce; this only + formats + sends, tallies the loss as an incident, and honours the event toggle.""" + + async def test_loss_sends_and_records_incident(self): + n = _FakeNotifier() + svc = _svc(notifier=n) + text = await svc.degradation_alert("loss", 0.62) + assert "62%" in text and "dropped" in text.lower() + assert n.sent == [text] + assert svc.drain_incidents() == {AlertService.EVT_HASHRATE_LOSS: 1} + + async def test_recovery_sends_no_incident(self): + n = _FakeNotifier() + svc = _svc(notifier=n) + text = await svc.degradation_alert("recovered", 0.0) + assert "recovered" in text.lower() + assert n.sent == [text] + assert svc.drain_incidents() == {} # recovery is not an incident + + async def test_gated_off_still_records_loss(self): + # Toggle off suppresses the message but the incident is still tallied for the daily log. + n = _FakeNotifier(allow=set()) + svc = _svc(notifier=n) + assert await svc.degradation_alert("loss", 0.5) is None + assert n.sent == [] + assert svc.drain_incidents() == {AlertService.EVT_HASHRATE_LOSS: 1} diff --git a/build/dashboard/tests/service/test_data_service.py b/build/dashboard/tests/service/test_data_service.py index 34b7eec..07ef080 100644 --- a/build/dashboard/tests/service/test_data_service.py +++ b/build/dashboard/tests/service/test_data_service.py @@ -688,6 +688,7 @@ async def test_single_iteration_aggregates(self): patch.object(ds_mod, "get_memory_usage", return_value={}), patch.object(ds_mod, "get_load_average", return_value="0"), patch.object(ds_mod, "get_cpu_usage", return_value="0%"), + patch.object(ds_mod, "get_cpu_avx2", return_value=True), patch("asyncio.sleep", AsyncMock(side_effect=StopAsyncIteration)), ): with pytest.raises(StopAsyncIteration): @@ -708,6 +709,56 @@ async def test_single_iteration_aggregates(self): sm.update_history.assert_called() sm.save_snapshot.assert_called() + async def test_degradation_edge_records_event_and_alerts(self): + # #99 wiring: when the detector reports an edge, the loop persists a chart marker and pushes + # the hashrate_loss alert. Stub the detector so a single iteration produces a deterministic + # edge (the debounce itself is unit-tested in test_degradation.py). + svc, sm, proxy = _make_service() + proxy.get_workers.return_value = {"workers": []} + proxy.get_summary.return_value = {"results": {}} + svc.degradation = MagicMock() + svc.degradation.update.return_value = ("loss", 0.6, 1000.0, 400.0) + svc.alert_service.degradation_alert = AsyncMock() + + worker_client = MagicMock() + worker_client.get_stats = AsyncMock(return_value={}) + tari_client = MagicMock() + tari_client.get_sync_status = AsyncMock(return_value={"is_syncing": False}) + tari_client.close = AsyncMock() + + with ( + patch.object(ds_mod, "ClientSession", _FakeClientSession), + patch.object(ds_mod, "XMRigWorkerClient", return_value=worker_client), + patch.object(ds_mod, "TariClient", return_value=tari_client), + patch.object(ds_mod, "get_stratum_stats", return_value={}), + patch.object(ds_mod, "get_network_stats", return_value={"height": 100}), + patch.object( + ds_mod, "get_tari_stats", return_value={"active": True, "status": "OK", "height": 3} + ), + patch.object( + ds_mod, + "get_p2pool_stats", + return_value={"pool": {"last_share_time": 0, "difficulty": 0}}, + ), + patch.object( + ds_mod, + "get_monero_sync_status", + AsyncMock(return_value={"is_syncing": False, "percent": 100}), + ), + patch.object(ds_mod, "get_disk_usage", return_value={}), + patch.object(ds_mod, "get_hugepages_status", return_value=("Enabled", "ok", "1/2")), + patch.object(ds_mod, "get_memory_usage", return_value={}), + patch.object(ds_mod, "get_load_average", return_value="0"), + patch.object(ds_mod, "get_cpu_usage", return_value="0%"), + patch.object(ds_mod, "get_cpu_avx2", return_value=True), + patch("asyncio.sleep", AsyncMock(side_effect=StopAsyncIteration)), + ): + with pytest.raises(StopAsyncIteration): + await svc.run() + + assert sm.add_event.call_args.args[1] == "hashrate_loss" + svc.alert_service.degradation_alert.assert_awaited_once_with("loss", 0.6) + async def test_run_holds_miner_while_syncing(self): # A syncing Monero node β†’ gate holds p2pool + xmrig-proxy and #31's failover stays # dormant (no workers to fail over before we've even started mining). @@ -756,6 +807,7 @@ async def test_run_holds_miner_while_syncing(self): patch.object(ds_mod, "get_memory_usage", return_value={}), patch.object(ds_mod, "get_load_average", return_value="0"), patch.object(ds_mod, "get_cpu_usage", return_value="0%"), + patch.object(ds_mod, "get_cpu_avx2", return_value=True), patch("asyncio.sleep", AsyncMock(side_effect=StopAsyncIteration)), ): with pytest.raises(StopAsyncIteration): @@ -768,6 +820,85 @@ async def test_run_holds_miner_while_syncing(self): assert svc.miner_released is False assert svc.latest_data["miner_held"] is True + async def test_run_wires_computed_signals_into_the_alerter(self): + # Wiring guard: the unit tests prove each signal β†’ the right alert in isolation; this proves + # the data loop actually hands the alerter the full contract each cycle (so a dropped/renamed + # kwarg, or forgetting the daily-summary call, fails here rather than silently going dark). + svc, sm, proxy = _make_service() + sm.is_db_healthy.return_value = True + proxy.get_workers.return_value = {"workers": []} + svc._apply_worker_rejection = AsyncMock() + svc.alert_service = MagicMock() + # Disabled β†’ the loop skips the per-cycle build_metrics (a MagicMock state_manager can't feed + # it); process()/maybe_daily_summary are still called every cycle regardless. + svc.alert_service.enabled = False + svc.alert_service.process = AsyncMock() + svc.alert_service.maybe_daily_summary = AsyncMock() + + worker_client = MagicMock() + worker_client.get_stats = AsyncMock(return_value={}) + tari_client = MagicMock() + tari_client.get_sync_status = AsyncMock( + return_value={"is_syncing": False, "reachable": True} + ) + tari_client.close = AsyncMock() + + with ( + patch.object(ds_mod, "ClientSession", _FakeClientSession), + patch.object(ds_mod, "XMRigWorkerClient", return_value=worker_client), + patch.object(ds_mod, "TariClient", return_value=tari_client), + patch.object(ds_mod, "get_stratum_stats", return_value={}), + patch.object(ds_mod, "get_network_stats", return_value={"height": 100}), + patch.object(ds_mod, "get_tari_stats", return_value={"active": True, "height": 3}), + patch.object( + ds_mod, + "get_p2pool_stats", + return_value={"pool": {"last_share_time": 0, "difficulty": 0}}, + ), + patch.object( + ds_mod, + "get_monero_sync_status", + AsyncMock(return_value={"is_syncing": False, "reachable": True}), + ), + patch.object(ds_mod, "get_disk_usage", return_value={"percent": 42}), + patch.object(ds_mod, "get_hugepages_status", return_value=("Enabled", "ok", "1/2")), + patch.object(ds_mod, "get_memory_usage", return_value={}), + patch.object(ds_mod, "get_load_average", return_value="0"), + patch.object(ds_mod, "get_cpu_usage", return_value="0%"), + patch.object(ds_mod, "get_cpu_avx2", return_value=True), + patch("asyncio.sleep", AsyncMock(side_effect=StopAsyncIteration)), + ): + with pytest.raises(StopAsyncIteration): + await svc.run() + + svc.alert_service.process.assert_awaited_once() + kw = svc.alert_service.process.await_args.kwargs + # The full signal contract the AlertService.evaluate() unit tests rely on. + assert set(kw) >= { + "monero_down", + "tari_down", + "tari_required", + "miner_released", + "workers", + "workers_expected", + "disk_percent", + "db_healthy", + "xvb_enabled", + "shares_in_window", + "clearnet_active", + "xvb_registration_state", + "update_available", + "low_hr_warning", + "hugepages_reserved", + "low_ram", + } + # ...sourced from the real computed values, not placeholders. + assert kw["db_healthy"] is True # from state_manager.is_db_healthy() + assert kw["disk_percent"] == 42 # from get_disk_usage() + assert isinstance(kw["workers"], list) + # The once-daily digest is wired in too. + svc.alert_service.maybe_daily_summary.assert_awaited_once() + async def test_run_releases_despite_height_override(self): # Both nodes are synced per their RPC/gRPC, but p2pool is held so its stats file is # empty β†’ get_network_stats height 0 trips the UI "syncing" override. The gate must @@ -809,6 +940,7 @@ async def test_run_releases_despite_height_override(self): patch.object(ds_mod, "get_memory_usage", return_value={}), patch.object(ds_mod, "get_load_average", return_value="0"), patch.object(ds_mod, "get_cpu_usage", return_value="0%"), + patch.object(ds_mod, "get_cpu_avx2", return_value=True), patch("asyncio.sleep", AsyncMock(side_effect=StopAsyncIteration)), ): with pytest.raises(StopAsyncIteration): @@ -868,6 +1000,7 @@ async def test_run_nonblocking_tari_releases_and_stays_operational(self): patch.object(ds_mod, "get_memory_usage", return_value={}), patch.object(ds_mod, "get_load_average", return_value="0"), patch.object(ds_mod, "get_cpu_usage", return_value="0%"), + patch.object(ds_mod, "get_cpu_avx2", return_value=True), patch("asyncio.sleep", AsyncMock(side_effect=StopAsyncIteration)), ): with pytest.raises(StopAsyncIteration): @@ -907,6 +1040,7 @@ async def _run_one_iteration(self, svc, monero_sync, tari_sync): patch.object(ds_mod, "get_memory_usage", return_value={}), patch.object(ds_mod, "get_load_average", return_value="0"), patch.object(ds_mod, "get_cpu_usage", return_value="0%"), + patch.object(ds_mod, "get_cpu_avx2", return_value=True), patch("asyncio.sleep", AsyncMock(side_effect=StopAsyncIteration)), ): with pytest.raises(StopAsyncIteration): @@ -1034,6 +1168,7 @@ async def test_run_holds_when_tari_required_and_only_monero_synced(self): patch.object(ds_mod, "get_memory_usage", return_value={}), patch.object(ds_mod, "get_load_average", return_value="0"), patch.object(ds_mod, "get_cpu_usage", return_value="0%"), + patch.object(ds_mod, "get_cpu_avx2", return_value=True), patch("asyncio.sleep", AsyncMock(side_effect=StopAsyncIteration)), ): with pytest.raises(StopAsyncIteration): diff --git a/build/dashboard/tests/service/test_degradation.py b/build/dashboard/tests/service/test_degradation.py new file mode 100644 index 0000000..e618190 --- /dev/null +++ b/build/dashboard/tests/service/test_degradation.py @@ -0,0 +1,81 @@ +"""Unit tests for the hashrate-degradation detector (Issue #99). + +Pure logic, driven by an injectable clock β€” no timers, no sleeps. Each test walks the monitor +through a scripted (value, time) sequence and asserts which edges fire. +""" + +from mining_dashboard.service.degradation import DegradationMonitor + + +def _mon(**kw): + # Small thresholds so tests read clearly: baseline established above 500, 50% drop for 600s. + kw.setdefault("min_baseline", 500) + return DegradationMonitor(**kw) + + +def test_steady_state_never_fires(): + m = _mon() + for t in range(0, 6000, 60): + assert m.update(1000, now=t) is None + + +def test_cold_start_below_min_baseline_is_silent(): + m = _mon(min_baseline=500) + # A tiny fleet that never crosses min_baseline can't be "degraded", even at zero hashrate. + for t in range(0, 3000, 60): + assert m.update(100, now=t) is None + assert m.update(0, now=t + 30) is None + + +def test_sustained_drop_fires_loss_once(): + m = _mon(sustained_sec=600, threshold_frac=0.5) + m.update(1000, now=0) # seed baseline + # Drop to zero and hold; nothing until the debounce window elapses. + assert m.update(0, now=60) is None + assert m.update(0, now=600) is None + edge = m.update(0, now=660) # 600s below threshold since t=60 + assert edge is not None + kind, drop_frac, baseline, current = edge + assert kind == "loss" + assert drop_frac > 0.9 + assert current == 0 + # Still down β€” no repeat. + assert m.update(0, now=1300) is None + + +def test_brief_blip_does_not_fire(): + m = _mon(sustained_sec=600, threshold_frac=0.5) + m.update(1000, now=0) + assert m.update(0, now=60) is None # blip starts + assert m.update(1000, now=120) is None # recovers well before 600s + # Baseline intact, clock reset β€” a later full window from scratch is needed to fire. + assert m.update(0, now=180) is None + assert m.update(0, now=700) is None + assert m.update(0, now=781) is not None # 600s since t=180 + + +def test_recovery_fires_after_hold(): + m = _mon(sustained_sec=600, recovery_frac=0.8, recovery_sec=120) + m.update(1000, now=0) + m.update(0, now=60) + assert m.update(0, now=660)[0] == "loss" + # Climb back above 80% of baseline and hold recovery_sec. + assert m.update(1000, now=720) is None + edge = m.update(1000, now=840) # 120s above recovery threshold + assert edge is not None and edge[0] == "recovered" + + +def test_baseline_frozen_while_degraded(): + # A sustained drop must not erode the baseline (which would mask the outage and prevent recovery + # detection). Once degraded, the baseline is pinned β€” a long outage doesn't drag it down. + m = _mon(sustained_sec=600, recovery_frac=0.8, recovery_sec=120) + m.update(1000, now=0) + m.update(0, now=60) + assert m.update(0, now=660)[0] == "loss" + frozen = m._baseline + # Long stretch near zero β€” an unfrozen EMA would drift the baseline down over dozens of samples. + for t in range(700, 6000, 60): + m.update(10, now=t) + assert m._baseline == frozen # pinned to the pre-drop level + m.update(1000, now=6060) + assert m.update(1000, now=6200)[0] == "recovered" diff --git a/build/dashboard/tests/service/test_egress.py b/build/dashboard/tests/service/test_egress.py index 69b62ad..7b38f12 100644 --- a/build/dashboard/tests/service/test_egress.py +++ b/build/dashboard/tests/service/test_egress.py @@ -23,6 +23,7 @@ "tari_clearnet_sync": False, "remote_monero": False, "healthchecks_enabled": False, + "telegram_enabled": False, } @@ -93,6 +94,14 @@ def test_healthchecks_ping_is_tor_when_configured_inactive_otherwise(): assert on["summary"]["leaks"] == 0 +def test_telegram_bot_is_tor_when_enabled_inactive_otherwise(): + # Enabling Telegram adds a dashboard Tor egress (#121/#340); off β†’ inactive, never a leak. + assert _conn(_posture(telegram_enabled=False), "dashboard", "Telegram")["route"] == INACTIVE + on = _posture(telegram_enabled=True, firewall=True) + assert _conn(on, "dashboard", "Telegram")["route"] == TOR + assert on["summary"]["leaks"] == 0 # Tor-routed, so never a leak + + def test_remote_monerod_rpc_is_clearnet(): assert _conn(_posture(remote_monero=False), "p2pool", "monerod RPC")["route"] != CLEARNET assert _conn(_posture(remote_monero=True), "p2pool", "monerod RPC")["route"] == CLEARNET @@ -226,6 +235,7 @@ def test_tari_clearnet_sync_surfaces_in_egress_and_topology(): "tari_clearnet_sync", "remote_monero", "healthchecks_enabled", + "telegram_enabled", ) diff --git a/build/dashboard/tests/service/test_storage_service.py b/build/dashboard/tests/service/test_storage_service.py index 54dc62e..dd967b0 100644 --- a/build/dashboard/tests/service/test_storage_service.py +++ b/build/dashboard/tests/service/test_storage_service.py @@ -395,3 +395,52 @@ def test_old_history_pruned_from_db_when_cleanup_fires(self, state_manager, monk (time.time() - HISTORY_RETENTION_SEC,), ).fetchone()[0] assert remaining == 0, "expired DB rows are pruned" + + +class TestChartEvents: + """Degradation/recovery markers for the chart (#99): in-memory tally, disk persistence, and + tolerance of a pre-migration DB with no events table.""" + + def test_add_and_get_roundtrip(self, state_manager): + t0 = time.time() + state_manager.add_event(t0, "loss", "-62%") + state_manager.add_event(t0 + 100, "recovered", "") + evs = state_manager.get_events() + assert [e["type"] for e in evs] == ["loss", "recovered"] + assert evs[0] == {"ts": t0, "type": "loss", "detail": "-62%"} + # returns a copy β€” mutating it doesn't corrupt stored state + evs.clear() + assert len(state_manager.get_events()) == 2 + + def test_old_events_pruned_from_memory(self, state_manager): + state_manager.add_event(1.0, "loss", "ancient") # ts well before the retention cutoff + state_manager.add_event(time.time(), "recovered", "fresh") + details = [e["detail"] for e in state_manager.get_events()] + assert details == ["fresh"] + + def test_events_survive_reload(self, tmp_path): + db = str(tmp_path / "events.db") + sm = StateManager(db_path=db) + sm.add_event(time.time(), "loss", "-50%") + sm.close() + sm2 = StateManager(db_path=db) + try: + evs = sm2.get_events() + assert len(evs) == 1 and evs[0]["type"] == "loss" + finally: + sm2.close() + + def test_load_tolerates_missing_events_table(self, tmp_path): + # Upgrade path: a DB written by a pre-#99 build has no events table. Opening it must not + # crash and must report no events. (StateManager creates the table on open, so load() then + # finds it empty; the sqlite3.Error guard in load() is defence-in-depth for that ordering.) + db = str(tmp_path / "legacy.db") + conn = sqlite3.connect(db) + conn.execute("CREATE TABLE state (key TEXT PRIMARY KEY, value TEXT)") + conn.commit() + conn.close() + sm = StateManager(db_path=db) + try: + assert sm.get_events() == [] + finally: + sm.close() diff --git a/build/dashboard/tests/service/test_telegram_commands.py b/build/dashboard/tests/service/test_telegram_commands.py new file mode 100644 index 0000000..8c57263 --- /dev/null +++ b/build/dashboard/tests/service/test_telegram_commands.py @@ -0,0 +1,713 @@ +"""Unit tests for the on-demand Telegram command interface (Issue #45). + +Covers command parsing, the pure reply formatters (fed hand-built Metrics), reply routing +(``build_metrics`` stubbed so no DB is touched), single-chat access control, and the +enabled/disabled gating. No network β€” the transport is stubbed throughout. +""" + +import asyncio +from dataclasses import replace +from types import SimpleNamespace + +import pytest + +from mining_dashboard.service import telegram_commands as tc +from mining_dashboard.service.metrics import Metrics, SyncMetric + +_SYNCED = SyncMetric( + percent=100, current=10, target=10, remaining=0, has_target=True, done=True, down=False +) +_DOWN = SyncMetric( + percent=0, current=0, target=0, remaining=0, has_target=False, done=False, down=True +) +_SYNCING = SyncMetric( + percent=42.5, current=850, target=2000, remaining=1150, has_target=True, done=False, down=False +) + +_BASE = Metrics( + total_h15=10500.0, + p2pool_1h=8000.0, + p2pool_24h=8100.0, + xvb_1h=2100.0, + xvb_24h=2300.0, + xvb_routed_1h=2000.0, + xvb_routed_24h=2050.0, + stratum_h15=10300.0, + stratum_h1h=10400.0, + stratum_h24h=10200.0, + mode="P2POOL", + xvb_enabled=True, + current_tier="Donor", + target_tier="Donor", + target_threshold=1000.0, + target_sustainable=True, + low_hr_warning=False, + xvb_fail_count=0, + xvb_last_update=0, + workers_online=2, + workers_total=3, + shares_in_window=5, + pplns_window=2160, + block_time=10, + pool_type="Mini", + pool_hashrate=120_000_000.0, + pool_difficulty=250_000_000.0, + network_difficulty=380_000_000_000.0, + network_height=3210001, + global_syncing=False, + monero=_SYNCED, + tari=_SYNCED, + monero_mode="Unknown", + tari_mining=True, +) + + +def _metrics(**over): + return replace(_BASE, **over) + + +# --- parse_command ------------------------------------------------------------------------ + + +@pytest.mark.parametrize( + "text,expected", + [ + ("/status", "status"), + ("/info", "info"), + (" /sync ", "sync"), + ("/HASHRATE", "hashrate"), + ("/system", "system"), + ("/pool", "pool"), + ("/xvb", "xvb"), + ("/earnings", "earnings"), + ("/status@PitheadBot", "status"), # group @mention suffix stripped + ("/workers now please", "workers"), # only the first word matters + ("/help", "help"), + ("/frobnicate", "unknown"), # a slash command we don't answer + ("hello there", None), # plain chatter is ignored + ("", None), + (None, None), + ("/", None), + ], +) +def test_parse_command(text, expected): + assert tc.parse_command(text) == expected + + +# --- formatters --------------------------------------------------------------------------- + + +def test_status_active(): + out = tc.format_status(_metrics(), mining_active=True) + assert "Monero node: 🟒 synced" in out + assert "Mining: 🟒 active (P2POOL)" in out + assert "Workers: 2/3 online" in out + assert "10.50 kH/s" in out + assert "PPLNS shares: 5 in window" in out + + +def test_status_syncing_beats_mining_flag(): + # While the whole stack is syncing, the reply says "holding", never "active". + out = tc.format_status(_metrics(global_syncing=True), mining_active=True) + assert "holding" in out + assert "active" not in out + + +def test_status_node_down_and_not_mining(): + out = tc.format_status(_metrics(monero=_DOWN), mining_active=False) + assert "Monero node: πŸ”΄ down" in out + assert "Mining: πŸ”΄ not mining" in out + + +def test_hashrate_lists_online_workers_desc(): + workers = [ + {"name": "rig-a", "status": "online", "h15": 3000}, + {"name": "rig-b", "status": "online", "h15": 7000}, + {"name": "rig-c", "status": "offline", "h15": 0}, + ] + out = tc.format_hashrate_reply(_metrics(), workers) + # Highest first, offline excluded. + assert out.index("rig-b") < out.index("rig-a") + assert "rig-c" not in out + + +def test_hashrate_no_online_workers(): + out = tc.format_hashrate_reply(_metrics(), [{"name": "x", "status": "offline"}]) + assert "No workers online." in out + + +def test_hashrate_uses_effective_rate_for_fresh_worker(): + # A just-connected rig has no 10m (h15) history yet but is mining β€” it must show its live 1m + # rate (the same value the total counts), never 0.00. (This was the reported inconsistency.) + workers = [{"name": "fresh", "status": "online", "h15": 0, "h60": 42000, "h10": 42000}] + out = tc.format_hashrate_reply(_metrics(), workers) + assert "42.00 kH/s" in out + assert "0.00 H/s" not in out + + +def test_workers_hashrate_uses_effective_rate(): + workers = [{"name": "fresh", "status": "online", "h15": 0, "h60": 5000, "h10": 5000}] + assert "5.00 kH/s" in tc.format_workers(workers) + + +def test_workers_online_first_with_offline_flagged(): + workers = [ + {"name": "off-1", "status": "offline", "h15": 0}, + {"name": "on-1", "status": "online", "h15": 5000, "uptime": 3661}, + ] + out = tc.format_workers(workers) + lines = out.splitlines() + assert "🟒 on-1" in lines[1] and "up 1h 1m" in lines[1] # online first, uptime shown + assert "πŸ”΄ off-1 β€” offline" in lines[2] + + +def test_workers_empty(): + assert "No workers connected." in tc.format_workers([]) + + +def test_status_node_syncing_percent(): + # _node_state's "syncing %" branch (not down, not done). + out = tc.format_status(_metrics(monero=_SYNCING), mining_active=True) + assert "Monero node: ⏳ syncing 42.5%" in out + + +def test_sync_line_variants(): + out = tc.format_sync(_metrics(monero=_SYNCING, tari=_DOWN)) + assert "Monero: ⏳ 42.5% (850/2,000)" in out + assert "Tari: πŸ”΄ node down" in out + + +def test_sync_line_no_target(): + # A chain that's syncing but hasn't discovered a target height yet. + no_target = SyncMetric( + percent=12.0, current=0, target=0, remaining=0, has_target=False, done=False, down=False + ) + assert "Monero: ⏳ syncing 12.0%" in tc.format_sync(_metrics(monero=no_target)) + + +def test_system_reads_snapshot(): + system = { + "disk": {"used_gb": 120.4, "total_gb": 500.0, "percent_str": "24%"}, + "memory": {"used_gb": 3.2, "total_gb": 16.0, "percent_str": "20%"}, + "cpu_percent": "12.5%", + "load": "0.50 0.40 0.30", + "hugepages": ["Enabled", "status-ok", "3072/3072"], + } + out = tc.format_system(system) + assert "Disk: 120.4/500.0 GB (24%)" in out + assert "RAM: 3.2/16.0 GB (20%)" in out + assert "CPU: 12.5%" in out + assert "HugePages: Enabled (3072/3072)" in out + + +@pytest.mark.parametrize( + "n,expected", + [ + (0, "0"), + (42, "42"), + (999, "999"), + (1500, "1.50 K"), + (380e9, "380.00 G"), + (2.5e12, "2.50 T"), + (3e18, "3.00 E"), # beyond peta β€” the fallback branch + ], +) +def test_human_count(n, expected): + assert tc._human_count(n) == expected + + +def test_pool_reads_metrics(): + out = tc.format_pool( + _metrics(pool_type="Mini", network_height=3210001, network_difficulty=380e9) + ) + assert "P2Pool Mini" in out + assert "height 3,210,001" in out + assert "diff 380.00 G" in out + assert "5 in window" in out # shares_in_window from _BASE + + +def test_pool_share_health_and_best_when_present(): + # Proxy /summary + found blocks enrich /pool (#82): acceptance rate, best share, blocks. + data = { + "pool": {"pool": {"blocks_found": 3}}, + "proxy_summary": {"accepted": 125_000, "rejected": 40, "best": 2_345_678}, + } + out = tc.format_pool(_metrics(), data) + assert "Blocks found: 3" in out + assert "125,000 βœ“ / 40 βœ— (0.03% rejects)" in out + assert "Best share: πŸ’Ž 2,345,678" in out + + +def test_pool_omits_share_lines_before_first_poll(): + # No proxy data yet (fresh start) β†’ no zeroed share/best/blocks lines, just the core figures. + out = tc.format_pool(_metrics(), {}) + assert "Shares to pool" not in out + assert "Best share" not in out + assert "Blocks found" not in out + assert "Effort" not in out # no stratum data β†’ no effort line + + +def test_pool_effort_when_stratum_present(): + # Effort is a luck indicator; shown only once stratum has been polled (the key is present). + out = tc.format_pool(_metrics(), {"stratum": {"current_effort": 87.3}}) + assert "Effort: 87.3%" in out + # Effort right after a block can legitimately be 0.0 β€” still shown (key present), not hidden. + assert "Effort: 0.0%" in tc.format_pool(_metrics(), {"stratum": {"current_effort": 0.0}}) + + +def test_xvb_enabled_with_share(): + out = tc.format_xvb(_metrics(xvb_enabled=True, shares_in_window=5, xvb_1h=2100, xvb_24h=2300)) + assert "Current tier: Donor" in out + assert "raffle-eligible" in out + # Credited averages (what XvB measures β†’ sets the tier) are shown alongside routed. + assert "Credited by XvB: 2.10 kH/s (1h) Β· 2.30 kH/s (24h)" in out + + +def test_xvb_stale_warns(): + out = tc.format_xvb(_metrics(xvb_enabled=True, shares_in_window=5, xvb_stale=True)) + assert "stale" in out + assert "stale" not in tc.format_xvb(_metrics(xvb_enabled=True, shares_in_window=5)) + + +def test_xvb_no_share_warns(): + out = tc.format_xvb(_metrics(xvb_enabled=True, shares_in_window=0)) + assert "wins skipped" in out + + +def test_xvb_disabled(): + assert "disabled" in tc.format_xvb(_metrics(xvb_enabled=False)) + + +def test_status_merge_mining_line(): + linked = tc.format_status(_metrics(), True, merge_mining=True) + assert "Merge-mining: 🟒 Tari linked" in linked + down = tc.format_status(_metrics(), True, merge_mining=False) + assert "Merge-mining: ⏸ Tari not linked" in down + # None (Tari not yet polled / not in play) omits the line entirely. + assert "Merge-mining" not in tc.format_status(_metrics(), True) + + +def test_earnings_estimate(): + # network reward present + a real difficulty β†’ a positive daily figure. + out = tc.format_earnings( + _metrics(p2pool_1h=8000.0, p2pool_24h=8100.0), {"reward": 600_000_000_000} + ) + assert "1h avg" in out and "XMR/day" in out + # The 24h average is shown once available and drives the steadier 30d projection. + assert "24h avg" in out and "XMR/30d" in out + + +def test_earnings_falls_back_to_1h_30d_without_24h_history(): + # A fresh node with no 24h average yet still gets a 30d figure (from the 1h rate). + out = tc.format_earnings( + _metrics(p2pool_1h=8000.0, p2pool_24h=0.0), {"reward": 600_000_000_000} + ) + assert "24h avg" not in out + assert "XMR/30d" in out + + +def test_earnings_unavailable_without_network_data(): + out = tc.format_earnings(_metrics(), {}) # no reward β†’ coeff 0 + assert "unavailable" in out + + +def test_daily_summary_is_a_24h_retrospective(): + now = 1_000_000 + data = { + "workers": [ + {"name": "miner-0", "status": "online", "h24h": 30000}, + {"name": "miner-1", "status": "online", "h24h": 20000}, + {"name": "old", "status": "offline", "h24h": 0}, + ], + # 2 shares within 24h, 1 older. + "shares": [{"ts": now - 100}, {"ts": now - 90000}, {"ts": now - 200}], + "system": {"disk": {"percent_str": "42%"}}, + "network": {"reward": 600_000_000_000}, + } + out = tc.format_daily_summary( + _metrics( + xvb_enabled=True, + p2pool_24h=40000, + xvb_routed_24h=10000, + current_tier="Donor", + workers_online=2, + workers_total=3, + ), + data, + now=now, + ) + assert "Daily summary β€” " in out # date+time stamped + assert "24h hashrate: 50.00 kH/s" in out # sum of per-rig h24h (30k + 20k) + assert "20% to XvB" in out # 10k / (40k + 10k) + assert "P2Pool 40.00 kH/s" in out and "XvB 10.00 kH/s" in out # apportioned, sums to fleet + assert "XvB tier: Donor" in out + assert "Shares (24h): 2" in out + assert "Est. earnings" in out + assert "miner-0: 30.00 kH/s" in out + assert "old" not in out # offline rig excluded + assert "Disk: 42% used" in out + # The retrospective drops live-status lines like node sync. + assert "synced" not in out.lower() + + +def test_daily_summary_without_xvb_omits_split(): + data = {"workers": [{"name": "m", "status": "online", "h24h": 5000}], "shares": []} + out = tc.format_daily_summary(_metrics(xvb_enabled=False), data, now=0) + assert "24h hashrate: 5.00 kH/s" in out + assert "to XvB" not in out + + +def test_daily_summary_incident_log(): + m, data = _metrics(xvb_enabled=False), {"workers": [], "shares": []} + # Incidents present β†’ a roll-up line, highest count first. + out = tc.format_daily_summary(m, data, now=0, incidents={"worker_offline": 3, "node_down": 1}) + assert "Incidents (24h): 3Γ— worker offline Β· 1Γ— node down" in out + # Empty tally β†’ an explicit all-clear. + assert "No incidents in the last 24h" in tc.format_daily_summary(m, data, now=0, incidents={}) + # Not tracked (None) β†’ no incident line at all. + none = tc.format_daily_summary(m, data, now=0, incidents=None) + assert "Incidents" not in none and "No incidents" not in none + + +def test_host_label_prefix(): + assert tc.format_sync(_metrics(), host_label="rig-box").startswith("[rig-box] ") + # The placeholder is never printed. + assert not tc.format_sync(_metrics(), host_label="Unknown Host").startswith("[") + + +# --- reply_for routing -------------------------------------------------------------------- + + +def _bot(monkeypatch, latest_data=None, db_healthy=True, **over): + monkeypatch.setattr(tc, "build_metrics", lambda data, sm: _metrics(**over)) + sm = SimpleNamespace(is_db_healthy=lambda: db_healthy) + ds = SimpleNamespace(latest_data=latest_data or {}, state_manager=sm) + return tc.TelegramCommandBot(ds, enabled=True, bot_token="tok", chat_id="42", host_label="") + + +def test_reply_for_help_and_unknown_need_no_metrics(): + ds = SimpleNamespace(latest_data={}, state_manager=object()) + bot = tc.TelegramCommandBot(ds, enabled=True, bot_token="t", chat_id="1", host_label="") + assert "/status" in bot.reply_for("/help") + assert "Unknown command" in bot.reply_for("/nope") + assert bot.reply_for("just chatting") is None + + +def test_reply_for_status_uses_mining_flag(monkeypatch): + bot = _bot(monkeypatch, latest_data={"miner_released": True, "workers_rejected": False}) + assert "🟒 active" in bot.reply_for("/status") + # Rejected workers (node-down failover) reads as not mining even when released. + bot2 = _bot(monkeypatch, latest_data={"miner_released": True, "workers_rejected": True}) + assert "πŸ”΄ not mining" in bot2.reply_for("/status") + + +def test_reply_for_status_merge_mining_from_tari_snapshot(monkeypatch): + # gRPC linked = connected AND active (the #313 rule) β†’ the "linked" line. + bot = _bot(monkeypatch, latest_data={"tari": {"connected": True, "active": True}}) + assert "Merge-mining: 🟒 Tari linked" in bot.reply_for("/status") + # Node up but gRPC not ready (the exact gap that hid #313) β†’ "not linked". + bot2 = _bot(monkeypatch, latest_data={"tari": {"connected": False, "active": True}}) + assert "Merge-mining: ⏸ Tari not linked" in bot2.reply_for("/status") + + +def test_reply_for_pool_reads_share_snapshot(monkeypatch): + data = {"proxy_summary": {"accepted": 999, "rejected": 1, "best": 555}} + bot = _bot(monkeypatch, latest_data=data, pool_type="Mini") + out = bot.reply_for("/pool") + assert "Best share: πŸ’Ž 555" in out and "999 βœ“ / 1 βœ—" in out + + +def test_reply_for_workers_reads_snapshot(monkeypatch): + workers = [{"name": "z", "status": "online", "h15": 1000}] + bot = _bot(monkeypatch, latest_data={"workers": workers}) + assert "z" in bot.reply_for("/workers") + + +def test_reply_for_system_reads_snapshot_without_metrics(): + # /system reads only the raw snapshot β€” build_metrics must not be needed (left unstubbed). + ds = SimpleNamespace(latest_data={"system": {"cpu_percent": "9%"}}, state_manager=None) + bot = tc.TelegramCommandBot(ds, enabled=True, bot_token="t", chat_id="1", host_label="") + assert "CPU: 9%" in bot.reply_for("/system") + + +def test_reply_for_pool_and_xvb(monkeypatch): + bot = _bot(monkeypatch, latest_data={}, pool_type="Nano") + assert "P2Pool Nano" in bot.reply_for("/pool") + assert "XvB" in bot.reply_for("/xvb") + + +def test_reply_for_earnings(monkeypatch): + bot = _bot(monkeypatch, latest_data={"network": {"reward": 600_000_000_000}}, p2pool_1h=8000.0) + assert "XMR/day" in bot.reply_for("/earnings") + + +def test_reply_for_hashrate_and_sync(monkeypatch): + workers = [{"name": "z", "status": "online", "h15": 1000}] + bot = _bot(monkeypatch, latest_data={"workers": workers}) + assert "Hashrate" in bot.reply_for("/hashrate") + assert "Sync status" in bot.reply_for("/sync") + + +def test_safe_reply_for_swallows_errors(monkeypatch): + # A formatting/read bug in reply_for must never kill the poll loop β€” it just goes quiet. + ds = SimpleNamespace(latest_data={}, state_manager=object()) + bot = tc.TelegramCommandBot(ds, enabled=True, bot_token="t", chat_id="1") + + def boom(_text): + raise RuntimeError("kaboom") + + monkeypatch.setattr(bot, "reply_for", boom) + assert bot._safe_reply_for("/status") is None + + +# --- enabled gating ----------------------------------------------------------------------- + + +def test_disabled_without_token_or_chat(): + ds = SimpleNamespace(latest_data={}, state_manager=object()) + assert not tc.TelegramCommandBot(ds, enabled=True, bot_token="", chat_id="1").enabled + assert not tc.TelegramCommandBot(ds, enabled=True, bot_token="t", chat_id="").enabled + assert not tc.TelegramCommandBot(ds, enabled=False, bot_token="t", chat_id="1").enabled + assert tc.TelegramCommandBot(ds, enabled=True, bot_token="t", chat_id="1").enabled + + +async def test_run_is_noop_when_disabled(): + ds = SimpleNamespace(latest_data={}, state_manager=object()) + bot = tc.TelegramCommandBot(ds, enabled=False, bot_token="", chat_id="") + # Returns immediately without touching the network β€” no session, no poll. + await bot.run() + + +# --- access control ----------------------------------------------------------------------- + + +async def test_handle_update_ignores_foreign_chat(monkeypatch): + bot = _bot(monkeypatch) + sent = [] + monkeypatch.setattr(bot, "_send", sent.append) # _send is sync now (run via to_thread) + # chat_id 999 != configured 42 β†’ dropped, nothing sent. + await bot._handle_update({"message": {"chat": {"id": 999}, "text": "/help"}}) + assert sent == [] + + +async def test_handle_update_replies_to_configured_chat(monkeypatch): + bot = _bot(monkeypatch) + sent = [] + monkeypatch.setattr(bot, "_send", sent.append) + await bot._handle_update({"message": {"chat": {"id": 42}, "text": "/help"}}) + assert len(sent) == 1 and "/status" in sent[0] + + +# --- transport (stubbed requests, over Tor) ----------------------------------------------- + + +class _Resp: + """Minimal stand-in for a requests.Response.""" + + def __init__(self, payload=None, raise_status=False): + self._payload = payload or {} + self._raise = raise_status + + def raise_for_status(self): + if self._raise: + raise RuntimeError("http error") + + def json(self): + return self._payload + + +def _make_bot(tor_proxy="socks5h://tor:9050"): + ds = SimpleNamespace(latest_data={}, state_manager=object()) + return tc.TelegramCommandBot( + ds, enabled=True, bot_token="tok", chat_id="42", tor_proxy=tor_proxy + ) + + +def test_get_updates_parses_results_over_tor(monkeypatch): + bot = _make_bot() + bot._offset = 7 + seen = {} + + def fake_get(url, params=None, timeout=None, proxies=None): + seen.update(url=url, params=params, proxies=proxies) + return _Resp({"ok": True, "result": [{"update_id": 8}]}) + + monkeypatch.setattr(tc.requests, "get", fake_get) + assert bot._get_updates(0) == [{"update_id": 8}] + assert "bottok" in seen["url"] and seen["params"]["offset"] == 7 # token + offset forwarded + assert seen["proxies"] == {"http": "socks5h://tor:9050", "https": "socks5h://tor:9050"} + + +def test_get_updates_not_ok_returns_empty(monkeypatch): + bot = _make_bot() + monkeypatch.setattr(tc.requests, "get", lambda *a, **k: _Resp({"ok": False})) + assert bot._get_updates(0) == [] + + +def test_prime_offset_skips_backlog(monkeypatch): + bot = _make_bot() + monkeypatch.setattr( + tc.requests, + "get", + lambda *a, **k: _Resp({"ok": True, "result": [{"update_id": 3}, {"update_id": 9}]}), + ) + bot._prime_offset() + assert bot._offset == 10 # past the last pending update + + +def test_prime_offset_swallows_error(monkeypatch): + bot = _make_bot() + + def boom(*a, **k): + raise OSError("offline") + + monkeypatch.setattr(tc.requests, "get", boom) + bot._prime_offset() # must not raise + assert bot._offset is None + + +def test_send_posts_over_tor(monkeypatch): + bot = _make_bot() + seen = {} + + def fake_post(url, json=None, timeout=None, proxies=None): + seen.update(url=url, body=json, proxies=proxies) + return _Resp({"ok": True}) + + monkeypatch.setattr(tc.requests, "post", fake_post) + bot._send("hi") + assert ( + "bottok" in seen["url"] and seen["body"]["chat_id"] == "42" and seen["body"]["text"] == "hi" + ) + assert seen["proxies"]["https"] == "socks5h://tor:9050" + + +def test_send_swallows_network_error(monkeypatch): + bot = _make_bot() + monkeypatch.setattr(tc.requests, "post", lambda *a, **k: _Resp(raise_status=True)) + bot._send("hi") # must not raise + + +async def test_run_processes_update_then_honours_cancel(monkeypatch): + bot = _make_bot() + monkeypatch.setattr(bot, "_prime_offset", lambda: None) + handled = [] + + async def _fake_handle(update): + handled.append(update) + + calls = {"n": 0} + + def _fake_get(poll_timeout): + calls["n"] += 1 + if calls["n"] == 1: + return [{"update_id": 1}] + raise asyncio.CancelledError + + monkeypatch.setattr(bot, "_handle_update", _fake_handle) + monkeypatch.setattr(bot, "_get_updates", _fake_get) + with pytest.raises(asyncio.CancelledError): + await bot.run() + assert handled == [{"update_id": 1}] and bot._offset == 2 + + +async def test_run_backs_off_on_poll_error(monkeypatch): + bot = _make_bot() + monkeypatch.setattr(bot, "_prime_offset", lambda: None) + slept = [] + + async def _sleep(secs): + slept.append(secs) + raise asyncio.CancelledError # break out after the first backoff + + def _boom(poll_timeout): + raise OSError("telegram unreachable") + + monkeypatch.setattr(tc.asyncio, "sleep", _sleep) + monkeypatch.setattr(bot, "_get_updates", _boom) + with pytest.raises(asyncio.CancelledError): + await bot.run() + assert slept == [tc.POLL_ERROR_BACKOFF_SECONDS] + + +class TestStatusWarnings: + """/status surfaces the same warning/error badges as the dashboard top bar (#104), reusing + build_badges so the two never drift; informational states ('Syncing…') are excluded.""" + + def test_bad_and_flagged_warn_badges_included_stripped(self): + # Low RAM (⚠ warn) + DB failing (bad) both surface; the leading ⚠ is stripped for the list. + warnings = tc.status_warnings( + {"system": {"memory": {"total_gb": 8}}}, _metrics(), db_healthy=False + ) + assert "Low RAM (8 GB)" in warnings + assert "DB write failing" in warnings + assert not any(w.startswith("⚠") for w in warnings) + + def test_informational_states_excluded(self): + # 'Syncing…' / 'Miner held' are warn-variant but informational (no ⚠) β€” not warnings. + warnings = tc.status_warnings( + {"miner_held": True}, _metrics(global_syncing=True), db_healthy=True + ) + assert warnings == [] + + def test_healthy_is_empty(self): + assert tc.status_warnings({}, _metrics(), db_healthy=True) == [] + + def test_format_status_lists_warnings(self): + text = tc.format_status( + _metrics(), True, warnings=["Low RAM (8 GB)", "HugePages not reserved"] + ) + assert "⚠️ Warnings:" in text + assert "β€’ Low RAM (8 GB)" in text + assert "β€’ HugePages not reserved" in text + + def test_format_status_all_clear(self): + text = tc.format_status(_metrics(), True, warnings=[]) + assert "βœ… No warnings." in text + + +class TestInfo: + """/info β€” the 'about this stack' card: build version, update availability, Monero DB mode, + P2Pool sidechain, and privacy (egress) posture. All facts the stack already computes.""" + + def test_release_up_to_date_pruned_tor(self): + out = tc.format_info( + {"text": "v1.1.0", "dev": False}, + {"available": False}, + _metrics(monero_mode="Pruned", pool_type="Mini"), + {"all_tor": True}, + ) + assert "Version: v1.1.0" in out and "(dev build)" not in out + assert "βœ… Up to date" in out + assert "Monero DB: Pruned" in out + assert "Sidechain: P2Pool Mini" in out + assert "πŸ§… Tor-only" in out + + def test_dev_update_available_full_clearnet(self): + out = tc.format_info( + {"text": "dev Β· main @ abc1234", "dev": True}, + {"available": True, "latest": "v1.2.0"}, + _metrics(monero_mode="Full"), + {"all_tor": False, "label": "2 clearnet egress path(s) exposing your IP"}, + ) + assert "(dev build)" in out + assert "πŸ†• v1.2.0 available" in out + assert "Monero DB: Full" in out + assert "⚠️ 2 clearnet egress path(s)" in out + + def test_unknown_db_mode_and_missing_update(self): + # monero_mode "Unknown" (remote/early) β†’ no false Pruned/Full; update None β†’ up to date. + out = tc.format_info( + {"text": "v1.1.0"}, None, _metrics(monero_mode="Unknown"), {"all_tor": True} + ) + assert "Monero DB: unknown" in out + assert "βœ… Up to date" in out + + def test_reply_for_info_routes(self, monkeypatch): + monkeypatch.setattr(tc, "resolve_version", lambda: {"text": "v1.1.0", "dev": False}) + monkeypatch.setattr( + tc, "egress_posture_from_config", lambda: {"summary": {"all_tor": True}} + ) + bot = _bot(monkeypatch, latest_data={"update": {"available": False}}, monero_mode="Pruned") + out = bot.reply_for("/info") + assert "πŸ“Ÿ Pithead info" in out and "Version: v1.1.0" in out and "πŸ§… Tor-only" in out diff --git a/build/dashboard/tests/service/test_telegram_notifier.py b/build/dashboard/tests/service/test_telegram_notifier.py new file mode 100644 index 0000000..8072c08 --- /dev/null +++ b/build/dashboard/tests/service/test_telegram_notifier.py @@ -0,0 +1,95 @@ +from unittest.mock import MagicMock, patch + +import requests + +import mining_dashboard.service.telegram_notifier as tg_mod +from mining_dashboard.service.telegram_notifier import TelegramNotifier + +EVENTS = {"node_down": True, "node_recovered": False} + + +def _enabled(**kw): + opts = dict(enabled=True, bot_token="TOKEN", chat_id="123", events=EVENTS) + opts.update(kw) + return TelegramNotifier(**opts) + + +class TestEnabledGating: + def test_disabled_by_default(self): + assert TelegramNotifier().enabled is False + + def test_enabled_requires_token_and_chat(self): + assert TelegramNotifier(enabled=True, bot_token="", chat_id="123").enabled is False + assert TelegramNotifier(enabled=True, bot_token="t", chat_id="").enabled is False + assert TelegramNotifier(enabled=True, bot_token="t", chat_id="123").enabled is True + + def test_enabled_flag_off_disables_even_with_creds(self): + assert TelegramNotifier(enabled=False, bot_token="t", chat_id="123").enabled is False + + def test_event_enabled_respects_toggle_and_enabled(self): + n = _enabled() + assert n.event_enabled("node_down") is True + assert n.event_enabled("node_recovered") is False # toggled off + assert n.event_enabled("worker_offline") is False # absent -> off + # A disabled notifier reports every event as off. + assert TelegramNotifier(events={"node_down": True}).event_enabled("node_down") is False + + +class TestSend: + def test_send_noop_when_disabled(self): + with patch.object(tg_mod.requests, "post") as post: + assert TelegramNotifier().send("hi") is False + post.assert_not_called() + + def test_send_posts_to_bot_api(self): + n = _enabled(api_base="https://tg.test") + resp = MagicMock() + resp.raise_for_status = MagicMock() + with patch.object(tg_mod.requests, "post", return_value=resp) as post: + assert n.send("node down") is True + url = post.call_args.args[0] + assert url == "https://tg.test/botTOKEN/sendMessage" + body = post.call_args.kwargs["json"] + assert body["chat_id"] == "123" + assert body["text"] == "node down" + + def test_send_routes_over_tor(self): + # The bot dial must ride the Tor SOCKS proxy, never leak the host IP to Telegram. + n = _enabled(tor_proxy="socks5h://tor:9050") + resp = MagicMock() + resp.raise_for_status = MagicMock() + with patch.object(tg_mod.requests, "post", return_value=resp) as post: + n.send("x") + assert post.call_args.kwargs["proxies"] == { + "http": "socks5h://tor:9050", + "https": "socks5h://tor:9050", + } + + def test_send_swallows_network_error(self): + n = _enabled() + with patch.object( + tg_mod.requests, "post", side_effect=requests.RequestException("offline") + ): + assert n.send("x") is False + + def test_send_swallows_http_error(self): + n = _enabled() + resp = MagicMock() + resp.raise_for_status.side_effect = requests.HTTPError("401") + with patch.object(tg_mod.requests, "post", return_value=resp): + assert n.send("x") is False + + +class TestTokenSecrecy: + def test_token_never_logged_on_failure(self, caplog): + # A requests error message embeds the request URL (with the token). Assert the token + # never reaches the logs even when a send fails. + n = _enabled(bot_token="SUPERSECRETTOKEN") + boom = requests.RequestException( + "HTTPSConnectionPool failed for url: " + "https://api.telegram.org/botSUPERSECRETTOKEN/sendMessage" + ) + with caplog.at_level("DEBUG"): + with patch.object(tg_mod.requests, "post", side_effect=boom): + n.send("x") + assert "SUPERSECRETTOKEN" not in caplog.text diff --git a/build/dashboard/tests/service/test_worker_presence.py b/build/dashboard/tests/service/test_worker_presence.py new file mode 100644 index 0000000..e4ac43e --- /dev/null +++ b/build/dashboard/tests/service/test_worker_presence.py @@ -0,0 +1,185 @@ +from mining_dashboard.service.worker_presence import WorkerPresenceMonitor + + +class _Clock: + """Manually advanced clock for deterministic debounce tests.""" + + def __init__(self): + self.t = 1000.0 + + def __call__(self): + return self.t + + def advance(self, secs): + self.t += secs + + +def _monitor(offline_after=300, recovery_after=120): + clock = _Clock() + m = WorkerPresenceMonitor( + offline_after=offline_after, recovery_after=recovery_after, clock=clock + ) + return m, clock + + +def _on(*names): + """Worker rows the proxy reports online.""" + return [{"name": n, "status": "online"} for n in names] + + +def _down(*names): + """Worker rows still listed by the proxy but disconnected β€” the DOWN state the UI shows.""" + return [{"name": n, "status": "offline"} for n in names] + + +class TestBaseline: + def test_first_sighting_online_is_silent(self): + # A brand-new worker is baselined ONLINE with no edge β€” it's not a "recovery". + m, _ = _monitor() + assert m.update(_on("rig-1")) == [] + + def test_first_sighting_down_is_silent(self): + # A rig already DOWN at startup baselines OFFLINE silently β€” a restart must not replay it. + m, _ = _monitor() + assert m.update(_down("rig-1")) == [] + assert m._workers["rig-1"]["state"] == "offline" + + def test_steady_online_emits_nothing(self): + m, clock = _monitor() + m.update(_on("rig-1")) + for _ in range(5): + clock.advance(30) + assert m.update(_on("rig-1")) == [] + + +class TestOfflineDebounce: + def test_not_offline_before_threshold(self): + m, clock = _monitor() + m.update(_on("rig-1")) # baseline online + clock.advance(30) + assert m.update(_down("rig-1")) == [] # DOWN streak starts here (within debounce) + clock.advance(269) + assert m.update(_down("rig-1")) == [] # 269s DOWN β€” still under the 300s threshold + + def test_offline_after_threshold(self): + m, clock = _monitor() + m.update(_on("rig-1")) + m.update(_down("rig-1")) # DOWN streak starts here + clock.advance(300) + assert m.update(_down("rig-1")) == [("rig-1", "offline")] + + def test_offline_emitted_once(self): + m, clock = _monitor() + m.update(_on("rig-1")) + m.update(_down("rig-1")) + clock.advance(300) + assert m.update(_down("rig-1")) == [("rig-1", "offline")] + clock.advance(300) + assert m.update(_down("rig-1")) == [] # already offline β€” no repeat + + def test_brief_down_does_not_trip(self): + m, clock = _monitor() + m.update(_on("rig-1")) + clock.advance(60) + assert m.update(_down("rig-1")) == [] # DOWN 60s + clock.advance(30) + assert m.update(_on("rig-1")) == [] # back well before 300s + + def test_vanishing_from_table_is_left_not_offline(self): + # A rig the proxy stops listing entirely (fell off the worker table) is reported as having + # LEFT β€” never aged to "offline", which is reserved for the DOWN-but-still-listed state. + m, clock = _monitor() + m.update(_on("rig-1")) # prime + baseline + clock.advance(600) + assert m.update([]) == [("rig-1", "left")] + assert "rig-1" not in m._workers + + +class TestRecoveryHysteresis: + def _take_offline(self, m, clock): + m.update(_on("rig-1")) + m.update(_down("rig-1")) + clock.advance(300) + assert m.update(_down("rig-1")) == [("rig-1", "offline")] + + def test_recovered_only_after_stable_window(self): + m, clock = _monitor() + self._take_offline(m, clock) + # Reappears online, but "back online" holds until it's been present for recovery_after. + assert m.update(_on("rig-1")) == [] + clock.advance(119) + assert m.update(_on("rig-1")) == [] + clock.advance(1) + assert m.update(_on("rig-1")) == [("rig-1", "recovered")] + + def test_flap_during_recovery_does_not_emit(self): + # A one-cycle reconnect during an outage must not produce a recoveredβ†’offline spam. + m, clock = _monitor() + self._take_offline(m, clock) + clock.advance(30) + assert m.update(_on("rig-1")) == [] # blink online (still offline) + clock.advance(30) + assert m.update(_down("rig-1")) == [] # blink DOWN β€” no recovered, no re-offline + clock.advance(30) + assert m.update(_down("rig-1")) == [] + + +class TestMultipleWorkers: + def test_independent_per_worker_state(self): + m, clock = _monitor() + m.update(_on("rig-1", "rig-2")) + # rig-2 stays online; rig-1 goes DOWN. + m.update(_on("rig-2") + _down("rig-1")) + clock.advance(300) + assert m.update(_on("rig-2") + _down("rig-1")) == [("rig-1", "offline")] + + +class TestReset: + def test_reset_clears_state_and_rebaselines_silently(self): + m, clock = _monitor() + m.update(_on("rig-1")) + m.update(_down("rig-1")) + clock.advance(300) + assert m.update(_down("rig-1")) == [("rig-1", "offline")] + m.reset() + # After a reset (e.g. proxy intentionally stopped), the worker re-appears as a fresh + # baseline β€” no "recovered" edge. + assert m.update(_on("rig-1")) == [] + + +class TestFalloff: + def test_worker_forgotten_when_it_leaves_the_table(self): + m, clock = _monitor() + m.update(_on("rig-1")) + m.update(_down("rig-1")) + clock.advance(300) + m.update(_down("rig-1")) # offline emitted + # The lifecycle eventually drops the ghost from the worker table (#182) β†’ LEFT edge. + assert m.update([]) == [("rig-1", "left")] + assert "rig-1" not in m._workers + # Returning after that counts as a fresh JOIN. + assert m.update(_on("rig-1")) == [("rig-1", "joined")] + + +class TestJoinLeave: + def test_first_cycle_baselines_silently(self): + # The startup roster is baselined without joined edges β€” a restart isn't a fleet change. + m, _ = _monitor() + assert m.update(_on("rig-1", "rig-2")) == [] + + def test_new_worker_after_prime_joins(self): + m, _ = _monitor() + m.update(_on("rig-1")) # prime + assert m.update(_on("rig-1", "rig-2")) == [("rig-2", "joined")] + + def test_worker_leaving_emits_left(self): + m, _ = _monitor() + m.update(_on("rig-1", "rig-2")) # prime + assert m.update(_on("rig-1")) == [("rig-2", "left")] + + def test_reset_reprimes_without_joins(self): + m, _ = _monitor() + m.update(_on("rig-1")) # prime + m.reset() # e.g. proxy stopped for a failover β€” clears the prime flag + # Readmission re-baselines the whole roster silently β€” no "joined" spam. + assert m.update(_on("rig-1", "rig-2")) == [] diff --git a/build/dashboard/tests/web/test_views.py b/build/dashboard/tests/web/test_views.py index 8d47798..d0e94ea 100644 --- a/build/dashboard/tests/web/test_views.py +++ b/build/dashboard/tests/web/test_views.py @@ -246,7 +246,13 @@ def test_unknown_range_keeps_everything(self): assert len(build_chart(history, [], "bogus")["p2pool"]) == 3 def test_empty_history(self): - assert build_chart([], [], "all") == {"p2pool": [], "xvb": [], "shares": [], "tension": 0.0} + assert build_chart([], [], "all") == { + "p2pool": [], + "xvb": [], + "shares": [], + "events": [], + "tension": 0.0, + } # --- Issue #47: custom zoom window + duration-adaptive resolution/smoothing --------- @@ -622,6 +628,46 @@ def test_no_disk_badge_when_missing(self): out = build_badges({}, _metrics(), "ok") assert not any("Disk" in b["text"] for b in out) + # --- Host-perf badges (#104): AVX2 / HugePages / low RAM, from live metrics ------------- + def test_hugepages_disabled_badge(self): + out = build_badges( + {"system": {"hugepages": ["Disabled", "status-bad", "0/0"]}}, _metrics(), "ok" + ) + assert any(b["variant"] == "warn" and "HugePages off" in b["text"] for b in out) + + def test_no_hugepages_badge_when_reserved(self): + for status in ("Allocated", "Enabled", "Unknown"): # only "Disabled" is a problem + out = build_badges({"system": {"hugepages": [status, "", "1/2"]}}, _metrics(), "ok") + assert not any("HugePages" in b["text"] for b in out), status + + def test_low_ram_badge(self): + out = build_badges({"system": {"memory": {"total_gb": 8}}}, _metrics(), "ok") + assert any(b["variant"] == "warn" and "Low RAM (8 GB)" in b["text"] for b in out) + + def test_no_low_ram_badge_at_or_above_threshold_or_unknown(self): + assert not any( + "Low RAM" in b["text"] + for b in build_badges({"system": {"memory": {"total_gb": 16}}}, _metrics(), "ok") + ) + # total 0 = couldn't read /proc/meminfo (not "0 GB of RAM") β€” no false badge. + assert not any( + "Low RAM" in b["text"] + for b in build_badges({"system": {"memory": {"total_gb": 0}}}, _metrics(), "ok") + ) + + def test_avx2_missing_badge(self): + out = build_badges({"system": {"avx2": False}}, _metrics(), "ok") + assert any(b["variant"] == "warn" and "No AVX2" in b["text"] for b in out) + + def test_no_avx2_badge_when_present_or_unknown(self): + assert not any( + "AVX2" in b["text"] for b in build_badges({"system": {"avx2": True}}, _metrics(), "ok") + ) + # None = couldn't determine (non-Linux / unreadable) β€” stay silent, don't cry wolf. + assert not any( + "AVX2" in b["text"] for b in build_badges({"system": {"avx2": None}}, _metrics(), "ok") + ) + # --- System (presentation thresholds) ------------------------------------------------- @@ -1313,3 +1359,41 @@ def test_na_when_xvb_off(self): "eligible": False, "label": "N/A (XvB off)", } + + +class TestChartEvents: + """Degradation/recovery markers (#99) flow through build_chart's new `events` kwarg: shaped as + xy points on the hidden 0-1 event axis, carrying kind+label, and window-filtered like history.""" + + def _hist(self, now): + return [{"timestamp": now, "v": 800, "v_p2pool": 800, "v_xvb": 0, "t": "a"}] + + def test_absent_events_default_to_empty(self): + now = time.time() + assert build_chart(self._hist(now), [], "all")["events"] == [] + + def test_event_point_shape(self): + now = time.time() + events = [{"ts": now, "type": "loss", "detail": "-62%"}] + pt = build_chart(self._hist(now), [], "all", events=events)["events"] + assert pt == [ + {"x": int(now * 1000), "y": views._EVENT_MARKER_Y, "kind": "loss", "label": "-62%"} + ] + + def test_label_falls_back_to_type(self): + now = time.time() + events = [{"ts": now, "type": "recovered", "detail": ""}] + assert build_chart(self._hist(now), [], "all", events=events)["events"][0]["label"] == ( + "recovered" + ) + + def test_events_filtered_by_range(self): + now = time.time() + events = [ + {"ts": now - 7200, "type": "loss", "detail": "old"}, # 2h ago + {"ts": now - 60, "type": "recovered", "detail": "recent"}, + ] + labels = [ + e["label"] for e in build_chart(self._hist(now), [], "1h", events=events)["events"] + ] + assert labels == ["recent"] # the 2h-old marker is outside the 1h window diff --git a/config.reference.json b/config.reference.json index 672e72d..ff49e70 100644 --- a/config.reference.json +++ b/config.reference.json @@ -59,6 +59,8 @@ "data_dir": "auto", "tari_required": true, "check_for_updates": true, + "hashrate_drop_threshold": 50, + "hashrate_drop_minutes": 10, "auth": { "username": "admin", "password": "" @@ -78,5 +80,36 @@ "healthchecks": { "ping_url": "" + }, + + "telegram": { + "enabled": false, + "bot_token": "", + "chat_id": "", + "events": { + "node_down": true, + "node_recovered": true, + "worker_offline": true, + "worker_recovered": true, + "worker_joined": true, + "worker_left": true, + "sync_finished": true, + "disk_space": true, + "db_unhealthy": true, + "xvb_no_share": true, + "clearnet_exposed": true, + "xvb_registration": true, + "new_release": true, + "stack_online": true, + "daily_summary": true, + "hashrate_low": true, + "hashrate_loss": true, + "hugepages": true, + "low_ram": true + }, + "daily_summary_time": "08:00", + "commands": { + "enabled": false + } } } diff --git a/docker-compose.yml b/docker-compose.yml index 84421e7..a39d7a5 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -423,6 +423,39 @@ services: # Healthchecks.io alerts externally β€” the one failure mode an in-stack notifier can't # report. The ping always rides the bridge Tor SOCKS (TOR_SOCKS_PROXY). See docs/monitoring.md. - HEALTHCHECKS_PING_URL=${HEALTHCHECKS_PING_URL:-} + # --- Operator alerts: Telegram (Issues #121, #45) --- + # Notifications-only push alerter plus an optional on-demand command interface; both + # disabled by default. bot_token is a secret sourced from the owner-only .env and is never + # logged by the dashboard. On a Tor-only / no-clearnet host the Telegram API is unreachable + # and both sends and command polling fail silently. Per-event toggles default on; the + # command interface is opt-in via TELEGRAM_COMMANDS_ENABLED. See docs/telegram.md. + - TELEGRAM_ENABLED=${TELEGRAM_ENABLED:-false} + - TELEGRAM_BOT_TOKEN=${TELEGRAM_BOT_TOKEN:-} + - TELEGRAM_CHAT_ID=${TELEGRAM_CHAT_ID:-} + - TELEGRAM_EVENT_NODE_DOWN=${TELEGRAM_EVENT_NODE_DOWN:-true} + - TELEGRAM_EVENT_NODE_RECOVERED=${TELEGRAM_EVENT_NODE_RECOVERED:-true} + - TELEGRAM_EVENT_WORKER_OFFLINE=${TELEGRAM_EVENT_WORKER_OFFLINE:-true} + - TELEGRAM_EVENT_WORKER_RECOVERED=${TELEGRAM_EVENT_WORKER_RECOVERED:-true} + - TELEGRAM_EVENT_WORKER_JOINED=${TELEGRAM_EVENT_WORKER_JOINED:-true} + - TELEGRAM_EVENT_WORKER_LEFT=${TELEGRAM_EVENT_WORKER_LEFT:-true} + - TELEGRAM_EVENT_SYNC_FINISHED=${TELEGRAM_EVENT_SYNC_FINISHED:-true} + - TELEGRAM_EVENT_DISK_SPACE=${TELEGRAM_EVENT_DISK_SPACE:-true} + - TELEGRAM_EVENT_DB_UNHEALTHY=${TELEGRAM_EVENT_DB_UNHEALTHY:-true} + - TELEGRAM_EVENT_XVB_NO_SHARE=${TELEGRAM_EVENT_XVB_NO_SHARE:-true} + - TELEGRAM_EVENT_CLEARNET_EXPOSED=${TELEGRAM_EVENT_CLEARNET_EXPOSED:-true} + - TELEGRAM_EVENT_XVB_REGISTRATION=${TELEGRAM_EVENT_XVB_REGISTRATION:-true} + - TELEGRAM_EVENT_NEW_RELEASE=${TELEGRAM_EVENT_NEW_RELEASE:-true} + - TELEGRAM_EVENT_STACK_ONLINE=${TELEGRAM_EVENT_STACK_ONLINE:-true} + - TELEGRAM_EVENT_DAILY_SUMMARY=${TELEGRAM_EVENT_DAILY_SUMMARY:-true} + - TELEGRAM_EVENT_HASHRATE_LOW=${TELEGRAM_EVENT_HASHRATE_LOW:-true} + - TELEGRAM_EVENT_HASHRATE_LOSS=${TELEGRAM_EVENT_HASHRATE_LOSS:-true} + - TELEGRAM_EVENT_HUGEPAGES=${TELEGRAM_EVENT_HUGEPAGES:-true} + - TELEGRAM_EVENT_LOW_RAM=${TELEGRAM_EVENT_LOW_RAM:-true} + - TELEGRAM_DAILY_SUMMARY_TIME=${TELEGRAM_DAILY_SUMMARY_TIME:-08:00} + # Hashrate-degradation detector (#99). + - HASHRATE_DROP_THRESHOLD_PCT=${HASHRATE_DROP_THRESHOLD_PCT:-50} + - HASHRATE_DROP_MINUTES=${HASHRATE_DROP_MINUTES:-10} + - TELEGRAM_COMMANDS_ENABLED=${TELEGRAM_COMMANDS_ENABLED:-false} # --- Docker Socket Proxy (read-only) --- # Read-only window onto the Docker API for the dashboard's container stats/logs. diff --git a/docs/README.md b/docs/README.md index 961f338..9be9d7c 100644 --- a/docs/README.md +++ b/docs/README.md @@ -14,6 +14,7 @@ stack. The other guides cover individual topics once you're running. | [Configuration](configuration.md) | Every `config.json` key and default, applying changes safely, reusing an existing node via data directories, and connecting to a remote Monero node. | | [The Dashboard](dashboard.md) | Sync Mode, the live operational view, and how to read every panel. | | [Monitoring & Alerting](monitoring.md) | Optional Healthchecks.io dead-man's switch β€” get alerted when your host goes down (power loss, crash), even when it can't tell you itself. | +| [Telegram Bot](telegram.md) | Push operator alerts (node down/recovered, worker offline/back, sync finished) to Telegram and query stack status on demand (`/status`, `/hashrate`, `/workers`, `/sync`) β€” creating a bot, finding your chat id, per-event toggles, and the command list. | | [Connecting Miners](workers.md) | Pointing any existing rig at the stack, plus [RigForge](https://github.com/p2pool-starter-stack/rigforge) for setting up new miners. | | [Architecture](architecture.md) | The nine services, how they fit together, the privacy model, and the algorithmic XvB switching engine. | | [Privacy & Network Egress](privacy.md) | Every connection the stack makes off-box: what's Tor-routed, what's clearnet today, and how to harden each path. | diff --git a/docs/architecture.md b/docs/architecture.md index 263b53d..37d480b 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -28,9 +28,14 @@ flowchart TB %% ── External actors ── You(["πŸ‘€ You Β· Browser"]) Workers(["⛏️ XMRig Workers"]) - XvB(["🎲 XMRvsBeast Pool"]) Net(["🌐 Tor Network / Internet"]) + %% ── External services the dashboard calls out to (each labeled with its route) ── + Telegram(["✈️ Telegram
alerts + commands"]) + HC(["🩺 Healthchecks.io
dead-man's switch"]) + XvB(["🎲 XMRvsBeast
pool + stats"]) + GitHub(["πŸ™ GitHub
release check"]) + subgraph stack ["🐳 Pithead"] direction TB @@ -52,27 +57,40 @@ flowchart TB Caddy --> Dashboard Workers ==>|"Stratum 3333"| Proxy + %% Dashboard internal control + monitoring (never leaves the box) Dashboard -.->|controls| Proxy Dashboard -.->|monitors| DockerProxy Dashboard -.->|"reads stats & sync"| core + %% ── Dashboard egress β€” every outbound call is routed through Tor (🟒), so none leak the host IP ── + Dashboard ==>|"🚨 alerts + commands Β· 🟒 Tor"| Tor + Dashboard ==>|"🩺 liveness ping Β· 🟒 Tor"| Tor + Dashboard ==>|"πŸ“ˆ XvB stats Β· 🟒 Tor"| Tor + Dashboard ==>|"πŸ†• update check Β· 🟒 Tor"| Tor + Proxy ==>|hashrate| P2Pool - Proxy ==>|hashrate| XvB + Proxy ==>|"hashrate Β· 🟒 Tor"| Tor P2Pool <-->|"RPC / ZMQ"| Monerod P2Pool -->|merge-mine| Tari - Monerod <-->|tx broadcast| Tor - Tari <-->|P2P| Tor - P2Pool <-->|P2P| Tor + Monerod <-->|"tx + P2P Β· 🟒 Tor"| Tor + Tari <-->|"P2P Β· 🟒 Tor"| Tor + P2Pool <-->|"P2P Β· 🟒 Tor"| Tor Tor <--> Net + %% Tor exit reaches each external service + Net -.-> Telegram + Net -.-> HC + Net -.-> XvB + Net -.-> GitHub + classDef ext fill:#1e293b,stroke:#64748b,color:#e2e8f0; classDef ctrl fill:#1d4ed8,stroke:#93c5fd,color:#eff6ff; classDef priv fill:#6d28d9,stroke:#c4b5fd,color:#f5f3ff; classDef mine fill:#047857,stroke:#6ee7b7,color:#ecfdf5; - class You,Workers,XvB,Net ext; + class You,Workers,Net,Telegram,HC,XvB,GitHub ext; class Caddy,Dashboard ctrl; class Tor,DockerProxy priv; class Proxy,P2Pool,Monerod,Tari mine; @@ -81,11 +99,20 @@ flowchart TB style core stroke:#10b981,stroke-width:1px,stroke-dasharray:5 4; ``` -Reading the diagram: thick arrows carry mining hashrate and inbound connections, dotted arrows are -the dashboard's control and monitoring, and solid arrows are internal service data and anonymized -network traffic. Node colors group services by role: 🟦 control plane (Caddy, Dashboard), πŸŸͺ privacy -and isolation (Tor, Docker socket proxy), and 🟩 the mining core. In remote-node mode the bundled 🟠 -Monero node isn't started, and P2Pool talks to your external node instead. +Reading the diagram: thick arrows carry inbound connections and every path that **leaves the box** β€” +each egress edge is tagged with its route, and **🟒 Tor** means it exits through the Tor daemon (a Tor +exit IP, never your host's). Dotted arrows are the dashboard's internal control and monitoring, which +never leave the machine. The dashboard makes four outbound calls β€” the **Telegram** bot (alerts + +commands), the **Healthchecks.io** liveness ping, the **XvB** stats fetch, and the **GitHub** release +check β€” and all four are Tor-routed, so enabling any of them never reveals where your stack runs. Node +colors group services by role: 🟦 control plane (Caddy, Dashboard), πŸŸͺ privacy and isolation (Tor, +Docker socket proxies), and 🟩 the mining core. In remote-node mode the bundled 🟠 Monero node isn't +started, and P2Pool talks to your external node instead. + +> The one exception is **optional clearnet initial sync** (`monero.clearnet_initial_sync` / +> `tari.clearnet_initial_sync`, default **off**): while active, that node's P2P leaves Tor to sync +> faster and its IP is exposed until it finishes, after which it reverts to Tor automatically (#234). +> The Telegram bot alerts you the whole time it's exposed. See [Privacy](privacy.md). ## Privacy by design diff --git a/docs/configuration.md b/docs/configuration.md index 05dec1b..6adb5ab 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -95,6 +95,8 @@ plain HTTP, edit `config.json` and run `./pithead apply`. | `dashboard.timezone` | `auto` | Timezone for the dashboard's timestamps and charts. `auto` = the host machine's timezone (auto-detected, falling back to `Etc/UTC`); set an IANA name (e.g. `America/Chicago`) to override. | | `dashboard.data_dir` | `auto` | Where the dashboard's database lives. `auto` = `./data/dashboard`. | | `dashboard.check_for_updates` | `true` _(on)_ | The dashboard periodically asks GitHub whether a newer Pithead release exists and, if so, shows a header badge linking to it (e.g. "New release v1.4.0 available"). Notify-only: it never updates anything; you upgrade with `./pithead upgrade` on your own terms. On by default because the check is routed over Tor (the same bridge SOCKS as the XvB fetch, `socks5h` so the DNS lookup goes through Tor too), so GitHub sees a Tor exit, not your IP. It's cached (hourly) and fails silently offline. Set to `false` to opt out entirely. See [Privacy β€Ί Runtime egress](privacy.md#runtime-egress). | +| `dashboard.hashrate_drop_threshold` | `50` | Percent below the recent normal that counts as a hashrate drop for the `hashrate_loss` alert and its chart marker. `50` = fire when total fleet hashrate falls to half its baseline. Raise it to catch smaller dips, lower it to only flag near-total outages. | +| `dashboard.hashrate_drop_minutes` | `10` | How many minutes the hashrate must stay below the threshold before the drop is reported β€” the debounce that keeps a brief blip from pinging you. | | `dashboard.tari_required` | `true` | How much a Tari problem holds up the rest of the stack. Monero is required to mine, so its behavior isn't configurable: a monerod outage always rejects workers (stops `xmrig-proxy` so miners fail over to their backup pools), and the miner is always held until monerod finishes syncing. Tari is only needed for merge mining, so this one flag decides how much it blocks. `true` (default): a Tari outage also rejects workers, the miner waits for Tari's initial sync too, and a Tari-only (re)sync shows the full-screen Sync view. `false` (non-blocking): keep mining Monero through a Tari outage, start mining as soon as Monero is synced (Tari finishes in the background), and keep the normal dashboard, with a `Tari syncing` indicator, instead of the takeover screen. | | `network.subnet` | `172.28.0.0/24` | The private Docker bridge the stack's containers run on. Change it only if install fails with `Pool overlaps with other one on this address space`, i.e. your host already uses `172.28.0.0/24` for another Docker network or interface. Must be a free `X.Y.Z.0/24` block (e.g. `"172.30.0.0/24"`); the services keep their fixed host octets (`.25`–`.31`) within it, so the structured addressing the dashboard and the worker SSRF guard rely on is preserved. | | `network.tor_egress_firewall` | `true` _(on)_ | Privacy-relevant, default on. Enforces "behind Tor" fail-closed: at `up`/`apply`, `pithead` installs host firewall rules (Docker's `DOCKER-USER` chain) that drop any direct clearnet dial from the mining containers (monerod/p2pool/tari/xmrig-proxy). Only the Tor container reaches the internet, so a misconfigured or buggy daemon can't leak your IP. Needs root (like the GRUB/HugePages steps); removed at `down`. Set `false` to skip it and rely on per-app Tor config only (e.g. a host where you manage egress yourself, or where `iptables` isn't available). Full detail: [Privacy β€Ί Enforced fail-closed](privacy.md#enforced-fail-closed-not-just-configured-270). | @@ -102,6 +104,12 @@ plain HTTP, edit `config.json` and run `./pithead apply`. | `workers.api_token` | `""` | The single shared Bearer token used when `workers.api_auth` is `token`. Ignored otherwise. | | `workers.api_port` | `8080` | TCP port the worker xmrig API listens on. Change only if your miners expose the API on a non-standard port. | | `healthchecks.ping_url` | _(blank)_ | The full ping URL from Healthchecks.io (e.g. `https://hc-ping.com/`) β€” the optional [dead-man's switch](monitoring.md) that alerts you when your host stops responding. **Setting it turns the monitor on; blank keeps it off.** Always pinged over Tor (every 60s), so it must be Tor-reachable (see [Monitoring β€Ί Privacy note](monitoring.md#privacy-note)). Treated as a secret β€” stored in the owner-only `.env`. | +| `telegram.enabled` | `false` | Push operational alerts (node down/recovered, worker offline/back, sync finished) to Telegram. Off by default. Requires `bot_token` + `chat_id` to actually send. Full walkthrough: [Telegram Bot](telegram.md). | +| `telegram.bot_token` | `""` | Your BotFather bot token. A secret β€” stored owner-only in `.env`, git-ignored, and never logged. Get one from [@BotFather](https://t.me/BotFather). | +| `telegram.chat_id` | `""` | Where alerts are sent and the only chat the command interface answers. A Telegram group id (negative, e.g. `-1001234567890`) or a personal chat id. See [how to find it](telegram.md#3-find-your-chat-id). | +| `telegram.events.*` | all `true` | Per-event toggles: `stack_online`, `node_down`, `node_recovered`, `worker_offline`, `worker_recovered`, `worker_joined`, `worker_left`, `sync_finished`, `disk_space`, `db_unhealthy`, `xvb_no_share`, `xvb_registration`, `clearnet_exposed`, `new_release`, `daily_summary`, `hashrate_low`, `hashrate_loss`, `hugepages`, `low_ram`. Each defaults to on once Telegram is enabled; set one `false` to silence just that alert. Full list: [Telegram Bot](telegram.md#choosing-which-alerts-you-get). | +| `telegram.daily_summary_time` | `08:00` | Local time (24-hour `HH:MM`) to push the once-a-day status digest, when the `daily_summary` event is on. Uses the dashboard's timezone (`dashboard.timezone`). A malformed value disables the digest. | +| `telegram.commands.enabled` | `false` | Turn on the interactive command interface β€” the bot answers `/status`, `/hashrate`, `/workers`, `/sync`, and `/help` from the configured `chat_id` (every other chat is ignored). Off by default; alerts work without it. Long-polls over Tor, so it needs no inbound port. See [Telegram β€Ί Commands](telegram.md#commands). | --- diff --git a/docs/dashboard.md b/docs/dashboard.md index a34b7d3..4a8e3f2 100644 --- a/docs/dashboard.md +++ b/docs/dashboard.md @@ -89,6 +89,21 @@ to the version badge, linking to the GitHub release. It never updates anything; your IP. Turn it off with `dashboard.check_for_updates: false` (see [Configuration](configuration.md#configuration-reference)). +### Host & performance warnings + +The top bar also surfaces the persistent host conditions that `setup` warns about, derived from +**live** metrics so they self-correct rather than going stale: + +| Badge | Means | Fix | +|---|---|---| +| `⚠ HugePages off` | HugePages aren't reserved β€” RandomX hashrate is capped. | Run setup's tuning (or edit GRUB) and reboot; the badge clears once they're reserved. | +| `⚠ Low RAM (N GB)` | Under 16 GB of RAM β€” syncing is memory-heavy and Tari can OOM. | Add RAM for a stable node. | +| `⚠ No AVX2` | The CPU lacks AVX2, so RandomX mining is much slower. | A hardware limit; nothing to change at runtime. | + +The first two also push a Telegram alert (`hugepages`, `low_ram`) when first detected, if the bot is +on; AVX2 is badge-only (see [Telegram Bot](telegram.md#choosing-which-alerts-you-get)). All active +warning badges are echoed in the bot's `/status` reply. + ### Hero band A strip of headline KPIs sits below the top bar: @@ -130,6 +145,12 @@ progress until it catches up and merge mining resumes. A time-series chart of hashrate with selectable ranges (1h / 24h / 1w / 1mo) that switch without reloading. Shaded bands show the P2Pool/XvB split over time. +Diamond markers along the top flag **hashrate events** (#99): an amber one where total hashrate +dropped sharply and stayed down (an outage or a rig gone dark), a green one where it recovered. +Hover for the size of the drop. They mark the same transitions as the `hashrate_loss` Telegram +alert and survive a dashboard restart, so a drop that happened overnight is still on the chart in the +morning. + An **Avg** control picks the hashrate-averaging window the chart plots: `1 Min` / `10 Min` / `1 Hr` / `12 Hr` / `24 Hr` (the native windows xmrig-proxy reports). It is independent of the Range control: the range sets how much *time* the x-axis spans; the averaging window sets how *smooth* each diff --git a/docs/monitoring.md b/docs/monitoring.md index e1533d3..6ac0971 100644 --- a/docs/monitoring.md +++ b/docs/monitoring.md @@ -40,9 +40,10 @@ pings and nothing is logged. ### 2. Choose where alerts go On the check's **Integrations** tab, point it at however you want to be notified β€” **email**, -**Telegram**, Slack, Discord, a webhook, and more. If you already use Telegram for other -alerts, you can route Healthchecks.io to the **same** Telegram chat, so host-down alerts and -in-stack events land in one place. +**Telegram**, Slack, Discord, a webhook, and more. If you already run the [Telegram +bot](telegram.md), route Healthchecks.io to the **same** Telegram group, so host-down alerts and +in-stack events land in one place β€” step-by-step in +[Telegram β€Ί Adding Healthchecks.io to the same group](telegram.md#adding-healthchecksio-to-the-same-group). ### 3. Paste the ping URL into `config.json` diff --git a/docs/telegram.md b/docs/telegram.md new file mode 100644 index 0000000..1ebc0f6 --- /dev/null +++ b/docs/telegram.md @@ -0,0 +1,321 @@ +# Telegram Bot + +Pithead can push a small set of **operational alerts** to Telegram, so you find out the moment +something needs attention β€” without sitting on the dashboard. It can also answer a few **read-only +status commands** on demand, so you can check on the stack from your phone. Both are **off by +default**; this guide takes you from nothing to a working alert in about five minutes, then adds +commands if you want them. + +> **What this is β€” and isn't.** Alerts are a one-way push (a pager). Commands are **read-only** β€” +> `/status`, `/hashrate`, and friends report what the dashboard already knows; **nothing controls +> the stack over Telegram** (start/stop/`apply` stay on the CLI). So the worst a leaked chat can do +> is *read* your status, never change anything. + +--- + +## What you'll get + +When enabled, the stack sends a short message on each of these events. Every alert is +**debounced** β€” a momentary blip won't ping you, and you get **one** message per real +transition, not a stream: + +| Alert | When it fires | +|---|---| +| πŸ”΄ **Node down** | Your Monero (or Tari) node has been unreachable long enough to be considered down β€” the stack has stopped serving your rigs so they **fail over to their backup pools**. | +| 🟒 **Node recovered** | The node is back and stable; the stack has readmitted your rigs. | +| πŸ”΄ **Worker offline** | A rig stopped hashing and hasn't been seen for a few minutes (a reboot, a dropped connection, a dead miner) β€” it's showing **DOWN** on the dashboard. | +| 🟒 **Worker back online** | A rig that had gone offline is hashing again. | +| 🟒 **New worker joined** | A rig the stack hasn't seen before connected β€” a new miner joined the fleet. | +| βšͺ **Worker left** | A rig dropped off the dashboard entirely (removed from the worker list, not just DOWN). | +| βœ… **Sync finished** | The initial blockchain sync completed and mining has started β€” handy on first run, when the sync can take hours. | +| 🟠 **Disk filling up** | The data disk crossed the warn/critical threshold β€” a full disk can corrupt the Monero database, so free space before it runs out. | +| πŸ”΄ **DB write failing** | The dashboard can no longer write to its database; hashrate history, shares, and stats will be lost on restart until it's fixed (usually disk space or permissions). | +| ⚠ **No PPLNS share (XvB)** | You're donating to XvB but hold no share in the PPLNS window, so raffle wins are **skipped** β€” donations are wasted until you land one. Only fires when XvB is enabled. | +| ⚠ **Clearnet sync active** | A node is doing its initial sync over **clearnet**, so this host's IP is exposed to that chain's P2P network until it finishes (it reverts to Tor automatically). | +| 🎰 **XvB registration** | XvB auto-registration was rejected (bad payout address) or is failing β€” raffle wins won't count until it recovers. Only fires when XvB is enabled. | +| πŸ“‰ **Hashrate low for tier** | You picked a fixed XvB donation tier your hashrate can't sustain β€” lower the tier or add hashrate. Fires on the transition and clears when it recovers. | +| ⚠️ **Hashrate drop** | Total fleet hashrate fell sharply below its recent normal and **stayed down** β€” a rig gone dark, a network cut, or a stalled miner. The dashboard also drops a marker on the hashrate chart at the moment it happened. Fires once on the drop and again on recovery. Thresholds are tunable (`dashboard.hashrate_drop_threshold`, `dashboard.hashrate_drop_minutes`). | +| 🧠 **HugePages not reserved** | RandomX runs capped until HugePages are reserved. Fires when the dashboard first sees them missing and again once a reboot applies them β€” so you know the tuning took. | +| πŸ’Ύ **Low RAM** | This host has less RAM than the stack wants (syncing is memory-heavy; Tari can OOM). Sent once when first detected β€” a heads-up that instability may be under-provisioning, not a bug. | +| πŸ†• **New release** | A newer Pithead release is available (the same signal as the dashboard header badge). | +| πŸš€ **Pithead online** | Sent once when the dashboard starts β€” a heartbeat that the stack is up (and confirms the bot works after setup). | +| πŸ“… **Daily summary** | A once-a-day retrospective of the last 24h across your whole fleet β€” date/time, an **incident roll-up** (what went wrong during the day, or an all-clear), **24h hashrate** with the **P2Pool / XvB split**, **shares found in the day**, an **estimated daily earnings** figure, and a **per-machine 24h breakdown** β€” pushed at a set local time (**08:00** by default; `telegram.daily_summary_time`). | + +Every message is prefixed with your dashboard hostname (e.g. `[rig-box.lan]`), so if you point +more than one stack at the same chat you can tell them apart. + +Each of these can be **turned off individually** β€” see [Choosing which alerts you get](#choosing-which-alerts-you-get). + +--- + +## Setup + +You need two things: a **bot token** (the credential Pithead uses to send) and a **chat id** (where +the messages go). Both come from Telegram, in a few taps. + +### 1. Create a bot and get its token + +1. In Telegram, open a chat with **[@BotFather](https://t.me/BotFather)** (the official bot for + making bots). +2. Send `/newbot` and follow the prompts β€” pick a name and a username (the username must end in + `bot`, e.g. `my_pithead_bot`). +3. BotFather replies with a **token** that looks like: + + ``` + 123456789:AAExampleExampleExampleExampleExample + ``` + + This is your `bot_token`. **Treat it like a password** β€” anyone with it can post as your bot. + +### 2. Pick where alerts go (a group is recommended) + +You can send alerts straight to your own Telegram account, but a **dedicated group** is the +cleaner choice: it keeps alerts out of your personal chats, lets you mute them with one tap, and +lets you add other operators (or a second alert source β€” see +[One chat, two bots](#one-chat-two-bots)). + +1. Create a new Telegram **group** (e.g. "Pithead alerts"). +2. Add your bot to it: open the group β†’ **Add members** β†’ search for your bot's username. + +> If you'd rather have alerts come as a normal direct message instead, skip the group and just +> **send your bot a `/start`** message β€” that's enough for it to be allowed to message you back. + +### 3. Find your chat id + +The easiest way: + +1. Add **[@userinfobot](https://t.me/userinfobot)** to the same group (or message it directly for a + 1-to-1 chat). It immediately replies with the chat's **id**. +2. Note the number. **Group ids are negative** and often long, e.g. `-1001234567890`. A direct + chat id is a positive number, e.g. `987654321`. +3. You can remove `@userinfobot` afterwards. + +> **Manual alternative** (no third-party bot): send any message in the group, then open +> `https://api.telegram.org/bot/getUpdates` in a browser and read the `chat.id` field +> from the JSON. (You may need to send the message *after* adding your bot for it to show up.) + +### 4. Put it in `config.json` + +Add a `telegram` block to your `config.json`. The minimum to switch it on: + +```json +{ + "monero": { "wallet_address": "your_monero_wallet_address" }, + "tari": { "wallet_address": "your_tari_wallet_address" }, + + "telegram": { + "enabled": true, + "bot_token": "123456789:AAExampleExampleExampleExampleExample", + "chat_id": "-1001234567890" + } +} +``` + +`chat_id` can be written as a string (recommended, since group ids are long and negative) or a +number β€” both work. + +### 5. Apply + +```bash +./pithead apply +``` + +`apply` re-renders the stack and restarts the dashboard with the new settings. On the next health +cycle, alerting is live. To confirm it works end-to-end, you can stop a rig (or briefly stop a +node) and wait for the offline/down alert β€” remember the debounce means it's a few minutes, not +instant, by design. + +--- + +## Choosing which alerts you get + +Every event is on by default once Telegram is enabled. To silence one, add it to a `telegram.events` +block and set it to `false` β€” any event you don't list stays on: + +```json +"telegram": { + "enabled": true, + "bot_token": "…", + "chat_id": "…", + "events": { + "worker_offline": false, + "worker_recovered": false + } +} +``` + +| Event key | Default | Alert | +|---|---|---| +| `node_down` | `true` | Monero/Tari node went down | +| `node_recovered` | `true` | …and came back | +| `worker_offline` | `true` | A rig went DOWN | +| `worker_recovered` | `true` | …and came back | +| `worker_joined` | `true` | A new rig joined the fleet | +| `worker_left` | `true` | A rig dropped off the dashboard entirely | +| `sync_finished` | `true` | Initial sync done, mining started | +| `disk_space` | `true` | Data disk filling up / critical / recovered | +| `db_unhealthy` | `true` | Dashboard database writes failing / recovered | +| `xvb_no_share` | `true` | XvB on but no PPLNS share (wins skipped) / restored | +| `clearnet_exposed` | `true` | A node is syncing over clearnet (IP exposed) / back on Tor | +| `xvb_registration` | `true` | XvB auto-registration rejected / failing / recovered | +| `new_release` | `true` | A newer Pithead release is available | +| `stack_online` | `true` | One-shot "dashboard is up" heartbeat on start | +| `daily_summary` | `true` | Once-a-day status roll-up (time set by `telegram.daily_summary_time`, default `08:00`) | +| `hashrate_low` | `true` | Hashrate can't sustain the chosen XvB tier / recovered | +| `hashrate_loss` | `true` | Total hashrate dropped sharply and stayed down (outage / rig dark) / recovered | +| `hugepages` | `true` | HugePages not reserved (RandomX capped) / reserved after a reboot | +| `low_ram` | `true` | Host has less RAM than the stack wants (one-shot heads-up) | + +Run `./pithead apply` after editing. + +> The dashboard also shows an **AVX2-missing** badge when the CPU lacks AVX2, but it has **no +> alert** β€” it's a fixed hardware fact with nothing to do at runtime, so it stays a badge (and shows +> in `/status`) rather than a push you can't act on. + +> **Tari note.** A node-down/recovered alert fires for **Tari only when Tari is treated as +> required** (`dashboard.tari_required: true`, the default). If you've made Tari non-blocking, a +> Tari outage doesn't stop your Monero mining, so it isn't alerted as a node-down β€” matching how +> the rest of the stack treats a non-blocking Tari. Monero is always alerted. + +--- + +## Commands + +Beyond alerts, the bot can answer **status queries on demand** β€” ask it how things are and it +replies with what the dashboard already knows. This is a **separate opt-in** from alerts: turn it on +by adding a `commands` block. + +```json +"telegram": { + "enabled": true, + "bot_token": "…", + "chat_id": "…", + "commands": { "enabled": true } +} +``` + +Run `./pithead apply` after editing. The commands: + +| Command | Reply | +|---|---| +| `/status` | One-glance health: each node up/down/syncing, the **Tari merge-mine link** (gRPC connected β€” distinct from the node being synced), whether mining is active, workers online, total hashrate, PPLNS shares in window β€” followed by any active **warning/error badges** (the same ones the dashboard's top bar shows), or an explicit "βœ… No warnings." | +| `/info` | About this stack: the running **version** (and whether a newer release is available), the Monero **DB mode** (pruned / full), the P2Pool **sidechain** (Mini / Main), and the **privacy posture** (Tor-only, or how many clearnet paths are exposed). | +| `/hashrate` | Total hashrate plus a per-rig breakdown of everything currently online. | +| `/workers` | Every rig's online/offline state, with uptime for the ones that are up. | +| `/sync` | Monero and Tari sync progress (percent and block height). | +| `/system` | Host resources: disk, RAM, CPU + load, and HugePages. | +| `/pool` | P2Pool sidechain type, pool hashrate, Monero network height + difficulty, PPLNS shares in window, current **effort** (luck indicator), **sidechain blocks found**, **share acceptance** (accepted/rejected + reject %), and the **best share** difficulty found. | +| `/xvb` | XvB mode, current and target tier, hashrate **routed** to XvB, the **credited** 1h/24h averages XvB measures (what sets your tier), raffle eligibility (PPLNS share), and a stale-data warning if the XvB feed is behind. | +| `/earnings` | Estimated P2Pool XMR per day/month, from both your **1h** and (once available) steadier **24h** average hashrate (P2Pool only β€” excludes XvB-donated hashrate and Tari). | +| `/help` | The command list. | + +The numbers come from the **same source as the dashboard**, so a reply and the web view always +agree. In a group, address the bot directly if you like β€” `/status@your_pithead_bot` works too. + +**Only the configured `chat_id` is answered.** A message from any other chat is ignored with no +reply, so the bot can't be used by anyone you haven't put in that chat. The bot **long-polls** +Telegram (`getUpdates`) rather than exposing a webhook, so it needs **no inbound port** and rides +the same outbound path as the alerts β€” nothing about your host is exposed to receive commands. + +> **Tor-only host.** Like alerts, commands reach `api.telegram.org` over clearnet. With no clearnet +> egress the poll just fails silently and the bot answers nothing β€” see +> [Privacy and secrets](#privacy-and-secrets). + +--- + +## One chat, two bots + +Pithead's companion **Healthchecks.io** monitor (a "dead-man's switch" that detects the whole host +going dark from *outside* the stack) can deliver its alerts to Telegram too. The two are +complementary: + +- **This (in-stack) alerter** reports everything the host can tell you **while it's alive** β€” a + node down, a rig offline, sync finished. +- **Healthchecks.io** reports the case this one can't: the **whole host is dead** (power cut, + kernel panic, network gone) and therefore can't send anything itself. + +The clean setup is to point **both at the same Telegram group** β€” one place for every alert. They +necessarily use **two different bots**: + +- This alerter uses **your own BotFather bot** (`bot_token` above) posting to your `chat_id`. +- Healthchecks.io has **its own** Telegram integration bot that you authorize into the chat on the + Healthchecks.io side β€” you never paste a token into Healthchecks.io. + +So the thing you share is the **chat**, not the token: create the group, add **both** bots to it, +and use that group's id here. Each source labels its own messages, so you can always tell which is +which. (Keeping them in separate chats is fine too β€” only useful if you want to mute or route them +differently.) + +### Adding Healthchecks.io to the same group + +Once your Pithead bot is posting to the group, add Healthchecks.io's bot to it as well. You do +**not** paste any token into Healthchecks.io β€” you authorize its bot from their side: + +1. In Telegram, **add [@HealthchecksBot](https://t.me/HealthchecksBot) to your alerts group** (the + same group the Pithead bot posts to). It joins as a member with no access to group messages. +2. In the group, send **`/start@HealthchecksBot`**. Use the **`@HealthchecksBot`** suffix, not a + bare `/start` β€” your Pithead bot is already in the group, so a plain `/start` is ambiguous and + won't reach the right bot. +3. HealthchecksBot replies with a **confirmation link**. Tap it β€” Healthchecks.io opens in your + browser. +4. **Select the project** your ping URL belongs to and click **"Connect Telegram"**. Done β€” host-down + alerts now land in the same group as your Pithead bot's alerts. + +Full walkthrough on their site: ****. For the +rest of the Healthchecks.io setup (creating the check, the ping URL, `config.json`), see +[Monitoring & Alerting](monitoring.md). + +--- + +## Privacy and secrets + +- **The bot token is a secret.** Pithead stores it in `.env`, which is created **owner-only** + (`chmod 600`) and is **git-ignored**, exactly like the Monero node RPC password. The dashboard + **never writes the token to a log line** β€” not even inside an error message. +- **Always over Tor.** Both the alert sends and the command long-poll reach `api.telegram.org` + **through the bundled Tor SOCKS proxy** (`socks5h`, so the DNS lookup goes through Tor too) β€” the + same routing as the Healthchecks.io pinger and the XvB fetch. Telegram sees a **Tor exit, not your + host IP**, so enabling the bot doesn't expose where your stack runs. If Tor is momentarily down (or + Telegram is blocking that exit), sends and polls **fail silently** β€” no errors, no log spam, the + rest of the stack is unaffected β€” and resume on their own. + +--- + +## Tuning the debounce (advanced) + +The defaults err on the side of **not** crying wolf. If you want faster (or quieter) worker alerts, +override these environment variables for the dashboard container β€” both are in seconds: + +| Variable | Default | Meaning | +|---|---|---| +| `WORKER_OFFLINE_AFTER_SEC` | `300` | A rig must be unseen this long before "offline" fires. | +| `WORKER_RECOVERY_AFTER_SEC` | `120` | A rig must be back this long before "back online" fires. | + +Node-down timing is shared with the existing failover logic (`NODE_DOWN_AFTER_SEC` / +`NODE_RECOVERY_AFTER_SEC`). These are advanced knobs; most operators never touch them. + +The **hashrate-drop** alert has its own two `config.json` knobs (not env vars): +`dashboard.hashrate_drop_threshold` (percent below the recent normal that counts as a drop, default +`50`) and `dashboard.hashrate_drop_minutes` (how long it must stay down before firing, default `10`). + +--- + +## Troubleshooting + +| Symptom | Likely cause / fix | +|---|---| +| No messages at all | Confirm `telegram.enabled` is `true` **and** both `bot_token` and `chat_id` are set β€” a missing one keeps alerting **off**. Did you run `./pithead apply`? | +| Still nothing | Make sure the bot has been **added to the group** (or that you sent it `/start` for a direct chat). A bot can't message a chat it isn't in. | +| `chat_id` looks wrong | Group ids are **negative** and long (`-100…`). Re-check with `@userinfobot`. | +| Works for "down" but not a specific alert | Check `telegram.events` β€” that event may be toggled `false`. | +| Alerts work but commands don't | Commands are a **separate** switch: set `telegram.commands.enabled` to `true` and `./pithead apply`. | +| Bot ignores my commands | It only answers the configured `chat_id`. Send from that exact chat, and check the id with `@userinfobot`. | +| No messages, Tor issues | Telegram is reached **over Tor**; if Tor is down or Telegram is blocking the exit, sends/polls fail silently and resume on their own. See [Privacy and secrets](#privacy-and-secrets). | + +--- + +## See also + +- [Configuration](configuration.md) β€” every `config.json` key, including the `telegram.*` block. +- [The Dashboard](dashboard.md) β€” the live view these alerts complement. +- [Operations & Maintenance](operations.md) β€” `apply`, upgrades, and troubleshooting. diff --git a/docs/test-inventory.md b/docs/test-inventory.md index 727faa5..2392304 100644 --- a/docs/test-inventory.md +++ b/docs/test-inventory.md @@ -4,7 +4,7 @@ _Generated by `make test-inventory` ([`tests/inventory.sh`](../tests/inventory.s edit by hand** β€” re-run the target to refresh. See [Testing Strategy](testing-strategy.md) for how the tiers fit together._ -**Totals:** 630 dashboard unit tests Β· 12 contract tests Β· 64 frontend +**Totals:** 795 dashboard unit tests Β· 12 contract tests Β· 66 frontend tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· 9 live config scenarios (17 axis values) Β· 8 mini-stack scenarios. @@ -14,8 +14,8 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· | Tier | Suite | Cases | |---|---|---| -| 1 β€” Unit | dashboard pytest | 630 | -| 1 β€” Unit | frontend (node --test) | 64 | +| 1 β€” Unit | dashboard pytest | 795 | +| 1 β€” Unit | frontend (node --test) | 66 | | 1 β€” Unit | `pithead` shell suite | 52 sections | | 1 β€” Unit | compose interpolation + hardening (#90) | 1 | | 2 β€” Contract | fake-daemon clients | 12 | @@ -27,7 +27,7 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· ## Tier 1 β€” Unit & component -### Dashboard (pytest) β€” 630 tests +### Dashboard (pytest) β€” 795 tests #### tests/client/test_docker_control.py β€” 6 - test_tcp_scheme_rewritten_to_http @@ -146,7 +146,7 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_malformed_json_returns_empty - test_valid_json -#### tests/collector/test_system.py β€” 11 +#### tests/collector/test_system.py β€” 15 - test_normal - test_error_returns_zeros - test_parses_meminfo @@ -158,6 +158,10 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_enabled_when_used - test_allocated_when_unused - test_unknown_when_missing +- test_flag_present +- test_flag_absent +- test_unreadable_is_unknown +- test_result_is_cached #### tests/config/test_config.py β€” 11 - test_defaults_load @@ -210,6 +214,57 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_none_when_no_route - test_socket_is_closed_even_on_error +#### tests/service/test_alert_service.py β€” 49 +- test_every_alert_event_has_a_config_toggle +- test_first_cycle_seeds_baseline_silently +- test_down_then_recovered +- test_node_text_names_the_chain +- test_non_blocking_tari_does_not_alert +- test_no_stale_edge_when_tari_becomes_required +- test_required_tari_alerts +- test_fires_once_when_gate_opens +- test_no_alert_on_restart_after_sync +- test_offline_then_recovered +- test_not_expected_resets_and_silences +- test_joined_after_baseline +- test_left_when_rig_drops_off_the_table +- test_warn_then_critical_then_recover +- test_seed_high_does_not_replay +- test_unhealthy_then_recovered +- test_no_share_then_restored +- test_silent_while_xvb_disabled +- test_exposed_then_reverted +- test_online_fires_once_on_first_cycle +- test_online_text_is_friendly +- test_invalid_then_recovered +- test_failing_alerts +- test_silent_while_disabled +- test_benign_transition_is_silent +- test_fires_once_on_rising_edge +- test_warns_then_recovers +- test_tallies_problems_and_drains +- test_recoveries_are_not_incidents +- test_worker_offline_counts_once +- test_disabled_events_are_dropped +- test_prefixes_when_set +- test_placeholder_host_is_not_prefixed +- test_disabled_notifier_is_noop +- test_enabled_notifier_dispatches +- test_process_swallows_evaluate_error +- test_fires_at_target_once_per_day +- test_late_start_waits_for_next_day +- test_malformed_time_disables +- test_gated_off_by_event_toggle +- test_provider_error_is_swallowed_and_marks_day_done +- test_hugepages_not_reserved_fires_once_then_recovers +- test_healthy_hugepages_never_fires +- test_low_ram_fires_once_no_recovery +- test_advisories_not_counted_as_incidents +- test_gated_off_by_toggle +- test_loss_sends_and_records_incident +- test_recovery_sends_no_incident +- test_gated_off_still_records_loss + #### tests/service/test_algo_service.py β€” 38 - test_xvb_disabled_forces_p2pool - test_zero_shares_forces_p2pool @@ -263,7 +318,7 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_per_chain_independent - test_marker_write_failure_does_not_restart -#### tests/service/test_data_service.py β€” 88 +#### tests/service/test_data_service.py β€” 90 - test_first_poll_baselines_without_backfill - test_delta_records_the_difference - test_no_change_records_nothing @@ -326,7 +381,9 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_partial_start_failure_keeps_latch_closed - test_rehold_stops_quietly_after_first_cycle - test_single_iteration_aggregates +- test_degradation_edge_records_event_and_alerts - test_run_holds_miner_while_syncing +- test_run_wires_computed_signals_into_the_alerter - test_run_releases_despite_height_override - test_run_nonblocking_tari_releases_and_stays_operational - test_healthchecks_pinged_when_healthy @@ -353,19 +410,28 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_real_worker_is_probed_internal_neighbour_is_not - test_malicious_name_is_never_used_as_a_host +#### tests/service/test_degradation.py β€” 6 +- test_steady_state_never_fires +- test_cold_start_below_min_baseline_is_silent +- test_sustained_drop_fires_loss_once +- test_brief_blip_does_not_fire +- test_recovery_fires_after_hold +- test_baseline_frozen_while_degraded + #### tests/service/test_earnings.py β€” 4 - test_matches_closed_form - test_worked_field_example - test_linear_in_inputs - test_missing_or_bad_inputs_are_zero -#### tests/service/test_egress.py β€” 26 +#### tests/service/test_egress.py β€” 27 - test_safe_config_is_all_tor - test_p2pool_clearnet_blocked_by_firewall_is_not_a_leak - test_p2pool_clearnet_without_firewall_is_a_leak - test_host_networked_dashboard_leaks_despite_firewall - test_xvb_disabled_routes_are_inactive - test_healthchecks_ping_is_tor_when_configured_inactive_otherwise +- test_telegram_bot_is_tor_when_enabled_inactive_otherwise - test_remote_monerod_rpc_is_clearnet - test_clearnet_initial_sync_surfaces_only_when_enabled - test_monerod_p2p_always_tor @@ -473,7 +539,7 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_routed_fraction_in_unit_interval - test_max_donation_fraction_within_reserve_bounds -#### tests/service/test_storage_service.py β€” 31 +#### tests/service/test_storage_service.py β€” 35 - test_get_tiers - test_default_xvb_stats - test_partial_updates @@ -505,6 +571,86 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_orphaned_workers_table_dropped_on_upgrade - test_history_older_than_retention_pruned_from_memory - test_old_history_pruned_from_db_when_cleanup_fires +- test_add_and_get_roundtrip +- test_old_events_pruned_from_memory +- test_events_survive_reload +- test_load_tolerates_missing_events_table + +#### tests/service/test_telegram_commands.py β€” 62 +- test_parse_command +- test_status_active +- test_status_syncing_beats_mining_flag +- test_status_node_down_and_not_mining +- test_hashrate_lists_online_workers_desc +- test_hashrate_no_online_workers +- test_hashrate_uses_effective_rate_for_fresh_worker +- test_workers_hashrate_uses_effective_rate +- test_workers_online_first_with_offline_flagged +- test_workers_empty +- test_status_node_syncing_percent +- test_sync_line_variants +- test_sync_line_no_target +- test_system_reads_snapshot +- test_human_count +- test_pool_reads_metrics +- test_pool_share_health_and_best_when_present +- test_pool_omits_share_lines_before_first_poll +- test_pool_effort_when_stratum_present +- test_xvb_enabled_with_share +- test_xvb_stale_warns +- test_xvb_no_share_warns +- test_xvb_disabled +- test_status_merge_mining_line +- test_earnings_estimate +- test_earnings_falls_back_to_1h_30d_without_24h_history +- test_earnings_unavailable_without_network_data +- test_daily_summary_is_a_24h_retrospective +- test_daily_summary_without_xvb_omits_split +- test_daily_summary_incident_log +- test_host_label_prefix +- test_reply_for_help_and_unknown_need_no_metrics +- test_reply_for_status_uses_mining_flag +- test_reply_for_status_merge_mining_from_tari_snapshot +- test_reply_for_pool_reads_share_snapshot +- test_reply_for_workers_reads_snapshot +- test_reply_for_system_reads_snapshot_without_metrics +- test_reply_for_pool_and_xvb +- test_reply_for_earnings +- test_reply_for_hashrate_and_sync +- test_safe_reply_for_swallows_errors +- test_disabled_without_token_or_chat +- test_run_is_noop_when_disabled +- test_handle_update_ignores_foreign_chat +- test_handle_update_replies_to_configured_chat +- test_get_updates_parses_results_over_tor +- test_get_updates_not_ok_returns_empty +- test_prime_offset_skips_backlog +- test_prime_offset_swallows_error +- test_send_posts_over_tor +- test_send_swallows_network_error +- test_run_processes_update_then_honours_cancel +- test_run_backs_off_on_poll_error +- test_bad_and_flagged_warn_badges_included_stripped +- test_informational_states_excluded +- test_healthy_is_empty +- test_format_status_lists_warnings +- test_format_status_all_clear +- test_release_up_to_date_pruned_tor +- test_dev_update_available_full_clearnet +- test_unknown_db_mode_and_missing_update +- test_reply_for_info_routes + +#### tests/service/test_telegram_notifier.py β€” 10 +- test_disabled_by_default +- test_enabled_requires_token_and_chat +- test_enabled_flag_off_disables_even_with_creds +- test_event_enabled_respects_toggle_and_enabled +- test_send_noop_when_disabled +- test_send_posts_to_bot_api +- test_send_routes_over_tor +- test_send_swallows_network_error +- test_send_swallows_http_error +- test_token_never_logged_on_failure #### tests/service/test_update_checker.py β€” 16 - test_accepts_plain_and_v_prefixed @@ -524,6 +670,25 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_failed_fetch_keeps_previous_result - test_up_to_date_yields_none +#### tests/service/test_worker_presence.py β€” 17 +- test_first_sighting_online_is_silent +- test_first_sighting_down_is_silent +- test_steady_online_emits_nothing +- test_not_offline_before_threshold +- test_offline_after_threshold +- test_offline_emitted_once +- test_brief_down_does_not_trip +- test_vanishing_from_table_is_left_not_offline +- test_recovered_only_after_stable_window +- test_flap_during_recovery_does_not_emit +- test_independent_per_worker_state +- test_reset_clears_state_and_rebaselines_silently +- test_worker_forgotten_when_it_leaves_the_table +- test_first_cycle_baselines_silently +- test_new_worker_after_prime_joins +- test_worker_leaving_emits_left +- test_reset_reprimes_without_joins + #### tests/sim/test_donation_model.py β€” 10 - test_holds_tier_without_overshoot - test_no_windup_from_cold_start @@ -580,7 +745,7 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_css_lets_hostname_wrap - test_host_at_separator_styled_and_rendered -#### tests/web/test_views.py β€” 133 +#### tests/web/test_views.py β€” 143 - test_point_shape_is_xy_with_epoch_ms - test_legacy_rows_attributed_to_p2pool - test_range_filtering @@ -645,6 +810,12 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_disk_badge_warn - test_no_disk_badge_when_ample - test_no_disk_badge_when_missing +- test_hugepages_disabled_badge +- test_no_hugepages_badge_when_reserved +- test_low_ram_badge +- test_no_low_ram_badge_at_or_above_threshold_or_unknown +- test_avx2_missing_badge +- test_no_avx2_badge_when_present_or_unknown - test_high_usage_levels_and_fill - test_warning_fill_between_70_and_90 - test_unparseable_cpu_is_ok @@ -714,14 +885,20 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· - test_no_when_below_tier_even_with_a_share - test_no_when_in_tier_but_no_share - test_na_when_xvb_off +- test_absent_events_default_to_empty +- test_event_point_shape +- test_label_falls_back_to_type +- test_events_filtered_by_range -### Frontend logic (node --test) β€” 64 tests +### Frontend logic (node --test) β€” 66 tests - withAlpha: appends an 8-bit alpha to a #rrggbb hex - withAlpha: non-#rrggbb values pass through opaque (a palette change cannot break fills) - padYAxis: pads the range and clamps the floor to zero - padYAxis: the floor is clamped to zero, never negative - padYAxis: the magnitude floor applies when the span is flat - padYAxis: no-op when the range is non-finite (all series hidden / no data) +- eventColors: maps recovery to ok, everything else to loss (#99) +- eventColors: tolerates a missing events list - App without state shows the right connection message - App always renders the theme switcher, even before the first load - operational App shows a disconnected banner when not connected @@ -995,5 +1172,5 @@ tests Β· 52 `pithead` shell sections Β· 18 harness self-test sections Β· --- -_Grand total: **793** enumerated cases/sections across the four tiers (plus the live +_Grand total: **960** enumerated cases/sections across the four tiers (plus the live lifecycle and fault-injection phases, which are exercised on a real server)._ diff --git a/pithead b/pithead index d323ada..7092c1b 100755 --- a/pithead +++ b/pithead @@ -2080,6 +2080,50 @@ render_env() { worker_api_auth=$(jq -r '.workers.api_auth // "none"' "$CONFIG_FILE") worker_api_token=$(jq -r '.workers.api_token // ""' "$CONFIG_FILE") + # Telegram operator bot (#121 alerts, #45 commands). Disabled by default. bot_token is a + # secret: it lives only in this owner-only .env (chmod 600 below) and the dashboard never logs + # it. Per-event toggles default to on, so enabling Telegram turns on the full set and an + # operator only opts *out* of the noisy ones. The interactive command interface is a separate + # opt-in (telegram.commands.enabled, default false). A blank chat_id/bot_token keeps everything + # off even if enabled=true (the dashboard guards that too). See docs/telegram.md. + local tg_enabled tg_token tg_chat tg_commands + tg_enabled=$(jq -r 'if .telegram.enabled != null then .telegram.enabled | tostring else "false" end' "$CONFIG_FILE") + tg_token=$(jq -r '.telegram.bot_token // empty' "$CONFIG_FILE") + tg_chat=$(jq -r '.telegram.chat_id // empty' "$CONFIG_FILE") + tg_commands=$(jq -r 'if .telegram.commands.enabled != null then .telegram.commands.enabled | tostring else "false" end' "$CONFIG_FILE") + # One toggle per event, defaulting to true when the key is absent. + tg_event() { jq -r --arg k "$1" 'if .telegram.events[$k] != null then .telegram.events[$k] | tostring else "true" end' "$CONFIG_FILE"; } + local tg_ev_node_down tg_ev_node_recovered tg_ev_worker_offline tg_ev_worker_recovered + local tg_ev_worker_joined tg_ev_worker_left tg_ev_sync_finished tg_ev_disk_space tg_ev_db_unhealthy + local tg_ev_xvb_no_share tg_ev_clearnet_exposed tg_ev_xvb_registration tg_ev_new_release tg_ev_stack_online + local tg_ev_daily_summary tg_summary_time tg_ev_hashrate_low tg_ev_hashrate_loss + local tg_ev_hugepages tg_ev_low_ram + local hr_drop_threshold hr_drop_minutes + tg_ev_node_down=$(tg_event node_down) + tg_ev_node_recovered=$(tg_event node_recovered) + tg_ev_worker_offline=$(tg_event worker_offline) + tg_ev_worker_recovered=$(tg_event worker_recovered) + tg_ev_worker_joined=$(tg_event worker_joined) + tg_ev_worker_left=$(tg_event worker_left) + tg_ev_sync_finished=$(tg_event sync_finished) + tg_ev_disk_space=$(tg_event disk_space) + tg_ev_db_unhealthy=$(tg_event db_unhealthy) + tg_ev_xvb_no_share=$(tg_event xvb_no_share) + tg_ev_clearnet_exposed=$(tg_event clearnet_exposed) + tg_ev_xvb_registration=$(tg_event xvb_registration) + tg_ev_new_release=$(tg_event new_release) + tg_ev_stack_online=$(tg_event stack_online) + tg_ev_daily_summary=$(tg_event daily_summary) + tg_ev_hashrate_low=$(tg_event hashrate_low) + tg_ev_hashrate_loss=$(tg_event hashrate_loss) + tg_ev_hugepages=$(tg_event hugepages) + tg_ev_low_ram=$(tg_event low_ram) + # Degradation detector (#99): drop-below-% and sustained-minutes; defaults 50 / 10. + hr_drop_threshold=$(jq -r '.dashboard.hashrate_drop_threshold // 50' "$CONFIG_FILE") + hr_drop_minutes=$(jq -r '.dashboard.hashrate_drop_minutes // 10' "$CONFIG_FILE") + # Local time (HH:MM) for the daily digest; default 08:00. + tg_summary_time=$(jq -r '.telegram.daily_summary_time // "08:00"' "$CONFIG_FILE") + # Tari memory cap (#55). Tari officially needs only a few GB (min 4 GB host, 8 GB+ recommended), # but its memory grows unbounded over time β€” one 32 GB host was seen at ~11 GB while staying # healthy. Uncapped, that growth can OOM the whole host on small machines. So the cap is a SAFETY @@ -2157,6 +2201,32 @@ TARI_REQUIRED=$tari_required DASHBOARD_CHECK_UPDATES=$check_for_updates TARI_MEM_LIMIT=$tari_mem_limit HEALTHCHECKS_PING_URL=$hc_ping_url +TELEGRAM_ENABLED=$tg_enabled +TELEGRAM_BOT_TOKEN=$tg_token +TELEGRAM_CHAT_ID=$tg_chat +TELEGRAM_COMMANDS_ENABLED=$tg_commands +TELEGRAM_EVENT_NODE_DOWN=$tg_ev_node_down +TELEGRAM_EVENT_NODE_RECOVERED=$tg_ev_node_recovered +TELEGRAM_EVENT_WORKER_OFFLINE=$tg_ev_worker_offline +TELEGRAM_EVENT_WORKER_RECOVERED=$tg_ev_worker_recovered +TELEGRAM_EVENT_WORKER_JOINED=$tg_ev_worker_joined +TELEGRAM_EVENT_WORKER_LEFT=$tg_ev_worker_left +TELEGRAM_EVENT_SYNC_FINISHED=$tg_ev_sync_finished +TELEGRAM_EVENT_DISK_SPACE=$tg_ev_disk_space +TELEGRAM_EVENT_DB_UNHEALTHY=$tg_ev_db_unhealthy +TELEGRAM_EVENT_XVB_NO_SHARE=$tg_ev_xvb_no_share +TELEGRAM_EVENT_CLEARNET_EXPOSED=$tg_ev_clearnet_exposed +TELEGRAM_EVENT_XVB_REGISTRATION=$tg_ev_xvb_registration +TELEGRAM_EVENT_NEW_RELEASE=$tg_ev_new_release +TELEGRAM_EVENT_STACK_ONLINE=$tg_ev_stack_online +TELEGRAM_EVENT_DAILY_SUMMARY=$tg_ev_daily_summary +TELEGRAM_EVENT_HASHRATE_LOW=$tg_ev_hashrate_low +TELEGRAM_EVENT_HASHRATE_LOSS=$tg_ev_hashrate_loss +TELEGRAM_EVENT_HUGEPAGES=$tg_ev_hugepages +TELEGRAM_EVENT_LOW_RAM=$tg_ev_low_ram +HASHRATE_DROP_THRESHOLD_PCT=$hr_drop_threshold +HASHRATE_DROP_MINUTES=$hr_drop_minutes +TELEGRAM_DAILY_SUMMARY_TIME=$tg_summary_time MONERO_MEM_LIMIT=$monero_mem_limit P2POOL_URL=${NETWORK_PREFIX}.28:3333 NETWORK_SUBNET=$NETWORK_SUBNET @@ -2670,6 +2740,25 @@ describe_change() { msg="Healthchecks.io ping URL updated (#79) β€” the dashboard container is recreated." fi ;; + TELEGRAM_ENABLED) + msg="Telegram operator bot β†’ $([ "$new" == "true" ] && echo on || echo off) (#121) β€” the dashboard container is recreated." + ;; + TELEGRAM_BOT_TOKEN) + # Secret β€” never echo the token value into the change preview / logs. + msg="Telegram bot token updated (#121) β€” the dashboard container is recreated." + ;; + TELEGRAM_CHAT_ID) + msg="Telegram chat id: $old β†’ $new (#121)." + ;; + TELEGRAM_COMMANDS_ENABLED) + msg="Telegram command interface β†’ $([ "$new" == "true" ] && echo on || echo off) (#45) β€” the bot $([ "$new" == "true" ] && echo "now answers" || echo "no longer answers") /status, /hashrate, /workers, /sync from the configured chat; the dashboard container is recreated." + ;; + TELEGRAM_EVENT_*) + msg="Telegram alert toggle ($key): $old β†’ $new (#121)." + ;; + TELEGRAM_DAILY_SUMMARY_TIME) + msg="Telegram daily summary time: $old β†’ $new (local time; #121)." + ;; MONERO_CLEARNET_SYNC) flag=DEST if [ "$new" == "true" ]; then diff --git a/tests/stack/run.sh b/tests/stack/run.sh index e738a88..ae29b7b 100755 --- a/tests/stack/run.sh +++ b/tests/stack/run.sh @@ -259,6 +259,16 @@ case "$(run_sourced "$SANDBOX" describe_change HEALTHCHECKS_PING_URL "" https:// *SECRET* | *OLD* | *NEW*) bad "hc ping_url not printed" "leaked the ping URL into the preview" ;; *) ok "hc ping_url not printed" ;; esac +# Telegram (#121): toggles/events are a brief dashboard restart (INFO); the bot token is a secret, +# so its change line must NOT echo the old/new value. +assert_contains "telegram enable is INFO" "$(run_sourced "$SANDBOX" describe_change TELEGRAM_ENABLED false true)" "INFO" +assert_contains "telegram event is INFO" "$(run_sourced "$SANDBOX" describe_change TELEGRAM_EVENT_NODE_DOWN true false)" "INFO" +tg_tok_msg="$(run_sourced "$SANDBOX" describe_change TELEGRAM_BOT_TOKEN oldsecret newsecret)" +assert_contains "telegram token change noted" "$tg_tok_msg" "Telegram bot token updated" +case "$tg_tok_msg" in +*oldsecret* | *newsecret*) bad "telegram token value not leaked in preview" "leaked: $tg_tok_msg" ;; +*) ok "telegram token value not leaked in preview" ;; +esac assert_contains "monero mem is INFO" "$(run_sourced "$SANDBOX" describe_change MONERO_MEM_LIMIT 4g 6g)" "INFO" assert_contains "monero mem recreate note" "$(run_sourced "$SANDBOX" describe_change MONERO_MEM_LIMIT 4g 6g)" "monerod container is recreated" # Clearnet initial sync (#183): enabling OR disabling is DEST (the daemon is recreated), and enabling @@ -1505,6 +1515,68 @@ printf '{ "monero": {"mode":"local","wallet_address":"%s","node_username":"u","n out="$(cd "$V" && DOCKER_LOG="$DOCKER_LOG" PATH="$V/bin:$PATH" ./pithead apply -y 2>&1)" assert_eq "check_for_updates opt-out propagated false" "$(run_sourced "$V" env_get_file "$V/.env" DASHBOARD_CHECK_UPDATES)" "false" +# Telegram defaults (#121): no telegram block => disabled, per-event toggles default on. +seed_env +printf '{ "monero": {"mode":"local","wallet_address":"%s","node_username":"u","node_password":"p"}, "tari":{"wallet_address":"T"}, "p2pool":{"pool":"mini"}, "dashboard":{"secure":false,"host":"box.lan"} }\n' "$WALLET" >"$V/config.json" +out="$(cd "$V" && DOCKER_LOG="$DOCKER_LOG" PATH="$V/bin:$PATH" ./pithead apply -y 2>&1)" +assert_eq "telegram disabled by default" "$(run_sourced "$V" env_get_file "$V/.env" TELEGRAM_ENABLED)" "false" +assert_eq "telegram event defaults on" "$(run_sourced "$V" env_get_file "$V/.env" TELEGRAM_EVENT_NODE_DOWN)" "true" + +# Telegram enabled: token/chat_id + per-event toggles propagate from config.json into .env. +seed_env +printf '{ "monero": {"mode":"local","wallet_address":"%s","node_username":"u","node_password":"p"}, "tari":{"wallet_address":"T"}, "p2pool":{"pool":"mini"}, "dashboard":{"secure":false,"host":"box.lan"}, "telegram":{"enabled":true,"bot_token":"BOTSECRET","chat_id":"-100123","events":{"worker_offline":false}} }\n' "$WALLET" >"$V/config.json" +out="$(cd "$V" && DOCKER_LOG="$DOCKER_LOG" PATH="$V/bin:$PATH" ./pithead apply -y 2>&1)" +assert_eq "telegram enabled propagated" "$(run_sourced "$V" env_get_file "$V/.env" TELEGRAM_ENABLED)" "true" +assert_eq "telegram token propagated" "$(run_sourced "$V" env_get_file "$V/.env" TELEGRAM_BOT_TOKEN)" "BOTSECRET" +assert_eq "telegram chat_id propagated" "$(run_sourced "$V" env_get_file "$V/.env" TELEGRAM_CHAT_ID)" "-100123" +assert_eq "telegram per-event override off" "$(run_sourced "$V" env_get_file "$V/.env" TELEGRAM_EVENT_WORKER_OFFLINE)" "false" +assert_eq "telegram unset event stays on" "$(run_sourced "$V" env_get_file "$V/.env" TELEGRAM_EVENT_NODE_DOWN)" "true" +# The bot token is a secret: the apply preview must not print it. +case "$out" in +*BOTSECRET*) bad "telegram token not printed by apply" "leaked in: $out" ;; +*) ok "telegram token not printed by apply" ;; +esac + +# Interactive command interface (#45): off by default, opt-in via telegram.commands.enabled. +seed_env +printf '{ "monero": {"mode":"local","wallet_address":"%s","node_username":"u","node_password":"p"}, "tari":{"wallet_address":"T"}, "p2pool":{"pool":"mini"}, "dashboard":{"secure":false,"host":"box.lan"} }\n' "$WALLET" >"$V/config.json" +out="$(cd "$V" && DOCKER_LOG="$DOCKER_LOG" PATH="$V/bin:$PATH" ./pithead apply -y 2>&1)" +assert_eq "telegram commands off by default" "$(run_sourced "$V" env_get_file "$V/.env" TELEGRAM_COMMANDS_ENABLED)" "false" +seed_env +printf '{ "monero": {"mode":"local","wallet_address":"%s","node_username":"u","node_password":"p"}, "tari":{"wallet_address":"T"}, "p2pool":{"pool":"mini"}, "dashboard":{"secure":false,"host":"box.lan"}, "telegram":{"enabled":true,"bot_token":"BOTSECRET","chat_id":"-100123","commands":{"enabled":true}} }\n' "$WALLET" >"$V/config.json" +out="$(cd "$V" && DOCKER_LOG="$DOCKER_LOG" PATH="$V/bin:$PATH" ./pithead apply -y 2>&1)" +assert_eq "telegram commands opt-in propagated" "$(run_sourced "$V" env_get_file "$V/.env" TELEGRAM_COMMANDS_ENABLED)" "true" + +# Daily-summary time (#121): defaults to 08:00; an explicit telegram.daily_summary_time propagates. +assert_eq "daily summary time defaults to 08:00" "$(run_sourced "$V" env_get_file "$V/.env" TELEGRAM_DAILY_SUMMARY_TIME)" "08:00" +seed_env +printf '{ "monero": {"mode":"local","wallet_address":"%s","node_username":"u","node_password":"p"}, "tari":{"wallet_address":"T"}, "p2pool":{"pool":"mini"}, "dashboard":{"secure":false,"host":"box.lan"}, "telegram":{"enabled":true,"bot_token":"BOTSECRET","chat_id":"-100123","daily_summary_time":"21:30"} }\n' "$WALLET" >"$V/config.json" +out="$(cd "$V" && DOCKER_LOG="$DOCKER_LOG" PATH="$V/bin:$PATH" ./pithead apply -y 2>&1)" +assert_eq "daily summary time propagated" "$(run_sourced "$V" env_get_file "$V/.env" TELEGRAM_DAILY_SUMMARY_TIME)" "21:30" + +# Hashrate-loss detector knobs (#99): default 50% over 10 min; explicit dashboard overrides propagate. +assert_eq "hashrate drop threshold default 50" "$(run_sourced "$V" env_get_file "$V/.env" HASHRATE_DROP_THRESHOLD_PCT)" "50" +assert_eq "hashrate drop minutes default 10" "$(run_sourced "$V" env_get_file "$V/.env" HASHRATE_DROP_MINUTES)" "10" +seed_env +printf '{ "monero": {"mode":"local","wallet_address":"%s","node_username":"u","node_password":"p"}, "tari":{"wallet_address":"T"}, "p2pool":{"pool":"mini"}, "dashboard":{"secure":false,"host":"box.lan","hashrate_drop_threshold":40,"hashrate_drop_minutes":5} }\n' "$WALLET" >"$V/config.json" +out="$(cd "$V" && DOCKER_LOG="$DOCKER_LOG" PATH="$V/bin:$PATH" ./pithead apply -y 2>&1)" +assert_eq "hashrate drop threshold override propagated" "$(run_sourced "$V" env_get_file "$V/.env" HASHRATE_DROP_THRESHOLD_PCT)" "40" +assert_eq "hashrate drop minutes override propagated" "$(run_sourced "$V" env_get_file "$V/.env" HASHRATE_DROP_MINUTES)" "5" + +# Event-set consistency (#121/#45): every telegram.events.* key in config.reference.json must be +# rendered by pithead into .env AND declared in docker-compose.yml β€” so adding an alert event in one +# surface but forgetting another fails here. (The Python side β€” AlertService.EVT_* vs config.py's +# TELEGRAM_EVENTS β€” is guarded by a dashboard unit test.) The .env above has all events at their +# default (no events overrides in that config), so each should render "true". +compose_text="$(cat "$ROOT/docker-compose.yml")" +while IFS= read -r ev; do + up=$(printf '%s' "$ev" | tr '[:lower:]' '[:upper:]') + assert_eq "telegram event '$ev' rendered to .env" \ + "$(run_sourced "$V" env_get_file "$V/.env" "TELEGRAM_EVENT_$up")" "true" + assert_contains "telegram event '$ev' declared in docker-compose.yml" \ + "$compose_text" "TELEGRAM_EVENT_$up=" +done < <(jq -r '.telegram.events | keys[]' "$ROOT/config.reference.json") + # An explicit tari.mem_limit is passed through verbatim (overriding the "auto" host-RAM scaling). seed_env printf '{ "monero": {"mode":"local","wallet_address":"%s","node_username":"u","node_password":"p"}, "tari":{"wallet_address":"T","mem_limit":"3072m"}, "p2pool":{"pool":"mini"}, "dashboard":{"secure":false,"host":"box.lan"} }\n' "$WALLET" >"$V/config.json"