feat: implement telemetry system for application usage tracking
- Added telemetry utility to capture application events and metrics. - Integrated PostHog for event tracking with distinct user identification. - Implemented telemetry initialization, event capturing, and shutdown procedures. feat: add UV environment setup for Python management - Created utilities to manage Python installation and configuration. - Implemented network optimization checks for Python installation mirrors. - Added functions to set up managed Python environments with error handling. feat: enhance host API communication with token management - Introduced host API token retrieval and management for secure requests. - Updated host API fetch functions to include token in headers. - Added support for creating event sources with authentication. test: add comprehensive tests for gateway protocol and startup helpers - Implemented unit tests for gateway protocol helpers, event dispatching, and state management. - Added tests for startup recovery strategies and process policies. - Ensured coverage for connection monitoring and restart governance logic.
This commit is contained in:
509
docs/ClawX-Gateway-Launcher-Migration-Execution-Plan.md
Normal file
509
docs/ClawX-Gateway-Launcher-Migration-Execution-Plan.md
Normal file
@@ -0,0 +1,509 @@
|
||||
# ClawX Gateway Launcher 迁移执行文档
|
||||
|
||||
## 1. 目标
|
||||
|
||||
`zn-ai` 当前已经完成 Gateway 启动链第一波迁移,解决重点是:
|
||||
|
||||
- launch context 组装
|
||||
- launcher core 与 Windows 启动兼容
|
||||
- startup orchestrator
|
||||
- startup stderr 诊断
|
||||
- supervisor 的基础端口与孤儿进程治理
|
||||
|
||||
下一波只继续迁移 `ClawX` 的 Gateway 生命周期与自愈闭环,范围严格限制在以下 4 组能力:
|
||||
|
||||
- `connection-monitor`
|
||||
- `restart-controller` / `restart-governor`
|
||||
- `reload-policy`
|
||||
- `doctor repair`
|
||||
|
||||
本轮明确不做:
|
||||
|
||||
- Chat 主链路改造
|
||||
- Skills 安装链路改造
|
||||
- Providers/Channels 非 Gateway 生命周期相关重构
|
||||
- UI 视觉层修改
|
||||
|
||||
## 2. 当前结论
|
||||
|
||||
`zn-ai` 与 `ClawX` 在 Gateway 启动能力上已经接近,但在 Gateway 完整生命周期能力上仍不一致。
|
||||
|
||||
`zn-ai` 已具备:
|
||||
|
||||
- `prepareGatewayLaunchContext(...)`
|
||||
- `launchGatewayProcess(...)`
|
||||
- `runGatewayStartupSequence(...)`
|
||||
- startup stderr 分类与缓存
|
||||
- `waitForPortFree(...)`
|
||||
- 孤儿监听进程探测与清理
|
||||
- Windows `node-runtime` 启动策略
|
||||
|
||||
`zn-ai` 仍缺少:
|
||||
|
||||
- 心跳监控与健康检查驱动的自愈
|
||||
- 启动中/重连中的 restart defer 机制
|
||||
- restart cooldown governor
|
||||
- Gateway reload policy 解析与应用
|
||||
- OpenClaw doctor repair
|
||||
- Python readiness warmup
|
||||
- 与这些能力配套的 manager 生命周期接线
|
||||
|
||||
## 3. 第二波迁移目标
|
||||
|
||||
迁移完成后,`zn-ai` 的 Gateway 至少要补齐下面这组行为:
|
||||
|
||||
1. Gateway 连接建立后有稳定的 ping/pong 与 message heartbeat 监控。
|
||||
2. 连接失活或健康检查失败时,manager 能按治理策略自愈,而不是单次掉线后长期停摆。
|
||||
3. 外部触发的 restart 请求在 `starting` / `reconnecting` 阶段不会打断在途启动,而是 defer 到合适时机。
|
||||
4. 连续 restart 受到 cooldown governor 约束,避免抖动和端口争抢。
|
||||
5. Gateway reload policy 能从 `~/.openclaw/openclaw.json` 读取,并决定走 `reload`、`restart`、`hybrid` 或 `off`。
|
||||
6. startup-orchestrator 在判定配置损坏时能触发 doctor repair,然后再试一次启动。
|
||||
7. Windows 下真实 Gateway 冒烟时,能区分:
|
||||
- 启动慢但最终 ready
|
||||
- 配置损坏后 repair 成功
|
||||
- repair 失败后停止重试
|
||||
|
||||
## 4. 建议 sub-agent 数量
|
||||
|
||||
建议 `4` 个开发 sub-agent,加 `1` 个主协调 agent。
|
||||
|
||||
理由:
|
||||
|
||||
- 这 4 组能力的依赖闭包并不完全独立。
|
||||
- `manager.ts` 是热点文件,不能让多个 worker 同时改。
|
||||
- `doctor repair` 会额外牵动 `supervisor.ts` 与若干工具模块,适合单独收口。
|
||||
- 测试与真实回归要独立 ownership,避免实现者自己绕过风险。
|
||||
|
||||
不建议拆成超过 `4` 个开发 sub-agent,因为:
|
||||
|
||||
- `manager.ts`、`startup-orchestrator.ts`、`supervisor.ts` 的写入冲突会明显增加。
|
||||
- 这轮目标是生命周期闭环,不是大范围模块化重构。
|
||||
|
||||
## 5. 分工方案
|
||||
|
||||
### 主协调 Agent
|
||||
|
||||
职责:
|
||||
|
||||
- 冻结第二波迁移边界,只做 Gateway 生命周期与自愈能力。
|
||||
- 冻结共享契约:
|
||||
- lifecycle state
|
||||
- reconnect / deferred restart policy
|
||||
- reload policy result
|
||||
- doctor repair hook
|
||||
- 负责合并顺序、冲突协调、回归 checklist 与最终收口。
|
||||
|
||||
不直接负责大规模代码写入,重点是接口与顺序控制。
|
||||
|
||||
### SA-1 Lifecycle Primitives
|
||||
|
||||
责任范围:
|
||||
|
||||
- `zn-ai/electron/gateway/connection-monitor.ts`
|
||||
- `zn-ai/electron/gateway/process-policy.ts`
|
||||
- `zn-ai/electron/gateway/lifecycle-controller.ts`
|
||||
- `zn-ai/electron/gateway/restart-controller.ts`
|
||||
- `zn-ai/electron/gateway/restart-governor.ts`
|
||||
- 如有必要,可补最小 `state` 辅助文件,但不负责 manager 接线
|
||||
|
||||
目标:
|
||||
|
||||
- 迁入 ClawX 的心跳监控、健康检查定时器、restart defer 规则、reconnect 规则、restart cooldown 规则
|
||||
- 保持实现尽量独立,供 manager 后续接入
|
||||
|
||||
约束:
|
||||
|
||||
- 不修改 `manager.ts`
|
||||
- 不修改 UI / API / Chat / Skills
|
||||
|
||||
### SA-2 Repair & Supervisor
|
||||
|
||||
责任范围:
|
||||
|
||||
- `zn-ai/electron/gateway/supervisor.ts`
|
||||
- `zn-ai/electron/gateway/startup-orchestrator.ts`
|
||||
- 如缺失则新增最小工具:
|
||||
- `zn-ai/electron/utils/uv-env.ts`
|
||||
- `zn-ai/electron/utils/uv-setup.ts`
|
||||
- `zn-ai/electron/utils/env-path.ts`
|
||||
|
||||
目标:
|
||||
|
||||
- 迁入 `warmupManagedPythonReadiness()`
|
||||
- 迁入 `runOpenClawDoctorRepair()`
|
||||
- 把 doctor repair 接进 `runGatewayStartupSequence(...)`
|
||||
- 保持对 Windows 的 process tree terminate 和 PATH 注入兼容
|
||||
|
||||
约束:
|
||||
|
||||
- 不改 `manager.ts`
|
||||
- 只为 doctor repair 引入最小依赖,不顺手扩散到其它 Python/uv 功能
|
||||
|
||||
### SA-3 Manager Lifecycle Integration
|
||||
|
||||
责任范围:
|
||||
|
||||
- `zn-ai/electron/gateway/manager.ts`
|
||||
- 必要时小幅修改:
|
||||
- `zn-ai/electron/gateway/ws-client.ts`
|
||||
- `zn-ai/electron/gateway/types.ts`
|
||||
|
||||
目标:
|
||||
|
||||
- 接入 SA-1 输出的 lifecycle / reconnect / restart governance
|
||||
- 接入 SA-2 输出的 doctor repair 与 startup recovery hook
|
||||
- 为 `zn-ai` 补齐:
|
||||
- `startHealthCheck()`
|
||||
- `startPing()`
|
||||
- `scheduleReconnect()`
|
||||
- `debouncedRestart()`
|
||||
- `reload()`
|
||||
- `debouncedReload()`
|
||||
- reload policy refresh
|
||||
|
||||
约束:
|
||||
|
||||
- 只动 Gateway 生命周期相关逻辑
|
||||
- 不改现有 chat payload 结构
|
||||
- 不借机重构无关的 runtime broadcast
|
||||
|
||||
说明:
|
||||
|
||||
`manager.ts` 由 SA-3 单独 owning,其他 sub-agent 不直接修改这个文件,避免冲突。
|
||||
|
||||
### SA-4 Verification & Regression
|
||||
|
||||
责任范围:
|
||||
|
||||
- `zn-ai/tests/*gateway*`
|
||||
- Gateway 相关 smoke 脚本与文档
|
||||
- 必要时补最小测试夹具
|
||||
|
||||
目标:
|
||||
|
||||
- 给新增的 lifecycle/reload/repair 行为补单测
|
||||
- 补真实 Windows 冒烟步骤
|
||||
- 输出失败定位矩阵,覆盖:
|
||||
- heartbeat miss
|
||||
- deferred restart
|
||||
- governor suppress
|
||||
- reload policy mode 分支
|
||||
- doctor repair success/failure
|
||||
|
||||
约束:
|
||||
|
||||
- 不修改产品功能逻辑,除非为了让测试可注入而做最小 seam
|
||||
|
||||
## 6. 实施顺序
|
||||
|
||||
### Wave 2A
|
||||
|
||||
并行推进:
|
||||
|
||||
- SA-1 Lifecycle Primitives
|
||||
- SA-2 Repair & Supervisor
|
||||
|
||||
冻结输出契约:
|
||||
|
||||
- `GatewayConnectionMonitor`
|
||||
- `GatewayRestartController`
|
||||
- `GatewayRestartGovernor`
|
||||
- `GatewayReloadPolicy`
|
||||
- `runOpenClawDoctorRepair()`
|
||||
- `warmupManagedPythonReadiness()`
|
||||
|
||||
### Wave 2B
|
||||
|
||||
在 Wave 2A 契约冻结后推进:
|
||||
|
||||
- SA-3 Manager Lifecycle Integration
|
||||
|
||||
这一步只做接线和行为收口,不反向改 SA-1/SA-2 的模块边界。
|
||||
|
||||
### Wave 2C
|
||||
|
||||
最后推进:
|
||||
|
||||
- SA-4 Verification & Regression
|
||||
- 主协调 Agent 汇总验收
|
||||
|
||||
## 7. 合并顺序
|
||||
|
||||
1. `process-policy.ts` / `lifecycle-controller.ts` / `connection-monitor.ts`
|
||||
2. `restart-controller.ts` / `restart-governor.ts`
|
||||
3. `reload-policy.ts`
|
||||
4. `supervisor.ts` 的 doctor repair 与 Python warmup
|
||||
5. `startup-orchestrator.ts` repair hook
|
||||
6. `manager.ts` 生命周期接线
|
||||
7. tests / smoke / 文档收口
|
||||
|
||||
## 8. 关键依赖与注意事项
|
||||
|
||||
### 8.1 manager.ts 是单点热点
|
||||
|
||||
本轮很多能力最终都要落到 `manager.ts`。因此:
|
||||
|
||||
- 只有 SA-3 可以写 `manager.ts`
|
||||
- SA-1 / SA-2 只提供可接入模块和稳定接口
|
||||
|
||||
### 8.2 doctor repair 不是孤立功能
|
||||
|
||||
`ClawX` 的 doctor repair 依赖:
|
||||
|
||||
- OpenClaw entry/runtime path
|
||||
- bundled `bin` PATH 注入
|
||||
- `uv` 镜像环境
|
||||
- Python readiness 检查
|
||||
|
||||
`zn-ai` 当前还没有完整对应模块,所以必须按“最小依赖闭包”迁入,不能只拷贝 `runOpenClawDoctorRepair()`。
|
||||
|
||||
### 8.3 reload policy 只做 Gateway 进程策略
|
||||
|
||||
本轮 `reload-policy` 只决定 Gateway 进程层行为:
|
||||
|
||||
- `off`
|
||||
- `reload`
|
||||
- `restart`
|
||||
- `hybrid`
|
||||
|
||||
不扩展到 UI 或配置编辑器逻辑。
|
||||
|
||||
### 8.4 Windows 冒烟必须保留长等待预算
|
||||
|
||||
之前真实诊断已经确认 Windows 下 OpenClaw ready 可能超过 100 秒,因此:
|
||||
|
||||
- 不得把 ready wait 预算收回到旧值
|
||||
- 冒烟时要区分“慢启动”与“真失败”
|
||||
|
||||
## 9. 验收标准
|
||||
|
||||
必须覆盖以下场景:
|
||||
|
||||
- Gateway 连接后能持续 ping/pong,并在 message/pong 到达时恢复 heartbeat 状态
|
||||
- heartbeat 连续 miss 达阈值后能触发受控恢复
|
||||
- `restart()` 在 `starting` / `reconnecting` 阶段会 defer,而不是打断在途启动
|
||||
- restart governor 在 cooldown 内能 suppress 重复 restart
|
||||
- `reload-policy` 可从 `~/.openclaw/openclaw.json` 读取并应用
|
||||
- `reload()` 失败时能 fallback 到 `restart()`
|
||||
- startup 检测到配置损坏时能执行 doctor repair
|
||||
- doctor repair 成功后,Gateway 能继续启动
|
||||
- doctor repair 失败后,错误信息可诊断,不会无限重试
|
||||
- Windows 真实启动回归中,不再频繁出现无诊断信息的 `exited before becoming ready`
|
||||
|
||||
建议最少验证:
|
||||
|
||||
- `pnpm typecheck`
|
||||
- Gateway 生命周期相关单测
|
||||
- 一次 Windows 本机真实 Gateway 冒烟
|
||||
- 一次配置损坏后的 repair 回归
|
||||
|
||||
## 10. 当前执行状态
|
||||
|
||||
当前建议按以下顺序继续推进:
|
||||
|
||||
1. 先完成 Wave 2 的 sub-agent 分工与 ownership 冻结
|
||||
2. 并行实施 SA-1 与 SA-2
|
||||
3. 由 SA-3 独占接入 `manager.ts`
|
||||
4. 最后由 SA-4 负责真实回归与失败矩阵
|
||||
|
||||
## 11. 第三波补齐范围
|
||||
|
||||
在 `Wave 2A / 2B` 完成之后,`zn-ai` 和 `ClawX` 的 Gateway 差距主要收敛到下面这一组:
|
||||
|
||||
- `state.ts`
|
||||
- `protocol.ts`
|
||||
- `event-dispatch.ts`
|
||||
- `manager.ts` 里的 diagnostics
|
||||
- `manager.ts` 里的 `gatewayReady fallback`
|
||||
|
||||
这组能力的目标不是继续扩展生命周期治理,而是把 Gateway 的“状态模型、协议兼容、事件分发、诊断可观测性”补到接近 `ClawX`。
|
||||
|
||||
本轮明确不做:
|
||||
|
||||
- Chat 业务语义改造
|
||||
- UI 新增诊断页
|
||||
- telemetry 上传体系迁移
|
||||
- 与本轮无关的 Skills / Channels / Providers 重构
|
||||
|
||||
## 12. 第三波建议 sub-agent 数量
|
||||
|
||||
最小可行配置是 `3` 个开发 sub-agent,加 `1` 个主协调 agent。
|
||||
|
||||
推荐配置是 `4` 个开发 sub-agent,加 `1` 个主协调 agent。
|
||||
|
||||
推荐按 `4` 个开发 sub-agent 推进,原因是:
|
||||
|
||||
- `manager.ts` 仍然是单点热点文件,必须单 owner。
|
||||
- `protocol.ts` / `event-dispatch.ts` 与 `state.ts` / diagnostics 的依赖闭包并不相同,适合并行推进。
|
||||
- diagnostics / `gatewayReady fallback` 需要独立验证,不适合完全由实现者自测。
|
||||
|
||||
如果资源受限,也可以退化成 `3` 个开发 sub-agent:
|
||||
|
||||
- 把验证工作并回主协调 Agent
|
||||
- 或把 `protocol/event-dispatch` 与 `state/diagnostics primitives` 合并给同一个 sub-agent
|
||||
|
||||
## 13. 第三波分工方案
|
||||
|
||||
### 主协调 Agent
|
||||
|
||||
职责:
|
||||
|
||||
- 冻结第三波范围,只做状态层、协议层、事件分发层和 manager diagnostics/fallback。
|
||||
- 冻结共享契约:
|
||||
- GatewayStatus 状态结构
|
||||
- diagnostics snapshot 结构
|
||||
- protocol / notification 类型守卫
|
||||
- `gateway.ready` fallback 行为
|
||||
- 负责合并顺序、冲突协调和最终验收。
|
||||
|
||||
### SA-1 Protocol & Dispatch
|
||||
|
||||
责任范围:
|
||||
|
||||
- `zn-ai/electron/gateway/protocol.ts`
|
||||
- `zn-ai/electron/gateway/event-dispatch.ts`
|
||||
- 必要时小幅修改:
|
||||
- `zn-ai/electron/gateway/types.ts`
|
||||
|
||||
目标:
|
||||
|
||||
- 迁入 `ClawX` 的 JSON-RPC type guards 与 protocol 类型定义
|
||||
- 迁入 protocol event 与 JSON-RPC notification 分发逻辑
|
||||
- 让 `zn-ai` 的 Gateway manager 不再只处理当前 OpenClaw event frame,而是具备和 `ClawX` 接近的 protocol fallback 面
|
||||
|
||||
约束:
|
||||
|
||||
- 不修改 `manager.ts`
|
||||
- 不修改 Chat store / renderer 事件消费逻辑
|
||||
|
||||
### SA-2 State & Diagnostics Primitives
|
||||
|
||||
责任范围:
|
||||
|
||||
- `zn-ai/electron/gateway/state.ts`
|
||||
- `zn-ai/electron/gateway/diagnostics.ts`
|
||||
- 如有必要,可补最小 diagnostics 类型文件
|
||||
|
||||
目标:
|
||||
|
||||
- 迁入 `GatewayStateController`
|
||||
- 定义 `getStatus()` / `isConnected()` / state transition hook 的统一实现
|
||||
- 为 manager diagnostics 提供稳定数据结构:
|
||||
- `lastAliveAt`
|
||||
- `lastRpcSuccessAt`
|
||||
- `lastRpcFailureAt`
|
||||
- `lastRpcFailureMethod`
|
||||
- `lastHeartbeatTimeoutAt`
|
||||
- `lastSocketCloseAt`
|
||||
- `lastSocketCloseCode`
|
||||
- `consecutiveHeartbeatMisses`
|
||||
- `consecutiveRpcFailures`
|
||||
|
||||
约束:
|
||||
|
||||
- 不修改 `manager.ts`
|
||||
- 不新增 UI 诊断入口
|
||||
|
||||
### SA-3 Manager State & Protocol Integration
|
||||
|
||||
责任范围:
|
||||
|
||||
- `zn-ai/electron/gateway/manager.ts`
|
||||
- 必要时小幅修改:
|
||||
- `zn-ai/electron/gateway/ws-client.ts`
|
||||
|
||||
目标:
|
||||
|
||||
- 接入 SA-1 的 protocol / event-dispatch
|
||||
- 接入 SA-2 的 state controller 与 diagnostics snapshot
|
||||
- 为 `zn-ai` 补齐:
|
||||
- `getDiagnostics()`
|
||||
- `gateway.ready` event handling
|
||||
- `gatewayReady fallback` timer
|
||||
- richer state transition handling
|
||||
- RPC success / failure / socket close / heartbeat timeout 记录
|
||||
|
||||
约束:
|
||||
|
||||
- `manager.ts` 由 SA-3 单独 owning
|
||||
- 不借机改动 chat payload 结构
|
||||
- 不迁移 telemetry 上传逻辑,除非成为必需依赖
|
||||
|
||||
### SA-4 Verification & Regression
|
||||
|
||||
责任范围:
|
||||
|
||||
- `zn-ai/tests/*gateway*`
|
||||
- Gateway 相关 smoke checklist 与文档
|
||||
|
||||
目标:
|
||||
|
||||
- 给以下能力补测试:
|
||||
- protocol type guards
|
||||
- event dispatch mapping
|
||||
- state transition callbacks
|
||||
- diagnostics snapshot 更新
|
||||
- `gateway.ready` fallback 行为
|
||||
- 补一组回归清单,确认新 protocol fallback 不会破坏现有 `chat:*` 事件链路
|
||||
|
||||
约束:
|
||||
|
||||
- 不修改产品逻辑,除非为了测试可注入而补最小 seam
|
||||
|
||||
## 14. 第三波实施顺序
|
||||
|
||||
### Wave 3A
|
||||
|
||||
并行推进:
|
||||
|
||||
- SA-1 Protocol & Dispatch
|
||||
- SA-2 State & Diagnostics Primitives
|
||||
|
||||
冻结输出契约:
|
||||
|
||||
- `protocol.ts`
|
||||
- `event-dispatch.ts`
|
||||
- `GatewayStateController`
|
||||
- diagnostics snapshot 结构
|
||||
|
||||
### Wave 3B
|
||||
|
||||
在 Wave 3A 契约冻结后推进:
|
||||
|
||||
- SA-3 Manager State & Protocol Integration
|
||||
|
||||
这一步必须保持 `manager.ts` 单 owner,不允许并行写入。
|
||||
|
||||
### Wave 3C
|
||||
|
||||
最后推进:
|
||||
|
||||
- SA-4 Verification & Regression
|
||||
- 主协调 Agent 汇总验收
|
||||
|
||||
## 15. 第三波合并顺序
|
||||
|
||||
1. `protocol.ts`
|
||||
2. `event-dispatch.ts`
|
||||
3. `state.ts`
|
||||
4. diagnostics types / helper
|
||||
5. `manager.ts` 接线
|
||||
6. tests / smoke / 文档收口
|
||||
|
||||
## 16. 第三波验收标准
|
||||
|
||||
必须覆盖以下场景:
|
||||
|
||||
- `zn-ai` 具备与 `ClawX` 接近的 JSON-RPC request/response/notification type guards
|
||||
- protocol event 与 JSON-RPC notification 能走统一 dispatch 层
|
||||
- manager 状态更新不再完全依赖手工分支,而是通过 state controller 收口
|
||||
- diagnostics 能记录最近 alive/RPC/socket/heartbeat 关键时间点和失败原因
|
||||
- 收到 `gateway.ready` 时能更新 Gateway ready 状态
|
||||
- 未收到 `gateway.ready` 时,fallback timer 能在超时后兜底设置 ready
|
||||
- 现有 `chat:delta` / `chat:final` / `chat:error` / `chat:aborted` 链路不回归
|
||||
|
||||
建议最少验证:
|
||||
|
||||
- `pnpm typecheck`
|
||||
- Gateway protocol / diagnostics 相关单测
|
||||
- 一次真实 Gateway 启动与 `gateway.ready` 兜底回归
|
||||
Reference in New Issue
Block a user