-
Notifications
You must be signed in to change notification settings - Fork 0
<fix>[storage]: Add a task to promptly check,update the latest status of ps and host #3112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: zsv_4.10.28
Are you sure you want to change the base?
Conversation
… of ps and host After a Ceph dead and fail, the host dis connects from the PS, potentially becoming offline. Post-failure, if the host-storage link isn't restored promptly, issues may arise where healthy hosts are wrongly filtered out. Thus,Mn needs to actively identify these states of disconnections. so wrote a scheduled task in PrimaryStorage -ManagerImpl to check and update the latest connection status between the host and ps on a scheduled basis. GlobalConfigImpact Resolves: ZSTAC-64050 Resolves: ZSV-8248 Change-Id: I76676a656d6a61796e6a657161757a6a64706568
演进说明本次变更为主存储系统添加了周期性主机连接状态检查功能。将主机状态检查相关消息类从 Ceph 包提升至通用主存储包,扩展主存储类型以支持状态检查配置,并在主存储管理器中实现了按配置间隔定时检查的后台任务。 变更清单
序列图sequenceDiagram
participant MGR as PrimaryStorageManager
participant SCHED as Scheduler
participant PS as PrimaryStorage<br/>(支持状态检查)
participant HOST as DisconnectedHosts
participant CB as CloudBus
participant REPLY as CheckReply
MGR->>SCHED: managementNodeReady()<br/>启动周期任务
activate SCHED
SCHED->>MGR: 每隔refresh_interval秒<br/>触发检查
deactivate SCHED
loop 每个周期
MGR->>MGR: 构建断开连接的<br/>主机清单
MGR->>PS: 检查存储类型<br/>isSupportCheckHostStatus()
alt 存储支持状态检查
MGR->>HOST: 为断开主机<br/>创建检查消息
MGR->>CB: 批量顺序发送<br/>CheckHostStorageConnectionMsg
activate CB
CB->>REPLY: 接收状态回复
deactivate CB
MGR->>MGR: 日志记录结果
else 存储不支持
MGR->>MGR: 跳过此存储
end
end
代码审查工作量估计🎯 3 (中等) | ⏱️ ~20 分钟 需重点关注的区域:
诗歌
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
storage/src/main/java/org/zstack/storage/primary/PrimaryStorageGlobalConfig.java (1)
59-60: 建议添加默认值和描述信息。新增的全局配置
PRIMARY_STORAGE_HOST_STATUS_REFRESH_INTERVAL缺少@GlobalConfigDef注解,建议添加以保持与其他配置的一致性,并提供更好的可维护性:🔎 建议的改进
@GlobalConfigValidation(numberGreaterThan = 0) +@GlobalConfigDef(defaultValue = "300", type = Integer.class, description = "The interval in seconds for refreshing primary storage host status") public static GlobalConfig PRIMARY_STORAGE_HOST_STATUS_REFRESH_INTERVAL = new GlobalConfig(CATEGORY, "primarystorage.host.status.refresh.interval");storage/src/main/java/org/zstack/storage/primary/PrimaryStorageManagerImpl.java (3)
1165-1168: 建议改进方法命名的语义准确性。方法名
startPeriodTasks建议改为startPeriodicTasks,更准确地表达"周期性任务"的含义,与startRefreshPrimaryStorageHostStatusTask中使用的PeriodicTask类保持一致。🔎 建议的重命名
-private void startPeriodTasks() { +private void startPeriodicTasks() { PrimaryStorageGlobalConfig.PRIMARY_STORAGE_HOST_STATUS_REFRESH_INTERVAL.installUpdateExtension((oldConfig, newConfig) -> startRefreshPrimaryStorageHostStatusTask()); startRefreshPrimaryStorageHostStatusTask(); }同时更新第1343行的调用:
-startPeriodTasks(); +startPeriodicTasks();
1192-1202: 建议优化宿主机分组逻辑的健壮性。当前代码在查询断开连接的宿主机引用后,直接构建分组映射。建议添加空值检查和日志记录以提高健壮性:
🔎 建议的改进
@Override public void run() { Map<String, List<String>> disconnectedHostsByPsUuid = new HashMap<>(); List<PrimaryStorageHostRefVO> refs = Q.New(PrimaryStorageHostRefVO.class) .eq(PrimaryStorageHostRefVO_.status, PrimaryStorageHostStatus.Disconnected) .list(); - - refs.forEach(ref -> { - disconnectedHostsByPsUuid.computeIfAbsent(ref.getPrimaryStorageUuid(), key -> new ArrayList<>()).add(ref.getHostUuid()); - }); + + if (refs == null || refs.isEmpty()) { + logger.debug("No disconnected hosts found, skipping host status refresh"); + return; + } + + refs.forEach(ref -> { + disconnectedHostsByPsUuid.computeIfAbsent(ref.getPrimaryStorageUuid(), key -> new ArrayList<>()).add(ref.getHostUuid()); + });
1217-1228: 改进错误日志的准确性。第1223行的错误日志仅显示
msg.getHostUuids().get(0),但消息可能包含多个宿主机UUID,建议显示完整列表以便故障排查。🔎 建议的改进
new While<>(msgs).step((msg, comp) -> { bus.send(msg, new CloudBusCallBack(comp) { @Override public void run(MessageReply reply) { if (!reply.isSuccess()) { - logger.error(String.format("Failed to check host storage connection for primary storage %s and host %s due to: %s", - msg.getPrimaryStorageUuid(), msg.getHostUuids().get(0), reply.getError())); + logger.error(String.format("Failed to check host storage connection for primary storage[uuid:%s] and hosts%s due to: %s", + msg.getPrimaryStorageUuid(), msg.getHostUuids(), reply.getError())); } comp.done(); } }); }, 10).run(new NopeWhileDoneCompletion());
📜 Review details
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
conf/globalConfig/primaryStorage.xmlis excluded by!**/*.xml
📒 Files selected for processing (7)
header/src/main/java/org/zstack/header/storage/primary/PrimaryStorageType.java(2 hunks)plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java(1 hunks)plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageFactory.java(1 hunks)storage/src/main/java/org/zstack/storage/primary/CheckHostStorageConnectionMsg.java(1 hunks)storage/src/main/java/org/zstack/storage/primary/CheckHostStorageConnectionReply.java(1 hunks)storage/src/main/java/org/zstack/storage/primary/PrimaryStorageGlobalConfig.java(1 hunks)storage/src/main/java/org/zstack/storage/primary/PrimaryStorageManagerImpl.java(5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.java
⚙️ CodeRabbit configuration file
**/*.java: ## 1. API 设计要求
- API 命名:
- API 名称必须唯一,不能重复。
- API 消息类需要继承
APIMessage;其返回类必须继承APIReply或APIEvent,并在注释中用@RestResponse进行标注。- API 消息上必须添加注解
@RestRequest,并满足如下规范:
path:
- 针对资源使用复数形式。
- 当 path 中引用消息类变量时,使用
{variableName}格式。- HTTP 方法对应:
- 查询操作 →
HttpMethod.GET- 更新操作 →
HttpMethod.PUT- 创建操作 →
HttpMethod.POST- 删除操作 →
HttpMethod.DELETE- API 类需要实现
__example__方法以便生成 API 文档,并确保生成对应的 Groovy API Template 与 API Markdown 文件。
2. 命名与格式规范
类名:
- 使用 UpperCamelCase 风格。
- 特殊情况:
- VO/AO/EO 类型类除外。
- 抽象类采用
Abstract或Base前缀/后缀。- 异常类应以
Exception结尾。- 测试类需要以
Test或Case结尾。方法名、参数名、成员变量和局部变量:
- 使用 lowerCamelCase 风格。
常量命名:
- 全部大写,使用下划线分隔单词。
- 要求表达清楚,避免使用含糊或不准确的名称。
包名:
- 统一使用小写,使用点分隔符,每个部分应是一个具有自然语义的英文单词(参考 Spring 框架的结构)。
命名细节:
- 避免在父子类或同一代码块中出现相同名字的成员或局部变量,防止混淆。
- 命名缩写:
- 不允许使用不必要的缩写,如:
AbsSchedulerJob、condi、Fu等。应使用完整单词提升可读性。
3. 编写自解释代码
意图表达:
- 避免使用布尔型参数造成含义不明确。例如:
- 对于
stopAgent(boolean ignoreError),建议拆分为不同函数(如stopAgentIgnoreError()),或使用枚举表达操作类型。- 命名应尽量用完整的单词组合表达意图,并在名称中体现数据类型或用途(例如在常量与变量名称中,将类型词放在末尾)。
注释:
- 代码应尽量做到自解释,对少于两行的说明可以直接写在代码中。
- 对于较长的注释,需要仔细校对并随代码更新,确保内容正确。
- 接口方法不应有多余的修饰符(例如
public),且必须配有有效的 Javadoc 注释。
4. 流程控制和结构优化
if...else 的使用:
- 应尽量减少 if...else 结构的使用,建议:
- 限制嵌套层级最多为两层,且内层不应再出现
else分支。- 尽早返回(Early Return),将条件判断中的处理逻辑提前结束或抽成独立方法。
- 使用 Java Stream 或 Lambda 表达式代替冗长的循环与条件判断。
条件判断:
- if 条件表达不宜过长或过于复杂,必要时可以将条件抽成 boolean 变量描述。
代码块长度:
- 单个 if 代码块不宜超过一屏显示,以提高可读性和后续维护性。
5. 异常处理与日志
- 捕获异常的原则:
- 对于可以通过预检查避免的 RuntimeException(如
NullPointerException、IndexOutOfBoundsException等),不建议使用 try-catch 来进行处理。- 捕获异常应仅用于处理真正的意外情况,不应将异常逻辑当作正常流程控制。
- 在必要时,应继续抛出异常,使上层业务处理者可以转换为用户友好的错误提示。
- 使用 try-with-resources 语法管理资源,确保在 finally 块中正确关闭资源,并避免在 finally 中返回值。
...
Files:
header/src/main/java/org/zstack/header/storage/primary/PrimaryStorageType.javastorage/src/main/java/org/zstack/storage/primary/PrimaryStorageGlobalConfig.javastorage/src/main/java/org/zstack/storage/primary/PrimaryStorageManagerImpl.javaplugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.javastorage/src/main/java/org/zstack/storage/primary/CheckHostStorageConnectionMsg.javastorage/src/main/java/org/zstack/storage/primary/CheckHostStorageConnectionReply.javaplugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageFactory.java
🧠 Learnings (1)
📚 Learning: 2025-10-20T11:27:25.928Z
Learnt from: MatheMatrix
Repo: MatheMatrix/zstack PR: 2763
File: sdk/src/main/java/org/zstack/sdk/UpdateHostKernelInterfaceAction.java:31-31
Timestamp: 2025-10-20T11:27:25.928Z
Learning: UpdateHostKernelInterface API (`sdk/src/main/java/org/zstack/sdk/UpdateHostKernelInterfaceAction.java`) 在 PR #2763 时尚未被实际使用,因此对其参数约束的变更(如 name 字段从必需改为可选)不会造成破坏性影响。
Applied to files:
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java
🧬 Code graph analysis (1)
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java (1)
storage/src/main/java/org/zstack/storage/primary/CheckHostStorageConnectionMsg.java (1)
CheckHostStorageConnectionMsg(8-28)
🔇 Additional comments (8)
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephPrimaryStorageFactory.java (1)
98-101: LGTM!启用主存储宿主机状态检查功能。代码正确地为Ceph主存储类型启用了宿主机状态检查功能。放置在初始化块中,与其他类型特性设置保持一致。
header/src/main/java/org/zstack/header/storage/primary/PrimaryStorageType.java (1)
27-28: LGTM!新增主存储宿主机状态检查支持标志。新增的
supportCheckHostStatus字段及其访问器方法符合以下要求:
- 遵循lowerCamelCase命名规范
- 默认值false是合理的(保守策略)
- 与类中其他布尔标志的模式保持一致
Also applies to: 206-212
plugin/ceph/src/main/java/org/zstack/storage/ceph/primary/CephKvmExtension.java (1)
28-28: LGTM!导入语句更新正确。导入语句已更新以反映
CheckHostStorageConnectionMsg的包路径变更(从org.zstack.storage.ceph.primary迁移到org.zstack.storage.primary)。该类在第155-161行的现有用法仍然有效。storage/src/main/java/org/zstack/storage/primary/CheckHostStorageConnectionMsg.java (1)
1-28: LGTM!消息类包路径迁移合理。将
CheckHostStorageConnectionMsg从Ceph特定包迁移到通用主存储包是正确的设计决策,因为:
- 该功能现在适用于所有支持宿主机状态检查的主存储类型
- 提高了代码的可重用性和模块化
- 类结构符合消息模式规范
storage/src/main/java/org/zstack/storage/primary/CheckHostStorageConnectionReply.java (1)
1-6: LGTM!回复类包路径迁移与消息类保持一致。
CheckHostStorageConnectionReply的包路径迁移与对应的消息类CheckHostStorageConnectionMsg保持一致,符合预期。storage/src/main/java/org/zstack/storage/primary/PrimaryStorageManagerImpl.java (3)
12-12: LGTM!导入和字段声明正确。新增的导入和字段声明符合要求:
While用于批量异步操作NopeWhileDoneCompletion用于完成回调refreshPrimaryStorageHostStatusTask字段命名清晰,使用Future类型便于任务取消Also applies to: 39-39, 114-115
1170-1189: 定期任务初始化逻辑正确。任务初始化实现良好:
- 使用
synchronized确保线程安全- 正确取消旧任务避免资源泄露
- 从全局配置读取间隔时间,支持动态更新
1343-1343: LGTM!定期任务启动时机正确。在
managementNodeReady()方法中启动定期任务是正确的设计,确保:
- 管理节点完全就绪后再启动周期性检查
- 在资源初始化(如垃圾回收任务)之后启动
- 符合系统启动流程的最佳实践
| List<CheckHostStorageConnectionMsg> msgs = new ArrayList<>(); | ||
| for (Map.Entry<String, List<String>> entry : disconnectedHostsByPsUuid.entrySet()) { | ||
| PrimaryStorageVO storageVO = dbf.findByUuid(entry.getKey(), PrimaryStorageVO.class); | ||
| if (!PrimaryStorageType.valueOf(storageVO.getType()).isSupportCheckHostStatus()){ | ||
| continue; | ||
| } | ||
| CheckHostStorageConnectionMsg msg = new CheckHostStorageConnectionMsg(); | ||
| msg.setPrimaryStorageUuid(entry.getKey()); | ||
| msg.setHostUuids(entry.getValue()); | ||
| bus.makeTargetServiceIdByResourceUuid(msg, PrimaryStorageConstant.SERVICE_ID, entry.getKey()); | ||
| msgs.add(msg); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
修复潜在的空指针异常。
第1206行的dbf.findByUuid可能返回null,但代码未进行null检查就直接调用storageVO.getType(),会导致NullPointerException。
🔎 建议的修复
List<CheckHostStorageConnectionMsg> msgs = new ArrayList<>();
for (Map.Entry<String, List<String>> entry : disconnectedHostsByPsUuid.entrySet()) {
PrimaryStorageVO storageVO = dbf.findByUuid(entry.getKey(), PrimaryStorageVO.class);
+ if (storageVO == null) {
+ logger.warn(String.format("Primary storage[uuid:%s] not found, skipping host status check", entry.getKey()));
+ continue;
+ }
if (!PrimaryStorageType.valueOf(storageVO.getType()).isSupportCheckHostStatus()){
continue;
}
CheckHostStorageConnectionMsg msg = new CheckHostStorageConnectionMsg();
msg.setPrimaryStorageUuid(entry.getKey());
msg.setHostUuids(entry.getValue());
bus.makeTargetServiceIdByResourceUuid(msg, PrimaryStorageConstant.SERVICE_ID, entry.getKey());
msgs.add(msg);
}🤖 Prompt for AI Agents
In
storage/src/main/java/org/zstack/storage/primary/PrimaryStorageManagerImpl.java
around lines 1204 to 1215, dbf.findByUuid(...) can return null so calling
storageVO.getType() may throw a NullPointerException; add a null-check after
findByUuid and skip the entry (or log a warning) if storageVO is null before
accessing its type, then continue with the existing logic for non-null
storageVOs.
After a Ceph dead and fail, the host dis
connects from the PS, potentially becoming
offline.
Post-failure, if the host-storage link isn't
restored promptly, issues may arise where
healthy hosts are wrongly filtered out.
Thus,Mn needs to actively identify these
states of disconnections.
so wrote a scheduled task in PrimaryStorage
-ManagerImpl to check and update the latest
connection status between the host and ps on
a scheduled basis.
GlobalConfigImpact
Resolves: ZSTAC-64050
Resolves: ZSV-8248
Change-Id: I76676a656d6a61796e6a657161757a6a64706568
sync from gitlab !8925