Revert "sched/group: move task group into task_tcb_s to improve perfo… #17564

wangzhi16 · 2025-12-18T13:00:43Z

This reverts commit 29e50ff.

Summary

Placing the main thread and the gourd in the same memory block, and allocating and freeing memory simultaneously, presents the following two problems：

When the main thread creates a child thread and performs a detach operation, there is a possibility that the main thread may have exited, but the main thread's TCB may not have been released.
This could potentially cause the main thread's TCB to be double-freed. The core contradiction in this problem lies in binding the main thread's TCB and the group together. When releasing the main thread's TCB, an additional check is needed to ensure the main thread was the last to leave the group. If this check and the free operation are atomically guaranteed, the logic is sound, and double freeing won't occur. However, this atomicity cannot be completely guaranteed. If other free operations cause a block, problems still arise. For example:

The main thread exits first, executes group_leave to remove itself from the group, executes another free operation (such as nxtask_joindestroy) causing a block, and switches to a child thread.

When the child thread executes group_leave, it discovers it was the last thread to exit, releases the group (the child thread assumes the main thread has already performed realloc, so it only releases the group structure, actually releasing the main thread's TCB along with it), releases its own TCB, and exits.

When switching back to the main thread, during release, it checks that there are no other threads in the group, assumes it was the last thread to exit (both the main thread and the child thread believe they were the last to leave the group, which is the root cause of the problem), and then executes the free operation, resulting in double freeing.

Impact

None

Testing

here are two patches that can reproduce the problem mentioned above.

master...wangzhi16:nuttx:master
apache/nuttx-apps@master...wangzhi16:nuttx-apps:dev

Since this is an intermittent issue, blocking during the free process will free up the CPU. Therefore, we use usleep to create a scenario that is guaranteed to occur.

Before revert:

Configuration that needs to be opened:

CONFIG_SCHED_HAVE_PARENT=y
CONFIG_CANCELLATION_POINTS=y

test on qemu-armv7a:

build:
tools/configure.sh qemu-armv7a:nsh; make -j16
result:

The crash mentioned above was caused by the execution flow described below.

thread_parent ------------------------------------thread_child
detach()
group_leave()
nxsig_cleanup()
free()---->block!
---------------------------------------------------group_leave()
---------------------------------------------------check no member in group, release thread_parent.
---------------------------------------------------exit()
release_tcb()--->double free!!!

after revert:

test on qemu-armv7a:

build:
tools/configure.sh qemu-armv7a:nsh; make -j16
result:

…rmance" This reverts commit 29e50ff. reason: Placing the main thread and the gourd in the same memory block, and allocating and freeing memory simultaneously, presents the following two problems： When the main thread creates a child thread and performs a detach operation, there is a possibility that the main thread may have exited, but the main thread's TCB (Transaction Control Block) may not have been released. This could potentially cause the main thread's TCB to be double-freed. The core contradiction in this problem lies in binding the main thread's TCB (Trust Container Registry) and the group together. When releasing the main thread's TCB, an additional check is needed to ensure the main thread was the last to leave the group. If this check and the free operation are atomically guaranteed, the logic is sound, and double freeing won't occur. However, this atomicity cannot be completely guaranteed. If other free operations cause a block, problems still arise. Signed-off-by: wangzhi16 <wangzhi16@xiaomi.com>

anchao · 2025-12-18T14:38:50Z

please hold this PR I have a better solution to this issue, which avoids fragmented allocation of group requests while also ensuring the release of TCBs on the main thread.

anchao · 2025-12-18T17:24:23Z

@wangzhi16 please review this change:
anchao@68ce1b8

wangzhi16 · 2025-12-19T03:43:37Z

@wangzhi16 please review this change: anchao@68ce1b8

@anchao Hi, there may be used after free.

I understand that your patch modification still binds the group and task tcb together, but the free action only appears in group leave, unlike before when it also appeared in release_tcb.

However, this might cause used after free, because group_leave is also called in the file nuttx/sched/task/task_exithook.c:454, and the tcb will still be operated on after that, even though you have already released the tcb.

Moreover, this patch does not solve the first problem I mentioned.

When the main thread creates a child thread and performs a detach operation, there is a possibility that the main thread may have exited, but the main thread's TCB may not have been released.

By the way, if the main thread neither executes join nor detach, and exits before the child thread, a situation will occur where the main thread exits but the tcb is not released.

please check again, thanks.

anchao · 2025-12-19T05:29:45Z

@wangzhi16 Thanks for your excellent analysis, I'll get back to you tonight.

Regarding the resource leak issue you mentioned this morning, it seems to be a bug in the message queue's reference counter maintenance. I will resolve it tonight.

nsh> free
      total       used       free    maxused    maxfree  nused  nfree name
   67108864    1803456   65305408    2501008   65084784     69      3 Umem
nsh> echo leak > /proc/memdump
Memdump leak
   PID        Size Overhead    Sequence            Address Backtrace
    61         328      192         262     0x7042c2bee000 0x0000000040076f11 0x000000004002849f 0x000000004002935a 0x000000004002551e 0x000000004004a8f2 0x000000004004ad07 0x00000000400a8e32 0x00000000400a9116 0x00000000400a926f 0x000000004009c64c 0x00000000400b6a4f 0x00000000400bb07f 0x000000004002b482 0x000000004014f098 
    61         328      192         263     0x7042c2bee148 0x0000000040076f11 0x000000004002849f 0x000000004002935a 0x000000004002551e 0x000000004004a8f2 0x000000004004ad07 0x00000000400a8e32 0x00000000400a9116 0x00000000400a926f 0x000000004009c64c 0x00000000400b6a4f 0x00000000400bb07f 0x000000004002b482 0x000000004014f098 
  Total Blks  Total Size
           2         656
nsh> poweroff

$ addr2line -fe nuttx 0x7042c2bee148 0x0000000040076f11 0x000000004002849f 0x000000004002935a 0x000000004002551e 0x000000004004a8f2 0x000000004004ad07 0x00000000400a8e32 0x00000000400a9116 0x00000000400a926f 0x000000004009c64c 0x00000000400b6a4f 0x00000000400bb07f 0x000000004002b482 0x000000004014f098
??
??:0
sched_backtrace
/home/archer/code/nuttx/n8/nuttx/sched/sched/sched_backtrace.c:104
mm_malloc
/home/archer/code/nuttx/n8/nuttx/mm/mm_heap/mm_malloc.c:335 (discriminator 5)
mm_zalloc
/home/archer/code/nuttx/n8/nuttx/mm/mm_heap/mm_zalloc.c:47
zalloc
/home/archer/code/nuttx/n8/nuttx/mm/umm_heap/umm_zalloc.c:72
inode_alloc
/home/archer/code/nuttx/n8/nuttx/fs/inode/fs_inodereserve.c:87
inode_reserve
/home/archer/code/nuttx/n8/nuttx/fs/inode/fs_inodereserve.c:230
file_mq_vopen
/home/archer/code/nuttx/n8/nuttx/fs/mqueue/mq_open.c:290
nxmq_vopen
/home/archer/code/nuttx/n8/nuttx/fs/mqueue/mq_open.c:359
mq_open
/home/archer/code/nuttx/n8/nuttx/fs/mqueue/mq_open.c:505
receiver_thread
/home/archer/code/nuttx/n8/apps/testing/ostest/mqueue.c:174
pthread_startup
/home/archer/code/nuttx/n8/nuttx/libs/libc/pthread/pthread_create.c:61
pthread_start
/home/archer/code/nuttx/n8/nuttx/sched/pthread/pthread_create.c:142
pre_start
/home/archer/code/nuttx/n8/nuttx/arch/sim/src/sim/sim_initialstate.c:54
??
??:0

acassis · 2025-12-20T12:46:40Z

@anchao do you have a new fix? The proposal anchao@68ce1b8 seems a little bit awkward using tcb and group offset operations, maybe you can create are struct type where the first member is a "struct task_group_s" and the second is a "struct tcb_s", then you do a casting and point to the right struct. Please don't take it as a critique, this is just a suggestion to make the code more readable.

Should we wait for a proper fix instead of merging this PR here?

xiaoxiang781216 · 2025-12-21T08:16:22Z

@anchao do you have a new fix? The proposal anchao@68ce1b8 seems a little bit awkward using tcb and group offset operations, maybe you can create are struct type where the first member is a "struct task_group_s" and the second is a "struct tcb_s", then you do a casting and point to the right struct. Please don't take it as a critique, this is just a suggestion to make the code more readable.

It's wrong to allocate tcb_s and task_group_s in one kmm_malloc from design point since each has the different lile cycle, this type of optimization:

make the code ugly and hard to maintain
waste memory after main thread exit, but other thread keep running

Should we wait for a proper fix instead of merging this PR here?

it's better to revert the optimization directly.

anchao · 2025-12-21T12:17:18Z

Moreover, this patch does not solve the first problem I mentioned.

When the main thread creates a child thread and performs a detach operation, there is a possibility that the main thread may have exited, but the main thread's TCB may not have been released.

By the way, if the main thread neither executes join nor detach, and exits before the child thread, a situation will occur where the main thread exits but the tcb is not released.

please check again, thanks.

@wangzhi16

I have fixed the use after free issue; please review it again. sched/task: Relocate the group ahead of the tcb to ensure the group drop logic #17578
Memleak is not caused by my commit. You'll find that the problem persists even after reverting my commit. It's due to Xiaomi's changes to the message queue.

anchao · 2025-12-21T12:19:51Z

@anchao do you have a new fix? The proposal anchao@68ce1b8 seems a little bit awkward using tcb and group offset operations, maybe you can create are struct type where the first member is a "struct task_group_s" and the second is a "struct tcb_s", then you do a casting and point to the right struct. Please don't take it as a critique, this is just a suggestion to make the code more readable.

It's wrong to allocate tcb_s and task_group_s in one kmm_malloc from design point since each has the different lile cycle, this type of optimization:

make the code ugly and hard to maintain

waste memory after main thread exit, but other thread keep running

Should we wait for a proper fix instead of merging this PR here?

it's better to revert the optimization directly.

The new commit will remove redundant checks during task exit. sched/task: Relocate the group ahead of the tcb to ensure the group drop logic #17578
If you revert this commit, then please provide a method that allows you to request tasks via TCB without accessing the allocator.

wangzhi16 requested review from Donny9, GUIDINGLI, anchao, gustavonihei, jerpelea, masayuki2009, pussuw, xiaoxiang781216 and yf13 as code owners December 18, 2025 13:00

github-actions bot added Area: Memory Management Memory Management issues Area: OS Components OS Components issues Size: M The size of the change in this PR is medium labels Dec 18, 2025

wangzhi16 force-pushed the dev branch from f4bb04d to 8cac62f Compare December 18, 2025 13:09

wangzhi16 force-pushed the dev branch from 8cac62f to 2341698 Compare December 18, 2025 13:12

xiaoxiang781216 linked an issue Dec 18, 2025 that may be closed by this pull request

[BUG] Premature exit of a process will cause the group to be released ahead of schedule. #17418

Closed

1 task

xiaoxiang781216 approved these changes Dec 18, 2025

View reviewed changes

acassis requested review from linguini1 and raiden00pl December 18, 2025 18:18

acassis approved these changes Dec 18, 2025

View reviewed changes

jerpelea approved these changes Dec 19, 2025

View reviewed changes

acassis merged commit 31e562b into apache:master Dec 21, 2025
71 of 114 checks passed

anchao mentioned this pull request Dec 22, 2025

[BUG] Premature exit of a process will cause the group to be released ahead of schedule. #17418

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "sched/group: move task group into task_tcb_s to improve perfo… #17564

Revert "sched/group: move task group into task_tcb_s to improve perfo… #17564

Uh oh!

wangzhi16 commented Dec 18, 2025 •

edited

Loading

Uh oh!

anchao commented Dec 18, 2025

Uh oh!

anchao commented Dec 18, 2025

Uh oh!

wangzhi16 commented Dec 19, 2025 •

edited

Loading

Uh oh!

anchao commented Dec 19, 2025

Uh oh!

acassis commented Dec 20, 2025

Uh oh!

xiaoxiang781216 commented Dec 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

anchao commented Dec 21, 2025

Uh oh!

anchao commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Revert "sched/group: move task group into task_tcb_s to improve perfo… #17564

Revert "sched/group: move task group into task_tcb_s to improve perfo… #17564

Uh oh!

Conversation

wangzhi16 commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Impact

Testing

Before revert:

test on qemu-armv7a:

after revert:

test on qemu-armv7a:

Uh oh!

anchao commented Dec 18, 2025

Uh oh!

anchao commented Dec 18, 2025

Uh oh!

wangzhi16 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anchao commented Dec 19, 2025

Uh oh!

acassis commented Dec 20, 2025

Uh oh!

xiaoxiang781216 commented Dec 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

anchao commented Dec 21, 2025

Uh oh!

anchao commented Dec 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wangzhi16 commented Dec 18, 2025 •

edited

Loading

wangzhi16 commented Dec 19, 2025 •

edited

Loading

xiaoxiang781216 commented Dec 21, 2025 •

edited

Loading