Skip to content

Conversation

@wangzhi16
Copy link
Contributor

@wangzhi16 wangzhi16 commented Dec 18, 2025

This reverts commit 29e50ff.

Summary

Placing the main thread and the gourd in the same memory block, and allocating and freeing memory simultaneously, presents the following two problems:

  1. When the main thread creates a child thread and performs a detach operation, there is a possibility that the main thread may have exited, but the main thread's TCB may not have been released.

  2. This could potentially cause the main thread's TCB to be double-freed. The core contradiction in this problem lies in binding the main thread's TCB and the group together. When releasing the main thread's TCB, an additional check is needed to ensure the main thread was the last to leave the group. If this check and the free operation are atomically guaranteed, the logic is sound, and double freeing won't occur. However, this atomicity cannot be completely guaranteed. If other free operations cause a block, problems still arise. For example:

  1. The main thread exits first, executes group_leave to remove itself from the group, executes another free operation (such as nxtask_joindestroy) causing a block, and switches to a child thread.
  2. When the child thread executes group_leave, it discovers it was the last thread to exit, releases the group (the child thread assumes the main thread has already performed realloc, so it only releases the group structure, actually releasing the main thread's TCB along with it), releases its own TCB, and exits.
  3. When switching back to the main thread, during release, it checks that there are no other threads in the group, assumes it was the last thread to exit (both the main thread and the child thread believe they were the last to leave the group, which is the root cause of the problem), and then executes the free operation, resulting in double freeing.

Impact

None

Testing

here are two patches that can reproduce the problem mentioned above.

master...wangzhi16:nuttx:master
apache/nuttx-apps@master...wangzhi16:nuttx-apps:dev

Since this is an intermittent issue, blocking during the free process will free up the CPU. Therefore, we use usleep to create a scenario that is guaranteed to occur.

Before revert:

Configuration that needs to be opened:

CONFIG_SCHED_HAVE_PARENT=y
CONFIG_CANCELLATION_POINTS=y

test on qemu-armv7a:

build:
tools/configure.sh qemu-armv7a:nsh; make -j16
result:
image

The crash mentioned above was caused by the execution flow described below.

thread_parent ------------------------------------thread_child
detach()
group_leave()
nxsig_cleanup()
free()---->block!
---------------------------------------------------group_leave()
---------------------------------------------------check no member in group, release thread_parent.
---------------------------------------------------exit()
release_tcb()--->double free!!!

after revert:

test on qemu-armv7a:

build:
tools/configure.sh qemu-armv7a:nsh; make -j16
result:
image

@github-actions github-actions bot added Area: Memory Management Memory Management issues Area: OS Components OS Components issues Size: M The size of the change in this PR is medium labels Dec 18, 2025
…rmance"

This reverts commit 29e50ff.

reason:
Placing the main thread and the gourd in the same memory block, and allocating and freeing memory simultaneously, presents the following two problems:

When the main thread creates a child thread and performs a detach operation, there is a possibility that the main thread may have exited, but the main thread's TCB (Transaction Control Block) may not have been released.

This could potentially cause the main thread's TCB to be double-freed. The core contradiction in this problem lies in binding the main thread's TCB (Trust Container Registry) and the group together. When releasing the main thread's TCB, an additional check is needed to ensure the main thread was the last to leave the group. If this check and the free operation are atomically guaranteed, the logic is sound, and double freeing won't occur. However, this atomicity cannot be completely guaranteed. If other free operations cause a block, problems still arise.

Signed-off-by: wangzhi16 <wangzhi16@xiaomi.com>
@anchao
Copy link
Contributor

anchao commented Dec 18, 2025

please hold this PR I have a better solution to this issue, which avoids fragmented allocation of group requests while also ensuring the release of TCBs on the main thread.

@anchao
Copy link
Contributor

anchao commented Dec 18, 2025

@wangzhi16 please review this change:
anchao@68ce1b8

@wangzhi16
Copy link
Contributor Author

wangzhi16 commented Dec 19, 2025

@wangzhi16 please review this change: anchao@68ce1b8

@anchao Hi, there may be used after free.

I understand that your patch modification still binds the group and task tcb together, but the free action only appears in group leave, unlike before when it also appeared in release_tcb.

image

However, this might cause used after free, because group_leave is also called in the file nuttx/sched/task/task_exithook.c:454, and the tcb will still be operated on after that, even though you have already released the tcb.

Moreover, this patch does not solve the first problem I mentioned.

When the main thread creates a child thread and performs a detach operation, there is a possibility that the main thread may have exited, but the main thread's TCB may not have been released.

By the way, if the main thread neither executes join nor detach, and exits before the child thread, a situation will occur where the main thread exits but the tcb is not released.

please check again, thanks.

@anchao
Copy link
Contributor

anchao commented Dec 19, 2025

@wangzhi16 Thanks for your excellent analysis, I'll get back to you tonight.

Regarding the resource leak issue you mentioned this morning, it seems to be a bug in the message queue's reference counter maintenance. I will resolve it tonight.

nsh> free
      total       used       free    maxused    maxfree  nused  nfree name
   67108864    1803456   65305408    2501008   65084784     69      3 Umem
nsh> echo leak > /proc/memdump
Memdump leak
   PID        Size Overhead    Sequence            Address Backtrace
    61         328      192         262     0x7042c2bee000 0x0000000040076f11 0x000000004002849f 0x000000004002935a 0x000000004002551e 0x000000004004a8f2 0x000000004004ad07 0x00000000400a8e32 0x00000000400a9116 0x00000000400a926f 0x000000004009c64c 0x00000000400b6a4f 0x00000000400bb07f 0x000000004002b482 0x000000004014f098 
    61         328      192         263     0x7042c2bee148 0x0000000040076f11 0x000000004002849f 0x000000004002935a 0x000000004002551e 0x000000004004a8f2 0x000000004004ad07 0x00000000400a8e32 0x00000000400a9116 0x00000000400a926f 0x000000004009c64c 0x00000000400b6a4f 0x00000000400bb07f 0x000000004002b482 0x000000004014f098 
  Total Blks  Total Size
           2         656
nsh> poweroff

$ addr2line -fe nuttx 0x7042c2bee148 0x0000000040076f11 0x000000004002849f 0x000000004002935a 0x000000004002551e 0x000000004004a8f2 0x000000004004ad07 0x00000000400a8e32 0x00000000400a9116 0x00000000400a926f 0x000000004009c64c 0x00000000400b6a4f 0x00000000400bb07f 0x000000004002b482 0x000000004014f098
??
??:0
sched_backtrace
/home/archer/code/nuttx/n8/nuttx/sched/sched/sched_backtrace.c:104
mm_malloc
/home/archer/code/nuttx/n8/nuttx/mm/mm_heap/mm_malloc.c:335 (discriminator 5)
mm_zalloc
/home/archer/code/nuttx/n8/nuttx/mm/mm_heap/mm_zalloc.c:47
zalloc
/home/archer/code/nuttx/n8/nuttx/mm/umm_heap/umm_zalloc.c:72
inode_alloc
/home/archer/code/nuttx/n8/nuttx/fs/inode/fs_inodereserve.c:87
inode_reserve
/home/archer/code/nuttx/n8/nuttx/fs/inode/fs_inodereserve.c:230
file_mq_vopen
/home/archer/code/nuttx/n8/nuttx/fs/mqueue/mq_open.c:290
nxmq_vopen
/home/archer/code/nuttx/n8/nuttx/fs/mqueue/mq_open.c:359
mq_open
/home/archer/code/nuttx/n8/nuttx/fs/mqueue/mq_open.c:505
receiver_thread
/home/archer/code/nuttx/n8/apps/testing/ostest/mqueue.c:174
pthread_startup
/home/archer/code/nuttx/n8/nuttx/libs/libc/pthread/pthread_create.c:61
pthread_start
/home/archer/code/nuttx/n8/nuttx/sched/pthread/pthread_create.c:142
pre_start
/home/archer/code/nuttx/n8/nuttx/arch/sim/src/sim/sim_initialstate.c:54
??
??:0

@acassis
Copy link
Contributor

acassis commented Dec 20, 2025

@anchao do you have a new fix? The proposal anchao@68ce1b8 seems a little bit awkward using tcb and group offset operations, maybe you can create are struct type where the first member is a "struct task_group_s" and the second is a "struct tcb_s", then you do a casting and point to the right struct. Please don't take it as a critique, this is just a suggestion to make the code more readable.

Should we wait for a proper fix instead of merging this PR here?

@xiaoxiang781216
Copy link
Contributor

xiaoxiang781216 commented Dec 21, 2025

@anchao do you have a new fix? The proposal anchao@68ce1b8 seems a little bit awkward using tcb and group offset operations, maybe you can create are struct type where the first member is a "struct task_group_s" and the second is a "struct tcb_s", then you do a casting and point to the right struct. Please don't take it as a critique, this is just a suggestion to make the code more readable.

It's wrong to allocate tcb_s and task_group_s in one kmm_malloc from design point since each has the different lile cycle, this type of optimization:

  1. make the code ugly and hard to maintain
  2. waste memory after main thread exit, but other thread keep running

Should we wait for a proper fix instead of merging this PR here?

it's better to revert the optimization directly.

@acassis acassis merged commit 31e562b into apache:master Dec 21, 2025
71 of 114 checks passed
@anchao
Copy link
Contributor

anchao commented Dec 21, 2025

Moreover, this patch does not solve the first problem I mentioned.

When the main thread creates a child thread and performs a detach operation, there is a possibility that the main thread may have exited, but the main thread's TCB may not have been released.

By the way, if the main thread neither executes join nor detach, and exits before the child thread, a situation will occur where the main thread exits but the tcb is not released.

please check again, thanks.

@wangzhi16

  1. I have fixed the use after free issue; please review it again. sched/task: Relocate the group ahead of the tcb to ensure the group drop logic #17578
  2. Memleak is not caused by my commit. You'll find that the problem persists even after reverting my commit. It's due to Xiaomi's changes to the message queue.

@anchao
Copy link
Contributor

anchao commented Dec 21, 2025

@anchao do you have a new fix? The proposal anchao@68ce1b8 seems a little bit awkward using tcb and group offset operations, maybe you can create are struct type where the first member is a "struct task_group_s" and the second is a "struct tcb_s", then you do a casting and point to the right struct. Please don't take it as a critique, this is just a suggestion to make the code more readable.

It's wrong to allocate tcb_s and task_group_s in one kmm_malloc from design point since each has the different lile cycle, this type of optimization:

  1. make the code ugly and hard to maintain
  2. waste memory after main thread exit, but other thread keep running

Should we wait for a proper fix instead of merging this PR here?

it's better to revert the optimization directly.

  1. The new commit will remove redundant checks during task exit. sched/task: Relocate the group ahead of the tcb to ensure the group drop logic #17578
  2. If you revert this commit, then please provide a method that allows you to request tasks via TCB without accessing the allocator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Memory Management Memory Management issues Area: OS Components OS Components issues Size: M The size of the change in this PR is medium

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Premature exit of a process will cause the group to be released ahead of schedule.

5 participants