Skip to content

Conversation

@arihant2math
Copy link
Collaborator

@arihant2math arihant2math commented Dec 25, 2025

Exposes better APIs to collect all tasks/querying tasks across CPUs. Schedules in a round-robin ordering for now. Eventually we'd want work stealing from other cores and a quick way to get the current queue lengths for each core. Also this forces timer interrupts and extends them to non-main cores.

Currently fixes the bug where background tasks force a hang.

Includes improvements to the EEVDF implementation which fix many bugs.

@arihant2math arihant2math force-pushed the multicore-sched branch 2 times, most recently from b620a40 to a63ce3b Compare December 25, 2025 08:39
@arihant2math arihant2math changed the title schedule in round-robin fashion Schedule onto multiple cores in round-robin fashion Dec 25, 2025
@arihant2math arihant2math marked this pull request as draft December 25, 2025 09:00
src/sched/mod.rs Outdated
}

// TODO: arbitrary cap.
pub static SCHED_STATES: [SyncUnsafeCell<Option<SchedState>>; 128] =
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you’ve ditched the per_cpu! macro here since it precludes access to state held by other cores?

I’ve been thinking about this for a while and wondering what the best approach is. I can see two options:

  • What you’ve done here: i.e., a static array which allows arbitrary access from any other core. The main downside I see is that everything must be locked, since it can be accessed by another core. I see that you’ve used SyncUnsafeCell; in that case, I think functions such as get_sched_state_by_id and with_cpu_sched_state are inherently unsafe, as they permit lock-free access to SchedState across CPUs.

  • Accessing/modifying CPU state via the CPUMessenger. In this approach, we fire off an IPI to the target CPU, which performs the read/modify locally (lock-free) and returns the result.

I think the second option is a good compromise. Updating or accessing the state of another CPU isn’t done nearly as often as accessing the current CPU’s state. With the first option, we’d need locking for every scheduler state access or update (even something as frequent as retrieving the current task). Given that, my vote would be to try the CPUMessenger approach. I’m happy to give that a go if you’d like.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an example of this, I've just pushed this branch which pushes tasks onto other CPUs via IPIs with the CPUMessenger. When I run usertest, I'm getting deadlock at the moment:

bash-5.3# ./bin/usertest 
Running userspace tests ...
[   35.35031303] interrupts::cpu_messenger: Putting task on CPU 1
Testing sync syscall ... OK

I suspect this is the child process that's finished on CPU1, the waker seting the task in CPU0 to be runnable again, but it's stuck in the idle task. I need to implement something such that a Waker, when signalling to another CPU causes the rechedule IPI message.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've intentionally left out cross-cpu access beyond task insertion. To my knowledge get_sched_state_by_id et. al. should be safe since I'm never overwriting any item in the array, and the interior mutability of other threads is limited to run_queue, which is protected by a SpinLock.

A better approach could be to Arc the runqueue and just share that.

On the topic of using messengers, how fast are they?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the optimal cross-CPU task balancing method is work-stealing rather than work-giving, how would that work with messengers? It seems like you'd need to query for the number of tasks first, then request a steal.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I've intentionally left out cross-cpu access beyond task insertion. To my knowledge get_sched_state_by_id et. al. should be safe since I'm never overwriting any item in the array, and the interior mutability of other threads is limited to run_queue, which is protected by a SpinLock.

I disagree. I see a code path where the procfs module can call find_task_by_descriptor. That call can originate from any task on any CPU, which means it will very likely result in cross-CPU data access (i.e. accessing the run_queue of another CPU). That code iterates through all active slots and walks the run_queues until it finds the task.

As far as I can see, this is highly unsafe and will lead to data races. SchedState itself is not protected by a mutex, nor are any of its fields.

If we can guarantee that the only core accessing a given piece of data is the CPU that owns it, then we can safely access it lock-free. That is precisely the guarantee provided by per_cpu!: there is no way to access data that does not belong to the current CPU. In this case, however, that guarantee does not hold. You need a different mechanism here—this is exactly the problem the CPUMessenger is trying to solve.

A better approach could be to Arc the runqueue and just share that.

Possibly. I did consider wrapping all of this state in a large Arc, but that effectively means locking everything. Any on-CPU operation would then require taking a lock, which I think introduces too much overhead for the scheduler fast path.

On the topic of using messengers, how fast are they?

IPIs are fast. You can hit additional latency if interrupts are disabled on the target CPU, but Linux uses them extensively in practice. For example, on my machine:

$ cat /proc/interrupts

[...]
 IWI:     103089     111648      51802      50290    3480843      50846      37640     101369   IRQ work interrupts
 RTR:          0          0          0          0          0          0          0          0   APIC ICR read retries
 RES:      53258      64925      39159      37986     143873      61857      38363      49915   Rescheduling interrupts
 CAL:    1434334    1127025     974039     838055     844695     993721     932218     876858   Function call interrupts
 TLB:     684503     742799     728184     629270     646185     829325     783279     722283   TLB shootdowns
 TRM:          0          0          0          0          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0          0          0          0          0   Threshold APIC interrupts
 DFR:          0          0          0          0          0          0          0          0   Deferred Error APIC interrupts
 MCE:          0          0          0          0          0          0          0          0   Machine check exceptions
 MCP:         66         65         65         65         65         65         65         65   Machine check polls
 ERR:          0

These are the various IPIs Linux uses, and as you can see, they are heavily exercised.

I've just pushed some additional code to the branch I mentioned earlier, this resolves the deadlock that I had and allows me to get through nearly all of usertest without any kind of preemption. I'm just using IPIs to wake-up other CPUs when needed.

I think the optimal cross-CPU task balancing method is work-stealing rather than work-giving, how would that work with messengers? It seems like you'd need to query for the number of tasks first, then request a steal.

That’s a fair point, and I agree that work-stealing is generally the better default for schedulers.

With messengers, the model shifts slightly compared to shared-memory designs, but it still maps cleanly. The key idea is that ownership never changes implicitly: the stealing CPU never directly inspects or mutates another CPU’s run queue. Instead, it asks the owning CPU to do the inspection and transfer on its behalf. Concretely, the stealing CPU raises a message to a target CPU with the necessary parameters, and the target CPU decides what (if any) tasks it can push back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think we can worry about the work stealing implementation later, I'll switch what I have currently to use messengers.

@arihant2math arihant2math force-pushed the multicore-sched branch 2 times, most recently from 5c890be to a25bb55 Compare December 25, 2025 21:04
@arihant2math arihant2math marked this pull request as ready for review December 25, 2025 21:04
@hexagonal-sun
Copy link
Owner

hexagonal-sun commented Dec 25, 2025

I've just tried to run the usertest and it looks like you're getting the same problem as I am on my cpu-messenger-task-migration branch:

Testing rust threads ... OK
[    5.5169154] moss: Kernel panicked at src/arch/arm64/memory/fault.rs:121:17: Page fault defered error, SIGBUS on process

I Need to investigate what's going on there.

@arihant2math arihant2math force-pushed the multicore-sched branch 2 times, most recently from 1e3cf44 to bc729dc Compare December 26, 2025 05:47
Copy link
Owner

@hexagonal-sun hexagonal-sun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor comments - nothing major. Other than that, happy to merge.

It might be work creating an issue for per-cpu driver initialisation logic for per-CPU devices (the armv8 timer being the primary one). That decouples that driver from the gic code.

}
}

// TODO: arbitrary cap.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant comment now.

sched_state.advance_vclock(now_inst);

let next_task = sched_state.find_next_runnable_task();
// if previous_task.tid != next_task.tid {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume this is useful debug code?

It'd be a shame to remove such code when it's useful for debugging. Perhaps we should make use of log's release_max_level_info feature flags or similar, so that useful code isn't removed.

src/sched/mod.rs Outdated

pub fn insert_task_cross_cpu(task: Arc<Task>) {
let cpu = get_next_cpu();
message_cpu(cpu.value(), Message::PutTask(task)).expect("Failed to send task to CPU");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could slightly optimise here to use insert_task if we're inserting to the current CPU.

.activate();

*init_task.state.lock_save_irq() = TaskState::Running;
SCHED_STATE.borrow_mut().running_task = Some(idle_task.clone());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want init to be the first task executed?

@arihant2math
Copy link
Collaborator Author

FYI still debugging some weird crashes with usertest. Also there's a lot of intermittent hanging.

@hexagonal-sun
Copy link
Owner

I found 50e5b0b helped a lot with intermittent hanging. Note that you need the prior commit to the interrupt manager to help prevent a deadlock in the interrupt manager.

There is no need to store a seperate inner struct with the interrupt
manager, refactor that.

Also, reduce the amount of locking when servicing an interrupt,
currently we keep the whole interrupt manager locked when servicing an
interrupt. This should be kept unlocked while the ISR is called.
@hexagonal-sun
Copy link
Owner

hexagonal-sun commented Dec 26, 2025

I’m seeing some lock-ups due to the interrupt manager not being able to handle interrupts reentrantly. That should be fixed if you cherry-pick 6ae2375.

I'm also seeing a lot of lock-ups in the allocator:

#0  spinning_top::spinlock::{impl#1}::lock (self=0xffff800001213170 <moss::arch::arm64::memory::HEAP_ALLOCATOR>) at src/spinlock.rs:57
#1  0xffff800000164f94 in lock_api::mutex::Mutex<spinning_top::spinlock::RawSpinlock, linked_list_allocator::Heap>::lock<spinning_top::spinlock::RawSpinlock, linked_list_allocator::Heap> (self=0xffff800001213170 <moss::arch::arm64::memory::HEAP_ALLOCATOR>)
    at /home/matthew/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/lock_api-0.4.14/src/mutex.rs:213
#2  0xffff800000166144 in linked_list_allocator::{impl#4}::alloc (self=0xffff800001213170 <moss::arch::arm64::memory::HEAP_ALLOCATOR>, layout=...)
    at src/lib.rs:336
#3  0xffff800000151088 in moss::arch::arm64::memory::_::__rust_alloc (size=24, align=8) at src/arch/arm64/memory/mod.rs:21
#4  0xffff800000119dc8 in alloc::alloc::alloc (layout=...)
    at /home/matthew/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:95
#5  alloc::alloc::Global::alloc_impl (self=0x1, layout=..., zeroed=false)
    at /home/matthew/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:190
#6  0xffff800000119f04 in alloc::alloc::{impl#1}::allocate (layout=...)
    at /home/matthew/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:251
#7  alloc::alloc::exchange_malloc (size=24, align=8)
    at /home/matthew/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:352
#8  0xffff8000000e4790 in alloc::boxed::{impl#0}::new<moss::drivers::interrupts::arm_gic_v3::ArmGicV3InterruptContext> (x=...)
    at /home/matthew/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:265
#9  moss::drivers::interrupts::arm_gic_v3::{impl#5}::read_active_interrupt (self=0xffffb00000000ce0) at src/drivers/interrupts/arm_gic_v3.rs:395
#10 0xffff800000066360 in moss::interrupts::InterruptManager::get_active_handler (self=0xffffb00000000d08) at src/interrupts/mod.rs:152
#11 0xffff800000066140 in moss::interrupts::InterruptManager::handle_interrupt (self=0xffffb00000000d08) at src/interrupts/mod.rs:161
#12 0xffff8000000e0a2c in moss::arch::arm64::exceptions::el1_irq_spx (state=0xffffb8000001a8d0) at src/arch/arm64/exceptions/mod.rs:125
#13 0xffffe000000002f4 in ?? ()

This makes total sense: the spinlock used in the allocator does not disable interrupts. Therefore, when an ISR is running, the heap spinlock may already be held. If an allocation occurs during the ISR, this will deadlock.

The `LockedHeap` type provided by `linked_list` doesn't disable
interrupts when modifying the heap. This causes allocations to
potentially deadlock when allocations occur during an ISR.

Fix this by wrapping the `Heap` in our interrupt-aware spinlock.
@hexagonal-sun
Copy link
Owner

hexagonal-sun commented Dec 26, 2025

Okay, I've just pushed fixes for the heap allocator and the interrupt manager to this PR which should help. I now get a successful boot to bash pretty much every time which means I think we've fixed interrupt locking issues. I'm now hitting:

[    0.717370] drivers::interrupts::arm_gic_v3: GICv3: CPU interface for core 2 enabled.
[    0.718020] arch::arm64::boot::secondary: CPU 2 online.
[    0.719852] drivers::interrupts::arm_gic_v3: GICv3: Redistributor for core 3 (MPIDR=0x80000003) is awake.
[    0.720558] drivers::interrupts::arm_gic_v3: GICv3: CPU interface for core 3 enabled.
[    0.721818] arch::arm64::boot::secondary: CPU 3 online.
[    0.725946] interrupts::cpu_messenger: CPU 1 recieved ping from CPU 0
bash-5.3# [    1.1196409] moss: Kernel panicked at src/sched/mod.rs:335:13: assertion `left == right` failed
  left: Sleeping
 right: Runnable

Which looks like it's a scheduler related issue. This was an incorrect assertion check which I added to the sched code. I've pushed a commit to remove it.

When switching tasks, we may well be swithing away from a task which is
going to `Sleep`. Therefore the check

```rust
 debug_assert_eq!(*prev_task.state.lock_save_irq(), TaskState::Runnable);
```

Is incorrect.
@arihant2math
Copy link
Collaborator Author

Seems like I can spawn 3 cat /dev/zero &s, then it starts to slow down. This is fine for now.

However usertest still doesn't run to completion. I think there might be a bug with respect to cross-cpu task completion.

@arihant2math
Copy link
Collaborator Author

Also interrupts being scheduled on top of times when the borrow is acquired is turning out to be a problem. (i.e. insert_task is running and then the timer interrupt strikes. This seems to cause hangups a lot.

/// Nothing, but the CPU context will be set to the next runnable task. See
/// `userspace_return` for how this is invoked.
fn schedule() {
if SCHED_STATE.try_borrow_mut().is_none() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my "fix" for that, but it just returns. Sometimes this happens instantly, causing the idle task to be executed ad-nauseam.

This prevents a bug where `sys_exit` calls `exit_group` for the thread's
process, even when there are still active threads.
The `timeout` parameter is only used for `_WAIT` futex ops. For other
ops, the `timeout` parameter is permitted to be an undefied value. The
current implementation would then try to `copy_from_user` using the
garbage pointer and fault, causing a missed wake-up and deadlock the
calling process.

Fix this by only accessing the timeout parmeter for `_WAIT` futex ops
where the parameter's value must be valid.
In an SMP environment, two threads sharing an address space may trigger
a page fault on the same address simultaneously. Previously, the loser
of this race would receive an `AlreadyMapped` error from the page table
mapper, causing the kernel to treat a valid execution flow as an error.

This patch modifies `handle_demand_fault` to gracefully handle these
spurious faults by:

1. Accepting `AlreadyMapped` as a successful resolution. If another CPU
has already mapped the page while we were waiting for the lock
(or performing I/O):, we consider the fault handled.

2. Fixing a memory leak in the race path. We now only `leak()` the
allocated `ClaimedPage` (surrendering ownership to the page tables) if
the mapping actually succeeds. If we lose the race, the `ClaimedPage` is
allowed to go out of scope, causing the `Drop` impl to return the unused
physical frame to the allocator.

3. Applying this logic to both the anonymous mapping path and the
deferred file-backed path.
@hexagonal-sun
Copy link
Owner

Fixed a bunch of issues. The thread exit logic was utterly wrong - we were sending a SIGCHLD to the parent for every thread exit; I'm surprised threading code worked prior to this PR!

Now we've introduced proper SMP, the fault handling code didn't handle page-fault collisions on two different CPUs. It treated an AlreadyMapped condition as a hard error, see 709845c for details.

I think we're getting closer to this being workable now.

@arihant2math
Copy link
Collaborator Author

Still being very non-deterministic from my end. Usertest does pass sometimes. I suspect there might be some sort of race condition somewhere. From a black-box perspective it seems the algorithm is fine, so I suspect it's something else.

@arihant2math
Copy link
Collaborator Author

arihant2math commented Dec 28, 2025

Ah ok, I see that preempts are being scheduled but not executed (sometimes).

@arihant2math
Copy link
Collaborator Author

arihant2math commented Dec 28, 2025

Ok I see the problem, we should disable interrupts when modifying the wait queue because it might be locked during a timer interrupt.

Or something is just holding the wakeup_q and never dropping it.

Nevermind, I see the problem:
Essentially SYS_TIMER seems to be running on a random CPU. However sched_yield yields the current CPU's scheduler. So, if the yield was meant for a different CPU it'll never go through.

Indeed, modifying WakeupEvent::preempt to include the callers CPU and adding the following patch:

                    WakeupKind::Preempt(cpu_id) => {
                        if cpu_id == ArchImpl::id() {
                            crate::sched::sched_yield();
                        } else {
                            log::warn!(
                                "SysTimer: received preemption event for CPU {}, but running on CPU {}",
                                cpu_id,
                                ArchImpl::id()
                            );
                        }
                    }
                }

Shows:

[    2.2638996] drivers::timer: SysTimer: received preemption event for CPU 1, but running on CPU 0

right before a hang.

Ideally we'd have some sort of per-cpu handling (either the wakeup-queue being per-cpu or the timer interrupts being handled per-cpu). But for now I'll just send an IPI.

Still running into the timer interrupt not firing (I think it's still being deadlocked somehow).

Ah, it's a cyclic lock acquire (sched_yield -> schedule -> schedule_preempt).

@arihant2math
Copy link
Collaborator Author

Ok, fixed the hanging bug with 8a09fe0. Essentially I move to interrupt the idle task every 4 ms.

However I'm seeing a speed regression of ~10x when running usertests. I'd like to probably resolve that before merging.

@hexagonal-sun
Copy link
Owner

hexagonal-sun commented Dec 28, 2025

Do you think we need a commit like 50e5b0b to force other CPUs out of the idle task if a waker changes the state of the task? I realise that the preempt code would eventually do this with a tick interrupt but I perhaps it may reduce latency?

@arihant2math
Copy link
Collaborator Author

Do you think we need a commit like 50e5b0b to force other CPUs out of the idle task if a waker changes the state of the task? I realise that the preempt code would eventually do this with a tick interrupt but I perhaps it may reduce latency?

This would help. I think we need a good way of ensuring that a child can quickly wake up it's parent.

@arihant2math
Copy link
Collaborator Author

arihant2math commented Dec 28, 2025

Removing cross-cpu scheduling by forcing everything to turn into single cpu leads to usertest taking 550ms. master only takes 356ms.

I'm looking at some optimizations like keeping threads on the same CPU to unless the local queue is too long.

I'm also thinking that we can improve on the way IPIs are being used by 1) batching new task sends and 2) using a lock-free ArrayQueue to push new tasks into a remote CPU.

I think the IPI stuff might be far-fetched for now, but I would be interested in somehow beating down that single-cpu number.

@hexagonal-sun
Copy link
Owner

I'll have a look today. I think the first thing would be to make the timer code be per-cpu. We need a per-cpu wakeup_q. That should allow preemption events to be scheduled for this CPU. This will also allow sleep()s to be inserted on the work_q for this CPU. I suspect this will help a lot.

IPIs should really be used for slow-path code. Ideally we should minimise their use for the scheduling fast-path as much as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants