-
Notifications
You must be signed in to change notification settings - Fork 67
Schedule onto multiple cores in round-robin fashion #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
b620a40 to
a63ce3b
Compare
src/sched/mod.rs
Outdated
| } | ||
|
|
||
| // TODO: arbitrary cap. | ||
| pub static SCHED_STATES: [SyncUnsafeCell<Option<SchedState>>; 128] = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume you’ve ditched the per_cpu! macro here since it precludes access to state held by other cores?
I’ve been thinking about this for a while and wondering what the best approach is. I can see two options:
-
What you’ve done here: i.e., a static array which allows arbitrary access from any other core. The main downside I see is that everything must be locked, since it can be accessed by another core. I see that you’ve used
SyncUnsafeCell; in that case, I think functions such asget_sched_state_by_idandwith_cpu_sched_stateare inherently unsafe, as they permit lock-free access toSchedStateacross CPUs. -
Accessing/modifying CPU state via the
CPUMessenger. In this approach, we fire off an IPI to the target CPU, which performs the read/modify locally (lock-free) and returns the result.
I think the second option is a good compromise. Updating or accessing the state of another CPU isn’t done nearly as often as accessing the current CPU’s state. With the first option, we’d need locking for every scheduler state access or update (even something as frequent as retrieving the current task). Given that, my vote would be to try the CPUMessenger approach. I’m happy to give that a go if you’d like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an example of this, I've just pushed this branch which pushes tasks onto other CPUs via IPIs with the CPUMessenger. When I run usertest, I'm getting deadlock at the moment:
bash-5.3# ./bin/usertest
Running userspace tests ...
[ 35.35031303] interrupts::cpu_messenger: Putting task on CPU 1
Testing sync syscall ... OK
I suspect this is the child process that's finished on CPU1, the waker seting the task in CPU0 to be runnable again, but it's stuck in the idle task. I need to implement something such that a Waker, when signalling to another CPU causes the rechedule IPI message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I've intentionally left out cross-cpu access beyond task insertion. To my knowledge get_sched_state_by_id et. al. should be safe since I'm never overwriting any item in the array, and the interior mutability of other threads is limited to run_queue, which is protected by a SpinLock.
A better approach could be to Arc the runqueue and just share that.
On the topic of using messengers, how fast are they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the optimal cross-CPU task balancing method is work-stealing rather than work-giving, how would that work with messengers? It seems like you'd need to query for the number of tasks first, then request a steal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I've intentionally left out cross-cpu access beyond task insertion. To my knowledge get_sched_state_by_id et. al. should be safe since I'm never overwriting any item in the array, and the interior mutability of other threads is limited to
run_queue, which is protected by a SpinLock.
I disagree. I see a code path where the procfs module can call find_task_by_descriptor. That call can originate from any task on any CPU, which means it will very likely result in cross-CPU data access (i.e. accessing the run_queue of another CPU). That code iterates through all active slots and walks the run_queues until it finds the task.
As far as I can see, this is highly unsafe and will lead to data races. SchedState itself is not protected by a mutex, nor are any of its fields.
If we can guarantee that the only core accessing a given piece of data is the CPU that owns it, then we can safely access it lock-free. That is precisely the guarantee provided by per_cpu!: there is no way to access data that does not belong to the current CPU. In this case, however, that guarantee does not hold. You need a different mechanism here—this is exactly the problem the CPUMessenger is trying to solve.
A better approach could be to Arc the runqueue and just share that.
Possibly. I did consider wrapping all of this state in a large Arc, but that effectively means locking everything. Any on-CPU operation would then require taking a lock, which I think introduces too much overhead for the scheduler fast path.
On the topic of using messengers, how fast are they?
IPIs are fast. You can hit additional latency if interrupts are disabled on the target CPU, but Linux uses them extensively in practice. For example, on my machine:
$ cat /proc/interrupts
[...]
IWI: 103089 111648 51802 50290 3480843 50846 37640 101369 IRQ work interrupts
RTR: 0 0 0 0 0 0 0 0 APIC ICR read retries
RES: 53258 64925 39159 37986 143873 61857 38363 49915 Rescheduling interrupts
CAL: 1434334 1127025 974039 838055 844695 993721 932218 876858 Function call interrupts
TLB: 684503 742799 728184 629270 646185 829325 783279 722283 TLB shootdowns
TRM: 0 0 0 0 0 0 0 0 Thermal event interrupts
THR: 0 0 0 0 0 0 0 0 Threshold APIC interrupts
DFR: 0 0 0 0 0 0 0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 0 0 0 0 Machine check exceptions
MCP: 66 65 65 65 65 65 65 65 Machine check polls
ERR: 0
These are the various IPIs Linux uses, and as you can see, they are heavily exercised.
I've just pushed some additional code to the branch I mentioned earlier, this resolves the deadlock that I had and allows me to get through nearly all of usertest without any kind of preemption. I'm just using IPIs to wake-up other CPUs when needed.
I think the optimal cross-CPU task balancing method is work-stealing rather than work-giving, how would that work with messengers? It seems like you'd need to query for the number of tasks first, then request a steal.
That’s a fair point, and I agree that work-stealing is generally the better default for schedulers.
With messengers, the model shifts slightly compared to shared-memory designs, but it still maps cleanly. The key idea is that ownership never changes implicitly: the stealing CPU never directly inspects or mutates another CPU’s run queue. Instead, it asks the owning CPU to do the inspection and transfer on its behalf. Concretely, the stealing CPU raises a message to a target CPU with the necessary parameters, and the target CPU decides what (if any) tasks it can push back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think we can worry about the work stealing implementation later, I'll switch what I have currently to use messengers.
5c890be to
a25bb55
Compare
|
I've just tried to run the usertest and it looks like you're getting the same problem as I am on my I Need to investigate what's going on there. |
1e3cf44 to
bc729dc
Compare
bc729dc to
427aae4
Compare
hexagonal-sun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some minor comments - nothing major. Other than that, happy to merge.
It might be work creating an issue for per-cpu driver initialisation logic for per-CPU devices (the armv8 timer being the primary one). That decouples that driver from the gic code.
| } | ||
| } | ||
|
|
||
| // TODO: arbitrary cap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant comment now.
| sched_state.advance_vclock(now_inst); | ||
|
|
||
| let next_task = sched_state.find_next_runnable_task(); | ||
| // if previous_task.tid != next_task.tid { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume this is useful debug code?
It'd be a shame to remove such code when it's useful for debugging. Perhaps we should make use of log's release_max_level_info feature flags or similar, so that useful code isn't removed.
src/sched/mod.rs
Outdated
|
|
||
| pub fn insert_task_cross_cpu(task: Arc<Task>) { | ||
| let cpu = get_next_cpu(); | ||
| message_cpu(cpu.value(), Message::PutTask(task)).expect("Failed to send task to CPU"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could slightly optimise here to use insert_task if we're inserting to the current CPU.
| .activate(); | ||
|
|
||
| *init_task.state.lock_save_irq() = TaskState::Running; | ||
| SCHED_STATE.borrow_mut().running_task = Some(idle_task.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want init to be the first task executed?
|
FYI still debugging some weird crashes with usertest. Also there's a lot of intermittent hanging. |
|
I found 50e5b0b helped a lot with intermittent hanging. Note that you need the prior commit to the interrupt manager to help prevent a deadlock in the interrupt manager. |
There is no need to store a seperate inner struct with the interrupt manager, refactor that. Also, reduce the amount of locking when servicing an interrupt, currently we keep the whole interrupt manager locked when servicing an interrupt. This should be kept unlocked while the ISR is called.
|
I’m seeing some lock-ups due to the interrupt manager not being able to handle interrupts reentrantly. That should be fixed if you cherry-pick 6ae2375. I'm also seeing a lot of lock-ups in the allocator: This makes total sense: the spinlock used in the allocator does not disable interrupts. Therefore, when an ISR is running, the heap spinlock may already be held. If an allocation occurs during the ISR, this will deadlock. |
The `LockedHeap` type provided by `linked_list` doesn't disable interrupts when modifying the heap. This causes allocations to potentially deadlock when allocations occur during an ISR. Fix this by wrapping the `Heap` in our interrupt-aware spinlock.
70ea99c to
c9d0ef4
Compare
|
Okay, I've just pushed fixes for the heap allocator and the interrupt manager to this PR which should help. I now get a successful boot to bash pretty much every time which means I think we've fixed interrupt locking issues. I'm now hitting:
|
When switching tasks, we may well be swithing away from a task which is going to `Sleep`. Therefore the check ```rust debug_assert_eq!(*prev_task.state.lock_save_irq(), TaskState::Runnable); ``` Is incorrect.
|
Seems like I can spawn 3 However usertest still doesn't run to completion. I think there might be a bug with respect to cross-cpu task completion. |
|
Also interrupts being scheduled on top of times when the borrow is acquired is turning out to be a problem. (i.e. |
| /// Nothing, but the CPU context will be set to the next runnable task. See | ||
| /// `userspace_return` for how this is invoked. | ||
| fn schedule() { | ||
| if SCHED_STATE.try_borrow_mut().is_none() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my "fix" for that, but it just returns. Sometimes this happens instantly, causing the idle task to be executed ad-nauseam.
b109f47 to
c6b44cb
Compare
This prevents a bug where `sys_exit` calls `exit_group` for the thread's process, even when there are still active threads.
c6b44cb to
c52b786
Compare
The `timeout` parameter is only used for `_WAIT` futex ops. For other ops, the `timeout` parameter is permitted to be an undefied value. The current implementation would then try to `copy_from_user` using the garbage pointer and fault, causing a missed wake-up and deadlock the calling process. Fix this by only accessing the timeout parmeter for `_WAIT` futex ops where the parameter's value must be valid.
In an SMP environment, two threads sharing an address space may trigger a page fault on the same address simultaneously. Previously, the loser of this race would receive an `AlreadyMapped` error from the page table mapper, causing the kernel to treat a valid execution flow as an error. This patch modifies `handle_demand_fault` to gracefully handle these spurious faults by: 1. Accepting `AlreadyMapped` as a successful resolution. If another CPU has already mapped the page while we were waiting for the lock (or performing I/O):, we consider the fault handled. 2. Fixing a memory leak in the race path. We now only `leak()` the allocated `ClaimedPage` (surrendering ownership to the page tables) if the mapping actually succeeds. If we lose the race, the `ClaimedPage` is allowed to go out of scope, causing the `Drop` impl to return the unused physical frame to the allocator. 3. Applying this logic to both the anonymous mapping path and the deferred file-backed path.
|
Fixed a bunch of issues. The thread exit logic was utterly wrong - we were sending a SIGCHLD to the parent for every thread exit; I'm surprised threading code worked prior to this PR! Now we've introduced proper SMP, the fault handling code didn't handle page-fault collisions on two different CPUs. It treated an I think we're getting closer to this being workable now. |
|
Still being very non-deterministic from my end. Usertest does pass sometimes. I suspect there might be some sort of race condition somewhere. From a black-box perspective it seems the algorithm is fine, so I suspect it's something else. |
|
Ah ok, I see that preempts are being scheduled but not executed (sometimes). |
|
Or something is just holding the wakeup_q and never dropping it. Nevermind, I see the problem: Indeed, modifying WakeupKind::Preempt(cpu_id) => {
if cpu_id == ArchImpl::id() {
crate::sched::sched_yield();
} else {
log::warn!(
"SysTimer: received preemption event for CPU {}, but running on CPU {}",
cpu_id,
ArchImpl::id()
);
}
}
}Shows: right before a hang. Ideally we'd have some sort of per-cpu handling (either the wakeup-queue being per-cpu or the timer interrupts being handled per-cpu). But for now I'll just send an IPI. Still running into the timer interrupt not firing (I think it's still being deadlocked somehow). Ah, it's a cyclic lock acquire (sched_yield -> schedule -> schedule_preempt). |
9697eea to
e1dadca
Compare
|
Ok, fixed the hanging bug with 8a09fe0. Essentially I move to interrupt the idle task every 4 ms. However I'm seeing a speed regression of ~10x when running usertests. I'd like to probably resolve that before merging. |
|
Do you think we need a commit like 50e5b0b to force other CPUs out of the idle task if a waker changes the state of the task? I realise that the preempt code would eventually do this with a tick interrupt but I perhaps it may reduce latency? |
This would help. I think we need a good way of ensuring that a child can quickly wake up it's parent. |
|
Removing cross-cpu scheduling by forcing everything to turn into single cpu leads to usertest taking 550ms. I'm looking at some optimizations like keeping threads on the same CPU to unless the local queue is too long. I'm also thinking that we can improve on the way IPIs are being used by 1) batching new task sends and 2) using a lock-free I think the IPI stuff might be far-fetched for now, but I would be interested in somehow beating down that single-cpu number. |
|
I'll have a look today. I think the first thing would be to make the timer code be per-cpu. We need a per-cpu IPIs should really be used for slow-path code. Ideally we should minimise their use for the scheduling fast-path as much as possible. |
Exposes better APIs to collect all tasks/querying tasks across CPUs. Schedules in a round-robin ordering for now. Eventually we'd want work stealing from other cores and a quick way to get the current queue lengths for each core. Also this forces timer interrupts and extends them to non-main cores.
Currently fixes the bug where background tasks force a hang.
Includes improvements to the EEVDF implementation which fix many bugs.