Skip to content

Rare server crash (parallel inter-task dependencies + other conditions) #7

@gavento

Description

@gavento

Rain server panics while a task becomes redy here. The relevant part of the log seems to be the following:

...
DEBUG 2018-03-17T15:31:49Z: librain::server::scheduler: Scheduler: New ready task (1,23092)
... [many New ready task info lines, various IDs]
DEBUG 2018-03-17T15:31:49Z: librain::server::scheduler: Scheduler: New ready task (1,23092)
thread 'main' panicked at 'assertion failed: r', src/server/scheduler.rs:148:17
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
DEBUG 2018-03-17T15:31:49Z: tokio_reactor: loop process - 1 events, 0.000s
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at libstd/sys_common/backtrace.rs:59
             at libstd/panicking.rs:207
   3: std::panicking::default_hook
             at libstd/panicking.rs:223
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:402
   5: std::panicking::begin_panic
   6: librain::server::scheduler::ReactiveScheduler::schedule
   7: librain::server::state::State::run_scheduler
   8: librain::server::state::<impl librain::common::wrapped::WrappedRcRefCell<librain::server::state::State>>::turn
   9: rain::main
  10: std::rt::lang_start::{{closure}}
  11: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:306
  12: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:102
  13: std::rt::lang_start_internal
             at libstd/panicking.rs:285
             at libstd/panic.rs:361
             at libstd/rt.rs:58
  14: main
  15: __libc_start_main
  16: _start
DEBUG 2018-03-17T15:31:49Z: tokio_reactor: loop process - 1 events, 0.000s
DEBUG 2018-03-17T15:31:49Z: tokio_reactor::background: shutting background reactor down NOW
...

However, a small test for multiple identical inputs passes, even with subsequent submits. The benchmark only fails with >500 tasks per layer. See the benchmark attached. It was run as python3 scalebench.py net -l 256 -w 1024 -s 0, the error happens around layer 10.
The debug checks with RAIN_DEBUG_MODE=1 do not find any consistency problems.

scalebench.py.txt

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingserver

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions