Proto commits in PKUHPC/CraneSched

These commits are when the Protocol Buffers files have changed: (only the last 100 relevant commits are shown)

Commit:f9ecbdc
Author:Junlin Li
Committer:GitHub

feat: Requeue (#875) * feat(requeue): add requeue support and follow-up fixes * fix(requeue): recover pending array child safely

The documentation is generated from this commit.

Commit:e3b775f
Author:junlinli
Committer:junlinli

feat(requeue): add requeue support and follow-up fixes

The documentation is generated from this commit.

Commit:24c4a46
Author:huerni
Committer:GitHub

feat: Add [account/user-partition] resource limit (#529) * feat: account partition limit - keep only proto and AccountManager changes on top of latest master * feat: Add Suspend and Resume job requests and update TerminateStepRequest with terminate source * style: Improve code formatting and comments in proto files for consistency * feat: Update AccountManager and related components to support job-based resource limits and modify account operations * Refactor resource management in AccountMetaContainer - Introduced MetaResourceStat to track per-user and per-account resource usage across QoS and partition dimensions. - Updated method names for clarity, changing "Qos" to "Meta" in resource management functions to reflect the new structure. - Replaced old resource checking and allocation methods with new implementations that utilize the MetaResourceStat structure. - Adjusted JobScheduler to use the new resource management methods, ensuring compatibility with the updated resource allocation logic. * feat: Refactor partition limit checks in AccountMetaContainer for improved clarity and efficiency * feat: Add IsUnlimitedTres_ check for resource limits in AccountMetaContainer * refactor: Remove commented-out section for clarity in AccountMetaContainer * feat: Implement partition resource limit handling in MongodbClient and update gRPC service methods * fix: Add missing newline at end of file in proto definitions * refactor: Update resource limit handling in AccountManager and DbClient for improved clarity and efficiency * refactor: Simplify resource limit checks in AccountMetaContainer and enhance partition limit handling * style: auto format with clang-format. * refactor: Enhance CheckTres_ method to include prefix for clearer error messages * refactor: Enhance resource allocation checks in TryMallocMetaSubmitResource method * refactor: Update error codes and messages for partition resource limits in PublicDefs and AccountMetaContainer * refactor: Add detailed resource limit pending reasons for QoS and Partition in documentation * refactor: Improve error handling and resource limit checks in AccountManager and AccountMetaContainer * refactor: Rename error code for missing partition entry in pending reasons and AccountMetaContainer * style: auto format with clang-format. * refactor: Enhance documentation for resource management functions and add static assertion for error code consistency * docs: Add Resource Limit Configuration Guide and update deployment index - Introduced a new guide for configuring QoS and partition-level resource limits for accounts and users. - Updated the deployment index to include a link to the new Resource Limit Configuration Guide. - Added the Resource Limit Guide to the navigation in mkdocs.yaml for better accessibility. * feat: Add partition resource limits and mapping for accounts and users * merge master * feat: Add account/user partition limits and update error string array size * feat: Enhance resource submission checks by incorporating user context and account mapping * feat: Update partition submit limit checks to incorporate QoS parameters * style: auto format with clang-format. * feat: Update resource limit checks to use QoS GRES validation * fix * feat: Enhance resource management by adding count parameter to submission functions * refactor: Update error string array to use dynamic size based on ErrCode_ARRAYSIZE * style: auto format with clang-format. * feat: Update account and user modification operations to join value lists as strings * feat: Enhance resource tracking by adding account-level partition resource mapping * feat: Improve partition limit checks with detailed logging for user and account submissions * style: auto format with clang-format. * feat: Add partition resource limit options and enhance error handling in account management * style: auto format with clang-format. * fix --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:49a8dd3
Author:NamelessOIer
Committer:GitHub

feat: optimize single-node step launch (#906)

Commit:bb0e1da
Author:NamelessOIer
Committer:NamelessOIer

feat: optimize single-node step launch

Commit:e91d192
Author:Yongkun Li
Committer:GitHub

feat: async launching tasks and make image pulling timeout configurable (#903) * feat: async launching * feat: Add sibling tasks aborting * fix: Add comment * fix: Fix compilation in GCC15 & Clang20+ * fix: workaround for 10s image pulling * feat: Add image pulling timeout knot and fix docs * fix: Remove noise in supervisor init * fix: typo

Commit:6571808
Author:RileyWen
Committer:RileyWen

proto: align cattach stream definitions

Commit:13ce39b
Author:RileyWen

Merge remote-tracking branch 'origin/master' into feat/metrics # Conflicts: # CMakePresets.json # protos/PublicDefs.proto # src/CraneCtld/CraneCtld.cpp # src/CraneCtld/CtldPublicDefs.cpp # src/CraneCtld/CtldPublicDefs.h # src/CraneCtld/JobScheduler.cpp # src/CraneCtld/JobScheduler.h # src/CraneCtld/RpcService/CtldGrpcServer.cpp # src/Craned/Supervisor/CMakeLists.txt # src/Craned/Supervisor/TaskManager.h # src/Utilities/CMakeLists.txt

Commit:33be70b
Author:RileyWen

ctld: add queue state summary query

Commit:f9b6cfb
Author:github-actions[bot]
Committer:huerni

style: auto format with clang-format.

Commit:7f0d382
Author:huerni
Committer:huerni

feat: Add account/user partition limits and update error string array size

Commit:106ace6
Author:github-actions[bot]
Committer:huerni

style: auto format with clang-format.

Commit:c68cc36
Author:github-actions[bot]
Committer:huerni

style: auto format with clang-format.

Commit:6f98ca7
Author:huerni
Committer:huerni

merge master

Commit:0629cd2
Author:huerni
Committer:huerni

fix: Add missing newline at end of file in proto definitions

Commit:3e94037
Author:huerni
Committer:huerni

style: Improve code formatting and comments in proto files for consistency

Commit:1e80672
Author:huerni
Committer:huerni

feat: Add Suspend and Resume job requests and update TerminateStepRequest with terminate source

Commit:955a1b9
Author:huerni
Committer:huerni

feat: account partition limit - keep only proto and AccountManager changes on top of latest master

Commit:5cf9733
Author:huerni
Committer:GitHub

feat: cattach (#621) * todo * feat: add m_cfored_stream_proxy_map_ * feat: io forward when cfored restart * feat: cattach recive data * feat: output data size * refactor * merge master * feat: maxReconnect * refactor cforedclient reconnect * fix: Fix reconnection failure caused by leftover events from the previous stream in m_cq_ * feat: cattach connect step * style: auto format with clang-format. * refactor: rename TaskScheduler refs to JobScheduler and remove obsolete task methods - Rename `TaskScheduler::QueryStepAndNodeRegex` to `JobScheduler::QueryStepAndNodeRegex` and update internal map references from `m_running_task_map_` to `m_running_job_map_` - Replace hardcoded step ID `1` with `kPrimaryStepId` constant for clarity - Remove obsolete `TaskScheduler::QueryTaskUseId` method and `JOB_META_REQUEST` handler that duplicated existing functionality - Update `CtldGrpcServer` to use `g_job_scheduler` instead of `g_task_scheduler` for step queries - Replace `WriteTaskMetaReply` with `WriteStepMetaReply` to better reflect step-level semantics - Fix log message in `CforedStream` to correctly say `JOB_REQUEST` and use `job->JobId()` instead of `task->TaskId()` * feat(Craned): add cfored reconnection logic in AsyncSendRecvThread Refactor `AsyncSendRecvThread_` to support automatic reconnection to cfored when the connection is lost. The state machine loop is now wrapped in an outer retry loop that checks `m_wait_reconn_` and `m_reconnect_attempts_` before re-establishing the gRPC stream. Key changes: - Add outer reconnect loop with exponential backoff up to `kMaxReconnectIntervalSec` seconds between attempts - Enforce a `kMaxReconnectAttempts` limit; on exhaustion, mark all forwarded tasks as stopped, terminate the step via `TerminateStepAsync`, and exit the thread - Move per-session variables (stream, context, write_pending, etc.) inside the reconnect loop so they are re-initialized on each attempt - On stream failure, transition to a reconnecting state instead of immediately terminating, allowing the outer loop to retry * fix(Craned): improve CforedClient reconnect and output drain logic - Track dequeued bytes for TASK_OUTPUT to correctly maintain `m_output_queue_bytes_` counter when items are consumed - Replace `goto exited` with early `return` on reconnect to preserve queued output data; only set `m_output_drained_` on normal exit - Add pre-write reconnect check to avoid issuing writes after a reconnect signal is detected mid-wait - Implement exponential backoff with an upper bound of `kMaxReconnectIntervalSec` before each reconnect attempt - Check gRPC channel state after backoff and skip reconnect if the channel is in TRANSIENT_FAILURE or SHUTDOWN; proceed when IDLE, CONNECTING, or READY - Improve log messages to distinguish normal vs. reconnect exits and report attempt counts and wait intervals clearly - Fix minor typo: "Markd" → "Marked" in error log * fix(Craned): set output_drained flag on all thread exit paths Ensure `m_output_drained_` is stored as `true` in both reconnect exit branches of `CleanOutputQueueAndWriteToStreamThread_`, not just on normal exit. Previously, early returns during reconnect skipped setting the flag, which could leave waiters blocked indefinitely. * fix(CforedClient): delay reconnect sleep until channel state is checked Move `sleep_for` after the channel state check so reconnection is only delayed when the channel is not ready, avoiding unnecessary waits when the channel recovers quickly. Also simplify the state condition from checking specific failure states to `!= GRPC_CHANNEL_READY` for broader coverage. * feat(proto): add task_id to StreamStepIO for targeted I/O routing Add `task_id` field to `TaskOutputReq` to identify which task produced the output, and an optional `task_id` to `TaskInputReq` to allow stdin delivery to a specific task instead of broadcasting to all tasks. Update `CforedClient` to capture and pass `task_id` when forwarding task stdout via both pipe and TTY handles, aligning the implementation with the updated proto contract. Also normalize trailing comment alignment in Crane.proto by removing extra spaces used for column-alignment. * fix: reduce kMaxOutputQueueBytes from 1MB to 1KB Decrease the maximum output queue size constant from 1 * 1024 * 1024 (1MB) to 1024 bytes (1KB) to limit memory usage for task output buffering. Also add a missing blank line in CforedClient.cpp for readability. * style: auto format with clang-format. * fix(JobScheduler): add null pointer checks in QueryStepAndNodeRegex Add null checks for `job` and `primary_step` pointers in `QueryStepAndNodeRegex` to prevent potential null pointer dereferences. Log debug messages when either pointer is null and return false early to handle these edge cases gracefully. * fix(Supervisor): move set_task_id call into correct stdout branch scope The `payload->set_task_id(req.task_id)` call was placed after the if-else block, outside the scope where `payload` was defined. This caused `payload` to be inaccessible at that point. Move the call inside the `if (req.is_stdout)` branch where the `payload` variable is properly in scope. * fix: increase max output queue size from 1KB to 10MB The `kMaxOutputQueueBytes` constant was set to only 1024 bytes (1 KB), which is likely too small for real-world job output buffering. Increase it to 10 MB to prevent premature queue overflow and data loss when handling larger output streams. * style: auto format with clang-format. * fix(JobScheduler): correct nodelist assignment in QueryStepAndNodeRegex method fix(CtldGrpcServer): improve error handling and logging for stream disconnection * fix(CforedClient): track bytes for TASK_OUTPUT only on stdout and improve logging on disconnection * fix(CforedClient): update termination cause for cfored connection failure * style: auto format with clang-format. * fix: reorder fields in StepMetaReq and improve comments in related methods * feat: add cattach command documentation in English and Chinese * fix(CtldGrpcServer): update error message for cfored connection issues to indicate waiting for reconnection * fix(CforedClient): improve reconnect logic and enhance comments for clarity * merge master * feat: enhance StepToCtld with craned_task_map and update related methods for task management * feat: add CattachStepInfo message and update StepMetaReply to include step_info * feat: add NodeTasks message and craned_task_map to CattachStepInfo for enhanced task management * refactor: remove craned_task_map from StepMetaReply and update CattachStepInfo for task routing * style: auto format with clang-format. * feat: add validation for interactive job types in CforedStream method * feat: add validation for cfored_name in CforedStream method --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:ad5c17b
Author:Zhang Yanwen
Committer:GitHub

feat: add array (#865) * feat: add array proto definitions and public defs * feat: add array manager * feat: integrate array scheduling * feat: persist and query array jobs * feat: wire array jobs through RPC/craned and update docs * fix: update embedded db test path after Database/ refactor * fix: address review comments and bug fixes * fix parent * format * accept comments * format * change for to range * resolve condition * accept comments * fix * accept comments * refact to range view * accept comments * delete deadcode * fix * fix bugs --------- Co-authored-by: crane-dev <crane-dev@local>

Commit:eb58fea
Author:huerni
Committer:GitHub

feat: Add CpuTopology message and integrate CPU socket configuration (#898) * feat: Add CpuTopology message and integrate CPU socket configuration * refactor: Rename CpuTopology to NodeTopoInfo and update related references * style: auto format with clang-format. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:e299218
Author:huerni
Committer:GitHub

feat: pmix v2 (#864) * feat: version1 feat: add pmix util feat: pmix test feat feat refactor feat: pmix server running mpi feat: PmixServer solo mpi refactor feat: pmix statue and coll(coll ring and coll tree) test feat: multi node info set feat: fence ring feat: pmix direct modex refactor feat: dmodex cb refactor: pmixcolltree feat: tree fence refactor grpc refactor fix: address nil feat: dmodex cb test feat: fence cb ring feat feat feat: tree fence async refactor feat: found pmix refactor refactor fix: reply ok clang refactor fix some bugs refactor grpc request fix refactor coll shared_from_this fix: ring fence refactor fix: not return fix: childrn_hosts refactor refactor fix: use status fix: ERROR in file gds_ds12_lock_pthread.c at line 168 fix: grpc shutdown fix: cbatch multi process test mpirun feat: use WITH_PMIX path feat: print pmix message feat: add env * feat: pmix * feat * feat * refactor pmix * feat: add env config * feat * feat: refactor PmixGrpcServer and PmixGrpcClient * feat: pmix ucx server * feat: pmix ucx client * fix * fix * refactor: pmix v3 * refactor pmix and ucx cmake * refactor * refactor m_dmodex_mgr_ * refactor pmix Coll * refactor * refactor pmix job info * refactor * feat: add pmix env config * feat: when no pmix * fix: pmix conn port * fix * fix: tree send data * feat: WaitAllStubReady add timeout * fix: cli dir permision denied * refactor * feat: add log * fix * fix: env configure * fix: ucx port * merge master * fix * feat: add task_id * fix: fix pmix info set with ntasks * fix * fix: ppn * fix: pmix ucx conn * fix(pmix): restore UCX/gRPC conditional selection and add license headers * fix(pmix): fix integer underflow bugs and improve error handler lifecycle * fix(pmix): add RPC trace logging and translate comments to English * refactor(pmix): replace global PMIx singleton with TaskManager ownership - Move `g_pmix_server` global into `TaskManager` as `m_pmix_server_` - Add `TaskManager::ReceivePmixPort()` to encapsulate PMIx port registration logic with proper error handling and structured logging - Delegate `ReceivePmixPort` gRPC handler to `g_task_mgr` instead of accessing global PMIx state directly in `SupervisorServer` - Inject `PmixClient` and `CranedClient` into `PmixCollTree` via constructor to eliminate dependency on a global singleton - Add forward declarations in `PmixCollTree.h` to avoid unnecessary full header includes This improves encapsulation, makes the PMIx lifecycle explicit under TaskManager, and removes hidden global dependencies from PMIx collection tree components. * refactor(pmix): replace raw timeout with std::chrono and add collective abort on timeout - Change `m_timeout_` from `uint64_t` to `std::chrono::seconds` for type-safe duration handling - Update timer start call to use `std::chrono::seconds` directly instead of wrapping - Add `IsTimedOut()` and pure virtual `AbortOnTimeout()` to `Coll` base class - Implement `AbortOnTimeout()` in `PmixCollRing` to invoke pending callbacks with `PMIX_ERR_TIMEOUT` and reset state - Replace `time_t m_ts_` with `std::chrono::steady_clock::time_point` for monotonic, type-safe timestamping - Hook `CleanupTimeoutColls()` into the periodic cleanup timer alongside existing dmodex cleanup * merge master * style: auto format with clang-format. * fix(pmix): fix data races, use-after-free, and broadcast result reporting - Add `std::atomic<bool> broadcast_all_ok` to correctly track partial failures during PMIx port fanout; previously `response->set_ok` always returned `true` regardless of individual craned errors - Capture `craned_id` by value and snapshot `pair.second` into `pmix_ports_snapshot` before spawning detached threads to eliminate dangling-reference UB when the enclosing scope exits before threads finish - Call `g_thread_pool->wait()` before `g_server.reset()` in Supervisor shutdown to prevent use-after-free caused by the detached `Server::Wait()` task still running inside a destroyed gRPC Server object - Add a second `g_thread_pool->wait()` after client resets to drain tasks spawned during teardown (e.g. `PodInstance::Cleanup` file removal) - Replace `m_env_.emplace(k, v)` with `insert_or_assign` in `ProcInstance::Spawn` so that PMIx env vars properly overwrite existing keys instead of silently being dropped on duplicates - Add drain logic in `PmixServer` destructor to flush in-flight collectives and pending dmodex requests before resetting `m_pmix_client_`, preventing callbacks from dereferencing a freed pointer * style: auto format with clang-format. * fix(pmix): guard static member definition with HAVE_PMIX preprocessor Wrap `PmixServer::s_instance_` static member definition with `#ifdef HAVE_PMIX` to match the existing conditional compilation pattern, preventing linker errors when PMIx support is disabled. * fix(pmix): make PMIx/UCX optional and improve error handling - Make PMIx and UCX dependencies optional in CMake to allow builds without MPI support - Add deadline timeout to ReceivePmixPort RPC to prevent hangs - Improve MPI type validation with descriptive error messages for unsupported MPI types in CforedStream - Initialize PMIx-related environment variables in ProcInstance - Fix PmixServer init error path using goto-based cleanup to properly tear down UVW loop, PMIx state, and connections on failure, and defer singleton registration until full init * refactor(ctld): move PMIx port fanout outside map lock to avoid lock contention Previously, the RPC fanout to all craneds was performed inside the `lazy_emplace_l` callback, which held the map lock throughout all network calls. This could cause unnecessary lock contention and potential deadlocks under slow or failing network conditions. This change takes only a snapshot of the completed port map inside the locked callback using `std::optional<PmixPortsMetaMapValue>`, then performs the actual multi-threaded RPC broadcast after `lazy_emplace_l` returns and the map lock is released. This ensures the map lock is never held during potentially slow or failing network operations. * feat * refactor(supervisor): move PMIx server ownership to StepInstance - Remove `InitPmixPreFork()` from TaskManager and `GetPmixServer()` accessor; PMIx server is now owned by `m_step_` (StepInstance) and accessed via `m_step_.pmix_server` and `m_parent_step_inst_->pmix_server` - Simplify non-daemon step init by removing the pre-fork PMIx check, setting status directly to `StepStatus::Starting` - Fix shutdown sequence: run `g_server->Wait()` directly on the calling thread instead of detaching it to the thread pool, then reset server before waiting on task manager to avoid use-after-free UB * refactor(supervisor): remove unused InitPmixPreFork method and fix EOF - Remove `InitPmixPreFork()` declaration from `TaskManager.h` as it is no longer needed - Add missing newline at end of `Supervisor.cpp` to fix EOF formatting * docs: add PMIx Support page to navigation and translations - Add "PMIx Support" entry under deployment/configuration in nav - Add Chinese translation for "PMIx Support" as "PMIx 使用指南" * refactor(cmake): enhance PMIx library finding logic and improve error handling in callbacks * feat(pmix): add PMIx error code and improve error handling in initialization * fix(pmix): ensure proper memory management in JobSet_ method by freeing PMIX info * feat(pmix): enhance callback handling by adding per-context callbacks in CollRingCtx * fix(pmix): prevent data corruption by ensuring ring_buf is cleared only for the correct context sequence * fix(pmix): improve activity timestamp management in ResetCollRing_ method * fix(pmix): correct child node assignment in ReverseTreeDirectChildren function * fix(pmix): reset collection ring on contribution failure in PmixCollRing and ensure tree progress on failure in PmixCollTree * fix(pmix): replace void* casts with pmix_release_cbfunc_t in PmixLibModexInvoke and update status parameter types to pmix_status_t in relevant methods * fix(pmix): prevent duplicate craned_id entries in EmplacePmixStub and update channel count logging * feat(pmix): implement TLS support for gRPC communication and update related methods * fix(pmix): add UCP_OP_ATTR_FLAG_NO_IMM_CMPL to request parameters in RegisterReceivesForType_ method * fix(pmix): add PMIx installation prefix option in CMake and improve MPI type validation in CtldGrpcServer * fix(pmix): clean up PMIx port broadcasting logic and improve thread handling in CtldGrpcServer * fix(pmix): improve peer count handling and error messages in PmixCollInit method * fix(pmix): enhance PMIx handling by adding IsPmix checks and cleaning up related logic * fix(pmix): refactor to use PmixStepInfo instead of PmixJobInfo for improved step handling * style: auto format with clang-format. * fix(pmix): update PMIx user guide for clarity and formatting improvements * fix(pmix): enhance PMIx user guide with additional clarity and formatting improvements * fix(crun): add MPI integration options and related environment variables to documentation * fix(pmix): add pre-submit checks for PMIx support and improve error handling * fix(header): update QoS limit error messages for clarity and consistency * style: auto format with clang-format. * fix(reverse-tree): handle edge case for tree width and node count to prevent underflow * fix(pmix): add previous craned ID handling in PmixCollRing for improved message validation * refactor(pmix): remove unused upward wait states from CollTreeState and related functions * fix(pmix): update StrToCollType to handle lowercase string inputs for FENCE_TREE and FENCE_RING * fix(pmix): prevent blocking during PMIx_server_finalize by handling abort scenarios in collectives * fix(pmix): update minimum version requirements for OpenPMIx and Open MPI in user guide * fix(pmix): update copyright year to 2026 in PmixDModex.cpp * style: auto format with clang-format. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:5ddfaf9
Author:huerni

feat: Enhance CPU topology configuration and detection logic

Commit:45a1c37
Author:NamelessOIer
Committer:GitHub

feat: preempt v2 (#881) * preempt_v2: basic data structures * preempt_v2: scheduler data structures (Step 4.1) Add data structures required by the v2 preempt scheduler without wiring any behaviour yet: - RnJobInScheduler gains a qos field so TryPreempt_ can filter running candidates against a preempter's QoS preempt list. - PdJobInScheduler gains end_time, preempted_jobs (variant of pending / running snapshots), and an is_scheduled() helper. - NodeState gains a per-QoS inverted index qos_job_map. Cancelling jobs are intentionally kept in this map so a new preempter can inherit their slot via the cancelling-first ordering in TryPreempt_. - SchedulerAlgo gains cancelling_set_<job_id_t> for action-queue de-duplication and a PreemptSegTree class (lazy-propagation segment tree on ResourceInNodeV3, used per-job during preempt decisions). No behavioural change in this step: the new fields are populated by later steps. * preempt_v2: reversible resource-map update API (Step 4.2) Turn the resource-time-map write path into a symmetric update: - NodeState::SubtractResourceInNode -> UpdateResourceInNode(..., bool is_release). Subtraction path unchanged; is_release=true flips sign so a previously scheduled window can be released back into the map. - IUpdateNodeCostPolicy::UpdateCost / MinCpuTimeRatioFirst::UpdateCost gain is_release. MinCpuTimeRatioFirst now decrements cost on release instead of the v1 double-accounting bug (Q2 fix). - INodeSelector::SubtractResource -> UpdateResource, plumbing is_release through to the per-node calls. - LocalScheduler::UpdateNodeSelector is split into UpdateNodeSelectorWithScheduledJob (subtract + register in qos_job_map) and UpdateNodeSelectorWithPreemptedJob (release + mirror unregister). Both gate the qos_job_map mutation on is_scheduled() so the two sides stay symmetric; callers must invoke the preempted path before setting reason = "Preempted" on pending preemption, which will be enforced when the main loop is rewritten in Step 4.4. Only the existing call site in the main loop is migrated (UpdateNodeSelector -> UpdateNodeSelectorWithScheduledJob) and the PdJobInScheduler::end_time is eagerly populated so the new API has the window it needs. No preempt logic is wired up yet — the new is_release branch and UpdateNodeSelectorWithPreemptedJob are unused until Step 4.3 / 4.4. * preempt_v2: three-stage scheduling with TryPreempt_ (Step 4.3) Split CalculateRunningNodesAndStartTime_ into three stages so preempt decisions have a natural place to plug in without disturbing the existing immediate-start / backfill logic. - CalculateRunningNodesAndStartTime_ becomes a 10-line driver: GetNodesAndTrySchedule_ (stage 1) -> fail if candidates < node_num -> PreemptType != NONE && TryPreempt_ (stage 2) -> Backfill_ (stage 3) PreemptType gating lives here so TryPreempt_ itself only deals with candidate selection. - GetNodesAndTrySchedule_ is the existing top-K logic, but its failure path now fills the caller's nodes_to_sched* instead of calling the earliest-start selector directly. - Backfill_ is what used to be the tail of CalculateRunningNodesAnd- StartTime_ — a straight EarliestStartSubsetSelector call consuming stage 1's provisional allocation. - TryPreempt_ is new: * Collect candidates from each node's qos_job_map, intersecting with qos_preempt_map[job->qos]. * Sort cancelling > pending > running, tie-breaking by qos_priority / job priority (pending) or start_time (running). * Build a PreemptSegTree per candidate node over [now, now+time_limit) with target = stage 1's provisional allocation, initialised from time_avail_res_map. * Forward scan Add candidates until node_num trees are satisfied, then reverse-prune to the minimum set; the survivors go into job->preempted_jobs. * Narrow job->craned_ids / allocated_res / craned_id_to_task_num to the chosen N nodes and set start_time = now. The caller in NodeSelect passes an empty qos_preempt_map and cancelling_set for now, so TryPreempt_ short-circuits on the map lookup and the three-stage driver reduces to the old "immediate -> backfill" flow. Step 4.4 will wire in real inputs. * preempt_v2: wire preempt inputs into NodeSelect (Step 4.4) Complete the NodeSelect side of preemption by feeding real preempt inputs into TryPreempt_ and dispatching cancel actions to the async cancel queue. - Phase A: build qos_preempt_map for this round. Keys are populated from pending_jobs; values are filled from AccountManager only when PreemptType == QOS, so at most one QoS-map lookup per distinct pending QoS. - Phase C: refresh cancelling_set_ against the round's running snapshot. Entries whose preempted job has already left the running map are dropped; surviving entries get the RnJob snapshot's end_time mutated to now+1s so the prediction layer sees an imminent release (works for partition and reservation paths since both read end_time after mutation). - Phase B: register running jobs in each NodeState's qos_job_map so TryPreempt_ has candidates. Cancelling jobs are intentionally kept in the map so a new preempter can inherit their slot via the cancelling-first ordering in TryPreempt_. - Main loop: drop the Step 4.3 kEmpty* placeholders and pass the real qos_preempt_map / cancelling_set_ into CalculateRunningNodesAndStartTime_. When TryPreempt_ succeeds, walk preempted_jobs: * For each entry, call UpdateNodeSelectorWithPreemptedJob first (the qos_job_map erase is gated on is_scheduled() — see the contract documented on Step 4.2). * Then mutate reason to "Preempted" (pending) or de-dupe through cancelling_set_ and hand the job to the async cancel queue (running). The dispatch is logged and leaves a TODO(preempt) marker for REQUEUE / SUSPEND modes. Running jobs are cancelled via CancelPendingOrRunningJob with operator_uid=0 and an empty filter_ids entry (= all steps). Sending the cancel from inside the main loop lets the RPC travel in parallel with subsequent scheduling work, so by the time ScheduleThread_ reaches post-validation the preempted job is usually gone. This step is self-consistent at compile time but unsafe to run in isolation: without the post-validation gate (Step 4.5) a preempter can be launched in the same round its victim is still physically running. The next commit adds that gate. * preempt_v2: post-validation for preempted running jobs (Step 4.5) Close the v2 preempt implementation by gating the Malloc / SetStatus commit on the preempted running jobs having actually left m_running_job_map_. NodeSelect produces optimistic decisions: when a preempter chooses to cancel a running Y, its seg tree pretends Y is released immediately and its allocated_res and craned_ids are filled as if that were true. That's safe for in-round accounting because UpdateNodeSelectorWith- PreemptedJob only touches the scheduler's own time_avail_res_map; it doesn't touch meta_container. The cost is that we must not hand the preempter off to execution while Y is still physically holding its cgroup / cpuset on the craned — we'd double-book the resources until Y's cancel propagates. Add one extra check at the tail of the existing post-validation block in ScheduleThread_: after the ResReduceEvent and reservation checks pass but before the license / QoS resource malloc, scan the snapshot's preempted_jobs and see whether any RnJobInScheduler* entry is still in m_running_job_map_ (we already hold that lock here). If so, set reason = "Waiting for Preemption" and skip this snapshot — cancelling_set_ still remembers Y, so next round TryPreempt_ will rediscover it via the cancelling-first ordering without re-sending a cancel, and the preempter will get another chance. Placing the check before the Malloc paths avoids the need to unwind license or QoS state when the snapshot is rejected. Together with Steps 4.1–4.4 this completes the v2 preempt feature: QoS-based CANCEL-mode preemption with strict priority ordering, cancelling-first candidate selection, async cancel dispatch inside NodeSelect, and post-validation guarding physical correctness. * preempt_v2: replace CancelPendingOrRunningJob with direct cancel enqueue Replace the heavyweight CancelPendingOrRunningJob RPC path with a lightweight direct-enqueue approach for preempt-initiated cancels: - New CancelRunningJobByIdElem variant in the cancel queue: carries a batch of job_ids + terminate_source. Expansion to per-step/per-craned entries is deferred to CleanCancelJobQueueCb_ which acquires the running lock on its own thread, avoiding lock contention in the NodeSelect hot path. - New EnqueuePreemptCancel() method: enqueues the batch element and triggers m_clean_cancel_job_queue_handle_ directly, bypassing the batch-threshold / 500ms-timer gate in CancelJobAsyncCb_ for minimum latency. This lets the cancel RPC travel in parallel with scheduling of subsequent jobs in the same round. - NodeSelect main loop collects preempted running job ids into a vector, then calls EnqueuePreemptCancel once per scheduling decision instead of constructing a CancelJobRequest per victim. Benefits over the previous approach: * Zero lock contention in NodeSelect (running lock deferred) * Batch all cancel requests from one TryPreempt_ into a single queue element + single async handle trigger * Immediate dispatch skips the 500ms timer gate * style: clean up comments, rename cancelling→preempting, clang-format - Remove verbose design-doc references, step numbers, and redundant explanatory comments from JobScheduler.h/cpp. Keep TODOs and essential constraints (e.g. is_scheduled() call-order gate). - Rename cancelling_set_ → m_preempting_set_ (and all related variables/lambdas) to follow the m_ naming convention and use a mode-neutral name that also covers future REQUEUE/SUSPEND. - Delete qos_job_map's design-rationale comment (type is self-evident). - Run clang-format on all files modified across the preempt branch. * fix: clamp running end_time at NodeSelect entry and use now for preempt release - Normalize running job end_times to >= now+1s at NodeSelect entry so overdue jobs don't produce negative-duration artifacts in the prediction layer. Phase B no longer needs a local clamp variable. - In UpdateNodeSelectorWithPreemptedJob, use now (not rn->start_time) as the release start: time_avail_res_map begins at now, passing a past timestamp caused iterator underflow (--begin() UB → segfault). * docs: add docs * fix * Validate QoS preempt graph * Fix preempt selector update for drained nodes * update config

Commit:8b030cd
Author:huerni

feat: Add hwloc support for hardware topology detection and configuration

Commit:99efb2a
Author:RileyWen
Committer:RileyWen

Merge branch 'master' into feat/metrics Resolve conflicts in tracing instrumentation vs master refactoring: - Supervisor.proto: keep both thread_pool_size and tracing fields - CtldPublicDefs.h: keep both job_traceparents and pending_append_steps - JobScheduler.cpp: adapt tracing to fire-and-forget RPC pattern, capture traceparents in lambda before steps are moved - JobManager.cpp: keep job/free trace span with async queue path - StepInstance.cpp: keep supervisor_ready trace span - TaskManager.cpp: keep execute/finish trace spans, use new StepStatusChangeAsync signature (Completing status + final_status) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Commit:05c4d92
Author:huerni
Committer:GitHub

docs: Add detailed error codes for RPC methods in Crane.proto (#891) * docs: Add detailed error codes for RPC methods in Crane.proto * docs: Update error codes in Crane.proto for clarity and completeness * docs: Update error codes in Crane.proto for accuracy and completeness

Commit:4f62121
Author:Junlin Li
Committer:GitHub

perf: Improve performance (#894) * Optimize perf Optimize perf Optimize perf style: auto format with clang-format. Log supervisor execution details in StepInstance Add logging for supervisor execution in StepInstance. style: auto format with clang-format. perf: defer supervisor cgroup migration to after initialization Move cgroup migration from the child process (before execvp) to the parent process (after receiving SupervisorReady). Previously the supervisor process was migrated into the job's CPU-limited cgroup before execvp, causing the entire supervisor initialization to run under the job's CPU quota (e.g. 0.1 CPU), inflating startup from ~38ms to ~350ms. Now craned migrates the supervisor via MigrateProcIn(child_pid) after the supervisor signals ready, so initialization runs without CPU throttling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Reorganize Supervisor initialization sequence Reorder initialization of g_server and g_craned_client to ensure SupervisorServer is created after sending SupervisorReady. style: auto format with clang-format. docs: Add examples of ccon perf: Optimize status change pipeline and fix FreeJobs race condition - Batch AppendSteps in CleanJobStatusChangeQueueCb_ to reduce per-job embedded DB overhead from ~160ms to amortized ~3ms (batch of 50+) - Remove latch-based RPC waits in status change loop (fire-and-forget) - Fix FreeJobs/FreeSteps race: step was removed from m_job_map_ before ShutdownSupervisor but not yet in completing maps, causing supervisor's Completing+final_status to be lost and daemon step to timeout. Refactored to ConcurrentQueue+AsyncHandle pattern so completing maps are populated before ShutdownSupervisor fires. - Increase kMaxStatusWaitRetryCount to 50 (10s) for pending_terminal_status Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> refactor: Remove perf instrumentation and dead code - Remove StatusChange loop timing/counting instrumentation from CleanJobStatusChangeQueueCb_ - Remove CleanUpJobAndStepsAsync (replaced by EvCleanFreeJobsQueueCb_ and EvCleanFreeStepsQueueCb_ in previous commit) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: Make thread pool size configurable for all components Add ThreadPoolSize config option to CraneCtld, Craned, and Supervisor sections. Default 0 means auto (hardware_concurrency). Also adds SchedulerRpcThreadPoolSize for ctld scheduler RPC dispatch pool. Removes leftover perf instrumentation from StepInstance.cpp and Supervisor.cpp. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: auto format with clang-format. * feat: Make StatusChange batch size and flush timeout configurable Add StatusChangeFlushTimeoutMs and StatusChangeBatchNum to CraneCtld config section, replacing hardcoded constants in JobScheduler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: Add performance tuning guide Add bilingual (en/zh) documentation for performance-related configuration parameters including thread pool sizes and StatusChange batching settings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: auto format with clang-format. * fix: Remove stale RPCs after AppendSteps failure and sync FreeJobs cleanup - Strip failed jobs from craned_step_alloc_map when batch AppendSteps fails, preventing AllocSteps RPCs for already-failed jobs. - FreeJobs now checks m_is_ending_now_ and synchronously removes jobs from job_map before returning, ensuring callers see a clean state. - Event loop callback only handles completing maps and ShutdownSupervisor. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: auto format with clang-format. --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:fd1e8ac
Author:junlinli

batch

Commit:9fc1d48
Author:junlinli
Committer:junlinli

add config Signed-off-by: junlinli <xiafeng.li@foxmail.com>

Commit:eaec750
Author:junlinli
Committer:junlinli

Optimize latency

Commit:e2224ce
Author:Junlin Li
Committer:GitHub

refactor: Add completing status (#866) * fix Signed-off-by: junlinli <xiafeng.li@foxmail.com> style: auto format with clang-format. fix crun Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix refactor: Completing status Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix Signed-off-by: junlinli <xiafeng.li@foxmail.com> * style: auto format with clang-format. * refactor: enhance step retry mechanism with state tracking * style: auto format with clang-format. * fix * style: auto format with clang-format. * fix * style: auto format with clang-format. --------- Signed-off-by: junlinli <xiafeng.li@foxmail.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:522f7a2
Author:RileyWen

Merge branch 'master' into feat/metrics

Commit:64e90f9
Author:RileyWen

feat: enhanced tracing with gap-filling spans, Perfetto native export, and system-level analysis Tracing instrumentation: - Add step_type attribute (daemon/primary/common) to spawn and execute spans - Add RPC debug spans: job/rpc_execute, step/rpc_receive, step/queue_dispatch, job/status_change for network latency visibility - Add job/pending ManualSpan (independent trace) for scheduling wait measurement - Add step/schedule span in StepScheduleThread_ for crun re-scheduling visibility - Add system context: pending_queue_depth, running_job_count on scheduling/cycle - Fix nesting violations: remove parallel worker pool spans that overlap on same lifecycle track (job/alloc_job_rpc, step/alloc_rpc, job/release) - Fix job/pending: independent trace instead of submit child (outlives parent) - Pass spawn span context to Supervisor (step/execute becomes child of spawn) - Add ManualSpan.CreateChild() and friend declaration for cross-class access - Proto: add job_traceparents field to ExecuteStepsRequest Visualization tooling (query_trace.py): - Add --perfetto export to Perfetto native proto format (.pftrace) with proper nested slices, flow arrows, debug annotations, and sorted BEGIN/END events - Add --system mode for aggregate performance analysis (P50/P95/P99 per phase, per-node RPC latency, step_type breakdown) - Add --system --chrome and --system --perfetto for system-level visualization - Fix Chrome Trace start_time: compute from _time - duration_us (InfluxDB _time is end time, not start time) - Fix Chrome Trace lane assignment: step_type labels, greedy no-overlap allocation Documentation and testing: - Rewrite docs/zh/features/tracing.md: add visualization, performance analysis, stress test, and troubleshooting workflow sections - Add test/Trace/stress_test.sh for automated load testing with analysis - Add .venv to .gitignore Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Commit:af8ed1c
Author:NamelessOIer
Committer:GitHub

feat: ResourceV3 (#867) * feat: ResourceV3 * fix * style: auto format with clang-format. * fix --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:eb6a20b
Author:Zhang Yanwen
Committer:GitHub

feat: add suspend and resume (#632) * feat: add job suspend/resume functionality rebased on master Port the suspend/resume feature from feat/suspend branch with corrected task/step/job naming conventions from master's refactor (#860). Changes include: - Proto: Add SuspendJobsRequest/Reply, ResumeJobsRequest/Reply messages; Add Suspended(=10) to JobStatus enum; Add suspend_time to RuntimeAttrOfJob; Add Suspend/Resume RPCs to Craned and Supervisor services - CraneCtld: Add SuspendRunningJobs/ResumeSuspendedJobs to JobScheduler; Add suspend_time tracking in JobInCtld; Add CranedStub suspend/resume RPCs; Add Suspend/Resume cases in ModifyJob handler; Handle Suspended status in EmbeddedDbClient recovery - Craned Core: Add SuspendJobs/ResumeJobs gRPC handlers in CranedServer; Add GetAllocatedJobSteps(job_id) and GetSupervisorStub() to JobManager; Add SuspendJob/ResumeJob in SupervisorStub; Handle Suspended status during recovery sync in CtldClient - Craned Supervisor: Add SuspendJob/ResumeJob handlers in SupervisorServer; Add SuspendJobAsync/ResumeJobAsync in TaskManager with cgroup freezer - CgroupManager: Add FreezeCgroupByPath/ThawCgroupByPath and child variants supporting both cgroup v1 (freezer.state) and v2 (cgroup.freeze) - String.h: Add "Suspended" to StepStatusToString array * fix * change to job freeze * delete old codes * fix bug * fix * fix error code * fix * fix * format * fix * fix * fix * fix * fix bugs * fix * fix cgroup * fix * FIX * fix * fix * fix * format

Commit:05d4d01
Author:Yongkun Li
Committer:GitHub

fix: Fix pod starting and add skipping subuid/gids (#869) * feat: Add UidShift for managed SubID * fix: Fix pod not starting * refactor: task/step exiting path * style: auto format with clang-format. * feat: Fix docs label and pod failure * feat: Optimize subId management * fix: Rebased and fix invaild status DEADLINE * refactor: Converge finished step status into a helper func * fix: Fix calloc timeout * style: auto format with clang-format. * fix: Fix sticky state * fix: Fix pod failure status * style: auto format with clang-format. * fix: Fix dup pod launching just in case * fix: Fix duplicated OOM init=false * refactor: Fix TIMEOUT/DEADLINE status missing on calloc * style: auto format with clang-format. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:0cf8261
Author:edragain
Committer:GitHub

feat: add cbatch --deadline (#630) * feat: add cbatch --deadline (squash commits) * fix format

Commit:4251f51
Author:Junlin Li
Committer:GitHub

feat: Batch input/output/error option with file pattern (#747) * fix: input/output/err for crun cbatch style: auto format with clang-format. fix refactor: improve task management and exit status handling in CforedClient (#861) fix Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: input/output/err for single task fix: step lost Signed-off-by: junlinli <xiafeng.li@foxmail.com> feat: Crun input/output/err Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor Signed-off-by: junlinli <xiafeng.li@foxmail.com> style: auto format with clang-format. feat: Batch input/output/error option with file pattern Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: submit time,start time, supervisor coredump Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> Signed-off-by: junlinli <xiafeng.li@foxmail.com> feat: Step with tasks. Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: Cancel step not found in craned Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: Sync bugfix from fix/conn Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: compile Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: AI comments Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Comments Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> style: auto format with clang-format. bugfix X11 multi conn Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: X11 multi conn Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> WIP: X11 Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: X11 forwarding Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Subprocess cg Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Only init cfored for ia task Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Exit reason Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Deadlock Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> style: auto format with clang-format. fix Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: tasks step exit code Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Share mem for tasks in a same step on node Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Support task execution Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Cgroup recovery Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: Use std::chrono_literals Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: PAM Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> move user cgroup of daemon step to Supervisor Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> Update TaskScheduler.cpp feat: Step cgroup recovery Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: Supervisor step status. Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: Step instance Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Add CPU binding functionality for cgroup v1 and v2 Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> * fix Signed-off-by: junlinli <xiafeng.li@foxmail.com> --------- Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> Signed-off-by: junlinli <xiafeng.li@foxmail.com>

Commit:41a3e4f
Author:github-actions[bot]

style: auto format with clang-format.

Commit:430d8ca
Author:RileyWen

Merge branch 'master' into feat/metrics

Commit:ec282b9
Author:RileyWen

feat: implement flame-graph-level distributed tracing instrumentation Add 25 trace spans across CraneCtld, Craned, and Supervisor to enable full flame-graph visualization of job lifecycle, scheduling cycles, and status change processing. Key changes: Infrastructure: - W3C traceparent cross-process propagation (CraneCtld → Craned → Supervisor) - SpanStatus enum and service_name in SpanInfo proto - New tracing macros: CRANE_TRACE_SCOPE_FROM_REMOTE, CRANE_TRACE_CHILD_NAMED - TracerManager cleanup: remove unused APIs, safe Shutdown with ForceFlush Instrumentation (3 independent trace dimensions): - Job lifecycle trace (per-job, cross 3 services): alloc → prolog → supervisor_spawn → execute → finish → end → release → commit - Scheduling cycle trace (per-cycle): node_select, resource_validate, db_persist, rpc_alloc_jobs, rpc_alloc_steps - Status change trace (per-batch): rpc_fanout, db_commit Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Commit:0596d94
Author:huerni
Committer:GitHub

feat: add slurm env (#800) * feat: add slurm env * feat: add EnableSlurmCompatibleEnv * style: auto format with clang-format. * fix: Fix GTIDS * feat: todo * style: auto format with clang-format. * feat: add slurm env when real task * style: auto format with clang-format. * fix(CtldPublicDefs): fix node_index increment in GetStepToD Previously, `node_index` was only incremented when a node was found in `craned_task_map`, causing incorrect node index assignments when some nodes were not present in the map. Restructured the logic to always increment `node_index` for every node regardless of map lookup result, ensuring correct index-to-node mapping. Also changed post-increment to pre-increment for minor optimization. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Nativu5 <44155313+Nativu5@users.noreply.github.com>

Commit:50b3e5c
Author:NamelessOIer
Committer:GitHub

refactor: job/step/task misuse (#860) * fix: job/step/task misuse * bug fix * style: auto format with clang-format. * fix Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * fix * fix: x11 task to step * fix --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Commit:a24dc83
Author:huerni
Committer:GitHub

feat: Add Qos tres fields and Qos tres limit (#528) * feat: version 1 feat: qos jobs per user limit feat: account submit jobs limit src/CraneCtld/ fix: judge result fix: modify and add fix std::move refactor fix: mutex style: auto format with clang-format. fix delete user meta refactor modify field array str style: auto format with clang-format. feat: When a user/account object is deleted, resources need to be reset. fix: qos is invalid when CheckQosResource refactor: if_contains feat: not delete user when has task feat: task account_chain feat: refactor kUnlimitedQos refactor style: auto format with clang-format. refactor fix: task recover account_chain empty fix: recover task qos resource malloc style: auto format with clang-format. fix: qos resource device map add logs fix: recover running task double malloc feat: malloc use allocated_res_view refactor feat refactor: delete user when has task style: auto format with clang-format. refactor style: auto format with clang-format. fix: restore running queue user task num fix feat: CheckQosResource CpuCount rebase fix style: auto format with clang-format. feat: CheckQosResource use allocated_res_view refactor qos assert refactor todo merge master merge master fix modify update QosToString update kCraneErrStrArr delete header refactor feat: qos submit jobs per user limit refactor modify field array str style: auto format with clang-format. feat: not delete user when has task feat: task account_chain feat: refactor kUnlimitedQos refactor refactor feat feat: tres limit feat: refactor tres limit and qos tres modify feat: add qos flags DenyOnLimit feat: add ERR_MAX_TRES_PER_USER_BEYOND and ERR_MAX_TRES_PER_ACCOUNT_BEYOND feat: Per Qos limit fix: CheckTres_ refactor feat: wall time feat: Compatible with max-cpu-per-user fix: 1. grpwall 2. mem bytes 3. CraneErrStr feat: deleteQosMeta refactor fix: 1. MallocQosResourceToRecoveredPendingTask and MallocQosResourceToRecoveredRunningTask 2. maxwall merge dev/qos_jobs_limit fix: recover running task malloc feat: malloc use allocated_res_view merge dev/qos_jobs_limit refactor feat: add task reason fix fix some bugs feat: ERR_CPUS_PER_TASK_BEYOND refactor feat rebase refactor fix: max wall feat: add modify qos refactor * merge master * refactor and merge master * refactor * refactor * refactor * feat: refactor code * style: auto format with clang-format. * refactor * fix CheckQosResource_ * style: auto format with clang-format. * refactor * refactor code name and flag set * style: auto format with clang-format. * feat: refactor max memory * style: auto format with clang-format. * fix: cpu mongo read and unlimited * fix: flags * fix: max wall * style: auto format with clang-format. * feat: fix unlimited qos cpu default value * feat: add flagset assert * style: auto format with clang-format. * feat: max_cpus_per_user use ERR_CPUS_PER_TASK_BEYOND * feat: merge master --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:5764a60
Author:RileyWen

Merge branch 'master' into feat/metrics

Commit:bf9c016
Author:Junlin Li
Committer:GitHub

feat: X11 multi connection and multi task support for step (#727) * fix: Cancel when configuring style: auto format with clang-format. fix mem limit debug Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> debug Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: DB cpu type cast Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: X11 local socket close handling Signed-off-by: junlinli <xiafeng.li@foxmail.com> bugfix fix: resource check fix: Cgroup leak when step failure Signed-off-by: junlinli <xiafeng.li@foxmail.com> style: auto format with clang-format. fix: crun primary step resource allocation feat: ntasks (#793) fix: x11 fd closed twice. Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor: Copy InteractiveMeta Signed-off-by: junlinli <xiafeng.li@foxmail.com> style: auto format with clang-format. fix: Sync bugfix from fix/conn Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: compile Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: AI comments Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Comments Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> style: auto format with clang-format. bugfix X11 multi conn Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: X11 multi conn Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> WIP: X11 Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: X11 forwarding Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Subprocess cg Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Only init cfored for ia task Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Exit reason Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Deadlock Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> style: auto format with clang-format. fix Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: tasks step exit code Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Share mem for tasks in a same step on node Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Support task execution Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Cgroup recovery Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: Use std::chrono_literals Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: PAM Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> move user cgroup of daemon step to Supervisor Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> Update TaskScheduler.cpp feat: Step cgroup recovery Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: Supervisor step status. Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: Step instance Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Add CPU binding functionality for cgroup v1 and v2 Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> style: auto format with clang-format. fix: Cancel step not found in craned Signed-off-by: junlinli <xiafeng.li@foxmail.com> feat: Step with tasks. Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: compile Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: daemon step lost Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: cancel configuring job Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix Signed-off-by: junlinli <xiafeng.li@foxmail.com> * bugfix * fix: req res view filed lost Signed-off-by: junlinli <xiafeng.li@foxmail.com> * bug fix * rename task/node/total_res_view * style: auto format with clang-format. * bug fix * fix: Coredump when Completing Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix: Coredump when Completing Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix: Reset sigchld handler after fork Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix: X11 auth pwd not init. Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix --------- Signed-off-by: junlinli <xiafeng.li@foxmail.com> Co-authored-by: NamelessOIer <70872016+NamelessOIer@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:a004021
Author:RileyWen

refactor: redesign OpenTelemetry tracing infrastructure - Replace CRANE_ENABLE_TEST with CRANE_ENABLE_TRACING throughout - Add ScopedSpan RAII class with macro system (Tracing.h) for zero-boilerplate span creation with compile-time and runtime gating - Add TracingConfig to config structs, controlled by config.yaml - Remove g_tracer globals; TracerManager singleton handles tracer access - Refactor CraneSpanExporter (renamed from TracePluginExporter): simplify constructor, extract helpers, move implementation to .cpp - Migrate all 10 tracing call sites from verbose ifdef blocks to single-line CRANE_TRACE_POINT_ATTR / CRANE_TRACE_SCOPE_NAMED macros - Add tracing_enabled field to Supervisor proto for config propagation - Use CMake INTERFACE library for crane_tracer when tracing is disabled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Commit:065de92
Author:zhansan114514
Committer:RileyWen

insert data into influxdb

Commit:5dfe12d
Author:NamelessOIer

rename task/node/total_res_view

Commit:de93be6
Author:Li Junlin
Committer:junlinli

fix: Cancel when configuring style: auto format with clang-format. fix mem limit debug Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> debug Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: DB cpu type cast Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: X11 local socket close handling Signed-off-by: junlinli <xiafeng.li@foxmail.com> bugfix fix: resource check fix: Cgroup leak when step failure Signed-off-by: junlinli <xiafeng.li@foxmail.com> style: auto format with clang-format. fix: crun primary step resource allocation feat: ntasks (#793) fix: x11 fd closed twice. Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor: Copy InteractiveMeta Signed-off-by: junlinli <xiafeng.li@foxmail.com> style: auto format with clang-format. fix: Sync bugfix from fix/conn Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: compile Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: AI comments Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Comments Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> style: auto format with clang-format. bugfix X11 multi conn Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: X11 multi conn Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> WIP: X11 Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: X11 forwarding Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Subprocess cg Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Only init cfored for ia task Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Exit reason Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Deadlock Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> style: auto format with clang-format. fix Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: tasks step exit code Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Share mem for tasks in a same step on node Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Support task execution Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Cgroup recovery Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: Use std::chrono_literals Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: PAM Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> move user cgroup of daemon step to Supervisor Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> Update TaskScheduler.cpp feat: Step cgroup recovery Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: Supervisor step status. Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: Step instance Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Add CPU binding functionality for cgroup v1 and v2 Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> style: auto format with clang-format. fix: Cancel step not found in craned Signed-off-by: junlinli <xiafeng.li@foxmail.com> feat: Step with tasks. Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: compile Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: daemon step lost Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: cancel configuring job Signed-off-by: junlinli <xiafeng.li@foxmail.com>

Commit:818faac
Author:edragain
Committer:GitHub

feat: cinfo --list-reasons/-R (#817) * cinfo -R/--list-reasons * enable single node mode * change style * remove unused header * fix format

Commit:6ccbbe4
Author:edragain
Committer:GitHub

feat: add field SubmitNode to ccontrol show job (#835) * feat: submit_node * add db surpport * modify name * fix format * add field to jobfetchrecord * fix format

Commit:1054aaf
Author:RileyWen

fix: resolve rebase conflicts and fix pre-existing bugs in dev/io Rebase fixes: - Remove duplicate LicenseInfo message in PublicDefs.proto - Remove duplicate struct/member declarations in TaskManager.h - Fix PodInstance constructor missing task_id parameter - Replace IsDaemonStep() with IsDaemon(), add missing IsBatch() - Remove duplicate io_meta copy in CtldPublicDefs.cpp Pre-existing bug fixes: - Fix InteractiveMeta modified by value instead of reference - Fix double on_finish() call in CforedClient pipe handlers - Fix input_file_pattern assigned to parsed_output_file_pattern - Fix std::expected::error() used as boolean (UB) - Add missing continue after step cgroup handling - Remove duplicate ERROR+WARN logging in SupervisorStub Minor fixes: - Fix stale "Daemon step" log message to "Primary step" - Remove unused variable in String.h - Move chmod after BuildAndStart in CranedServer Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Commit:3a81557
Author:RileyWen
Committer:RileyWen

feat: Batch I/O, step tasks, and related fixes (squashed dev/io)

Commit:e706d8c
Author:Riley W
Committer:GitHub

feat: reset task id counter (#843) * feat: add ResetNextTaskId RPC for resetting task ID counters Admin-only RPC that resets task ID and task DB ID counters in the embedded database without requiring a ctld restart. Used by the test framework to reset state between test cases. - Proto: ResetNextTaskIdRequest/Reply with next_task_id and next_task_db_id fields (0 = don't change, >0 = reset to value) - EmbeddedDbClient: ResetNextTaskId() resets counters in both DB and memory (also resets step counters when task ID is reset) - RPC handler: admin auth via CheckUidIsAdmin, server readiness check Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: add purge/reset RPCs for all metadata types Add bulk cleanup capabilities for test framework integration: - PurgeAllAccounts: post-order subtree deletion, skip ROOT - PurgeAllQos: skip UNLIMITED default QoS - PurgeAllWckeys: skip root user's wckeys - PurgeAllLicenseResources: clear all license/resource entries - PurgeAllReservations: delete all reservations - ResetAllPartitionAcls: reload defaults from config or clear all - ResetNextStepDbId: reset step ID counters independently Each triggered via "delete ALL --force" CLI pattern or dedicated reset RPC. Enables skip-ctld-restart test optimization (275/281 pass). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: block account all and reset cert all (#834) * feat: add PurgeTaskHistory RPC for embedded DB cleanup Add RPC to purge all task/step history records from embedded DB (fixed_db, variable_db, step_fixed_db, step_var_db) without deleting files. This replaces the unsafe wipe_embedded() approach which unlinks files while ctld holds open fds — causing data loss when ctld restarts (new process can't find unlinked files). Frontend command: ccontrol reset task-history Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: auto format with clang-format. * Cherry-pick "fix: Cancel when configuring" from dev/x11_task (#845) * Initial plan * fix: Cancel when configuring --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: junlinli <xiafeng.li@foxmail.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: huerni <47264950+huerni@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: junlinli <xiafeng.li@foxmail.com>

Commit:3a9cc7d
Author:Yongkun Li
Committer:GitHub

feat: Fix GRES config reading and add CDI-related support (#842) * fix: fix GRES config reading and docs * chores: Add CMakeUserPresets.json * feat: Add CDI for GRES * docs: Add docs for CDI * fix: Add exiting when device parsing failed * fix: Fix corner case when parsing

Commit:956be58
Author:Junlin Li
Committer:GitHub

feat: Creport (#792) * refactor: Hour init aggregation Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: AppendToAccUsageTable for recovered job Signed-off-by: junlinli <xiafeng.li@foxmail.com> chore: Remove not used code Signed-off-by: junlinli <xiafeng.li@foxmail.com> chore: Update wipe data script Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor: Remove not used config Signed-off-by: junlinli <xiafeng.li@foxmail.com> doc doc Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor Signed-off-by: junlinli <xiafeng.li@foxmail.com> doc: Remove some comments Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor: Job aggregation Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor: Job aggregation Signed-off-by: junlinli <xiafeng.li@foxmail.com> doc: Remove creport photos. Signed-off-by: junlinli <xiafeng.li@foxmail.com> Update ceff.md doc: Update doc format Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: Update success time only when success Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor: Job summary by gid Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor: Job summary date Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor: Job summary query Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor: Mongodb Job aggregation Signed-off-by: junlinli <xiafeng.li@foxmail.com> refactor: View value or Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix: doc Signed-off-by: junlinli <xiafeng.li@foxmail.com> style: auto format with clang-format. refactor: Rewrite with chrono fix comments style: auto format with clang-format. fix comments opt active agg add creport cmd help opt code fix name opt rolltype rebase change query add nodenamelist add time opt query use mongodb func use one table change code style add jobsize cmd add wckeyprocess add hour timer add cpu_alloc_level use thread pool change stream size change stream out use P/C thread bulk opt fix log fix cgroup date change cplugin test Signed-off-by: junlinli <xiafeng.li@foxmail.com> * refactor Signed-off-by: junlinli <xiafeng.li@foxmail.com> * refactor Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix: creport job size query Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix: transaction for acc usage recovery and marking. Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix: Remove cluster field, fix aggregate time not set. Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix: comments Signed-off-by: junlinli <xiafeng.li@foxmail.com> * doc Signed-off-by: junlinli <xiafeng.li@foxmail.com> * chore: typo Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix: Remove not used function Signed-off-by: junlinli <xiafeng.li@foxmail.com> * fix: Data type, Ctld destruct order Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> * fix Signed-off-by: junlinli <xiafeng.li@foxmail.com> * style: auto format with clang-format. --------- Signed-off-by: junlinli <xiafeng.li@foxmail.com> Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> Co-authored-by: db <1301189887@qq.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:4dcb7f9
Author:huerni
Committer:GitHub

feat: Add QOS fields: MaxJobsPerUser, MaxSubmitJobsPerUser, MaxJobsPerAccount, MaxSubmitJobsPerAccount (#499) * feat: qos submit jobs per user limit * feat: qos jobs per user limit * feat: account submit jobs limit * src/CraneCtld/ * fix: judge result * fix: modify and add * fix std::move * refactor * fix: mutex * style: auto format with clang-format. * fix delete user meta * refactor modify field array str * style: auto format with clang-format. * feat: When a user/account object is deleted, resources need to be reset. * fix: qos is invalid when CheckQosResource * refactor: if_contains * feat: not delete user when has task * feat: task account_chain * feat: refactor kUnlimitedQos * refactor * style: auto format with clang-format. * refactor * fix: task recover account_chain empty * fix: recover task qos resource malloc * style: auto format with clang-format. * fix: qos resource device map * add logs * fix: recover running task double malloc * feat: malloc use allocated_res_view * refactor * feat * refactor: delete user when has task * style: auto format with clang-format. * refactor * style: auto format with clang-format. * fix: restore running queue user task num * fix * feat: CheckQosResource CpuCount * rebase fix * style: auto format with clang-format. * feat: CheckQosResource use allocated_res_view * refactor qos assert * refactor * todo * merge master * merge master * fix modify * update QosToString * update kCraneErrStrArr * delete header * refactor * style: auto format with clang-format. * refactor PdJobInScheduler * style: auto format with clang-format. * fix * fix lic * fix * style: auto format with clang-format. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:984a92c
Author:edragain
Committer:GitHub

feat: add dns config to annotations (#811) * add job_id hostname uid to cni plugin's annotations * add dns to config * adjust format * add docs * fix: Fix wrong YAML option reading * refactor: Refactor annotations * chores: Amend documentations * feat: Add FQDN and refactor * fix: Fix dns server field in db * fix: Fix docs example on DNS config --------- Co-authored-by: Nativu5 <44155313+Nativu5@users.noreply.github.com>

Commit:4d4628e
Author:huerni
Committer:GitHub

fix: wckey some bug (#816) * fix: delete wckey add force and cbatch no wckey * fix: cacct add wckey * fix: cacct *wckey * style: auto format with clang-format. * feat: embed add using_default_wckey * style: auto format with clang-format. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:1e9cd7d
Author:huerni
Committer:GitHub

feat: cbatch signal (#801) * add cbatch signal * fix comments * refactor signal * style: auto format with clang-format. * fix * style: auto format with clang-format. * refactor * feat: default not sig bash * style: auto format with clang-format. --------- Co-authored-by: db <1301189887@qq.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:b72fb9d
Author:huerni
Committer:GitHub

feat: craned health check (#635) * feat: HealthCheck config read * feat: healthCheck when craned start * feat: health check * feat: randomHealthCheck * feat: health check with nodeState * feat: node state with reason * refactor * feat: add m_health_check_thread_ * style: auto format with clang-format. * fix * feat * feat: add NONDRAINED_IDLE and refactor CheckNodeState_ * style: auto format with clang-format. * feat: multi node state * feat: add ResReduceEvent * refactor node health check, not SendHealthCheckResult * feat: START_ONLY * refactor * style: auto format with clang-format. * refactor * refactor * style: auto format with clang-format. * refactor and fix long time sleep * style: auto format with clang-format. * fix NeedHealthCheck_ * feat: update doc * feat: not health check when ping failed * fix * refactor * fix compare_exchange_strong * refactor 4 * feat * merge master * fix joinable and refactor * style: auto format with clang-format. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:a797efa
Author:huerni
Committer:GitHub

feat: prolog and epilog (#687) * feat: prolog epilog config * feat: prologctld and epilogctld * feat: add ParseLogHookPaths * feat: prolog and epilog * feat: taskprolog and taskepilog * feat: run prolog with node configure state * refactor * feat: prolog and epilog * refactor * feat: set node drain when prolog and epilog failed * feat: add flags and RunPrologOrEpiLog * refactor * feat: add flag judge * feat: ctld add prolog flags * feat: serial flag * feat: prolog flag NoHold * refactor * feat * feat: flag runinjob * refactor prologctld * refactor taskprolog * feat: task prolog output get * feat: fix timeout * fix: nohold * feat: contain flag * fix dead lock * refactor * merge master * feat: refactor * merge master * refactor * refactor prolog * fix dead lock * fix out of range * refactor * style: auto format with clang-format. * feat: add AddResReduceEvents * refactor config * refactor and fix close fd * feat: add prolog_epilog_guide.md * style: auto format with clang-format. * refactor * fix: use std::atomic<bool> is_prolog_run * feat: refactor config * refactor * refactor * feat: add step task prolog * refactor * feat: add step task prolog * refactor * fix dead lock * refactor * refactor * refactor * refactor doc * style: auto format with clang-format. * feat * refactor * refactor: use STDOUT_FILENO * refactor: use ParsePrologEpilogHookPaths * refactor * style: auto format with clang-format. * refactor * merge master * refactor flag * style: auto format with clang-format. * refactor * refactor * refactor * refactor * style: auto format with clang-format. * fix: close fd * refactor * refactor * refactor * merge master and refactor doc * update doc * refactor * fix * refactor log * refactor * style: auto format with clang-format. * refactor * fix: uv__signal_handler assert failed * style: auto format with clang-format. * fix prolog when RUNINJOB * fix: epilog script_lock * style: auto format with clang-format. * fix * fix serial * style: auto format with clang-format. * refactor --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:d342002
Author:Yongkun Li
Committer:GitHub

feat: Add subId managed mode (#802) * feat(container): Add SubId managed mode * fix: Fix docs and default offset * fix: Fix unmanaged mode * refactor: simplify api * fix: fix overflow checking * docs: Add a subid part

Commit:5da8bcf
Author:huerni
Committer:GitHub

feat: remote license (#733) * feat: licenses fix: setfields fix: The task that exceeds the total number of licenses should generate an error. * refactor * update gitignore * refactor * feat: add licenses or * refactor * fix scheduler malloc license * refactor * feat: add/modify/delete licenses * refactor * feat * refactor proto * feat: query resource * feat: allocated * feat: db operator * feat: RemoveRemoteLicense * refactor * feat: use rich err * fix * refactor last update * feat: add type modify * refactor modify to one request * refactor * feat: multi add step * feat: add license_guide * feat: add flag * feat: add flag and license_guide.md * feat: add AllLicenseResourcesAbsolute * feat: add license guide * feat: cacctmgr guide add resource * feat: wipe_data script * refactor * style: auto format with clang-format. * refactor * refactor * refactor * refactor * fix: dbclient * feat: add reserved last_deficit last_consumed in monitor * style: auto format with clang-format. * refactor * fix last_deficit and refactor * refactor * refactor * style: auto format with clang-format. * refactor * refactor * style: auto format with clang-format. * refactor * feat: update doc * refactor * style: auto format with clang-format. * fix: mutable_licenses use elem.get_int64() * fix: uint32_t -> int64_t * style: auto format with clang-format. * refactor * style: auto format with clang-format. * refactor * style: auto format with clang-format. * fix ERR_RESOURCE_ALREADY_EXIST * style: auto format with clang-format. * refactor * fix allocated > 100% * style: auto format with clang-format. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:dbc2d14
Author:huerni
Committer:huerni

refactor

Commit:028ffa0
Author:huerni
Committer:huerni

refactor

Commit:1a03e99
Author:github-actions[bot]
Committer:huerni

style: auto format with clang-format.

Commit:582abf4
Author:huerni
Committer:huerni

refactor

Commit:38fefd3
Author:huerni
Committer:huerni

refactor

Commit:963731c
Author:huerni
Committer:huerni

feat: use rich err

Commit:1846b36
Author:huerni
Committer:huerni

feat: add/modify/delete licenses

Commit:537667a
Author:huerni
Committer:huerni

feat: add flag

Commit:9dac228
Author:huerni
Committer:huerni

feat

Commit:25732ad
Author:huerni
Committer:huerni

refactor proto

Commit:2ce64bb
Author:huerni
Committer:huerni

feat: add flag and license_guide.md

Commit:caae0c0
Author:huerni
Committer:huerni

feat: query resource

Commit:d6c8b73
Author:huerni
Committer:huerni

refactor last update

Commit:b4620c5
Author:huerni
Committer:huerni

feat: add type modify

Commit:5155662
Author:huerni
Committer:huerni

refactor modify to one request

Commit:6edadc8
Author:huerni
Committer:huerni

feat: multi add step

Commit:6fb5fac
Author:edragain
Committer:GitHub

fix: duplicate modification message output (#715) * fix issue * refactor modify * adjust res_usert * adjust modify_user() * refactor modifyqos * refactor modifyaccount remove dbclient fn * adjust * remove account_map user_map qos_map * adjust code style * adjust code style(2) * fix rebase * fix format * delete unuseful fn * fix modify account * fix modifyuser * fix modify user bug * fix invalid username no desption * fix empty desption * fix test bug * fix delede qos failed * fix empty description with errcode ACCESS_ACCOUNT_DENIED * replace USER_EMPTY_PARTITION description * adjust log style

Commit:a1320fd
Author:Yongkun Li
Committer:GitHub

feat: Tighten mount permission check and add docs for container (#794) * feat: Add mount permission check * chores: Update absl dep * fix: Fix systemd installation path inconsistence and docs * chores: Remove outdated scripts * feat: Allow customize of bindfs mnt path * fix: Fix wrong annotations * docs: Fix wrong claims * feat: Add container docs * feat: Revise Chinese doc * style: auto format with clang-format. * fix: Fix docs's English translations * fix: Fix installation path and docs * feat: Add supplement groups for checking * docs: Revise deployment guide * fix: Docs format --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:2a666ee
Author:huerni
Committer:GitHub

feat: sol lua (#783) * feat: lua register function * feat: lua pool * feat: job submit * feat: register log_funcs * feat: env get and set * feat: job submit and job modify * feat: job info * feat: jobreq and jobrec * feat: jobreq field * feat: next step * feat: use taskInCtld * feat: partition rec * refactor * refactor * feat: use crane::grpc::PartitionInfo * feat: use crane::grpc::ReservationInfo * refactor * add reason * feat: add job modify lua check * feat * feat: use c++ device map * feat: add ifdef HAVA_LUA * refactor * feat: add lua env * refactor * refactor * style: auto format with clang-format. * refactor lua pool * refactor * fix ExecuteLuaScript return * refactor TimeStr2Mins * refactor * refactor * refactor * feat: add lua_guide.md * feat: add lua_guide.md * refactor and use std::string * refactor * refactor * feat: use lambda and concept * feat: job lazy get * feat: resv lazy get * feat: resv lazy get * refactor doc * feat: SetJobReqFieldCb_ * feat: refactor doc * refactor * refactor lua env to sol2 * feat: RegisterGlobalFunctions_ * feat: crane.jobs * feat: crane.reservations * feat: ResourceView * feat: fix and add doc * refactor * refactor * refactor * style: auto format with clang-format. * refactor * refactor and fix lua modify * feat: update doc * style: auto format with clang-format. * refactor * refactor * refactor * style: auto format with clang-format. * refactor TimeStr2Mins --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:6a4f067
Author:huerni
Committer:huerni

add reason

Commit:8954e01
Author:huerni
Committer:huerni

feat: job submit and job modify

Commit:8e69853
Author:huerni
Committer:huerni

feat: register log_funcs

Commit:32f00ad
Author:NamelessOIer
Committer:GitHub

feat: dependency (#742) * feat: dependency * update docs * bugfix

Commit:b61b651
Author:Yongkun Li
Committer:GitHub

feat: Add multi-step CRI support for inter-node container job (#735) * feat: Add multi-step pod and container support (squashed) * refactor: Optimize logic and fix timelimit * fix: Fix launching on multiple nodes * fix: Fix race condition in daemon step * feat: Add multiple exec/attach support * feat: Add errcode for attach * style: auto format with clang-format. * fix: Fix log message and lint errors * fix: Fix redundant declaration * feat: Rename log file name to support multiple nodes' logs * fix: Fix multiple move() * fix: Fix empty mount dir style: auto format with clang-format. * fix: Fix redundant code * feat: Add first bindfs impl * feat: Add bindfs config block * feat: Add bindfs workaround for idmapped mounts * feat: Add BindFsMeta validation * fix: Fix typos and add macro fallback * fix: Make bindfs optional --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:eea02f9
Author:1daidai1
Committer:GitHub

Add cbatch/calloc/crun --mem-per-cpu (#501) * fix comments style: auto format with clang-format. test Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Taskid not set when try malloc qos Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> add cbatch/calloc/crun --mem-per-cpu * fix: Wckey default handling, zero cpu/mem req handling. Signed-off-by: junlinli <xiafeng.li@foxmail.com> --------- Signed-off-by: junlinli <xiafeng.li@foxmail.com> Co-authored-by: junlinli <xiafeng.li@foxmail.com> Co-authored-by: Junlin Li <70465472+L-Xiafeng@users.noreply.github.com>

Commit:df3a38d
Author:1daidai1
Committer:GitHub

fix: proto index (#600) Signed-off-by: junlinli <xiafeng.li@foxmail.com> fix comments del exclude change log add log rebase Reapply "feat: Add job env (#495)" (#573) This reverts commit 268d2602bdb7b0fa8665748036b493f5d5bb3edf. add env

Commit:5187844
Author:1daidai1
Committer:GitHub

Add job wckey (#558) * add wckey * add new wckey * add wckey para in config * opt * fix comments * style: auto format with clang-format. * rebase * rebase * fix crun no err message * opt updatemongodb * fix comments * rebase * fix comments * fix comments * del valid_wckey * style: auto format with clang-format. * add mkdocs * add wckeyvaild * del cluster * rebase * fix comments * style: auto format with clang-format. * del must * fix comments * check permission * rebase * fix comments --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:5b4e575
Author:huerni
Committer:GitHub

feat: Add license monitor (#706) * feat: licenses fix: setfields fix: The task that exceeds the total number of licenses should generate an error. * refactor * refactor * update gitignore * refactor * feat: licenses fix: setfields fix: The task that exceeds the total number of licenses should generate an error. * feat: license monitor * merge master * refactor * feat: only update change license * style: auto format with clang-format. --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Commit:274c2da
Author:db
Committer:db

add cbatch signal

Commit:3e8e55b
Author:huerni
Committer:huerni

add reason

Commit:7835a09
Author:huerni
Committer:huerni

feat: job submit and job modify

Commit:2975e3e
Author:huerni
Committer:huerni

feat: register log_funcs

Commit:4cc83e0
Author:Junlin Li
Committer:GitHub

feat: Step query/cancel (#644) * fix: craned id field unexpectedly removed Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Query running job with finished step style: auto format with clang-format. refactor: Fast return if no job info Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Compile error Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: step schedule fix: resource error fix: Step script Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: crun step res field, inherit job when not specified Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> style: auto format with clang-format. fix: Remove pd_reason in Common step Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: cwd and cmd_line not set Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Set running node by execnode Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> Refactor error handling in TaskScheduler fix: step end time Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Compile error Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: AI comments Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Cancel job fix: Cancel pd job fix: Cancel not existed job fix: job type not set fix: Cancel all pending step when primary exit Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> refactor: Improve cancel performance Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: crun step submit Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> style: auto format with clang-format. refactor: Interactive cb Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Wrong res view returned. Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> WIP Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> add ResourceView::operator=(const crane::grpc::ResourceView& Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Submit step Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> WIP Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Duplicate code when rebase Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Fetch job step info and remove unused FetchStepRecords method Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> update db Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Query step info and cancel step fix: StepInfo elapsed_time not set Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Query step info from mongodb Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: AI comments Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Step cancel Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> fix: Remove db entry if job finish Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> feat: Cancel steps Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> WIP Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> * fix: Step gres multiplied by ntasks-per-node Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> * fix: submit time,start time, supervisor coredump Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> * fix: submit time Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> * fix: Running job elapsed time Signed-off-by: Li Junlin <xiafeng.li@foxmail.com> --------- Signed-off-by: Li Junlin <xiafeng.li@foxmail.com>

Commit:1c7728f
Author:huerni

refactor