pytabkit.bench.scheduling package

Submodules

pytabkit.bench.scheduling.execution module

class pytabkit.bench.scheduling.execution.NodeManager

Bases: object

start()
terminate()
class pytabkit.bench.scheduling.execution.RayJobManager

Bases: NodeManager

__init__(max_n_threads=None, available_cpu_ram_multiplier=1.0, available_gpu_ram_multiplier=1.0, **ray_kwargs)
Parameters:
  • max_n_threads (int | None)

  • available_cpu_ram_multiplier (float)

  • available_gpu_ram_multiplier (float)

get_resource_manager()
Return type:

ResourceManager

pop_finished_job_infos(timeout_s=-1.0)
Parameters:

timeout_s (float)

Return type:

List[JobInfo]

start()
Return type:

None

submit_job(job_info)
Parameters:

job_info (JobInfo)

Return type:

None

terminate()
Return type:

None

pytabkit.bench.scheduling.execution.get_gpu_rams_gb(use_reserved=True)
Returns:

gpu_rams_gb: total GPU memory per visible device (GB) gpu_rams_fixed_gb: this process GPU memory per visible device (GB)

  • reserved (default): torch caching allocator reserved bytes (often matches “process used” better)

  • allocated: live tensor bytes only

Parameters:

use_reserved (bool)

pytabkit.bench.scheduling.execution.measure_node_resources(node_id)

Function that measures available resources.

Parameters:

node_id (int) – Node ID that will be used to identify the node in the returned NodeResources.

Returns:

Returns a tuple of NodeResources objects. The first one contains the total available resources, and the second one contains the resources that a single process (with PyTorch GPU usage) uses without doing anything.

Return type:

Tuple[NodeResources, NodeResources]

pytabkit.bench.scheduling.execution.node_runner(feedback_queue, job_queue, node_id)
Parameters:

node_id (int)

pytabkit.bench.scheduling.jobs module

class pytabkit.bench.scheduling.jobs.AbstractJob

Bases: object

Abstract base class for jobs that can be scheduled using schedulers in schedulers.py.

get_desc()
Returns:

Return a description that can be logged, e.g., when the job is started and when it finishes.

Return type:

str

get_group()
Returns:

Should return a “group name” string. All jobs with the same “group name” will have a common time factor that is adjusted on-the-fly during scheduling based on already completed jobs.

Return type:

str

get_required_resources()
Returns:

Return the resources requested by this job.

Return type:

RequiredResources

class pytabkit.bench.scheduling.jobs.JobResult

Bases: object

Helper class to store information about a job that has been run.

__init__(job_id, time_s, oom_cpu=False, oom_gpu=False, finished_normally=True, exception_msg=None)
Parameters:
  • job_id (int) – Job id.

  • time_s (float) – Time in seconds that the job ran for.

  • oom_cpu (bool) – Whether an out-of-memory error occurred on the CPU.

  • oom_gpu (bool) – Whether an out-of-memory error occurred on the GPU.

  • finished_normally (bool) – Whether the job ran normally, such that its time and RAM values are representative of how it would normally run. For example, if the job ran faster because the results were already partially precomputed, it should not count towards the time estimation. Of course, if an exception occurred, we should have finished_normally=False.

  • exception_msg (str | None) – Exception message (if there was any).

set_max_cpu_ram_gb(value)

Set the maximum RAM usage of the job. :param value: maximum RAM usage in GiB.

Parameters:

value (float)

Return type:

None

class pytabkit.bench.scheduling.jobs.JobRunner

Bases: object

Helper class that runs an AbstractJob, catches exceptions, measures time and RAM usage, and returns its result.

__init__(job, job_id, assigned_resources)
Parameters:
  • job (AbstractJob) – The job to be run.

  • job_id (int) – An ID that will be returned at the end so that the job can be identified.

  • assigned_resources (NodeResources) – Assigned resources to run the job.

pytabkit.bench.scheduling.resource_manager module

class pytabkit.bench.scheduling.resource_manager.JobInfo

Bases: object

__init__(job, job_id, start_time=None, assigned_resources=None, job_result=None)
Parameters:
get_status()
Return type:

JobStatus

is_failed()
is_finished()
is_remaining()
is_running()
is_succeed()
set_finished(job_result)
Parameters:

job_result (JobResult)

set_started(assigned_resources)
Parameters:

assigned_resources (NodeResources)

class pytabkit.bench.scheduling.resource_manager.JobStatus

Bases: Enum

An enumeration.

FAILED = 3
REMAINING = 0
RUNNING = 1
SUCCEEDED = 2
class pytabkit.bench.scheduling.resource_manager.ResourceManager

Bases: object

Keeps track of running jobs and available resources.

__init__(total_resources, fixed_resources)
Parameters:
get_fixed_resources()
get_free_resources()
get_total_resources()
job_finished(job_result)
Parameters:

job_result (JobResult)

Return type:

JobInfo

job_started(job_info)
Parameters:

job_info (JobInfo)

pytabkit.bench.scheduling.resources module

class pytabkit.bench.scheduling.resources.NodeResources

Bases: object

Represents available/used/free resources on a compute node.

__init__(node_id, n_threads, cpu_ram_gb, gpu_usages, gpu_rams_gb, physical_core_usages)
Parameters:
  • node_id (int)

  • n_threads (float)

  • cpu_ram_gb (float)

  • gpu_usages (ndarray)

  • gpu_rams_gb (ndarray)

  • physical_core_usages (ndarray)

get_cpu_ram_gb()
Return type:

float

get_gpu_rams_gb()
Return type:

ndarray

get_gpu_usages()
Return type:

ndarray

get_interface_resources()
Return type:

InterfaceResources

get_n_physical_cores()
Return type:

int

get_n_threads()
Return type:

int

get_physical_core_usages()
Return type:

ndarray

get_resource_vector()
Return type:

ndarray

get_total_gpu_ram_gb()
Return type:

float

get_total_gpu_usage()
Return type:

float

get_used_gpu_ids()
Return type:

ndarray

get_used_physical_cores()
Return type:

ndarray

set_cpu_ram_gb(cpu_ram_gb)
Parameters:

cpu_ram_gb (float)

Return type:

None

set_gpu_rams_gb(gpu_rams_gb)
Parameters:

gpu_rams_gb (ndarray)

Return type:

None

set_n_threads(n_threads)
Parameters:

n_threads (int)

try_assign(required_resources, fixed_resources)
Parameters:
Return type:

NodeResources | None

static zeros_like(node_resources)
Parameters:

node_resources (NodeResources)

Return type:

NodeResources

class pytabkit.bench.scheduling.resources.SystemResources

Bases: object

System resources, consisting of NodeResources for each node.

__init__(resources)
Parameters:

resources (List[NodeResources])

get_cpu_ram_gb()
get_gpu_ram_gb()
get_gpu_usage()
get_n_threads()
get_num_gpus()
get_resource_vector()

pytabkit.bench.scheduling.schedulers module

class pytabkit.bench.scheduling.schedulers.BaseJobScheduler

Bases: object

Base scheduler class where the logic for selecting which jobs should be run next still has to be implemented. Contains functionality for printing intermediate states and the main loop in run().

__init__(job_manager)
Parameters:

job_manager (RayJobManager)

add_jobs(jobs)
Parameters:

jobs (List[AbstractJob])

run()
class pytabkit.bench.scheduling.schedulers.CustomJobScheduler

Bases: BaseJobScheduler

More complicated scheduler with different heuristics for which jobs to submit first (based on which resources it thinks are scarce, estimated time, which methods have not been run yet, etc.). This scheduler can be slow for a large number of jobs (say 10,000 or more).

class pytabkit.bench.scheduling.schedulers.SimpleJobScheduler

Bases: BaseJobScheduler

Simple scheduler. Submits jobs with the largest estimated time. If a job doesn’t fit, jobs with not too much smaller time can be submitted instead. In the beginning, the scheduler ensures that at least three jobs from each group are run (e.g. 3x XGB, 3x LGBM, 3x MLP).

pytabkit.bench.scheduling.schedulers.format_date_s(time_s)
Parameters:

time_s (float)

Return type:

str

pytabkit.bench.scheduling.schedulers.format_length_s(duration)
Parameters:

duration (float)

Return type:

str

Module contents