API Reference¶

TrainerClient¶

class kubeflow.trainer.TrainerClient(backend_config: KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None = None)[source]¶

Bases: object

__init__(backend_config: KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None = None)[source]¶

Initialize a Kubeflow Trainer client.

Parameters:: backend_config (KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None) – Backend configuration. Either KubernetesBackendConfig, LocalProcessBackendConfig, ContainerBackendConfig, or None to use the backend’s default config class. Defaults to KubernetesBackendConfig.
Raises:: ValueError – Invalid backend configuration.

list_runtimes() → list[Runtime][source]¶

List of the available runtimes.

Returns:

A list of available training runtimes. If no runtimes exist, an empty list is returned.

Raises:

TimeoutError – Timeout to list runtimes.
RuntimeError – Failed to list runtimes.

get_runtime(name: str) → Runtime[source]¶

Get the runtime object

Parameters:

name (str) – Name of the runtime.

Returns:

A runtime object.

Raises:

TimeoutError – Timeout to get a runtime.
RuntimeError – Failed to get a runtime.

get_runtime_packages(runtime: Runtime)[source]¶

Print the installed Python packages for the given runtime. If a runtime has GPUs it also prints available GPUs on the single training node.

Parameters:

runtime (Runtime) – Reference to one of existing runtimes.

Raises:

ValueError – Input arguments are invalid.
RuntimeError – Failed to get Runtime.

Create a TrainJob. You can configure the TrainJob using one of these trainers:

CustomTrainer: Runs training with a user-defined function that fully encapsulates the
training process.
CustomTrainerContainer: Runs training with a user-defined image that fully encapsulates
the training process.
BuiltinTrainer: Uses a predefined trainer with built-in post-training logic, requiring
only parameter configuration.

Parameters:

runtime (str | Runtime | None) – Optional reference to one of the existing runtimes. It can accept the runtime name or Runtime object from the get_runtime() API. Defaults to the torch-distributed runtime if not provided.
initializer (Initializer | None) – Optional configuration for the dataset and model initializers.
trainer (CustomTrainer | CustomTrainerContainer | BuiltinTrainer | None) – Optional configuration for a CustomTrainer, CustomTrainerContainer, or BuiltinTrainer. If not specified, the TrainJob will use the runtime’s default values.
options (list | None) – Optional list of configuration options to apply to the TrainJob. Options can be imported from kubeflow.trainer.options.

Returns:

The unique name of the TrainJob that has been generated.

Raises:

ValueError – Input arguments are invalid.
TimeoutError – Timeout to create TrainJobs.
RuntimeError – Failed to create TrainJobs.

list_jobs(runtime: Runtime | None = None) → list[TrainJob][source]¶

List of the created TrainJobs. If a runtime is specified, only TrainJobs associated with that runtime are returned.

Parameters:

runtime (Runtime | None) – Reference to one of the existing runtimes.

Returns:

List of created TrainJobs. If no TrainJobs exist, an empty list is returned.

Raises:

TimeoutError – Timeout to list TrainJobs.
RuntimeError – Failed to list TrainJobs.

get_job(name: str) → TrainJob[source]¶

Get the TrainJob object.

Parameters:

name (str) – Name of the TrainJob.

Returns:

A TrainJob object.

Raises:

TimeoutError – Timeout to get a TrainJob.
RuntimeError – Failed to get a TrainJob.

get_job_logs(name: str, step: str = 'node-0', follow: bool | None = False) → Iterator[str][source]¶

Get logs from a specific step of a TrainJob.

You can watch for the logs in realtime as follows: ```python from kubeflow.trainer import TrainerClient

for logline in TrainerClient().get_job_logs(name=”s8d44aa4fb6d”, follow=True):: print(logline)

```

Parameters:

name (str) – Name of the TrainJob.
step (str) – Step of the TrainJob to collect logs from, like dataset-initializer or node-0.
follow (bool | None) – Whether to stream logs in realtime as they are produced.

Returns:

Iterator of log lines.

Raises:

TimeoutError – Timeout to get a TrainJob.
RuntimeError – Failed to get a TrainJob.

get_job_events(name: str) → list[Event][source]¶

Get events for a TrainJob.

This provides additional clarity about the state of the TrainJob when logs alone are not sufficient. Events include information about pod state changes, errors, and other significant occurrences.

Parameters:

name (str) – Name of the TrainJob.

Returns:

A list of Event objects associated with the TrainJob.

Raises:

TimeoutError – Timeout to get a TrainJob events.
RuntimeError – Failed to get a TrainJob events.

wait_for_job_status(name: str, status: set[str] = {'Complete'}, timeout: int = 600, polling_interval: int = 2, callbacks: list[Callable[[TrainJob], None]] | None = None) → TrainJob[source]¶

Wait for a TrainJob to reach a desired status.

Parameters:

name (str) – Name of the TrainJob.
status (set[str]) – Expected statuses. Must be a subset of Created, Running, Complete, and Failed statuses.
timeout (int) – Maximum number of seconds to wait for the TrainJob to reach one of the expected statuses.
polling_interval (int) – The polling interval in seconds to check TrainJob status.
callbacks (list[Callable[[TrainJob], None]] | None) – Optional list of callback functions to be invoked after each polling interval. Each callback should accept a single argument: the TrainJob object.

Returns:

A TrainJob object that reaches the desired status.

Raises:

ValueError – The input values are incorrect.
RuntimeError – Failed to get TrainJob or TrainJob reaches unexpected Failed status.
TimeoutError – Timeout to wait for TrainJob status.

delete_job(name: str)[source]¶

Delete the TrainJob.

Parameters:

name (str) – Name of the TrainJob.

Raises:

TimeoutError – Timeout to delete TrainJob.
RuntimeError – Failed to delete TrainJob.

Trainers¶

class kubeflow.trainer.CustomTrainer(func: Callable, func_args: dict | None = None, image: str | None = None, packages_to_install: list[str] | None = None, pip_index_urls: list[str] = <factory>, num_nodes: int | None = None, resources_per_node: dict | None = None, env: dict[str, str] | None=None) → None[source]¶

Bases: object

Custom Trainer configuration. Configure the self-contained function: that encapsulates the entire model training process.

Parameters:

func (Callable) – The function that encapsulates the entire model training process.
func_args (Optional[dict]) – The arguments to pass to the function.
image (Optional[str]) – The optional container image to use in TrainJob.
packages_to_install (Optional[list[str]]) – A list of Python packages to install before running the function.
pip_index_urls (list[str]) – The PyPI URLs from which to install Python packages. The first URL will be the index-url, and remaining ones are extra-index-urls.
num_nodes (Optional[int]) – The number of nodes to use for training.
resources_per_node (Optional[dict]) –

The computing resources to allocate per node.
`python resources_per_node = {"gpu": 4, "cpu": 5, "memory": "10G"} `

If your compute supports fractional GPUs (e.g. multi-instance GPU),

you can set the resources as follows (request 1 GPU slice of 5Gb) :

`python resources_per_node = {"mig-1g.5gb": 1} `
env (Optional[dict[str, str]]) – The environment variables to set in the training nodes.

func: Callable¶

func_args: dict | None = None¶

image: str | None = None¶

packages_to_install: list[str] | None = None¶

pip_index_urls: list[str]¶

num_nodes: int | None = None¶

resources_per_node: dict | None = None¶

env: dict[str, str] | None = None¶

class kubeflow.trainer.CustomTrainerContainer(image: str, num_nodes: int | None = None, resources_per_node: dict | None = None, env: dict[str, str] | None = None) → None[source]¶

Bases: object

Custom Trainer Container configuration. Configure the container image: that encapsulates the entire model training process.

Parameters:

image (str) – The container image that encapsulates the entire model training process.
num_nodes (Optional[int]) – The number of nodes to use for training.
resources_per_node (Optional[dict]) –

The computing resources to allocate per node.
`python resources_per_node = {"gpu": 4, "cpu": 5, "memory": "10G"} `

If your compute supports fractional GPUs (e.g. multi-instance GPU),

you can set the resources as follows (request 1 GPU slice of 5Gb) :

`python resources_per_node = {"mig-1g.5gb": 1} `
env (Optional[dict[str, str]]) – The environment variables to set in the training nodes.

image: str¶

num_nodes: int | None = None¶

resources_per_node: dict | None = None¶

env: dict[str, str] | None = None¶

class kubeflow.trainer.BuiltinTrainer(config: TorchTuneConfig) → None[source]¶

Bases: object

Builtin Trainer configuration. Configure the builtin trainer that already includes: the fine-tuning logic, requiring only parameter adjustments.

Parameters:: config (TorchTuneConfig) – The configuration for the builtin trainer.

config: TorchTuneConfig¶

Backend Configurations¶

class kubeflow.trainer.KubernetesBackendConfig(**data: Any) → None[source]¶

Bases: BaseModel

namespace: str | None¶

config_file: str | None¶

context: str | None¶

client_configuration: Configuration | None¶

class Config[source]¶

Bases: object

arbitrary_types_allowed = True¶

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kubeflow.trainer.LocalProcessBackendConfig(**data: Any) → None[source]¶

Bases: BaseModel

cleanup_venv: bool¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class kubeflow.trainer.ContainerBackendConfig(**data: Any) → None[source]¶

Bases: BaseModel

pull_policy: str¶

auto_remove: bool¶

container_host: str | None¶

container_runtime: Literal['docker', 'podman'] | None¶

runtime_source: TrainingRuntimeSource¶

dataset_initializer_image: str¶

model_initializer_image: str¶

initializer_timeout: int¶

model_config: ClassVar[ConfigDict] = {}¶: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].