API Reference¶
TrainerClient¶
- class kubeflow.trainer.TrainerClient(backend_config: KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None = None)[source]¶
Bases:
object- __init__(backend_config: KubernetesBackendConfig | LocalProcessBackendConfig | ContainerBackendConfig | None = None)[source]¶
Initialize a Kubeflow Trainer client.
- Parameters:
backend_config (
KubernetesBackendConfig|LocalProcessBackendConfig|ContainerBackendConfig|None) – Backend configuration. Either KubernetesBackendConfig, LocalProcessBackendConfig, ContainerBackendConfig, or None to use the backend’s default config class. Defaults to KubernetesBackendConfig.- Raises:
ValueError – Invalid backend configuration.
- list_runtimes() list[Runtime][source]¶
List of the available runtimes.
- Returns:
A list of available training runtimes. If no runtimes exist, an empty list is returned.
- Raises:
TimeoutError – Timeout to list runtimes.
RuntimeError – Failed to list runtimes.
- get_runtime(name: str) Runtime[source]¶
Get the runtime object
- Parameters:
name (
str) – Name of the runtime.- Returns:
A runtime object.
- Raises:
TimeoutError – Timeout to get a runtime.
RuntimeError – Failed to get a runtime.
- get_runtime_packages(runtime: Runtime)[source]¶
Print the installed Python packages for the given runtime. If a runtime has GPUs it also prints available GPUs on the single training node.
- Parameters:
runtime (
Runtime) – Reference to one of existing runtimes.- Raises:
ValueError – Input arguments are invalid.
RuntimeError – Failed to get Runtime.
- train(runtime: str | Runtime | None = None, initializer: Initializer | None = None, trainer: CustomTrainer | CustomTrainerContainer | BuiltinTrainer | None = None, options: list | None = None) str[source]¶
Create a TrainJob. You can configure the TrainJob using one of these trainers:
- CustomTrainer: Runs training with a user-defined function that fully encapsulates the
training process.
- CustomTrainerContainer: Runs training with a user-defined image that fully encapsulates
the training process.
- BuiltinTrainer: Uses a predefined trainer with built-in post-training logic, requiring
only parameter configuration.
- Parameters:
runtime (
str|Runtime|None) – Optional reference to one of the existing runtimes. It can accept the runtime name or Runtime object from the get_runtime() API. Defaults to the torch-distributed runtime if not provided.initializer (
Initializer|None) – Optional configuration for the dataset and model initializers.trainer (
CustomTrainer|CustomTrainerContainer|BuiltinTrainer|None) – Optional configuration for a CustomTrainer, CustomTrainerContainer, or BuiltinTrainer. If not specified, the TrainJob will use the runtime’s default values.options (
list|None) – Optional list of configuration options to apply to the TrainJob. Options can be imported from kubeflow.trainer.options.
- Returns:
The unique name of the TrainJob that has been generated.
- Raises:
ValueError – Input arguments are invalid.
TimeoutError – Timeout to create TrainJobs.
RuntimeError – Failed to create TrainJobs.
- list_jobs(runtime: Runtime | None = None) list[TrainJob][source]¶
List of the created TrainJobs. If a runtime is specified, only TrainJobs associated with that runtime are returned.
- Parameters:
runtime (
Runtime|None) – Reference to one of the existing runtimes.- Returns:
List of created TrainJobs. If no TrainJobs exist, an empty list is returned.
- Raises:
TimeoutError – Timeout to list TrainJobs.
RuntimeError – Failed to list TrainJobs.
- get_job(name: str) TrainJob[source]¶
Get the TrainJob object.
- Parameters:
name (
str) – Name of the TrainJob.- Returns:
A TrainJob object.
- Raises:
TimeoutError – Timeout to get a TrainJob.
RuntimeError – Failed to get a TrainJob.
- get_job_logs(name: str, step: str = 'node-0', follow: bool | None = False) Iterator[str][source]¶
Get logs from a specific step of a TrainJob.
You can watch for the logs in realtime as follows: ```python from kubeflow.trainer import TrainerClient
- for logline in TrainerClient().get_job_logs(name=”s8d44aa4fb6d”, follow=True):
print(logline)
- Parameters:
- Returns:
Iterator of log lines.
- Raises:
TimeoutError – Timeout to get a TrainJob.
RuntimeError – Failed to get a TrainJob.
- get_job_events(name: str) list[Event][source]¶
Get events for a TrainJob.
This provides additional clarity about the state of the TrainJob when logs alone are not sufficient. Events include information about pod state changes, errors, and other significant occurrences.
- Parameters:
name (
str) – Name of the TrainJob.- Returns:
A list of Event objects associated with the TrainJob.
- Raises:
TimeoutError – Timeout to get a TrainJob events.
RuntimeError – Failed to get a TrainJob events.
- wait_for_job_status(name: str, status: set[str] = {'Complete'}, timeout: int = 600, polling_interval: int = 2, callbacks: list[Callable[[TrainJob], None]] | None = None) TrainJob[source]¶
Wait for a TrainJob to reach a desired status.
- Parameters:
name (
str) – Name of the TrainJob.status (
set[str]) – Expected statuses. Must be a subset of Created, Running, Complete, and Failed statuses.timeout (
int) – Maximum number of seconds to wait for the TrainJob to reach one of the expected statuses.polling_interval (
int) – The polling interval in seconds to check TrainJob status.callbacks (
list[Callable[[TrainJob],None]] |None) – Optional list of callback functions to be invoked after each polling interval. Each callback should accept a single argument: the TrainJob object.
- Returns:
A TrainJob object that reaches the desired status.
- Raises:
ValueError – The input values are incorrect.
RuntimeError – Failed to get TrainJob or TrainJob reaches unexpected Failed status.
TimeoutError – Timeout to wait for TrainJob status.
- delete_job(name: str)[source]¶
Delete the TrainJob.
- Parameters:
name (
str) – Name of the TrainJob.- Raises:
TimeoutError – Timeout to delete TrainJob.
RuntimeError – Failed to delete TrainJob.
Trainers¶
- class kubeflow.trainer.CustomTrainer(func: Callable, func_args: dict | None = None, image: str | None = None, packages_to_install: list[str] | None = None, pip_index_urls: list[str] = <factory>, num_nodes: int | None = None, resources_per_node: dict | None = None, env: dict[str, str] | None=None) None[source]¶
Bases:
object- Custom Trainer configuration. Configure the self-contained function
that encapsulates the entire model training process.
- Parameters:
func (Callable) – The function that encapsulates the entire model training process.
func_args (Optional[dict]) – The arguments to pass to the function.
image (Optional[str]) – The optional container image to use in TrainJob.
packages_to_install (Optional[list[str]]) – A list of Python packages to install before running the function.
pip_index_urls (list[str]) – The PyPI URLs from which to install Python packages. The first URL will be the index-url, and remaining ones are extra-index-urls.
num_nodes (Optional[int]) – The number of nodes to use for training.
resources_per_node (Optional[dict]) –
- The computing resources to allocate per node.
`python resources_per_node = {"gpu": 4, "cpu": 5, "memory": "10G"} `- If your compute supports fractional GPUs (e.g. multi-instance GPU),
you can set the resources as follows (request 1 GPU slice of 5Gb) :
`python resources_per_node = {"mig-1g.5gb": 1} `
env (Optional[dict[str, str]]) – The environment variables to set in the training nodes.
- class kubeflow.trainer.CustomTrainerContainer(image: str, num_nodes: int | None = None, resources_per_node: dict | None = None, env: dict[str, str] | None = None) None[source]¶
Bases:
object- Custom Trainer Container configuration. Configure the container image
that encapsulates the entire model training process.
- Parameters:
image (str) – The container image that encapsulates the entire model training process.
num_nodes (Optional[int]) – The number of nodes to use for training.
resources_per_node (Optional[dict]) –
- The computing resources to allocate per node.
`python resources_per_node = {"gpu": 4, "cpu": 5, "memory": "10G"} `- If your compute supports fractional GPUs (e.g. multi-instance GPU),
you can set the resources as follows (request 1 GPU slice of 5Gb) :
`python resources_per_node = {"mig-1g.5gb": 1} `
env (Optional[dict[str, str]]) – The environment variables to set in the training nodes.
- class kubeflow.trainer.BuiltinTrainer(config: TorchTuneConfig) None[source]¶
Bases:
object- Builtin Trainer configuration. Configure the builtin trainer that already includes
the fine-tuning logic, requiring only parameter adjustments.
- Parameters:
config (TorchTuneConfig) – The configuration for the builtin trainer.
- config: TorchTuneConfig¶