Ray
Install
Install ray on each node:
$ sudo pip install -U "ray[default]"
Run
Mannually
Start a head node:
$ ray start --head --port=6379 --dashboard-host=0.0.0.0
Enable usage stats collection? This prompt will auto-proceed in 10 seconds to avoid blocking cluster startup. Confirm [Y/n]:
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
Local node IP: 10.225.4.51
--------------------
Ray runtime started.
--------------------
Next steps
To add another node to this Ray cluster, run
ray start --address='10.225.4.51:6379'
To connect to this Ray cluster:
import ray
ray.init()
To submit a Ray job using the Ray Jobs CLI:
RAY_API_SERVER_ADDRESS='http://10.225.4.51:8265' ray job submit --working-dir . -- python my_script.py
See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html
for more information on submitting Ray jobs to the Ray cluster.
To terminate the Ray runtime, run
ray stop
To view the status of the cluster, use
ray status
To monitor and debug Ray, view the dashboard at
10.225.4.51:8265
If connection to the dashboard fails, check your firewall settings and network configuration.
Start Worker nodes:
$ ray start --address='10.225.4.51:6379'
Local node IP: 10.225.4.53
--------------------
Ray runtime started.
--------------------
To terminate the Ray runtime, run
ray stop
Check ray cluster status:
======== Autoscaler status: 2026-04-03 15:16:01.279233 ========
Node status
---------------------------------------------------------------
Active:
1 node_bae165fb38933adb4299b4c856363101c36aa14a2953eeeba83737e7
1 node_40a4218cfae734c57821c51ee9927a9d8f3a922840fe4e8b76157a15
1 node_196edc38eeb48c518cb064395c84b7af2a5e56ce723b51be4342d368
1 node_d394e88293678c958f1eea4d97dceb93f7dbb6ab6118be74d243df6a
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/64.0 CPU
0B/79.43GiB memory
0B/34.04GiB object_store_memory
From request_resources:
(none)
Pending Demands:
(no resource demands)
Show all nodes:
$ ray list nodes
======== List: 2026-04-03 15:19:40.524765 ========
Stats:
------------------------------
Total: 4
Table:
------------------------------
⋮
Stop ray on the head node:
$ ray stop
Stopped all 6 Ray processes.
Stop a worker:
$ ray stop
Stopped all 2 Ray processes.
Note
Must stop workers first before stopping the head.
Using config file
A more practical way is using a cluster config file, like this:
cluster_name: las
provider:
type: local
head_ip: 10.225.4.51
worker_ips:
- 10.225.4.52
- 10.225.4.53
- 10.225.4.54
auth:
ssh_user: ubuntu
ssh_private_key: ~/.ssh/id_ed25519
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --dashboard-host=0.0.0.0 --autoscaling-config=~/workspace/ray-cluster.yaml
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379
Then bootstrap a ray cluster up:
$ ray up ray-cluster.yaml
Cluster: las
Checking Local environment settings
2026-04-03 17:10:58,735 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['10.225.4.52', '10.225.4.53', '10.225.4.54', '10.225.4.51']
No head node found. Launching a new cluster. Confirm [y/N]: y
⋮
Useful commands:
To terminate the cluster:
ray down /home/ubuntu/workspace/ray-cluster.yaml
To retrieve the IP address of the cluster head:
ray get-head-ip /home/ubuntu/workspace/ray-cluster.yaml
To port-forward the cluster's Ray Dashboard to the local machine:
ray dashboard /home/ubuntu/workspace/ray-cluster.yaml
To submit a job to the cluster, port-forward the Ray Dashboard in another terminal and run:
ray job submit --address http://localhost:<dashboard-port> --working-dir . -- python my_script.py
To connect to a terminal on the cluster head for debugging:
ray attach /home/ubuntu/workspace/ray-cluster.yaml
To monitor autoscaling:
ray exec /home/ubuntu/workspace/ray-cluster.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Check status:
$ ray status
======== Autoscaler status: 2026-04-03 17:12:01.809482 ========
Node status
---------------------------------------------------------------
Active:
(no active nodes)
Idle:
3 local.cluster.node
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
0.0/48.0 CPU
0B/59.14GiB memory
0B/25.34GiB object_store_memory
From request_resources:
(none)
Pending Demands:
(no resource demands)
Important
There is a max_workers config, which is default to the number of worker_ips (and cannot be set to a larger number), and a worker is run on the head node, so there is always a host left without any worker running on. Don’t know why it is designed.
Teardown the cluster:
$ ray down ray-cluster.yaml
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Destroying cluster. Confirm [y/N]: y
⋮
Requested 3 nodes to shut down. [interval=1s]
0 nodes remaining after 5 second(s).
No nodes remaining.
Usage
Run a command
If you are on a node of ray cluster (head or worker), you can run ray app directly by:
$ python count_hosts.py
This cluster consists of
3 nodes in total
48.0 CPU resources in total
Tasks executed
20 tasks on las0
14 tasks on las1
14 tasks on las2
The source code of count_hosts.py is public on GitHub.
Submit a job
Or you can run it by sumbit a job:
$ ray job submit --working-dir . -- python count_hosts.py
Job submission server address: http://10.225.4.51:8265
2026-04-03 18:42:22,308 INFO dashboard_sdk.py:359 -- Uploading package gcs://_ray_pkg_1cd54066d1e984c7.zip.
2026-04-03 18:42:22,309 INFO packaging.py:691 -- Creating a file package for local module '.'.
-------------------------------------------------------
Job 'raysubmit_62j7dPtmX6UfzDnx' submitted successfully
-------------------------------------------------------
Next steps
Query the logs of the job:
ray job logs raysubmit_62j7dPtmX6UfzDnx
Query the status of the job:
ray job status raysubmit_62j7dPtmX6UfzDnx
Request the job to be stopped:
ray job stop raysubmit_62j7dPtmX6UfzDnx
Tailing logs until the job exits (disable with --no-wait):
2026-04-03 18:42:22,328 INFO job_manager.py:587 -- Runtime env is setting up.
Running entrypoint for job raysubmit_62j7dPtmX6UfzDnx: python count_hosts.py
2026-04-03 18:42:24,001 INFO worker.py:1669 -- Using address 10.225.4.51:6379 set in the environment variable RAY_ADDRESS
2026-04-03 18:42:24,009 INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 10.225.4.51:6379...
2026-04-03 18:42:24,084 INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at 10.225.4.51:8265
This cluster consists of
3 nodes in total
48.0 CPU resources in total
Tasks executed
19 tasks on las1
19 tasks on las0
10 tasks on las2
------------------------------------------
Job 'raysubmit_62j7dPtmX6UfzDnx' succeeded
------------------------------------------
Note
In this way, the files under the working directory is packaged and uploaded. So don’t put irrelevant files in it.
Submit a job asynchronously:
$ ray job submit --no-wait --working-dir . -- python count_hosts.py
Job submission server address: http://10.225.4.51:8265
2026-04-03 18:43:05,441 INFO dashboard_sdk.py:411 -- Package gcs://_ray_pkg_1cd54066d1e984c7.zip already exists, skipping upload.
⋮
The package is now cached, so it will not being uploaded again.
Job management
List all jobs:
$ ray list jobs
======== List: 2026-04-03 18:56:24.144025 ========
Stats:
------------------------------
Total: 2
Table:
------------------------------
JOB_ID SUBMISSION_ID ENTRYPOINT TYPE STATUS MESSAGE ERROR_TYPE DRIVER_INFO
0 01000000 python count_hosts.py DRIVER SUCCEEDED id: '01000000'
node_ip_address: 10.225.4.51
pid: '1139146'
1 03000000 raysubmit_2npNAssCnHczsCpv python count_hosts.py SUBMISSION SUCCEEDED Job finished successfully. id: '03000000'
node_ip_address: 10.225.4.51
pid: '1140228'
Delete a job:
$ ray job delete 03000000
Job submission server address: http://10.225.4.51:8265
Job '03000000' deleted successfully
If the job is in RUNNING status, you need to stop it first:
$ ray job stop 04000000
Job submission server address: http://10.225.4.51:8265
Attempting to stop job '04000000'
Waiting for job '04000000' to exit (disable with --no-wait):
Job has not exited yet. Status: RUNNING
Job has not exited yet. Status: RUNNING
Job has not exited yet. Status: RUNNING
Job '04000000' was stopped