Ray

Install

Install ray on each node:

$ sudo pip install -U "ray[default]"

Run

Mannually

Start a head node:

$ ray start --head --port=6379 --dashboard-host=0.0.0.0
Enable usage stats collection? This prompt will auto-proceed in 10 seconds to avoid blocking cluster startup. Confirm [Y/n]: 
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 10.225.4.51

--------------------
Ray runtime started.
--------------------

Next steps
  To add another node to this Ray cluster, run
    ray start --address='10.225.4.51:6379'
  
  To connect to this Ray cluster:
    import ray
    ray.init()
  
  To submit a Ray job using the Ray Jobs CLI:
    RAY_API_SERVER_ADDRESS='http://10.225.4.51:8265' ray job submit --working-dir . -- python my_script.py
  
  See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
  for more information on submitting Ray jobs to the Ray cluster.
  
  To terminate the Ray runtime, run
    ray stop
  
  To view the status of the cluster, use
    ray status
  
  To monitor and debug Ray, view the dashboard at 
    10.225.4.51:8265
  
  If connection to the dashboard fails, check your firewall settings and network configuration.

Start Worker nodes:

$ ray start --address='10.225.4.51:6379'
Local node IP: 10.225.4.53

--------------------
Ray runtime started.
--------------------

To terminate the Ray runtime, run
  ray stop

Check ray cluster status:

======== Autoscaler status: 2026-04-03 15:16:01.279233 ========
Node status
---------------------------------------------------------------
Active:
 1 node_bae165fb38933adb4299b4c856363101c36aa14a2953eeeba83737e7
 1 node_40a4218cfae734c57821c51ee9927a9d8f3a922840fe4e8b76157a15
 1 node_196edc38eeb48c518cb064395c84b7af2a5e56ce723b51be4342d368
 1 node_d394e88293678c958f1eea4d97dceb93f7dbb6ab6118be74d243df6a
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/64.0 CPU
 0B/79.43GiB memory
 0B/34.04GiB object_store_memory

From request_resources:
 (none)
Pending Demands:
 (no resource demands)

Show all nodes:

$ ray list nodes

======== List: 2026-04-03 15:19:40.524765 ========
Stats:
------------------------------
Total: 4

Table:
------------------------------

Stop ray on the head node:

$ ray stop
Stopped all 6 Ray processes.

Stop a worker:

$ ray stop
Stopped all 2 Ray processes.

Note

Must stop workers first before stopping the head.

Using config file

A more practical way is using a cluster config file, like this:

cluster_name: las

provider:
  type: local
  head_ip: 10.225.4.51
  worker_ips:
    - 10.225.4.52
    - 10.225.4.53
    - 10.225.4.54

auth:
  ssh_user: ubuntu
  ssh_private_key: ~/.ssh/id_ed25519

head_start_ray_commands:
  - ray stop
  - ray start --head --port=6379 --dashboard-host=0.0.0.0 --autoscaling-config=~/workspace/ray-cluster.yaml

worker_start_ray_commands:
  - ray stop
  - ray start --address=$RAY_HEAD_IP:6379

Then bootstrap a ray cluster up:

$ ray up ray-cluster.yaml
Cluster: las

Checking Local environment settings
2026-04-03 17:10:58,735 INFO node_provider.py:53 -- ClusterState: Loaded cluster state: ['10.225.4.52', '10.225.4.53', '10.225.4.54', '10.225.4.51']
No head node found. Launching a new cluster. Confirm [y/N]: y

Useful commands:
  To terminate the cluster:
    ray down /home/ubuntu/workspace/ray-cluster.yaml
  
  To retrieve the IP address of the cluster head:
    ray get-head-ip /home/ubuntu/workspace/ray-cluster.yaml
  
  To port-forward the cluster's Ray Dashboard to the local machine:
    ray dashboard /home/ubuntu/workspace/ray-cluster.yaml
  
  To submit a job to the cluster, port-forward the Ray Dashboard in another terminal and run:
    ray job submit --address http://localhost:<dashboard-port> --working-dir . -- python my_script.py
  
  To connect to a terminal on the cluster head for debugging:
    ray attach /home/ubuntu/workspace/ray-cluster.yaml
  
  To monitor autoscaling:
    ray exec /home/ubuntu/workspace/ray-cluster.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'

Check status:

$ ray status
======== Autoscaler status: 2026-04-03 17:12:01.809482 ========
Node status
---------------------------------------------------------------
Active:
 (no active nodes)
Idle:
 3 local.cluster.node
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/48.0 CPU
 0B/59.14GiB memory
 0B/25.34GiB object_store_memory

From request_resources:
 (none)
Pending Demands:
 (no resource demands)

Important

There is a max_workers config, which is default to the number of worker_ips (and cannot be set to a larger number), and a worker is run on the head node, so there is always a host left without any worker running on. Don’t know why it is designed.

Teardown the cluster:

$ ray down ray-cluster.yaml 
Loaded cached provider configuration
If you experience issues with the cloud provider, try re-running the command with --no-config-cache.
Destroying cluster. Confirm [y/N]: y

Requested 3 nodes to shut down. [interval=1s]
0 nodes remaining after 5 second(s).
No nodes remaining.

Usage

Run a command

If you are on a node of ray cluster (head or worker), you can run ray app directly by:

$ python count_hosts.py
This cluster consists of
    3 nodes in total
    48.0 CPU resources in total

Tasks executed
    20 tasks on las0
    14 tasks on las1
    14 tasks on las2

The source code of count_hosts.py is public on GitHub.

Submit a job

Or you can run it by sumbit a job:

$ ray job submit --working-dir . -- python count_hosts.py
Job submission server address: http://10.225.4.51:8265
2026-04-03 18:42:22,308 INFO dashboard_sdk.py:359 -- Uploading package gcs://_ray_pkg_1cd54066d1e984c7.zip.
2026-04-03 18:42:22,309 INFO packaging.py:691 -- Creating a file package for local module '.'.

-------------------------------------------------------
Job 'raysubmit_62j7dPtmX6UfzDnx' submitted successfully
-------------------------------------------------------

Next steps
  Query the logs of the job:
    ray job logs raysubmit_62j7dPtmX6UfzDnx
  Query the status of the job:
    ray job status raysubmit_62j7dPtmX6UfzDnx
  Request the job to be stopped:
    ray job stop raysubmit_62j7dPtmX6UfzDnx

Tailing logs until the job exits (disable with --no-wait):
2026-04-03 18:42:22,328 INFO job_manager.py:587 -- Runtime env is setting up.
Running entrypoint for job raysubmit_62j7dPtmX6UfzDnx: python count_hosts.py
2026-04-03 18:42:24,001 INFO worker.py:1669 -- Using address 10.225.4.51:6379 set in the environment variable RAY_ADDRESS
2026-04-03 18:42:24,009 INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 10.225.4.51:6379...
2026-04-03 18:42:24,084 INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at 10.225.4.51:8265 
This cluster consists of
    3 nodes in total
    48.0 CPU resources in total

Tasks executed
    19 tasks on las1
    19 tasks on las0
    10 tasks on las2

------------------------------------------
Job 'raysubmit_62j7dPtmX6UfzDnx' succeeded
------------------------------------------

Note

In this way, the files under the working directory is packaged and uploaded. So don’t put irrelevant files in it.

Submit a job asynchronously:

$ ray job submit --no-wait --working-dir . -- python count_hosts.py
Job submission server address: http://10.225.4.51:8265
2026-04-03 18:43:05,441 INFO dashboard_sdk.py:411 -- Package gcs://_ray_pkg_1cd54066d1e984c7.zip already exists, skipping upload.

The package is now cached, so it will not being uploaded again.

Job management

List all jobs:

$ ray list jobs

======== List: 2026-04-03 18:56:24.144025 ========
Stats:
------------------------------
Total: 2

Table:
------------------------------
      JOB_ID  SUBMISSION_ID               ENTRYPOINT              TYPE        STATUS     MESSAGE                     ERROR_TYPE    DRIVER_INFO
 0  01000000                              python count_hosts.py   DRIVER      SUCCEEDED                                            id: '01000000'
                                                                                                                                   node_ip_address: 10.225.4.51
                                                                                                                                   pid: '1139146'
 1  03000000  raysubmit_2npNAssCnHczsCpv  python count_hosts.py   SUBMISSION  SUCCEEDED  Job finished successfully.                id: '03000000'
                                                                                                                                   node_ip_address: 10.225.4.51
                                                                                                                                   pid: '1140228'

Delete a job:

$ ray job delete 03000000
Job submission server address: http://10.225.4.51:8265
Job '03000000' deleted successfully

If the job is in RUNNING status, you need to stop it first:

$ ray job stop 04000000
Job submission server address: http://10.225.4.51:8265
Attempting to stop job '04000000'
Waiting for job '04000000' to exit (disable with --no-wait):
Job has not exited yet. Status: RUNNING
Job has not exited yet. Status: RUNNING
Job has not exited yet. Status: RUNNING
Job '04000000' was stopped