Acknowledgement

Thanks to Weiwei for the first introduction of OSG to me and constructive suggestions to this tutorial article.

Some codes in this article are written with the help from ChatGPT. Thanks!


This article is indicated to describe the whole process of performing jobs on OSG. More specifically, it takes running a Python project with PyTorch environment as an example. All commands here serve for Mac OS system. In short, we will introduce:

  • How to sign-up and access OSG
  • How to set up the necessary environment on OSG
  • How to prepare, submit, run and monitor jobs on OSG
  • How to collect results from OSG to local

Due to the specific design of OSG, it is pretty powerful to run tasks which can be performed parallelly. For example, in the area of statistics, people usually need to do some kind of simulations to advocate their methods. However, due to the randomness, the simulation procedure usually need to be repeated many times. In this case, OSG can render much convenience since it can handle hundreds simulation replications meanwhile. Moreover, for each replication, we can further run it parallelly with multiple CPUs.

Imagine we submit 1000 jobs to OSG and each job will run with 4 CPUs, i.e., 4000 CPUs will run together for our tasks.


Introduction of the Open Science Grid platform

The Open Science Grid (OSG) consists of unused computing and storage resources at universities and national labs spanning the United States. Participators can submit their tasks to OSG through the Open Science Pool. However, each jobs should fit on a single server; see Page to learn the appropriate size of job for running smoothly on OSG.

Sign up for OSG account

If you are a researcher affiliated with a US Academic Institution, just go to Link to sign up for the usage of OSG. In short, you need to fulfill the reason of applying the OSG usage and then someone from the OSG team will set up a meeting with you to give more detailed introduction about OSG and create an account for you.

Access to OSG from local

Once you have met with OSG team member, you will have an account set up to access OSG. There are two steps to make the connection between your local computer and the OSG:

  1. Create a SSH key; see Page for the simple way to create a SSH key. If the folder name which stores the SSH key is ssh. You can run the code follows in terminal to open the ssh folder:
cd ~/.ssh

Then, a simple command below can list all SSH keys stored in your local computer.

ls

To get a specific SSH key info, use the below code but replace the name with the actual file name.

cat name.pub

Lastly, we need to add a this SSH key info to our web profile in OSG Page; see more details about this process from Page.

  1. Add the two-factor authentication; see Page for more details. (I used the Google Authenticator to be the second authentication)

Finally, we are ready to access OSG from local. Type the below code in terminal, but replace UnixUsername with your own UnixUsername read from Profile on Page. Also, please replace loginnode with the true login node assigned to you:

ssh UnixUsername@loginnode

Then, you need to enter the passphrase which you defined when the SSH key is created. Also, you need to pass the second authentication.

Set up the environment

Once we are able to access OSG, the next step is to run our jobs. Since the OSG provides the computation resources collected from different places, we need to specify the environment in which our jobs will run upon. For example, we shall indicate which packages we require. There are some existing environment provided by OSG using containers; see Page. Unfortunately, these basic environments may not satisfy our need. Thus, we need to set up our own environment.

In this article, we introduce the method to create the desired environment through compiling an image based on a docker file. First, please install the Docker from Page. Then, we can check the if the installation is successful by running below code in terminal

docker --version

Then, we need to write a docker file (Dockerfile.txt file) to state which packages we need. A docker file template looks like below:

# Use an official Python runtime as a parent image
FROM --platform=linux/amd64 python:3.8-slim

# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=off \
    PIP_DISABLE_PIP_VERSION_CHECK=on \
    PIP_DEFAULT_TIMEOUT=100

# Install system dependencies
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
        git \
        build-essential \
        libglib2.0-0 \
        libsm6 \
        libxext6 \
        libxrender-dev \
        wget \
    && rm -rf /var/lib/apt/lists/*

# Install PyTorch and other packages
RUN python3 -m pip install torch joblib pandas scikit-learn numpy


# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

First line FROM --platform=linux/amd64 python:3.8-slim is important which indicates that we require an environment built in amd64 structure. (Mac will adopt another structure without this line due to the ARM-based M chip)

The line RUN python3 -m pip install torch joblib pandas sklearn gives a list of needed packages. Here, joblib is a package used to run code parallelly with multiple CPUs. You can add additional necessary packages following the list.

How to check the built-in libraries in Python will be explained later.

Subsequently, we can build the container image which will be provided as the environment later by the command below. (note you need to be in the folder which contains the docker file; use the command cd)

docker build -t namespace/repository_name .

For the namespace, you can use your own Docker account name. For repository_name, you can give any distinguishable name. For this moment, I am creating an environment for CalibrationPI project, so I will use command, e.g., docker build -t user/calibrationpi .. You can replace user with your own user name.

After the building, you should be able to see the Image ID by the command below:

docker image list

Then, we need to transfer the Docker image to the OSG. First, let’s compress this image to a .tar file by the command below:

docker save Image ID -o file name.tar

Replace Image ID with the image ID you found with docker image list. You can specify any file name.

Since the image file is usually larger than 1 GB, we need to transfer it to Open Science Data Federation (OSDF); the location of OSDF for different login node can be found on this Page. For example, my login group is ap21 and my UnixUsername is user. Thus, the location in which we should transfer image to is /ospool/ap21/data/user. We can accomplish the transfer by the code below:

scp calibrationpi.tar user@ap21.uc.osg-htc.org:/ospool/ap21/data/user

Remember to replace user in above command with your own UnixUsername and use your login node.

Then, we need login the OSG and go the location /ospool/ap21/data/user where we put the docker image in. Use the below command, we can convert the docker image to SIF image:

apptainer build filename.sif docker-archive://file name.tar

Here, file name.tar is the .tar file compressed before. After building SIF image, you can see .sif file in this folder by ls command. This .sig file will be specified when we write the submit file on OSG.

Last step, we shall check if the packages have been installed successfully. Open the filename.sif by command:

apptainer shell pytorch-2.1.1.sif

Then, open python by command:

python

Check if torch was installed successfully by command:

import torch

After check, use exit() to exit python and exit to exit image.

Prepare, submit, run and monitor jobs on OSG

In this part, we apply a simple simulation task to illustrate how shall we prepare and run jobs on OSG.

Let’s say we want to simulate 100 samples from each normal distribution $N(\mu_i,1)$ for $i = 1,\ldots, K$. Then, we can use the sample mean of each simulated data to approximate $\mu_i$. However, this sample mean only stands for a single point estimation. We could re-do such simulation procedure $M$ times and use the empirical distribution of these $M$ number of observed sample mean to approximate the sampling distribution of the sample mean. In other words, we hope to simulate 100 data $M$ time for all $\mu_i,~ i=1,\ldots, K$.

In practice, we may face a much more complicated simulation which needed to be replicated many times and with different parameters. The logic to achieve a more changeling task is similar as this toy example.

We first prepare a main.py file which is the main body to solve the aforementioned simulation problem.

import sys
import torch
import os
import torch

def main(i, ii):

    new_folder_name = './example_mean' + f'{i}' 
    os.makedirs(new_folder_name)
    
    torch.manual_seed(ii)
    Single_sim = i + 1 * torch.randn(100)

    torch.save(Single_sim, new_folder_name + '/Simulation_Rep'+ f'{ii}' + '.pth')
    
if __name__ == "__main__":
    # Check if an argument is passed
    if len(sys.argv) > 1:
        mean = sys.argv[1] # Get the mean value from input
        iteration = sys.argv[2] # Get the iteration number from input
        main(mean, iteration)
    else:
        print("No iteration number provided.")

main(i, ii) takes two inputs. $i$ represents the mean is equal to $i$; $ii$ represents the $ii$-th replication.

torch.manual_seed(ii) set the seed as the same as the replication index to make this experiment reproducible.

new_folder_name = './example_mean' + f'{i}' creates a folder for each mean value to store all simulations.

torch.save(Single_sim, new_folder_name + '/Simulation_Rep'+ f'{ii}' + '.pth') saves each simulation with its corresponding mean value to the specific path.

sys.argv[1] and sys.argv[2] read two inputs given by our submission.

For a more complicated project, main function may depend on other functions. We can write these auxiliary functions separately and import them to the main.py by command like from Other_functions import *.

Then, we need to prepare a .sh file to tell the server on OSG to execute the main.py.

# Assign passed arguments to variables
# var1 stands for the mean
# var2 stands for the replication index
var1 = "$1"
var2 = "$2"

# Use the variables to run your Python script
python3 main.py $var1 $var2

Lastly, we need to prepare a submission file to submit all simulations with different mean to OSG. Let’s name the file by Submission.sub

+SingularityImage = "osdf:///ospool/login code/data/UnixUsername/filename.sif"

executable = shell.sh
arguments = $(mean) $(inde)

transfer_input_files =  main.py, Other_functions.py

transfer_output_files = example_mean$(mean)

log = example_mean_logs/example_mean$(mean)_$(inde).log
error = example_mean_errors/example_mean$(mean)_$(inde).err
output = example_mean_outputs/example_mean$(mean)_$(inde).out

+JobDurationCategory = "Long"

request_cpus = 1
request_memory = 1GB
request_disk = 4GB

requirements = (Microarch >= "x86_64-v3")

queue mean inde from Example.txt

+SingularityImage = "osdf:///ospool/logincode/data/UnixUsername/filename.sif" indicates which environment (i.e., .sif image) we want to use. You can replcae logincode, UnixUsername and filename.sif with your information.

transfer_input_files = main.py, Other_functions.py indicates we need main.py and other possible auxiliary functions. If you do not have Other_functions.py, just remove it.

transfer_output_files = example_mean$(mean) indicates the final result we hope the OSG collects from all servers and return to our. For here, we hope to return all folders we built in main.py.

log = example_mean_logs/example_mean$(mean)_$(inde).log collects all running information for each job.

error = example_mean_errors/example_mean$(mean)_$(inde).err collects all running errors for each job.

output = example_mean_outputs/example_mean$(mean)_$(inde).out collects all running outputs for each job.

+JobDurationCategory = "Long" indicates the possible running time of each job; see Page for more introduction.

request_cpus = 1 request one cpu for each single simulation; request_memory = 1GB request 1GB memory; request_disk = 4GB request 4GB disk.

requirements = (Microarch >= "x86_64-v3") restricts the server model we want to use.

queue mean inde from Example.txt extracts all mean values and replication indices from Example.txt. If we take $\mu_1 = 1, \mu_2 = 2$ which means $K = 2$; take $M = 2$. Then Example.txt shall have below format

1 0
1 1
2 0
2 1

These are all files we need to prepare. Then, we can transfer all such files to OSG. And we are ready to run them.

We can put all files in a single folder Example_blog and then use the below command with our local terminal to transfer all of them to OSG platform (make sure we are in the path of the folder in terminal)

scp -r Example_blog UnixUsername@loginnode:/home/UnixUsername

Open another terminal in local and connect to OSG with command ssh UnixUsername@loginnode. Then, we can see the folder Example_blog by command ls. Use cd Example_blog to open the folder. Use ls again, we can see all files. However, we need create one more folder to store the .log files generated from the running process. The step can be accomplished by command:

mkdir -p example_mean_logs

Finally, we can use blow code to submit our four jobs to OSG

condor_submit Simulation.sub

Then, with command condor_q, we can see the status of our submitted jobs. You can consider other commands, e.g., condor_watch_q, watch condor_q. Sometimes, the job will be held due to many complicated issues. If you are sure your submission is accurate, you can just condor_release the job ID to resubmit all held jobs. Once the submission is finished, we can compress the whole folder together and then transfer the compressed folder to our local:

tar -czvf Example_blog.tar.gz Example_blog

Then, run below command in a local terminal to get the compressed folder from OSG:

scp -r UnixUsername@loginnode:/home/UnixUsername/Example_blog.tar.gz /local_path

Please replace UnixUsername, loginnode and local_path with your choices.

Others

Once we have all .pth files in our local system, we can read all .pth files and store them in a list by a simple way.

import torch
import glob

directory = '/local_path'

# Get a list of .pth files in the directory
pth_files = glob.glob(f'{directory}/*.pth')
n_res = len(pth_files)

All_sim_res = [None] * n_res
i = 0
for pth_file in pth_files:
    All_sim_res[i] = torch.load(pth_file)
    i += 1

Remember to replace '/local_path with your own pth in which all .pth files are stored.


This is my first tech blog. Thank you for reading! Let me know if you have any questions or comments! 💫

Update

Some issues I met when I used OSG

  • Python 3.8 version may not be a perfect choice in the above docker file. For example, the cde package for python does not work appropriately for Python version 3.8.
  • If we want to add additional packages to the existing image, we can use command docker run -it ImageID /bin/bash in local terminal and install packages and then commit the change.