Acknowledgement
Thanks to Weiwei for the first introduction of OSG to me and constructive suggestions to this tutorial article.
Some codes in this article are written with the help from ChatGPT. Thanks!
This article is indicated to describe the whole process of performing jobs on OSG. More specifically, it takes running a Python project with PyTorch environment as an example. All commands here serve for Mac OS system. In short, we will introduce:
- How to sign-up and access OSG
- How to set up the necessary environment on OSG
- How to prepare, submit, run and monitor jobs on OSG
- How to collect results from OSG to local
Due to the specific design of OSG, it is pretty powerful to run tasks which can be performed parallelly. For example, in the area of statistics, people usually need to do some kind of simulations to advocate their methods. However, due to the randomness, the simulation procedure usually need to be repeated many times. In this case, OSG can render much convenience since it can handle hundreds simulation replications meanwhile. Moreover, for each replication, we can further run it parallelly with multiple CPUs.
Imagine we submit 1000 jobs to OSG and each job will run with 4 CPUs, i.e., 4000 CPUs will run together for our tasks.
Introduction of the Open Science Grid platform
The Open Science Grid (OSG) consists of unused computing and storage resources at universities and national labs spanning the United States. Participators can submit their tasks to OSG through the Open Science Pool. However, each jobs should fit on a single server; see Page to learn the appropriate size of job for running smoothly on OSG.
Sign up for OSG account
If you are a researcher affiliated with a US Academic Institution, just go to Link to sign up for the usage of OSG. In short, you need to fulfill the reason of applying the OSG usage and then someone from the OSG team will set up a meeting with you to give more detailed introduction about OSG and create an account for you.
Access to OSG from local
Once you have met with OSG team member, you will have an account set up to access OSG. There are two steps to make the connection between your local computer and the OSG:
- Create a SSH key; see Page for the simple way to create a SSH key. If the folder name which stores the SSH key is ssh. You can run the code follows in terminal to open the ssh folder:
cd ~/.ssh
Then, a simple command below can list all SSH keys stored in your local computer.
ls
To get a specific SSH key info, use the below code but replace the name
with the actual file name.
cat name.pub
Lastly, we need to add a this SSH key info to our web profile in OSG Page; see more details about this process from Page.
- Add the two-factor authentication; see Page for more details. (I used the Google Authenticator to be the second authentication)
Finally, we are ready to access OSG from local. Type the below code in terminal, but replace UnixUsername
with your own UnixUsername read from Profile on Page. Also, please replace loginnode
with the true login node assigned to you:
ssh UnixUsername@loginnode
Then, you need to enter the passphrase which you defined when the SSH key is created. Also, you need to pass the second authentication.
Set up the environment
Once we are able to access OSG, the next step is to run our jobs. Since the OSG provides the computation resources collected from different places, we need to specify the environment in which our jobs will run upon. For example, we shall indicate which packages we require. There are some existing environment provided by OSG using containers; see Page. Unfortunately, these basic environments may not satisfy our need. Thus, we need to set up our own environment.
In this article, we introduce the method to create the desired environment through compiling an image based on a docker file. First, please install the Docker from Page. Then, we can check the if the installation is successful by running below code in terminal
docker --version
Then, we need to write a docker file (Dockerfile.txt file) to state which packages we need. A docker file template looks like below:
# Use an official Python runtime as a parent image
FROM --platform=linux/amd64 python:3.8-slim
# Set environment variables
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=off \
PIP_DISABLE_PIP_VERSION_CHECK=on \
PIP_DEFAULT_TIMEOUT=100
# Install system dependencies
RUN apt-get update \
&& apt-get install -y --no-install-recommends \
git \
build-essential \
libglib2.0-0 \
libsm6 \
libxext6 \
libxrender-dev \
wget \
&& rm -rf /var/lib/apt/lists/*
# Install PyTorch and other packages
RUN python3 -m pip install torch joblib pandas scikit-learn numpy
# Set the working directory in the container
WORKDIR /app
# Copy the current directory contents into the container at /app
COPY . /app
First line FROM --platform=linux/amd64 python:3.8-slim
is important which indicates that we require an environment built in amd64
structure. (Mac will adopt another structure without this line due to the ARM-based M chip)
The line RUN python3 -m pip install torch joblib pandas sklearn
gives a list of needed packages. Here, joblib
is a package used to run code parallelly with multiple CPUs. You can add additional necessary packages following the list.
How to check the built-in libraries in Python will be explained later.
Subsequently, we can build the container image which will be provided as the environment later by the command below. (note you need to be in the folder which contains the docker file; use the command cd
)
docker build -t namespace/repository_name .
For the namespace
, you can use your own Docker account name. For repository_name
, you can give any distinguishable name. For this moment, I am creating an environment for CalibrationPI project, so I will use command, e.g., docker build -t user/calibrationpi .
. You can replace user
with your own user name.
After the building, you should be able to see the Image ID by the command below:
docker image list
Then, we need to transfer the Docker image to the OSG. First, let’s compress this image to a .tar
file by the command below:
docker save Image ID -o file name.tar
Replace Image ID
with the image ID you found with docker image list
. You can specify any file name
.
Since the image file is usually larger than 1 GB, we need to transfer it to Open Science Data Federation (OSDF); the location of OSDF for different login node can be found on this Page. For example, my login group is ap21
and my UnixUsername is user
. Thus, the location in which we should transfer image to is /ospool/ap21/data/user
. We can accomplish the transfer by the code below:
scp calibrationpi.tar user@ap21.uc.osg-htc.org:/ospool/ap21/data/user
Remember to replace user
in above command with your own UnixUsername and use your login node.
Then, we need login the OSG and go the location /ospool/ap21/data/user
where we put the docker image in. Use the below command, we can convert the docker image to SIF image:
apptainer build filename.sif docker-archive://file name.tar
Here, file name.tar
is the .tar
file compressed before. After building SIF image, you can see .sif
file in this folder by ls
command. This .sig
file will be specified when we write the submit file on OSG.
Last step, we shall check if the packages have been installed successfully. Open the filename.sif
by command:
apptainer shell pytorch-2.1.1.sif
Then, open python by command:
python
Check if torch
was installed successfully by command:
import torch
After check, use exit()
to exit python and exit
to exit image.
Prepare, submit, run and monitor jobs on OSG
In this part, we apply a simple simulation task to illustrate how shall we prepare and run jobs on OSG.
Let’s say we want to simulate 100 samples from each normal distribution $N(\mu_i,1)$ for $i = 1,\ldots, K$. Then, we can use the sample mean of each simulated data to approximate $\mu_i$. However, this sample mean only stands for a single point estimation. We could re-do such simulation procedure $M$ times and use the empirical distribution of these $M$ number of observed sample mean to approximate the sampling distribution of the sample mean. In other words, we hope to simulate 100 data $M$ time for all $\mu_i,~ i=1,\ldots, K$.
In practice, we may face a much more complicated simulation which needed to be replicated many times and with different parameters. The logic to achieve a more changeling task is similar as this toy example.
We first prepare a
main.py
file which is the main body to solve the aforementioned simulation problem.
import sys
import torch
import os
import torch
def main(i, ii):
new_folder_name = './example_mean' + f'{i}'
os.makedirs(new_folder_name)
torch.manual_seed(ii)
Single_sim = i + 1 * torch.randn(100)
torch.save(Single_sim, new_folder_name + '/Simulation_Rep'+ f'{ii}' + '.pth')
if __name__ == "__main__":
# Check if an argument is passed
if len(sys.argv) > 1:
mean = sys.argv[1] # Get the mean value from input
iteration = sys.argv[2] # Get the iteration number from input
main(mean, iteration)
else:
print("No iteration number provided.")
main(i, ii)
takes two inputs. $i$ represents the mean is equal to $i$; $ii$ represents the $ii$-th replication.
torch.manual_seed(ii)
set the seed as the same as the replication index to make this experiment reproducible.
new_folder_name = './example_mean' + f'{i}'
creates a folder for each mean value to store all simulations.
torch.save(Single_sim, new_folder_name + '/Simulation_Rep'+ f'{ii}' + '.pth')
saves each simulation with its corresponding mean value to the specific path.
sys.argv[1]
and sys.argv[2]
read two inputs given by our submission.
For a more complicated project, main
function may depend on other functions. We can write these auxiliary functions separately and import them to the main.py
by command like from Other_functions import *
.
Then, we need to prepare a
.sh
file to tell the server on OSG to execute themain.py
.
# Assign passed arguments to variables
# var1 stands for the mean
# var2 stands for the replication index
var1 = "$1"
var2 = "$2"
# Use the variables to run your Python script
python3 main.py $var1 $var2
Lastly, we need to prepare a submission file to submit all simulations with different mean to OSG. Let’s name the file by
Submission.sub
+SingularityImage = "osdf:///ospool/login code/data/UnixUsername/filename.sif"
executable = shell.sh
arguments = $(mean) $(inde)
transfer_input_files = main.py, Other_functions.py
transfer_output_files = example_mean$(mean)
log = example_mean_logs/example_mean$(mean)_$(inde).log
error = example_mean_errors/example_mean$(mean)_$(inde).err
output = example_mean_outputs/example_mean$(mean)_$(inde).out
+JobDurationCategory = "Long"
request_cpus = 1
request_memory = 1GB
request_disk = 4GB
requirements = (Microarch >= "x86_64-v3")
queue mean inde from Example.txt
+SingularityImage = "osdf:///ospool/logincode/data/UnixUsername/filename.sif"
indicates which environment (i.e., .sif
image) we want to use. You can replcae logincode
, UnixUsername
and filename.sif
with your information.
transfer_input_files = main.py, Other_functions.py
indicates we need main.py
and other possible auxiliary functions. If you do not have Other_functions.py
, just remove it.
transfer_output_files = example_mean$(mean)
indicates the final result we hope the OSG collects from all servers and return to our. For here, we hope to return all folders we built in main.py
.
log = example_mean_logs/example_mean$(mean)_$(inde).log
collects all running information for each job.
error = example_mean_errors/example_mean$(mean)_$(inde).err
collects all running errors for each job.
output = example_mean_outputs/example_mean$(mean)_$(inde).out
collects all running outputs for each job.
+JobDurationCategory = "Long"
indicates the possible running time of each job; see Page for more introduction.
request_cpus = 1
request one cpu for each single simulation; request_memory = 1GB
request 1GB memory; request_disk = 4GB
request 4GB disk.
requirements = (Microarch >= "x86_64-v3")
restricts the server model we want to use.
queue mean inde from Example.txt
extracts all mean values and replication indices from Example.txt
. If we take $\mu_1 = 1, \mu_2 = 2$ which means $K = 2$; take $M = 2$. Then Example.txt
shall have below format
1 0
1 1
2 0
2 1
These are all files we need to prepare. Then, we can transfer all such files to OSG. And we are ready to run them.
We can put all files in a single folder Example_blog
and then use the below command with our local terminal to transfer all of them to OSG platform (make sure we are in the path of the folder in terminal)
scp -r Example_blog UnixUsername@loginnode:/home/UnixUsername
Open another terminal in local and connect to OSG with command ssh UnixUsername@loginnode
. Then, we can see the folder Example_blog
by command ls
. Use cd Example_blog
to open the folder. Use ls
again, we can see all files. However, we need create one more folder to store the .log
files generated from the running process. The step can be accomplished by command:
mkdir -p example_mean_logs
Finally, we can use blow code to submit our four jobs to OSG
condor_submit Simulation.sub
Then, with command condor_q
, we can see the status of our submitted jobs. You can consider other commands, e.g., condor_watch_q
, watch condor_q
. Sometimes, the job will be held due to many complicated issues. If you are sure your submission is accurate, you can just condor_release
the job ID to resubmit all held jobs. Once the submission is finished, we can compress the whole folder together and then transfer the compressed folder to our local:
tar -czvf Example_blog.tar.gz Example_blog
Then, run below command in a local terminal to get the compressed folder from OSG:
scp -r UnixUsername@loginnode:/home/UnixUsername/Example_blog.tar.gz /local_path
Please replace UnixUsername
, loginnode
and local_path
with your choices.
Others
Once we have all .pth
files in our local system, we can read all .pth
files and store them in a list by a simple way.
import torch
import glob
directory = '/local_path'
# Get a list of .pth files in the directory
pth_files = glob.glob(f'{directory}/*.pth')
n_res = len(pth_files)
All_sim_res = [None] * n_res
i = 0
for pth_file in pth_files:
All_sim_res[i] = torch.load(pth_file)
i += 1
Remember to replace '/local_path
with your own pth in which all .pth
files are stored.
This is my first tech blog. Thank you for reading! Let me know if you have any questions or comments! 💫
Update
Some issues I met when I used OSG
- Python 3.8 version may not be a perfect choice in the above docker file. For example, the cde package for python does not work appropriately for Python version 3.8.
- If we want to add additional packages to the existing image, we can use command
docker run -it ImageID /bin/bash
in local terminal and install packages and then commit the change.