SDM
People Publications Projects

Back to Climate 100 Home

Cloud Computing for Climate Sciences

Jump to:

Our overall goal is to explore the capability of cloud computing platforms for efficient scientific data processing (see, for example, http://www.nimbusproject.org/). During the summer of 2010, we plan to conduct a few small scale tests to understand how the various components fit together. To this end, we plan the following step-wise action items. Document each process as instructions for others. This work is done by Daren Hasenkamp, summer 2010.

Project plan

  1. Create a Virtual Machine for submission to an Infrastructure-as-a-service cloud. This initial VM will carry with it a simple job (such as generating a few random numbers) and report its progress to a remote web service host. Document the process as instructions for others.
  2. Submit the job to a cloud service at ALCF. Verify that the job is completed as expected.
  3. Create a VM to retrieve (stage-in) data from LBNL/NERSC using GridFTP, compute a checksum of the file, write the file name, time stamp and checksum into another file, and copy the new file to LBNL using GridFTP.
  4. Climate data analysis: Create a VM to include all necessary software to analyze climate data. Climate data needs to be staged from LBNL/NERSC using GridFTP, be analyzed, and the results need to be copied (staged-out) to LBNL using GridFTP.
  5. Find a way to distinguish the cloud jobs (VMs) from each other so that data assignment can be specific to each job.

Tasks

Reference

TUTORIALS FOR BUILDING VMS

All tutorials use the same Ubuntu 10.04 Server image available at http://uec-images.ubuntu.com/releases/10.04/beta2/ubuntu-10.04-beta2-server-uec-i386.tar.gz

Many of the commands require root access. This tutorial does not include "sudo" in front of these commands. If you get permission errors running any of these commands, run them with sudo or as root.

Much of the customization process involves creation of "chroots," which are processes running with a different root directory than the system-wide root directory. To do this, whatever VM image you work with needs to have the same word-length as your machine. So, if you're on 32-bit machine, you need to use a 32-bit image.

Euca2ools

Euca2ools is a command line interface for interacting with Eucalyptus clouds. Euca2ools is available through the aptitude package manager:

$ apt-get install euca2ools

For other OS, follow instructions here: http://wiki.magellancloud.org/index.php/Eucalyptus_Tools

A listing of useful Euca2ools commands can be found here: http://wiki.magellancloud.org/index.php/List_of_Eucalyptus_Commands

Basics of creating/scripting a custom non-interactive VM

The easiest way to create a custom VM is to get a prebuilt image, modify it locally (using a chroot if necessary), and upload it to the cloud. This example tutorial will show you how to create a VM that will use wget to fetch a remote file. (Once you have this complete, you can replace the "wget" with whatever sort of work you need your VM to do.)

Note that basic configuration tasks like this do not require a chroot; however, more complicated tasks (such as using aptitude package manager) will.

1. Get a vm image you would like to use. I placed this in /tmp.

$ cd /tmp

$ wget http://uec-images.ubuntu.com/releases/10.04/beta2/ubuntu-10.04-beta2-server-uec-i386.tar.gz

$ tar zxvf ubuntu-10.04-beta2-server-uec-i386.tar.gz

I will refer to the .img file produced by this command as <image file>. Depending on the image you use, you might also end up with kvm and/or xen kernels/ramdisks, but we are not interested in these.

2. Mount the vm as a loop device.

$ mkdir <mount point>

$ losetup /dev/loop0 <image file>

$ mount /dev/loop0 <mount point>

(for us, <image file>=="/tmp/lucid-server-uec-i386.img")

3. Now that your vm image is mounted, you can add a startup script to it. There are many ways to do this; the following method worked well for me.

Use a text editor to open "<mount point>/etc/rc.local". Note that the rc.local you're editing is contained in the vm image, which you mounted to <mount point>. rc.local is a shell script; you can do whatever you like with it. Whatever script you add here will be run whenever your VM boots. So, to have your VM run wget at startup, add the following line to rc.local:

wget www.google.com

You might also want to shut your VM down after the job is completed. To do this, you can add a "halt" or "shutdown" command to the end of the rc.local script. (You can also shut down by using a popen from within whatever control script you end up implementing--eg, I used a multi-threaded python control script that executes "popen('halt')" when it's done.)

3.5. At this point, if you wanted to install other software on the VM, you might want to chroot into your mounted VM. We don't need to do this for this tutorial, but if you needed to, now would be the time. The command would be

$ chroot <mount point>

4. Once you have saved rc.local, unmount the image and free the loop device.

$ umount /dev/loop0

$ losetup -d /dev/loop0

5. Now you can use euca2ools to bundle and upload your image. (euca2ools is a complex piece of software; this guide assumes you have euca2ools installed and sourced [using ". ~/.euca/eucarc" on bash or "source ~/.euca/eucarc" on (t)csh].)

$ euca-bundle-image -i <image file> --kernel <kernel> --ramdisk <ramdisk>

(For the image in this tutorial: $ euca-bundle-image -i <image file> --kernel eki-27CD1D49 --ramdisk eri- 02AE1CBC )

(You can view available kernels and ramdisks with the command euca-describe-images; replace <kernel> and <ramdisk> with the id of the kernel and ramdisk you wish to use. On Magellan at Argonne, for the 32-bit ubuntu 10.04 image we're using, the proper kernel is eki-27CD1D49 and the proper ramdisk is eri- 02AE1CBC. If you were using a 64-bit ubuntu 10.04 image, the proper kernel would be eki-44811D5A and the proper ramdisk is eri-1F621CD3.)

$ euca-upload-bundle -b <bucket> -m <path to manifest generated by previous command>

(You can choose whatever bucket you like; if it doesn't exist it will be created for you. -m <manifest> specifies the manifest file, which was created when you used the euca-bundle-image command.)

$ euca-register <bucket>/<manifest>

(<bucket> and <manifest> are the bucket and manifest you used. The output of this command will be the ID of your VM image.)

6. (Optional) You might wish to make your VM image private.

$ euca-modify-image-attribute <your image id> -l -r all

7. Now you can launch your VM. The shell script you placed in rc.local will be executed when the VM launches.

$ euca-run-instances -k <your username>-keys -n 1 -z Magellan2 -g default <your image ID>

Check the running instance with

$ euca-describe-instances

and remember to shut it down with

$ euca-terminate-instances <your image ID>

Configuring a VM to run globus

1. Get an image to work with. I'm placing mine in /tmp; it doesn't matter where you put yours. Again, make sure the word length on the image is the same as your computers word length.

$ cd /tmp

$ wget http://uec-images.ubuntu.com/releases/10.04/beta2/ubuntu-10.04-beta2-server-uec-i386.tar.gz

2. Unpack your image.

$ tar zxvf ubuntu-10.04-beta2-server-uec-i386.tar.gz

Depending on the image you're working with, this produces a few files. We're only interested in the .img file produced--for this tutorial, lucid-server-uec-i386.img. The other files are kernels and ramdisks, which you might be interested in, but as of the writing of this tutorial Magellan at ALCF does not allow regular users to upload their own kernels or ramdisks.

I will refer to the image file created by this command as <image file>.

3. Mount the image and set up a chroot:

(Mounting the image:)

$ mkdir <mount point>

$ losetup /dev/loop0 <image file>

$ mount /dev/loop0 <mount point>

(Enable network access from within chroot:)

$ mount /etc/resolv.conf <mount point>/etc/resolv.conf

4. Now chroot into the mounted vm.

$ chroot <mount point>

If everything is working properly, you should see a root prompt. From here on I will use a '#' prompt to indicate commands entered from within the chroot, and a '$' prompt to indicate commands entered from outside of the chroot.

5. I used aptitude to install globus. Installing from source is much more difficult; use aptitude if you can.

First, add the following line to your VM's /etc/apt/sources.list:

deb http://mirrors.kernel.org/ubuntu lucid main universe

Now run:

# apt-get update

# apt-get install globus-gass-copy-progs globus-proxy-utils

6. Now you need to get your credentials onto the VM. This guide assumes you have a .p12 file from the DOE, and that you have registered it with whatever hosts you will be using gsiftp to connect to. (I got my creds from the DOE and registered them with NIM.)

(From outside the chroot)

$ cp <credential>.p12 <mount point>/tmp

(From inside the chroot)

# mkdir /root/.globus

# openssl pkcs12 -in <credential>.p12 -clcerts -nokeys -out /root/.globus/usercert.pem

# openssl pkcs12 -in <credential>.p12 -nocerts -out /root/.globus/userkey.pem

# chmod o= /root/.globus/userkey.pem

# chmod g= /root/.globus/userkey.pem

# rm /tmp/<credential>.p12 (Not necessary, just good practice.)

7. Now we need to configure our trusted certificates. What you do here depends on what servers you will be connecting to. At the end of this step, you need to have a directory (in the chroot) /etc/grid-security/certificates that contains the proper files for each certificate you trust. I did this by copying /etc/grid-security/certificates from dm.lbl.gov to my local machine and then running the following commands:

(From outside the chroot)

$ mkdir <mount point>/etc/grid-security

$ cp -r <path to certs>/certificates <mount point>/etc/grid-security/certificates

8. Now we will create a proxy using grid-proxy-init. We need to do this now (and not in a script to be run at VM boot time) because it requires keyboard interaction. We will also have to change the default proxy file (and the environment variable that goes with it) because the default location is /tmp, which will be wiped when the VM boots on a cloud.

(From within chroot)

# grid-proxy-init -out /root/grid_proxy -valid 24:00

This creates a grid proxy valid for 24 hours. If you need more time you can alter the argument to -valid; the format is hh:mm where "hh" is hours and "mm" is minutes. I placed the proxy in a non-default location; we will have to set environment variables to get GridFTP/gsiftp to work with this proxy location. (See next step.)

9. Insert the following line in /etc/rc.local (inside your vm/chroot, not your local machine):

export X509_USER_PROXY=/root/grid_proxy

10. Now you are ready to script your vm to do whatever it needs to do. You can place whatever commands you need your VM to execute in /etc/rc.local after the export line from step 10. For example, you could use gsiftp to fetch a file from the server data1.lbl.gov and place it in /tmp by placing the following line in /etc/rc.local after the export line you added in step 10:

globus-url-copy gsiftp://data1.lbl.gov//tmp/test.data file:////root/nersc.test

11. After you've made whatever other customizations you would like (installing other programs, migrating data, whatever), it's time to unmount everything.

(from within the chroot)

# exit

(from outside the chroot)

$ umount <mount point>/etc/resolv.conf

$ umount <mount point>

$ losetup -d /dev/loop0

12. Now we can bundle the image and upload it to Magellan.

$ cd /tmp

$ euca-bundle-image -i <image file> --kernel eki-27CD1D49 --ramdisk eri-02AE1CBC

(Note: The kernel ramdisk pair provided are for the Ubuntu 10.04 image I've used in all tutorials. If you're using a different image, you might need a different kernel/ramdisk pair. The Magellan admins will upload whatever kernels/ramdisks you need them to. If you're using 64-bit Ubuntu image, you should try kernel eki- 44811D5A and ramdisk eri-1F621CD3.)

$ euca-upload-bundle -b <image bucket> -m <manifest created by previous command>

$ euca-register <image bucket>/<manifest file>

13. Now you can run your VM on the magellan cloud.

$ euca-run-instances -k <username>-keys -n 1 -z Magellan2 -g default <your VM id>

If you do run an instance on the cloud, make sure to shut it down:

$ euca-terminate-instances <instance ID>

You can figure out the instance id of all running instances belonging to you with $ euca-describe-instances

14. Now, if you like, you could log on to your VM to verify that everything worked correctly. If you added the line from step 11, you'll be looking for the presence of /tmp/nersc.test. The first command only needs to be run once per user/machine (ie, if you've done it already and you're on a machine with the same IP address, you don't need to run it again).

$ euca-authorize -P tcp -p 22 -s <your local machine's IP>/32 default

$ ssh -i ~/.ssh/Magellan.key root<allocated IP>

How to configure TSTORMS for a VM

This tutorial involves creating a chroot of your VM image. I will use a $ prompt to denote commands entered outside of the chroot, and # to denote commands entered inside the chroot.

1. Mount your vm image.

$ mkdir <mount point>

$ losetup /dev/loop0 <your image>

$ mount /dev/loop0 <mount point>

$ mount /etc/resolv.conf <mount point>/etc/resolv.conf

2. Put tstorms.tar.gz in a chroot directory (this tutorial will use /usr/local) and unpack it.

$ cd <mount point>/usr/local

$ wget http://vis.lbl.gov/~romano/climate/tstorms.tar.gz

$ tar zxvf tstorms.tar.gz

3. Install a fortran compiler:

$ chroot <mount point>

# apt-get install gfortran

4. Install the netCDF libraries.

# wget ftp://ftp.unidata.ucar.edu/pub/netcdf/netcdf.tar.Z

# tar zxvf netcdf.tar.Z

# cd netcdf-4.1.1

# ./configure --prefix=/path/to/install --disable-netcdf-4

(I used --prefix=/usr/local)

# make check install

5. Install the "nco" package, which contains ncks, which is needed by tstorms:

# apt-get install nco

6. We also need to install the c shell:

# apt-get install csh

7. We're almost ready to compile the tstorms code. We need to edit a couple lines in 2 configuration files. (These directions are from Raquel Romanos instructions at "http://vis.lbl.gov/~romano/climate/tropicalstorms.html."

-Add full path to fortran compiler (use "which gfortran" to find this):

set DEFAULT_COMPILE = 'full-path-to-f90-compiler -c -g'

set FINAL_COMMAND = 'full-path-to-f90-compiler *.o -o $1.exe $LINK_OPTS'

-Add full path to tropical storms directory:

set ANAL_PATH = full-path-to-local-copy-of-tstorms/source/tstorms

-Add full path to netCDF library:

set LINK_OPTS = '-Lfull-path-to-netcdf-lib-directory -lnetcdf'

On my image, the libnetcdf directory was /usr/local/lib.

In tstorms/source/trajectory/path_names_for_modules:

-Add full path to fortran compiler:

set DEFAULT_COMPILE = 'full-path-to-f90-compiler -c -g'

set FINAL_COMMAND = 'full-path-to-f90-compiler *.o -o $1.exe'

-Add full path to tropical storms directory:

set ANAL_PATH = full-path-to-local-copy-of-tstorms/source/trajectory

Finally, we have to make a small edit to the trajectory module. In tstorms/source/trajectory/ts_tools.f90, change the "landmask" definition in *3* places:

landmask = 'full-path-to-local-copy-of-tstorms/source/trajectory/landsea.map'

Now (in the same file) change the "cmask" definition in *1* place:

cmask = 'full-path-to-local-copy-of-tstorms/source/trajectory/imask_2'

The trajectory modifications are not actually necessary if you don't intend to compile/run trajectory code on your VM, but I did it anyway.

8. Now we're ready to compile. From tstorms/source/tstorms, execute the commands

# rm *.o

# rm .use*

# ./compile_it tstorms_drive

And from tstorms/source/trajectory, execute

# rm *.o

# rm .use*

# ./compile_it trajectory

Running TSTORMS on a VM

First, install TSTORMS using the previous tutorial.

In order to run TSTORMS, several environment variables must be set. We will use /etc/rc.local to set these environment variables at launch.

1. Add path to shared NCO NetCDF library (usually libnco.so) and path to fortran compiler libraries (for gfortran, the filenames will contain "libgfortran"). For me, the NCO library was in /usr/lib/nco/ and the fortran libraries were in /usr/lib/. So I added the following line to /etc/rc.local:

export LD_LIBRARY_PATH=/usr/lib/nco/:/usr/lib/

This will be slightly different for you depending where your NCO and gfortran libraries are located.

2. Add path to TSTORMS executables to PATH. In rc.local, add

export PATH=$PATH:/<path to tstorms>/source/tstorms/:<path to tstorms>/source/trajectory/

Now you can execute TSTORMS from any directory with the command

# <path to tstorms>/scripts/run_tstorms_general <path to NetCDF input file> <path to output file>

(Note: the output file will be created if it does not exist.)

Running Instances on Eucalyptus

Imagine you have on your local filesystem a virtual machine image located at <image>. This tutorial will show you how to upload, register, manage, and run it. Note that this tutorial is a repeat of information provided in previous tutorials. I include it only for convenience.

1. Now bundle your image and upload it to Magellan. The --kernel and --ramdisk options are not necessary unless your image requires a special kernel/ramdisk pair to run. If it does, you'll need to ask the administrators to add the kernel; once they upload it, you can use "euca-describe-images" to figure out the kernel/ramdisk IDs, which you can use in the following command:

$ euca-bundle-image -i lucid-server-uec-i386.img --kernel <kernel ID> --ramdisk <ramdisk ID>

$ euca-upload-bundle -b <image bucket> -m <manifest produced by previous command>

$ euca-register <image bucket>/<manifest>

The output of euca-register is the image ID of your image.

2. Now you can run your VM on the magellan cloud.

$ euca-run-instances -k <username>-keys -n 1 -z Magellan2 -g default <machine ID>

where -n specifies how many instances to run, -z specifies what cloud to run on (magellan2 for us), -g is the security group (you might at some point create your own), and <machine ID> is the ID of whatever image you wish to run.

3. Make sure to shut your instance down:

$ euca-terminate-instances i-<instance ID>

You can also shut instances down by executing a "halt" or "shutdown" command from within the VM. This is useful if you want to create non-interactive VMs that shut themselves upon job completion.

4. You can figure out the instance id of all running instances belonging to you with

$ euca-describe-instances

5. If you like, you can ssh into your VM.

$ euca-authorize -P tcp -p 22 -s <your local machine's IP>/32 default

$ ssh -i ~/.ssh/Magellan.key root<allocated IP>

This assumes the VM was allocated an external IP by the eucalyptus scheduler. If it wasn't, run

$ euca-allocate-address

which outputs an address for you to use. Then run

$ euca-associate-address -i <instance ID> <address>

where <instance ID> is the ID of the instance you wish to associate the address with, and <address> is the address returned by "euca-allocate-address".

Technically, you are supposed to deallocate your address using

$ euca-disassociate-address <address>; euca-release-address <address>

However, Eucalyptus seems to do this for you if you don't. Still, it's probably better if you do. You'll also probably want to configure network settings.

Configuring Eucalyptus network settings

All VMs run by Eucalyptus run under a "security group", which is basically a bundling of information about network rules. By default, VMs on Eucalyptus will refuse all network traffic; if you want to use the network, you need to open up whatever ports, protocols, and IPs you want to use. The method is to create a security group, modify group rules using "euca-authorize", and run VMs under this security group. These commands only need to be run once per security group; ie, once you add a rule, it stays until you revoke it. (See listing of Eucalyptus commands on magellan wiki for revocation syntax.)

To add a security group:

$ euca-add-group -d <description> group-name

To open up protocols/ports/IPs for VMs run under a given security group:

$ euca-authorize [-P protocol] [-p port-range] [-s source-subnet] security-group

So, for example, to authorize TCP for all ports and source IPs for the security group "my-group":

$ euca-authorize -P tcp -p 0-65535 -s 0.0.0.0/0 my-group

To authorize UDP on port 52000 from a single IP address:

$ euca-authorize -P udp -p 52000 -s a.b.c.d/32 my-group

CLIMATE VM USAGE

Configuring input files: The climate VM contains a newline-separated list of remote URLs in the file /root/inputs.txt. Modify these at your leisure.

Configuring output directory: The output directory is provided as an argument to the Python control script, which is invoked by rc.local. To change the output directory, edit the third argument of the "python /root/ClimateVMCoord.py" invocation in /etc/rc.local. You need to give it a valid server and an existing directory (it will not create the directory for you).

Network settings: The climate VM uses a couple of ports and protocols. You need to enable UDP ports 64747 and 64748 using euca-authorize command (see "Configuring eucalyptus network settings"). If you want to SSH to the VM you'll also need to open TCP 22.

GridFTPIn order for the Vm to use GridFTP (ie, gsiftp from globus-url-copy), you need to put your grid credentials onto the VM and generate a proxy. Follow instructions 6-8 in the "Configuring a VM to run Globus" tutorial. Doing step 7 will only be necessary if the trusted certificates on the VM have expired.

Trusted certificates: The trusted certificates on the VM (in /etc/grid-security/certificates) expire periodically (I'm not sure how long they last). If you see the VM fail, get its output (see "debugging a vm") and check whether you see any output that has to do with CRLs; if so, the trusted certificates have expired and you need to follow instruction 7 in the "Configuring a VM to run Globus" tutorial.

Control Script

I wrote a control script in python to implement leader election and data coordination between VM instances.

The script should be run at start time using /etc/rc.local. Usage is as follows:

python path/to/ClimateVMCoord.py <port> <path to input file list> <URL of output directory> <path to tstorms>

<port> defines the ports the control script uses. It will use UDP <port> and TCP <port+1>. <path to input file list> is the path to a newline-seperated list of URLs to input files. <URL of output directory> is the URL of the remote directory you would like the results placed in. (Note: Your VM needs to be able to use GridFTP to copy files between both the server hosting the input files and the server hosting the output directory. See "configuring a VM to run globus".)

So, if the control script is at /root/ClimateVMCoord.py, you want to use port 64747/64748, have an input list at /root/inputs.txt, want to stage your results out to the directory /global/u1/d/<user>/cyclones/ on dtn02.nersc.gov, and have tstorms in /usr/local/tstorms/, then you should place the following line in /etc/rc.local:

python /root/ClimateVMCoord.py 64747 /root/inputs.txt dtn02.nersc.gov//global/u1/d/<user>/cyclones /usr/local/tstorms/

This control script runs TSTORMS as a child process, so you'll need to have the proper environment variables set. See the "Running TSTORMS on a VM" tutorial. It also uses globus tools--see "Configuring a VM to run globus".

INPUT FILE LIST

The control script takes as input a newline-seperated list of URLs to input files. For example, if you have some NetCDF files in a repository on dtn02.nersc.gov, the input file list might look like this:

dtn01.nersc.gov//project/projectdirs/clim100/analysis/d26-amip.cam2.h2.1979-01-05-00000.nc

dtn01.nersc.gov//project/projectdirs/clim100/analysis/d26-amip.cam2.h2.1979-01-10-00000.nc

dtn01.nersc.gov//project/projectdirs/clim100/analysis/d26-amip.cam2.h2.1979-01-15-00000.nc

dtn01.nersc.gov//project/projectdirs/clim100/analysis/d26-amip.cam2.h2.1979-01-20-00000.nc

Debugging a VM

Debugging can be a pain because the turnaround time from completing a draft of the code and getting output from it can take upwards of a half hour. Moreover, it can be rather difficult to actually get the output. I've been using a few methods for this.

The quick and dirty way to debug is to throw print statements into your code, run the VM, and get the instance ID of a running instance. Then run "euca-get-console-output <instance ID>", which gives you the first 64k of output (from stdout and stderr) from the specified instance. Unfortunately, I've found "the first 64k of output" to be woefully inadequate, especially since over half of that is taken up by the output of the boot process. Also, the command is really unreliable--it sometimes takes 15 minutes to get the output, and sometimes it never gets it at all.

So, instead, I've been writing all my output to both a file and stdout. (Having it go to stdout in addition to the file is useful so that euca-get-console-output continues to be useful.) I close and reopen the file periodically so that I can log onto my VM (using SSH; see "Running instances on Eucalyptus" and "configuring Eucalyptus network settings") and view the file. This is the most useful debugging method I have found.