[前言]
MPI(Message Passing Interface)消息传送接口
它不是一个协议,但它的地位已经实际上是一个协议了。它主要用于在分布式存储系统中的并行程序通信。MPI是一个函数库,它可以通过Fortran和C程序进行调用,MPI的好处是它速度比较快,而且移植性比较好。
Cluster
目前常见的Cluster(集群)架构有两种,一种是Web/Internet Cluster System,这种架构主要是将资料放置在不同的主机上面,亦即由多部主机同时负责一项服务;而另外一种则是所谓的平行运算了(Parallel Algorithms Cluster System)!平行运算其实就是将同一个运算的工作,交给整个Cluster里面的所有CPU来进行同步运算的一个功能。由于使用到多个CPU的运算能力,所以可以加快运算的速度。
此文档所安装架设的LAM/MPI Cluster System属于后者,由于实验环境条件以及自身能力的限制,可能文档有部分解释不详尽,如有疑问请来信与我联系,我将尽力完善此文档,谢谢!
[软件及平台]
Server \ FreeBSD 5.3 Stable
IP:172.18.5.247
Hostname: center.the9.com
Client \ FreeBSD 5.3 Release
IP:172.18.5.80
Hostname: node1.the9.com
apache_1.3.29 \ All Ports Install
php4-4.3.10
php4-gd-4.3.10
php4-extensions-1.0
lam-6.5.9
ganglia-monitor-core-2.5.6
ganglia-webfrontend-2.5.5
[目的]
架设一套基于FreeBSD 5.3的LAM/MPI Cluster System.
[安装及配置]
一,各节点系统 /etc/hosts 的基本配置 \ 如果内网有DNS,则配置好系统中的 /etc/resolv.conf 即可!
center.the9.com
#more /etc/hosts
172.18.5.247 center.the9.com
172.18.5.80 node1.the9.com
node1.the9.com
#more /etc/hosts
172.18.5.247 center.the9.com
172.18.5.80 node1.the9.com
二,Apache+PHP Server 的架设
center.the9.com
#cd /usr/ports/www/apache13-modssl
#make install clean \ 安装 APACHE
#cd /usr/ports/lang/php4-extensions
#make install clean \ 安装 PHP. 切记这里一定要选择安装GD库
#vi /usr/local/etc/apache/http.conf \ 加入以下相关参数
AddType application/x-httpd-php .php
AddType application/x-httpd-php-source .phps
三,NFS Server-Client 的架设
NFS Server(center.the9.com)
#vi /etc/rc.conf \ 加入以下相关参数
nfs_server_enable="YES"
nfs_server_flags="-u -t -n 4 -h 172.18.5.247"
mountd_enable="YES"
mountd_flags="-r -l"
rpcbind_enable="YES"
rpcbind_flags="-l -h 172.18.5.247"
#vi /etc/exports \ 配置NFS共享目录
/cluster -maproot=0:0 -network 172.18.5.0 -mask 255.255.255.0
#/etc/rc.d/rpcbind start
#/etc/rc.d/mountd start
#/etc/rc.d/nfsd start \ 启动NFS Server
NFS Client(node1.the9.com)
#vi /etc/rc.conf \ 加入以下相关参数
nfs_client_enable="YES"
#vi /etc/fstab \ 加入以下相关参数
172.18.5.247:/cluster /cluster nfs rw 0 0
#mount /cluster \ Mount /Cluster 目录
四,LAM/MPI Cluster System的架设
Step 1: 基本安装
center.the9.com
#cd /usr/ports/net/lam
#make install clean \ 安装 LAM
#cd /usr/ports/sysutils/ganglia-monitor-core
#make install clean \ 安装Cluster System 所需的Monitor Core
#cd /usr/ports/sysutils/ganglia-webfrontend
#make install clean \ 安装上面Monitor Core 所需的WEB GUI
node1.the9.com
#cd /usr/ports/net/lam
#make install clean \ 安装 LAM
#cd /usr/ports/sysutils/ganglia-monitor-core
#make install clean \ 安装Cluster System 所需的Monitor Core
Step 2: 配置
center.the9.com
#cd /usr/local/etc/
#cp gmond.conf.sample gmond.conf
#cp gmetad.conf.sample gmetad.conf
#vi gmond.conf \ 修改name和mcast_if 的参数
# The name of the cluster this node is a part of
# default: "unspecified"
name "BSDCluster"
# The multicast interface for gmond to send/receive data on
# default: the kernel decides based on routing configuration
mcast_if lnc0
#vi gmetad.conf \ 修改data_source 的参数
# data_source "my cluster" 10 localhost my.machine.edu:8649 1.2.3.5:8655
# data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651
# data_source "another source" 1.3.4.7:8655 1.3.4.8
data_source "BSDCluster" 10 center.the9.com:8649 node1.the9.com:8649
#vi /usr/local/etc/lam-bhost.def \ 加入各node 的hostname
center.the9.com
node1.the9.com
node1.the9.com \ 基本上,每个新增节点的配置都要和以上center.the9.com 的配置一致.
node2.the9.com
nodeX.the9.com ........
五,Monitor WEB GUI 的配置
center.the9.com
#vi /usr/local/etc/apache/http.conf \ 加入以下相关参数,配置Cluster Monitor Web的路径
Alias /ganglia/ "/usr/local/www/ganglia/"
<Directory "/usr/local/www/ganglia">
Options Indexes FollowSymlinks MultiViews
AllowOverride None
Order allow,deny
Allow from all
</Directory>
#vi /etc/rc.conf \ 加入以下参数
apache_enable="YES"
apache_flags="-DSSL"
apache_pidfile="/var/run/httpd.pid"
#/usr/local/etc/rc.d/apache.sh start \ 启动APACHE
六,启动并调试Cluster System以及检查测试
center.the9.com node1.the9.com nodeX.the9.com etc....
#/usr/local/etc/rc.d/gmetad.sh start
#/usr/local/etc/rc.d/gmond.sh start \ 启动Cluster 各Node的Monitor Core
center.the9.com
$lamboot -dv \ 启动各节点的lam daemon
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
lamboot: boot schema file: /usr/local/etc/lam-bhost.def
lamboot: opening hostfile /usr/local/etc/lam-bhost.def
lamboot: found the following hosts:
lamboot: n0 center.the9.com
lamboot: n1 node1.the9.com
lamboot: resolved hosts:
lamboot: n0 center.the9.com --> 172.18.5.247
lamboot: n1 node1.the9.com --> 172.18.5.80
lamboot: found 2 host node(s)
lamboot: origin node is 0 (center.the9.com)
Executing hboot on n0 (center.the9.com - 1 CPU)...
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H 172.18.5.247 -P 53433 -n 0 -o 0 ""
hboot: process schema = "/usr/local/etc/lam-conf.lam"
hboot: found /usr/local/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/local/bin/lamd
[1] 28338 lamd -H 172.18.5.247 -P 53433 -n 0 -o 0 -d
hboot: attempting to execute
Executing hboot on n1 (node1.the9.com - 1 CPU)...
lamboot: attempting to execute "/usr/bin/ssh node1.the9.com -n echo $SHELL"
lamboot: got remote shell /bin/sh
lamboot: attempting to execute "/usr/bin/ssh node1.the9.com -n (. ./.profile; hboot -t -c lam-conf.lam -d -v -s -I "-H 172.18.5.247 -P 53433 -n 1 -o 0 " )"
hboot: process schema = "/usr/local/etc/lam-conf.lam"
hboot: found /usr/local/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/local/bin/lamd
[1] 43110 lamd -H 172.18.5.247 -P 53433 -n 1 -o 0 -d
topology done
lamboot completed successfully
$lamhalt -dv \ 停止各节点的lam daemon
$ lamhalt -dv
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
Shutting down LAM
lamhalt: sending HALT to n1 (node1.the9.com)
lamhalt: waiting for HALT ACKs from remote LAM daemons
lamhalt: received HALT ACK from n1 (node1.the9.com)
lamhalt: sending final HALT to n0 (center.the9.com)
lamhalt: local LAM daemon halted
LAM halted
$lamnodes \ 查看node info
$lamexec N echo "hello" \ 查看node run status
center.the9.com
#ps ax
28338 ?? I 0:00.04 /usr/local/bin/lamd -H 172.18.5.247 -P 53433 -n 0 -o 0 -d
node1.the9.com
#ps ax
43110 ?? S 0:00.05 /usr/local/bin/lamd -H 172.18.5.247 -P 53433 -n 1 -o 0 -d
Cluster Monitor WEB GUI
http://center.the9.com/ganglia/ \ 用这个查看系统数据,还是很直观的,是以RRDTool 生成的 images. :)
CPUs Total: 2
Hosts up: 2
Hosts down: 0
Avg Load (15, 5, 1m):
1%, 4%, 0%
Localtime:
2004-12-31 10:50
Total CPUs: 2
Total Memory: 0.2 GB
Total Disk: 8.0 GB
Most Full Disk: 61.2% Used \ 实验环境的机器比较烂,见谅见谅. :)
[参考]
http://lam-mpi.org/ lam-mpi
http://www.beowulf.org/ beowulf FAQ
http://www.lasg.ac.cn/cgi-bin/forum/topic.cgi?forum=4&topic=2247 MPI Cluster With RH9
http://lists.freebsd.org/mailman/listinfo/freebsd-cluster freebsd cluster maillist