杰克工作室 发表于 2023-2-28 14:59

RabbitMQ运维一记

<p>这是一次比较苦逼的运维,完全不熟悉的系统、不清楚环境、不清楚配置,两眼一抹黑。为啥?就是因为原来的负责人撤了、交接人休假、再次交接人也休假,再再次交接人只有一份不全的文档。而我是再、再、再次交接人,连文档也没有。更要命的是,这是生产环境,大家都懂得,生产环境就是不能出问题,自封一个&ldquo;奉命于危难之间&rdquo;吧。</p>

<p>&nbsp;</p>

<p>抱怨了一整段了,还是简单的说下这次运维吧,运维的是RabbitMQ集群,3个节点A、B、C,每个节点上启动了2个实例a1/a2、b1/b2、c1/c2,其中a1、b1、c1组成一套集群环境rabbit cluster1,a1是磁盘节点;a2、b2、c2组成一套集群环境rabbit cluster2,c2是磁盘节点。</p>

<p>&nbsp;</p>

<p>就是因为完全不熟悉RabbitMQ集群,所以基本上趟了一堆的坑,碰到了一堆不应该出现的错误,也算是新手村长经验了。按照套路,这里先说说正确的启动方式,然后再说说碰到的异常。</p>

<p>&nbsp;</p>

<p>1. 正确启动</p>

<p>1.1 启动各个节点</p>

<p>因为找不到自己启动的历史了,就是摘取官网的部分内容放在这里。有3个服务器节点,hostname分别是:rabbit1、rabbit2、rabbit3。</p>

<p>&nbsp;</p>

<p>分别在3个节点上启动Rabbit MQ:</p>

<p>rabbit1$ rabbitmq-server -detached</p>

<p>rabbit2$ rabbitmq-server -detached</p>

<p>rabbit3$ rabbitmq-server -detached</p>

<p>通过命令rabbitmq-server -detached就可以启动rabbit服务。</p>

<p>&nbsp;</p>

<p>然后在每个节点上查看集群状态:</p>

<p>&nbsp;</p>

<p>rabbit1$ rabbitmqctl cluster_status</p>

<p>Cluster status of node rabbit@rabbit1 ...</p>

<p>[{nodes,[{disc,}]},{running_nodes,}]</p>

<p>...done.</p>

<p>&nbsp;</p>

<p>rabbit2$ rabbitmqctl cluster_status</p>

<p>Cluster status of node rabbit@rabbit2 ...</p>

<p>[{nodes,[{disc,}]},{running_nodes,}]</p>

<p>...done.</p>

<p>&nbsp;</p>

<p>rabbit3$ rabbitmqctl cluster_status</p>

<p>Cluster status of node rabbit@rabbit3 ...</p>

<p>[{nodes,[{disc,}]},{running_nodes,}]</p>

<p>...done.</p>

<p>RabbitMQ服务节点的名字是rabbit@shorthostname,所以上面3个节点分别是rabbit@rabbit1、rabbit@rabbit2、rabbit@rabbit3。需要注意的是使用rabbitmq-server基本执行的名字都是小写的,如果是在Windows中使用rabbitmq-server.bat,那节点名字就是大写的了,比如rabbit@RABBIT1。</p>

<p>&nbsp;</p>

<p>1.2 创建集群</p>

<p>这里把rabbit@rabbit2和rabbit@rabbit3加入rabbit@rabbit1中,也就是说rabbit@rabbit1是磁盘节点,其他两个都是内存节点。</p>

<p>&nbsp;</p>

<p>先把rabbit@rabbit2加入到rabbit@rabbit1中:</p>

<p>&nbsp;</p>

<p>rabbit2$ rabbitmqctl stop_app</p>

<p>Stopping node rabbit@rabbit2 ...done.</p>

<p>&nbsp;</p>

<p>rabbit2$ rabbitmqctl join_cluster rabbit@rabbit1</p>

<p>Clustering node rabbit@rabbit2 with ...done.</p>

<p>&nbsp;</p>

<p>rabbit2$ rabbitmqctl start_app</p>

<p>Starting node rabbit@rabbit2 ...done.</p>

<p>如果没有报错,rabbit@rabbit2就已经加入到rabbit@rabbit1中了,可以使用命令rabbitmqctl cluster_status查看集群状态:</p>

<p>&nbsp;</p>

<p>rabbit1$ rabbitmqctl cluster_status</p>

<p>Cluster status of node rabbit@rabbit1 ...</p>

<p>[{nodes,[{disc,}]},</p>

<p>&nbsp;{running_nodes,}]</p>

<p>...done.</p>

<p>&nbsp;</p>

<p>rabbit2$ rabbitmqctl cluster_status</p>

<p>Cluster status of node rabbit@rabbit2 ...</p>

<p>[{nodes,[{disc,}]},</p>

<p>&nbsp;{running_nodes,}]</p>

<p>...done.</p>

<p>通过join_cluster --ram可以指定节点以内存节点的形式加入集群。然后在rabbit@rabbit3上执行相同的操作即可,这里不再赘述。</p>

<p>&nbsp;</p>

<p>1.3 拆分集群</p>

<p>拆分集群实际上就是在想要从集群中删除的节点上执行reset即可,他会通知集群中所有的节点不要再理这个节点了。</p>

<p>&nbsp;</p>

<p>rabbit3$ rabbitmqctl stop_app</p>

<p>Stopping node rabbit@rabbit3 ...done.</p>

<p>&nbsp;</p>

<p>rabbit3$ rabbitmqctl reset</p>

<p>Resetting node rabbit@rabbit3 ...done.</p>

<p>&nbsp;</p>

<p>rabbit3$ rabbitmqctl start_app</p>

<p>Starting node rabbit@rabbit3 ...done.</p>

<p>在各个节点上查看集群状态:</p>

<p>&nbsp;</p>

<p>rabbit1$ rabbitmqctl cluster_status</p>

<p>Cluster status of node rabbit@rabbit1 ...</p>

<p>[{nodes,[{disc,}]},</p>

<p>&nbsp;{running_nodes,}]</p>

<p>...done.</p>

<p>&nbsp;</p>

<p>rabbit2$ rabbitmqctl cluster_status</p>

<p>Cluster status of node rabbit@rabbit2 ...</p>

<p>[{nodes,[{disc,}]},</p>

<p>&nbsp;{running_nodes,}]</p>

<p>...done.</p>

<p>&nbsp;</p>

<p>rabbit3$ rabbitmqctl cluster_status</p>

<p>Cluster status of node rabbit@rabbit3 ...</p>

<p>[{nodes,[{disc,}]},{running_nodes,}]</p>

<p>...done.</p>

<p>还可以在某节点上删除别的节点,可以使用forget_cluster_node来进行,这里不进行赘述,后面的一种异常中会用到这个命令。</p>

<p>&nbsp;</p>

<p>2. 几个异常</p>

<p>2.1 一台机器上启动多个实例</p>

<p>RabbitMQ允许在一台机器上启动多个实例,自己在这次运维中占用时间最长的就是不知道这3个节点上部署了两套集群,通过ps -ef|grep rabbit命令看到有两个实例,就天真的以为是有一个没有成功关闭,所以直接把两个都kill了。所以这里记录一下如果在一台机器上启动多个实例。</p>

<p>&nbsp;</p>

<p>$ RABBITMQ_NODE_PORT=5673 RABBITMQ_SERVER_START_ARGS=&quot;-rabbitmq_management listener [{port,15673}]&quot; RABBITMQ_NODENAME=hare rabbitmq-server -detached</p>

<p>$ rabbitmqctl -n hare stop_app</p>

<p>$ rabbitmqctl -n hare join_cluster rabbit@rabbit1</p>

<p>$ rabbitmqctl -n hare start_app</p>

<p>执行命令需要使用-n来指定执行命令的实例,这个是需要记住的。</p>

<p>&nbsp;</p>

<p>2.2 Bad cookie in table definition rabbit_durable_queue</p>

<p>这个已经找不到错误的具体描述了,就从Google上找了一条,基本类似。</p>

<p>&nbsp;</p>

<p>rabbitmqctl cluster rabbit@vmrabbita</p>

<p>Clustering node rabbit@vmrabbitb with ...</p>

<p>Error: {unable_to_join_cluster,</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ,</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {merge_schema_failed,</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &quot;Bad cookie in table definition rabbit_durable_queue: rabbit@vmrabbitb = {cstruct,rabbit_durable_queue,set,[],,[],0,read_write,[],[],false,amqqueue,,[],[],{{1266,16863,365586},rabbit@vmrabbitb},{{2,0},[]}}, rabbit@vmrabbita = {cstruct,rabbit_durable_queue,set,[],,[],0,read_write,[],[],false,amqqueue,,[],[],{{1266,14484,94839},rabbit@vmrabbita},{{4,0},{rabbit@vmrabbitc,{1266,16151,144423}}}}\n&quot;}}</p>

<p>主要的就是Bad cookie in table definition rabbit_durable_queue这句,这是因为节点之间是通过&ldquo;the Erlang Cookie&rdquo;彼此识别的,存储在$HOME/.erlang.cookie中。如果因为某种原因,集群中几个节点服务器上的cookie不一致,就会不能够彼此识别,出现这样那样的错误。更多的是上面的这个&rdquo;Bad cookie。。。&rdquo;,还会有&rdquo;Connection attempt from disallowed node&rdquo;、 &ldquo;Could not auto-cluster&rdquo;。</p>

<p>&nbsp;</p>

<p>2.3 already_member</p>

<p>这个问题就是比较2的一个问题了,自己给自己挖的坑,只能自己填了。集群几个节点之间不能通信,然后我就把一个内存节点的var/lib/rabbitmq/mnesia中的文件夹删了,然后又执行了reset,当执行join_cluster命令的时候,就会报出错误:</p>

<p>&nbsp;</p>

<p>Error: {ok,already_member}</p>

<p>分析一下可以明白,当前节点上没有待加入集群的信息,但是待加入集群中已经有了该节点的信息,但是发现两个信息不一致。所以当当前节点期望加入到集群的时候,出于安全考虑,集群就说你小子已经是集群里的一员了,不要再加了。扒出日志来看:</p>

<p>&nbsp;</p>

<p>=INFO REPORT==== 25-Jul-2016::20:11:10 ===</p>

<p>Error description:</p>

<p>&nbsp; &nbsp;{could_not_start,rabbitmq_management,</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp;{{shutdown,</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {failed_to_start_child,rabbit_mgmt_sup,</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {&#39;EXIT&#39;,</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; {{shutdown,</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;[{{already_started,&lt;23251.1658.0&gt;},</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;{child,undefined,rabbit_mgmt_db,</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;{rabbit_mgmt_db,start_link,[]},</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;permanent,4294967295,worker,</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;}}]},</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;{gen_server2,call,</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;[&lt;0.618.0&gt;,{init,&lt;0.616.0&gt;},infinity]}}}}},</p>

<p>&nbsp; &nbsp; &nbsp; &nbsp; {rabbit_mgmt_app,start,]}}}</p>

<p>&nbsp;</p>

<p>Log files (may contain more information):</p>

<p>&nbsp; &nbsp;./../var/log/rabbitmq/hare.log</p>

<p>&nbsp; &nbsp;./../var/log/rabbitmq/hare-sasl.log</p>

<p>既然集群中已经有个该节点信息,所以不要该节点重复加入。那就把集群里该节点信息删了,再加入集群,不就应该类似与一个全新的节点加入集群一样吗?</p>

<p>&nbsp;</p>

<p>rabbitmqctl -n hare forget_cluster_node hare@rabbit1</p>

<p>这样,集群中就没有hare@rabbit1的信息了,之后就重新执行join_cluster命令即可。</p>

<p>&nbsp;</p>

<p>2.4 千万不要在磁盘节点上删除var/lib/rabbitmq/mnesia中的文件</p>

<p>这个文件夹中的内容是磁盘节点用于记录集群信息的文件,一旦删除,会出现各种各样的异常。</p>

<p>&nbsp;</p>

<p>如果是磁盘节点,集群中配置的Exchanges、Queues、User等信息全都丢失</p>

<p>如果是内存节点,连接集群的信息丢失,重新加入集群是会失败</p>

<p>因为是两套集群,这两个问题我都碰到了。生产环境啊,可以想象当我正在为集群正常启动后得意的时候,突然发现所有的Exchanges、Queues信息全都没了的时候的心情吗?幸亏还有一套预生产环境,直接把预生产环境的内容导出,然后在生产环境导入。</p>

<p>&nbsp;</p>

<p>唉,有种劫后余生的感动。。。</p>

<p>原文地址:https://www.howardliu.cn/rabbitmq-operation/</p>
页: [1]
查看完整版本: RabbitMQ运维一记