Discussion:
[Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash
Thomas Glanzmann
2013-06-06 09:11:19 UTC
Permalink
Hello,
over the last couple of days, I setup an active passive nfs server and
iSCSI storage using drbd, pacemaker, heartbeat, lio and nfs kernel
server. While testing cluster I was often setting it to unmanaged using:

crm configure property maintenance-mode=true

Sometimes when I did that, both nodes or the standby node, suicided
itself because /usr/lib/heartbeat/crmd was crashing. I can reproduce the
problem easily. It even happened to me with a two node cluster having no
resources at all. If you need more information, drop me an e-mail.

Highlights of the log:

Jun 6 10:17:37 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_INTEGRATION [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ]
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: crm_abort: abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph != NULL
Jun 6 10:17:37 astorage1 heartbeat: [2863]: WARN: Managed /usr/lib/heartbeat/crmd process 2947 killed by signal 11 [SIGSEGV - Segmentation violation].
Jun 6 10:17:37 astorage1 ccm: [2942]: info: client (pid=2947) removed from ccm
Jun 6 10:17:37 astorage1 heartbeat: [2863]: ERROR: Managed /usr/lib/heartbeat/crmd process 2947 dumped core
Jun 6 10:17:37 astorage1 heartbeat: [2863]: EMERG: Rebooting system. Reason: /usr/lib/heartbeat/crmd

See the log:

Jun 6 10:17:22 astorage1 crmd: [2947]: info: do_election_count_vote: Election 4 (owner: 56adf229-a1a7-4484-8f18-742ddce19db8) lost: vote from astorage2 (Uptime)
Jun 6 10:17:22 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_NOT_DC -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL origin=do_election_count_vote ]
Jun 6 10:17:27 astorage1 crmd: [2947]: info: update_dc: Set DC to astorage2 (3.0.6)
Jun 6 10:17:28 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=astorage2/crmd/210, version=0.9.18): ok (rc=0)
Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Jun 6 10:17:28 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd3:0 (10000)
Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd10:0 (10000)
Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd8:0 (10000)
Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd6:0 (10000)
Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd5:0 (10000)
Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd9:0 (10000)
Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jun 6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd4:0 (10000)
Jun 6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[35] on astorage2-fencing for client 2947, its parameters: hostname=[astorage2] userid=[ADMIN] CRM_meta_timeout=[20000] CRM_meta_name=[monitor] passwd=[ADMIN] crm_feature_set=[3.0.6] ipaddr=[10.10.30.22] CRM_meta_interval=[60000] cancelled
Jun 6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation astorage2-fencing_monitor_60000 (call=35, status=1, cib-update=0, confirmed=true) Cancelled
Jun 6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[36] on drbd10:0 for client 2947, its parameters: drbd_resource=[r10] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_inactive_resource=[drbd10:0 ] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[20000] CRM_meta_notify_master_uname=[astorage2 ] CRM_meta_name=[monitor] CRM_meta_notify_start_resource=[drbd10:0 ] CRM_meta_notify_start_uname=[astorage1 ] crm_feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_promote_resour cancelled
Jun 6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation drbd10:0_monitor_31000 (call=36, status=1, cib-update=0, confirmed=true) Cancelled
Jun 6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[37] on drbd3:0 for client 2947, its parameters: drbd_resource=[r3] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_inactive_resource=[drbd3:0 ] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[20000] CRM_meta_notify_master_uname=[astorage2 ] CRM_meta_name=[monitor] CRM_meta_notify_start_resource=[drbd3:0 ] CRM_meta_notify_start_uname=[astorage1 ] crm_feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_promote_resource=[ cancelled
Jun 6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation drbd3:0_monitor_31000 (call=37, status=1, cib-update=0, confirmed=true) Cancelled
Jun 6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[38] on drbd4:0 for client 2947, its parameters: drbd_resource=[r4] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_inactive_resource=[drbd4:0 ] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[20000] CRM_meta_notify_master_uname=[astorage2 ] CRM_meta_name=[monitor] CRM_meta_notify_start_resource=[drbd4:0 ] CRM_meta_notify_start_uname=[astorage1 ] crm_feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_promote_resource=[ cancelled
Jun 6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation drbd4:0_monitor_31000 (call=38, status=1, cib-update=0, confirmed=true) Cancelled
Jun 6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[39] on drbd5:0 for client 2947, its parameters: drbd_resource=[r5] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_inactive_resource=[drbd5:0 ] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[20000] CRM_meta_notify_master_uname=[astorage2 ] CRM_meta_name=[monitor] CRM_meta_notify_start_resource=[drbd5:0 ] CRM_meta_notify_start_uname=[astorage1 ] crm_feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_promote_resource=[ cancelled
Jun 6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation drbd5:0_monitor_31000 (call=39, status=1, cib-update=0, confirmed=true) Cancelled
Jun 6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[40] on drbd6:0 for client 2947, its parameters: drbd_resource=[r6] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_inactive_resource=[drbd6:0 ] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[20000] CRM_meta_notify_master_uname=[astorage2 ] CRM_meta_name=[monitor] CRM_meta_notify_start_resource=[drbd6:0 ] CRM_meta_notify_start_uname=[astorage1 ] crm_feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_promote_resource=[ cancelled
Jun 6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation drbd6:0_monitor_31000 (call=40, status=1, cib-update=0, confirmed=true) Cancelled
Jun 6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[41] on drbd8:0 for client 2947, its parameters: drbd_resource=[r8] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_inactive_resource=[drbd8:0 ] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[20000] CRM_meta_notify_master_uname=[astorage2 ] CRM_meta_name=[monitor] CRM_meta_notify_start_resource=[drbd8:0 ] CRM_meta_notify_start_uname=[astorage1 ] crm_feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_promote_resource=[ cancelled
Jun 6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation drbd8:0_monitor_31000 (call=41, status=1, cib-update=0, confirmed=true) Cancelled
Jun 6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[42] on drbd9:0 for client 2947, its parameters: drbd_resource=[r9] CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_inactive_resource=[drbd9:0 ] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[20000] CRM_meta_notify_master_uname=[astorage2 ] CRM_meta_name=[monitor] CRM_meta_notify_start_resource=[drbd9:0 ] CRM_meta_notify_start_uname=[astorage1 ] crm_feature_set=[3.0.6] CRM_meta_notify=[true] CRM_meta_notify_promote_resource=[ cancelled
Jun 6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation drbd9:0_monitor_31000 (call=42, status=1, cib-update=0, confirmed=true) Cancelled
Jun 6 10:17:31 astorage1 crmd: [2947]: notice: crmd_client_status_callback: Status update: Client astorage2/crmd now has status [offline] (DC=false)
Jun 6 10:17:31 astorage1 crmd: [2947]: info: crm_update_peer_proc: astorage2.crmd is now offline
Jun 6 10:17:31 astorage1 crmd: [2947]: notice: crmd_peer_update: Status update: Client astorage2/crmd now has status [offline] (DC=astorage2)
Jun 6 10:17:31 astorage1 crmd: [2947]: info: crmd_peer_update: Got client status callback - our DC is dead
Jun 6 10:17:31 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_CRMD_STATUS_CALLBACK origin=crmd_peer_update ]
Jun 6 10:17:31 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
Jun 6 10:17:31 astorage1 crmd: [2947]: info: do_te_control: Registering TE UUID: 3fa38a9f-5ebc-4a48-bc80-1c95cc6655bc
Jun 6 10:17:31 astorage1 crmd: [2947]: info: set_graph_functions: Setting custom graph functions
Jun 6 10:17:31 astorage1 crmd: [2947]: info: start_subsystem: Starting sub-system "pengine"
Jun 6 10:17:31 astorage1 pengine: [5812]: info: Invoked: /usr/lib/pacemaker/pengine
Jun 6 10:17:31 astorage1 cib: [2943]: info: cib_process_shutdown_req: Shutdown REQ from astorage2
Jun 6 10:17:31 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_shutdown_req for section 'all' (origin=astorage2/astorage2/(null), version=0.9.54): ok (rc=0)
Jun 6 10:17:32 astorage1 cib: [2943]: info: cib_client_status_callback: Status update: Client astorage2/cib now has status [leave]
Jun 6 10:17:32 astorage1 cib: [2943]: info: crm_update_peer_proc: astorage2.cib is now offline
Jun 6 10:17:32 astorage1 cib: [2943]: info: mem_handle_event: Got an event OC_EV_MS_NOT_PRIMARY from ccm
Jun 6 10:17:32 astorage1 cib: [2943]: info: mem_handle_event: instance=12, nodes=2, new=2, lost=0, n_idx=0, new_idx=0, old_idx=4
Jun 6 10:17:32 astorage1 cib: [2943]: info: cib_ccm_msg_callback: Processing CCM event=NOT PRIMARY (id=12)
Jun 6 10:17:35 astorage1 crmd: [2947]: info: do_dc_takeover: Taking over DC status for this partition
Jun 6 10:17:35 astorage1 cib: [2943]: info: cib_process_readwrite: We are now in R/W mode
Jun 6 10:17:35 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_master for section 'all' (origin=local/crmd/56, version=0.9.55): ok (rc=0)
Jun 6 10:17:35 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/57, version=0.9.56): ok (rc=0)
Jun 6 10:17:35 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/59, version=0.9.57): ok (rc=0)
Jun 6 10:17:35 astorage1 crmd: [2947]: info: join_make_offer: Making join offers based on membership 12
Jun 6 10:17:35 astorage1 crmd: [2947]: info: join_make_offer: Peer process on astorage2 is not active (yet?): 00000002 2
Jun 6 10:17:35 astorage1 crmd: [2947]: info: do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks
Jun 6 10:17:35 astorage1 crmd: [2947]: info: mem_handle_event: Got an event OC_EV_MS_NOT_PRIMARY from ccm
Jun 6 10:17:35 astorage1 crmd: [2947]: info: mem_handle_event: instance=12, nodes=2, new=2, lost=0, n_idx=0, new_idx=0, old_idx=4
Jun 6 10:17:35 astorage1 crmd: [2947]: info: crmd_ccm_msg_callback: Quorum lost after event=NOT PRIMARY (id=12)
Jun 6 10:17:35 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/61, version=0.9.58): ok (rc=0)
Jun 6 10:17:35 astorage1 crmd: [2947]: info: update_dc: Set DC to astorage1 (3.0.6)
Jun 6 10:17:35 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Jun 6 10:17:35 astorage1 crmd: [2947]: info: do_dc_join_finalize: join-1: Syncing the CIB from astorage1 to the rest of the cluster
Jun 6 10:17:35 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/64, version=0.9.58): ok (rc=0)
Jun 6 10:17:35 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/65, version=0.9.59): ok (rc=0)
Jun 6 10:17:36 astorage1 crmd: [2947]: info: do_dc_join_ack: join-1: Updating node state to member for astorage1
Jun 6 10:17:36 astorage1 crmd: [2947]: info: erase_status_tag: Deleting xpath: //node_state[@uname='astorage1']/lrm
Jun 6 10:17:36 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='astorage1']/lrm (origin=local/crmd/66, version=0.9.60): ok (rc=0)
Jun 6 10:17:36 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
Jun 6 10:17:36 astorage1 crmd: [2947]: info: populate_cib_nodes_ha: Requesting the list of configured nodes
Jun 6 10:17:36 astorage1 attrd: [2946]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Jun 6 10:17:36 astorage1 crmd: [2947]: info: abort_transition_graph: do_te_invoke:162 - Triggered transition abort (complete=1) : Peer Cancelled
Jun 6 10:17:36 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd3:0 (10000)
Jun 6 10:17:36 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/68, version=0.9.62): ok (rc=0)
Jun 6 10:17:37 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/70, version=0.9.64): ok (rc=0)
Jun 6 10:17:37 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd10:0 (10000)
Jun 6 10:17:37 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd8:0 (10000)
Jun 6 10:17:37 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd6:0 (10000)
Jun 6 10:17:37 astorage1 pengine: [5812]: notice: can_be_master: Forcing unmanaged master drbd10:1 to remain promoted on astorage2
Jun 6 10:17:37 astorage1 pengine: [5812]: notice: can_be_master: Forcing unmanaged master drbd3:1 to remain promoted on astorage2
Jun 6 10:17:37 astorage1 pengine: [5812]: notice: can_be_master: Forcing unmanaged master drbd4:1 to remain promoted on astorage2
Jun 6 10:17:37 astorage1 pengine: [5812]: notice: can_be_master: Forcing unmanaged master drbd5:1 to remain promoted on astorage2
Jun 6 10:17:37 astorage1 pengine: [5812]: notice: can_be_master: Forcing unmanaged master drbd6:1 to remain promoted on astorage2
Jun 6 10:17:37 astorage1 pengine: [5812]: notice: can_be_master: Forcing unmanaged master drbd8:1 to remain promoted on astorage2
Jun 6 10:17:37 astorage1 pengine: [5812]: notice: can_be_master: Forcing unmanaged master drbd9:1 to remain promoted on astorage2
Jun 6 10:17:37 astorage1 pengine: [5812]: notice: stage6: Delaying fencing operations until there are resources to manage
Jun 6 10:17:37 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd5:0 (10000)
Jun 6 10:17:37 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Jun 6 10:17:37 astorage1 crmd: [2947]: info: do_te_invoke: Processing graph 0 (ref=pe_calc-dc-1370506657-30) derived from /var/lib/pengine/pe-input-496.bz2
Jun 6 10:17:37 astorage1 crmd: [2947]: info: te_rsc_command: Initiating action 4: cancel astorage2-fencing_monitor_60000 on astorage1 (local)
Jun 6 10:17:37 astorage1 crmd: [2947]: info: cancel_op: No pending op found for astorage2-fencing:35
Jun 6 10:17:37 astorage1 lrmd: [2944]: info: on_msg_cancel_op: no operation with id 35
Jun 6 10:17:37 astorage1 crmd: [2947]: info: te_rsc_command: Initiating action 2: cancel drbd10:0_monitor_31000 on astorage1 (local)
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: get_lrm_resource: Could not add resource drbd10:0 to LRM
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: do_lrm_invoke: Invalid resource definition
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.6" subt="request" reference="lrm_invoke-tengine-1370506657-33" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="astorage1" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <crm_xml >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <rsc_op id="2" operation="cancel" operation_key="drbd10:0_monitor_31000" on_node="astorage1" on_node_uuid="76bbbf07-3d2d-476d-b758-2a7a4577f162" transition-key="2:0:0:3fa38a9f-5ebc-4a48-bc80-1c95cc6655bc" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <primitive id="drbd10:0" long-id="ma-ms-drbd10:drbd10:0" class="ocf" provider="linbit" type="drbd" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <attributes CRM_meta_call_id="36" CRM_meta_clone="0" CRM_meta_clone_max="2" CRM_meta_clone_node_max="1" CRM_meta_globally_unique="false" CRM_meta_interval="31000" CRM_meta_master_max="1" CRM_meta_master_node_max="1" CRM_meta_name="monitor" CRM_meta_notify="true" CRM_meta_operation="monitor" CRM_meta_role="Slave" CRM_meta_timeout="20000" crm_feature_set="3.0.6" drbd_resource="r10" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </rsc_op>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </crm_xml>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </create_request_adv>
Jun 6 10:17:37 astorage1 crmd: [2947]: info: te_rsc_command: Initiating action 5: cancel drbd3:0_monitor_31000 on astorage1 (local)
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: get_lrm_resource: Could not add resource drbd3:0 to LRM
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: do_lrm_invoke: Invalid resource definition
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.6" subt="request" reference="lrm_invoke-tengine-1370506657-34" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="astorage1" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <crm_xml >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <rsc_op id="5" operation="cancel" operation_key="drbd3:0_monitor_31000" on_node="astorage1" on_node_uuid="76bbbf07-3d2d-476d-b758-2a7a4577f162" transition-key="5:0:0:3fa38a9f-5ebc-4a48-bc80-1c95cc6655bc" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <primitive id="drbd3:0" long-id="ma-ms-drbd3:drbd3:0" class="ocf" provider="linbit" type="drbd" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <attributes CRM_meta_call_id="37" CRM_meta_clone="0" CRM_meta_clone_max="2" CRM_meta_clone_node_max="1" CRM_meta_globally_unique="false" CRM_meta_interval="31000" CRM_meta_master_max="1" CRM_meta_master_node_max="1" CRM_meta_name="monitor" CRM_meta_notify="true" CRM_meta_operation="monitor" CRM_meta_role="Slave" CRM_meta_timeout="20000" crm_feature_set="3.0.6" drbd_resource="r3" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </rsc_op>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </crm_xml>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </create_request_adv>
Jun 6 10:17:37 astorage1 crmd: [2947]: info: te_rsc_command: Initiating action 3: cancel drbd4:0_monitor_31000 on astorage1 (local)
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: get_lrm_resource: Could not add resource drbd4:0 to LRM
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: do_lrm_invoke: Invalid resource definition
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.6" subt="request" reference="lrm_invoke-tengine-1370506657-35" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="astorage1" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <crm_xml >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <rsc_op id="3" operation="cancel" operation_key="drbd4:0_monitor_31000" on_node="astorage1" on_node_uuid="76bbbf07-3d2d-476d-b758-2a7a4577f162" transition-key="3:0:0:3fa38a9f-5ebc-4a48-bc80-1c95cc6655bc" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <primitive id="drbd4:0" long-id="ma-ms-drbd4:drbd4:0" class="ocf" provider="linbit" type="drbd" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <attributes CRM_meta_call_id="38" CRM_meta_clone="0" CRM_meta_clone_max="2" CRM_meta_clone_node_max="1" CRM_meta_globally_unique="false" CRM_meta_interval="31000" CRM_meta_master_max="1" CRM_meta_master_node_max="1" CRM_meta_name="monitor" CRM_meta_notify="true" CRM_meta_operation="monitor" CRM_meta_role="Slave" CRM_meta_timeout="20000" crm_feature_set="3.0.6" drbd_resource="r4" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </rsc_op>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </crm_xml>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </create_request_adv>
Jun 6 10:17:37 astorage1 crmd: [2947]: info: te_rsc_command: Initiating action 6: cancel drbd5:0_monitor_31000 on astorage1 (local)
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: get_lrm_resource: Could not add resource drbd5:0 to LRM
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: do_lrm_invoke: Invalid resource definition
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.6" subt="request" reference="lrm_invoke-tengine-1370506657-36" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="astorage1" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <crm_xml >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <rsc_op id="6" operation="cancel" operation_key="drbd5:0_monitor_31000" on_node="astorage1" on_node_uuid="76bbbf07-3d2d-476d-b758-2a7a4577f162" transition-key="6:0:0:3fa38a9f-5ebc-4a48-bc80-1c95cc6655bc" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <primitive id="drbd5:0" long-id="ma-ms-drbd5:drbd5:0" class="ocf" provider="linbit" type="drbd" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <attributes CRM_meta_call_id="39" CRM_meta_clone="0" CRM_meta_clone_max="2" CRM_meta_clone_node_max="1" CRM_meta_globally_unique="false" CRM_meta_interval="31000" CRM_meta_master_max="1" CRM_meta_master_node_max="1" CRM_meta_name="monitor" CRM_meta_notify="true" CRM_meta_operation="monitor" CRM_meta_role="Slave" CRM_meta_timeout="20000" crm_feature_set="3.0.6" drbd_resource="r5" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </rsc_op>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </crm_xml>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </create_request_adv>
Jun 6 10:17:37 astorage1 crmd: [2947]: info: te_rsc_command: Initiating action 7: cancel drbd6:0_monitor_31000 on astorage1 (local)
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: get_lrm_resource: Could not add resource drbd6:0 to LRM
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: do_lrm_invoke: Invalid resource definition
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.6" subt="request" reference="lrm_invoke-tengine-1370506657-37" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="astorage1" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <crm_xml >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <rsc_op id="7" operation="cancel" operation_key="drbd6:0_monitor_31000" on_node="astorage1" on_node_uuid="76bbbf07-3d2d-476d-b758-2a7a4577f162" transition-key="7:0:0:3fa38a9f-5ebc-4a48-bc80-1c95cc6655bc" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <primitive id="drbd6:0" long-id="ma-ms-drbd6:drbd6:0" class="ocf" provider="linbit" type="drbd" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <attributes CRM_meta_call_id="40" CRM_meta_clone="0" CRM_meta_clone_max="2" CRM_meta_clone_node_max="1" CRM_meta_globally_unique="false" CRM_meta_interval="31000" CRM_meta_master_max="1" CRM_meta_master_node_max="1" CRM_meta_name="monitor" CRM_meta_notify="true" CRM_meta_operation="monitor" CRM_meta_role="Slave" CRM_meta_timeout="20000" crm_feature_set="3.0.6" drbd_resource="r6" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </rsc_op>
Jun 6 10:17:37 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd9:0 (10000)
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </crm_xml>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </create_request_adv>
Jun 6 10:17:37 astorage1 crmd: [2947]: info: te_rsc_command: Initiating action 1: cancel drbd8:0_monitor_31000 on astorage1 (local)
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: get_lrm_resource: Could not add resource drbd8:0 to LRM
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: do_lrm_invoke: Invalid resource definition
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.6" subt="request" reference="lrm_invoke-tengine-1370506657-38" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="astorage1" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <crm_xml >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <rsc_op id="1" operation="cancel" operation_key="drbd8:0_monitor_31000" on_node="astorage1" on_node_uuid="76bbbf07-3d2d-476d-b758-2a7a4577f162" transition-key="1:0:0:3fa38a9f-5ebc-4a48-bc80-1c95cc6655bc" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <primitive id="drbd8:0" long-id="ma-ms-drbd8:drbd8:0" class="ocf" provider="linbit" type="drbd" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <attributes CRM_meta_call_id="41" CRM_meta_clone="0" CRM_meta_clone_max="2" CRM_meta_clone_node_max="1" CRM_meta_globally_unique="false" CRM_meta_interval="31000" CRM_meta_master_max="1" CRM_meta_master_node_max="1" CRM_meta_name="monitor" CRM_meta_notify="true" CRM_meta_operation="monitor" CRM_meta_role="Slave" CRM_meta_timeout="20000" crm_feature_set="3.0.6" drbd_resource="r8" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </rsc_op>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </crm_xml>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </create_request_adv>
Jun 6 10:17:37 astorage1 crmd: [2947]: info: te_rsc_command: Initiating action 8: cancel drbd9:0_monitor_31000 on astorage1 (local)
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: get_lrm_resource: Could not add resource drbd9:0 to LRM
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: do_lrm_invoke: Invalid resource definition
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <create_request_adv origin="te_rsc_command" t="crmd" version="3.0.6" subt="request" reference="lrm_invoke-tengine-1370506657-39" crm_task="lrm_invoke" crm_sys_to="lrmd" crm_sys_from="tengine" crm_host_to="astorage1" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <crm_xml >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <rsc_op id="8" operation="cancel" operation_key="drbd9:0_monitor_31000" on_node="astorage1" on_node_uuid="76bbbf07-3d2d-476d-b758-2a7a4577f162" transition-key="8:0:0:3fa38a9f-5ebc-4a48-bc80-1c95cc6655bc" >
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <primitive id="drbd9:0" long-id="ma-ms-drbd9:drbd9:0" class="ocf" provider="linbit" type="drbd" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input <attributes CRM_meta_call_id="42" CRM_meta_clone="0" CRM_meta_clone_max="2" CRM_meta_clone_node_max="1" CRM_meta_globally_unique="false" CRM_meta_interval="31000" CRM_meta_master_max="1" CRM_meta_master_node_max="1" CRM_meta_name="monitor" CRM_meta_notify="true" CRM_meta_operation="monitor" CRM_meta_role="Slave" CRM_meta_timeout="20000" crm_feature_set="3.0.6" drbd_resource="r9" />
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </rsc_op>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </crm_xml>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_lrm_invoke: bad input </create_request_adv>
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_log: FSA: Input I_FAIL from get_lrm_resource() received in state S_TRANSITION_ENGINE
Jun 6 10:17:37 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ]
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: destroy_action: Cancelling timer for action 4 (src=73)
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: destroy_action: Cancelling timer for action 2 (src=74)
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: destroy_action: Cancelling timer for action 5 (src=75)
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: destroy_action: Cancelling timer for action 3 (src=76)
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: destroy_action: Cancelling timer for action 6 (src=77)
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: destroy_action: Cancelling timer for action 7 (src=78)
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: destroy_action: Cancelling timer for action 1 (src=79)
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: destroy_action: Cancelling timer for action 8 (src=80)
Jun 6 10:17:37 astorage1 crmd: [2947]: info: do_te_control: Transitioner is now inactive
Jun 6 10:17:37 astorage1 crmd: [2947]: WARN: do_log: FSA: Input I_FAIL from get_lrm_resource() received in state S_POLICY_ENGINE
Jun 6 10:17:37 astorage1 crmd: [2947]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_INTEGRATION [ input=I_FAIL cause=C_FSA_INTERNAL origin=get_lrm_resource ]
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: crm_abort: abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph != NULL
Jun 6 10:17:37 astorage1 heartbeat: [2863]: WARN: Managed /usr/lib/heartbeat/crmd process 2947 killed by signal 11 [SIGSEGV - Segmentation violation].
Jun 6 10:17:37 astorage1 ccm: [2942]: info: client (pid=2947) removed from ccm
Jun 6 10:17:37 astorage1 heartbeat: [2863]: ERROR: Managed /usr/lib/heartbeat/crmd process 2947 dumped core
Jun 6 10:17:37 astorage1 heartbeat: [2863]: EMERG: Rebooting system. Reason: /usr/lib/heartbeat/crmd
Jun 6 10:17:37 astorage1 cib: [2943]: WARN: send_ipc_message: IPC Channel to 2947 is not connected
Jun 6 10:17:37 astorage1 cib: [2943]: WARN: cib_notify_client: Notification of client 2947/d4332be4-1b1f-42e7-8d6a-4dc79e5a7e07 failed
Jun 6 10:17:37 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Jun 6 10:17:37 astorage1 cib: [2943]: WARN: send_ipc_message: IPC Channel to 2947 is not connected
Jun 6 10:17:37 astorage1 cib: [2943]: WARN: cib_notify_client: Notification of client 2947/d4332be4-1b1f-42e7-8d6a-4dc79e5a7e07 failed
Jun 6 10:17:37 astorage1 cib: [2943]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='astorage1']//lrm_resource[@id='astorage2-fencing']/lrm_rsc_op[@id='astorage2-fencing_monitor_60000' and @call-id='35'] (origin=local/crmd/72, version=0.9.70): ok (rc=0)
Jun 6 10:17:37 astorage1 cib: [2943]: WARN: send_ipc_message: IPC Channel to 2947 is not connected
Jun 6 10:17:37 astorage1 cib: [2943]: WARN: send_via_callback_channel: Delivery of reply to client 2947/d4332be4-1b1f-42e7-8d6a-4dc79e5a7e07 failed
Jun 6 10:17:37 astorage1 cib: [2943]: WARN: do_local_notify: A-Sync reply to crmd failed: reply failed
Jun 6 10:17:37 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd4:0 (10000)
Jun 6 10:17:37 astorage1 pengine: [5812]: notice: process_pe_message: Transition 0: PEngine Input stored in: /var/lib/pengine/pe-input-496.bz2

root at astorage1:/var/lib/heartbeat/cores/hacluster# ls -al
total 2024
drwx------ 2 hacluster root 4096 Jun 6 10:17 .
drwxr-xr-x 5 root root 4096 Jun 5 16:50 ..
-rw------- 1 hacluster haclient 2187264 Jun 6 10:17 core
root at astorage1:/var/lib/heartbeat/cores/hacluster# file core
core: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/lib/heartbeat/crmd'
root at astorage1:/var/lib/heartbeat/cores/hacluster# gdb /usr/lib/heartbeat/crmd core
GNU gdb (GDB) 7.4.1-debian
Copyright (C) 2012 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/lib/heartbeat/crmd...(no debugging symbols found)...done.
[New LWP 2947]

warning: Can't read pathname for load map: Input/output error.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/heartbeat/crmd'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000000000416fd3 in ?? ()
(gdb) bt
#0 0x0000000000416fd3 in ?? ()
#1 0x0000000000406ef4 in ?? ()
#2 0x0000000000407a54 in ?? ()
#3 0x0000000000410a67 in ?? ()
#4 0x00007fd976db4355 in g_main_context_dispatch () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5 0x00007fd976db4688 in ?? () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#6 0x00007fd976db4a82 in g_main_loop_run () from /lib/x86_64-linux-gnu/libglib-2.0.so.0
#7 0x0000000000405763 in ?? ()
#8 0x00007fd97789fead in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x0000000000405589 in ?? ()
#10 0x00007fff1b1d1d28 in ?? ()
#11 0x000000000000001c in ?? ()
#12 0x0000000000000001 in ?? ()
#13 0x00007fff1b1d2aa0 in ?? ()
#14 0x0000000000000000 in ?? ()
(gdb)

Please let me know if that is a known bug and if I should file a
bugreport against Debian Wheezy?

Cheers,
Thomas
Andrew Beekhof
2013-06-07 01:41:21 UTC
Permalink
Post by Thomas Glanzmann
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: crm_abort: abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph != NULL
This is the cause of the coredump.
What version of pacemaker is this?

Installing pacemaker's debug symbols would also make the stack trace more useful.
Thomas Glanzmann
2013-06-07 07:45:25 UTC
Permalink
Hello Andrew,
Post by Andrew Beekhof
Post by Thomas Glanzmann
Jun 6 10:17:37 astorage1 crmd: [2947]: ERROR: crm_abort: abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph != NULL
This is the cause of the coredump.
What version of pacemaker is this?
1.1.7-1
Post by Andrew Beekhof
Installing pacemaker's debug symbols would also make the stack trace more useful.
I'll do that and will get back to you.

I tried to reproduce the issue in my lab by installing two Debian Wheezy
VMs and reconstruct the the network and ha config, but was unable to do
so. What I wonder is that the issue on the production system showed up
multiple times (at least 3 times).

Rolf,
could you please do a apt-get install pacmaker-dev and see if the
backtrace reveals a little bit more?

Cheers,
Thomas
Thomas Glanzmann
2013-06-07 12:50:41 UTC
Permalink
Hello Andrew,
Post by Andrew Beekhof
Installing pacemaker's debug symbols would also make the stack trace more useful.
we tried to install heartbeat-dev to see more, but there are no
debugging symbols available. Also I tried to reproduce the issue with a
64 bit Debian Wheezy as I used 32 bit before, I was not able to
reproduce the issue. However in the near future I'll setup 6 more Linux
HA clusters using Debian Wheezy, I'll report back if the issue happens
to me again. On the system where I can reproduce the problem, I'll not
do any more experiements because it is about to go into production and
except for the maintance part everything works perfectly fine.

Cheers,
Thomas
Ferenc Wagner
2013-06-07 21:10:48 UTC
Permalink
Post by Thomas Glanzmann
Post by Andrew Beekhof
Installing pacemaker's debug symbols would also make the stack trace more useful.
we tried to install heartbeat-dev to see more, but there are no
debugging symbols available.
You'd probably need the pacemaker-dbg package, which is not present for
the version 1.1.7-1 in wheezy. However, it is present for version
1.1.7-2 in sid. While the single changelog entry notes the addition of
the debug package, the different compiler and library versions might
bring other differences on the table. You other option is recompiling
the 1.1.7-2 source on wheezy.
--
Regards,
Feri.
Thomas Glanzmann
2013-08-04 17:11:21 UTC
Permalink
Hello Andrew,
I just got another crash when putting a node into unmanaged node, this
time it hit me hard:

- Both nodes sucided or snothined each other
- One out of four md devices where detected on both nodes after
reset.
- Half of the config was gone. Could you help me get to the
bottom of this?

This was on Debian Wheezy.

Cheers,
Thomas
Andrew Beekhof
2013-08-05 03:05:19 UTC
Permalink
Post by Thomas Glanzmann
Hello Andrew,
I just got another crash when putting a node into unmanaged node, this
- Both nodes sucided or snothined each other
- One out of four md devices where detected on both nodes after
reset.
- Half of the config was gone. Could you help me get to the
bottom of this?
You will need to run crm_report and email us the resulting tarball.
This will include the version of the software you're running and log files (both system and cluster) - without which we can't do anything.
Post by Thomas Glanzmann
This was on Debian Wheezy.
Cheers,
Thomas
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Thomas Glanzmann
2013-08-05 16:29:45 UTC
Permalink
Hello Andrew,
Post by Andrew Beekhof
You will need to run crm_report and email us the resulting tarball.
This will include the version of the software you're running and log
files (both system and cluster) - without which we can't do anything.
Find the files here:

I manually packaged it because crm_report output was empty. If I forget
something, please let me know. I included the daemon syslog output from
both nodes from the central syslog server and the crm file, the ha.cf
which is the same on both nodes and the /var/lib/heartbeat directory
which seems to keep all files from the first node.

The reason for the crash in unmanaged mode seems to be the same as
before:

Aug 4 18:50:27 apache-03 crmd: [29398]: ERROR: crm_abort: abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph != NULL

Probably I should update it.

But why the config got lost, I have no idea what went wrong here.

https://thomas.glanzmann.de/tmp/linux_ha_crash.2013-08-05.tar.gz

Cheers,
Thomas
Andrew Beekhof
2013-08-06 21:11:20 UTC
Permalink
Post by Thomas Glanzmann
Hello Andrew,
Post by Andrew Beekhof
You will need to run crm_report and email us the resulting tarball.
This will include the version of the software you're running and log
files (both system and cluster) - without which we can't do anything.
I manually packaged it because crm_report output was empty.
I can try and fix that if you re-run with -x and paste the output.
Post by Thomas Glanzmann
If I forget
something, please let me know. I included the daemon syslog output from
both nodes from the central syslog server and the crm file, the ha.cf
which is the same on both nodes and the /var/lib/heartbeat directory
which seems to keep all files from the first node.
I can't do anything with the core file I'm afraid.
I don't run debian at all, let alone that particular version with the same binaries, libraries and symbols as you.
Without those, the core file is meaningless (which is why crm_report generates backtraces).
Post by Thomas Glanzmann
The reason for the crash in unmanaged mode seems to be the same as
Aug 4 18:50:27 apache-03 crmd: [29398]: ERROR: crm_abort: abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph != NULL
That shouldn't have resulted in a crash.

I see a lot of this though:

Aug 4 18:50:27 apache-03 crmd: [29398]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Aug 4 18:50:27 apache-03 crmd: [29398]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Aug 4 18:50:27 apache-03 crmd: [29398]: ERROR: lrm_add_rsc(870): failed to send a addrsc message to lrmd via ch_cmd channel.
Aug 4 18:50:27 apache-03 crmd: [29398]: ERROR: lrm_get_rsc(666): failed to send a getrsc message to lrmd via ch_cmd channel.
Aug 4 18:50:27 apache-03 crmd: [29398]: ERROR: get_lrm_resource: Could not add resource nfs-common to LRM

Which looks more concerning.

I would _really_ recommend upgrading to something a little more recent.
And it might be time to get off heartbeat while you're at it.
Post by Thomas Glanzmann
Probably I should update it.
But why the config got lost, I have no idea what went wrong here.
https://thomas.glanzmann.de/tmp/linux_ha_crash.2013-08-05.tar.gz
Cheers,
Thomas
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Thomas Glanzmann
2013-08-07 07:42:31 UTC
Permalink
Hello Andrew,
Post by Andrew Beekhof
I can try and fix that if you re-run with -x and paste the output.
(apache-03) [~] crm_report -l /var/adm/syslog/2013/08/05 -f "2013-08-04 18:30:00" -t "2013-08-04 19:15" -x
+ shift
+ true
+ [ ! -z ]
+ break
+ [ x != x ]
+ [ x1375633800 != x ]
+ masterlog=
+ [ -z ]
+ log WARNING: The tarball produced by this program may contain
+ printf %-10s WARNING: The tarball produced by this program may contain\n apache-03:
apache-03: WARNING: The tarball produced by this program may contain
+ log sensitive information such as passwords.
+ printf %-10s sensitive information such as passwords.\n apache-03:
apache-03: sensitive information such as passwords.
+ log
+ printf %-10s \n apache-03:
apache-03:
+ log We will attempt to remove such information if you use the
+ printf %-10s We will attempt to remove such information if you use the\n apache-03:
apache-03: We will attempt to remove such information if you use the
+ log -p option. For example: -p "pass.*" -p "user.*"
+ printf %-10s -p option. For example: -p "pass.*" -p "user.*"\n apache-03:
apache-03: -p option. For example: -p "pass.*" -p "user.*"
+ log
+ printf %-10s \n apache-03:
apache-03:
+ log However, doing this may reduce the ability for the recipients
+ printf %-10s However, doing this may reduce the ability for the recipients\n apache-03:
apache-03: However, doing this may reduce the ability for the recipients
+ log to diagnose issues and generally provide assistance.
+ printf %-10s to diagnose issues and generally provide assistance.\n apache-03:
apache-03: to diagnose issues and generally provide assistance.
+ log
+ printf %-10s \n apache-03:
apache-03:
+ log IT IS YOUR RESPONSIBILITY TO PROTECT SENSITIVE DATA FROM EXPOSURE
+ printf %-10s IT IS YOUR RESPONSIBILITY TO PROTECT SENSITIVE DATA FROM EXPOSURE\n apache-03:
apache-03: IT IS YOUR RESPONSIBILITY TO PROTECT SENSITIVE DATA FROM EXPOSURE
+ log
+ printf %-10s \n apache-03:
apache-03:
+ [ -z ]
+ getnodes any
+ [ -z any ]
+ cluster=any
+ [ -z ]
+ HA_STATE_DIR=/var/lib/heartbeat
+ find_cluster_cf any
+ warning Unknown cluster type: any
+ log WARN: Unknown cluster type: any
+ printf %-10s WARN: Unknown cluster type: any\n apache-03:
apache-03: WARN: Unknown cluster type: any
+ cluster_cf=
+ ps -ef
+ egrep -qs [c]ib
+ debug Querying CIB for nodes
+ [ 0 -gt 0 ]
+ cibadmin -Ql -o nodes
+ awk
/type="normal"/ {
for( i=1; i<=NF; i++ )
if( $i~/^uname=/ ) {
sub("uname=.","",$i);
sub("\".*","",$i);
print $i;
next;
}
}

+ tr \n
+ nodes=apache-03 apache-04
+ log Calculated node list: apache-03 apache-04
+ printf %-10s Calculated node list: apache-03 apache-04 \n apache-03:
apache-03: Calculated node list: apache-03 apache-04
+ [ -z apache-03 apache-04 ]
+ echo apache-03 apache-04
+ grep -qs apache-03
+ debug We are a cluster node
+ [ 0 -gt 0 ]
+ [ -z 1375636500 ]
+ date +%a-%d-%b-%Y
+ label=pcmk-Wed-07-Aug-2013
+ time2str 1375633800
+ perl -e use POSIX; print strftime('%x %X',localtime(1375633800));
+ time2str 1375636500
+ perl -e use POSIX; print strftime('%x %X',localtime(1375636500));
+ log Collecting data from apache-03 apache-04 (08/04/13 18:30:00 to 08/04/13 19:15:00)
+ printf %-10s Collecting data from apache-03 apache-04 (08/04/13 18:30:00 to 08/04/13 19:15:00)\n apache-03:
apache-03: Collecting data from apache-03 apache-04 (08/04/13 18:30:00 to 08/04/13 19:15:00)
+ collect_data pcmk-Wed-07-Aug-2013 1375633800 1375636500
+ label=pcmk-Wed-07-Aug-2013
+ expr 1375633800 - 10
+ start=1375633790
+ expr 1375636500 + 10
+ end=1375636510
+ masterlog=
+ [ x != x ]
+ l_base=/home/tg/pcmk-Wed-07-Aug-2013
+ r_base=pcmk-Wed-07-Aug-2013
+ [ -e /home/tg/pcmk-Wed-07-Aug-2013 ]
+ mkdir -p /home/tg/pcmk-Wed-07-Aug-2013
+ [ x != x ]
+ cat
+ [ apache-03 = apache-03 ]
+ cat
+ cat /home/tg/pcmk-Wed-07-Aug-2013/.env /usr/share/pacemaker/report.common /usr/share/pacemaker/report.collector
+ bash /home/tg/pcmk-Wed-07-Aug-2013/collector
apache-03: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path
+ cat
+ [ apache-03 = apache-04 ]
+ cat /home/tg/pcmk-Wed-07-Aug-2013/.env /usr/share/pacemaker/report.common /usr/share/pacemaker/report.collector
+ ssh+ -l root -T apache-04 -- mkdir -p pcmk-Wed-07-Aug-2013; cat > pcmk-Wed-07-Aug-2013/collector; bash pcmk-Wed-07-Aug-2013/collectorcd
/home/tg/pcmk-Wed-07-Aug-2013
+ tar xf -
apache-04: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
+ analyze /home/tg/pcmk-Wed-07-Aug-2013
+ flist=hostcache members.txt cib.xml crm_mon.txt logd.cf sysinfo.txt
+ printf Diff hostcache...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/hostcache
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/hostcache :/
+ continue
+ printf Diff members.txt...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/members.txt
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/members.txt :/
+ continue
+ printf Diff cib.xml...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/cib.xml
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/cib.xml :/
+ continue
+ printf Diff crm_mon.txt...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/crm_mon.txt
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/crm_mon.txt :/
+ continue
+ printf Diff logd.cf...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/logd.cf
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/logd.cf :/
+ continue
+ printf Diff sysinfo.txt...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/sysinfo.txt
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/sysinfo.txt :/
+ continue
+ [ -f /home/tg/pcmk-Wed-07-Aug-2013/cluster-log.txt ]
+ cat /home/tg/pcmk-Wed-07-Aug-2013/apache-03/analysis.txt
cat: /home/tg/pcmk-Wed-07-Aug-2013/apache-03/analysis.txt: No such file or directory
+ [ -s /home/tg/pcmk-Wed-07-Aug-2013/apache-03/events.txt ]
+ [ -s /home/tg/pcmk-Wed-07-Aug-2013/cluster-log.txt ]
+ cat /home/tg/pcmk-Wed-07-Aug-2013/apache-04/analysis.txt
cat: /home/tg/pcmk-Wed-07-Aug-2013/apache-04/analysis.txt: No such file or directory
+ [ -s /home/tg/pcmk-Wed-07-Aug-2013/apache-04/events.txt ]
+ [ -s /home/tg/pcmk-Wed-07-Aug-2013/cluster-log.txt ]
+ log
+ printf %-10s \n apache-03:
apache-03:
+ [ 1 = 1 ]
+ shrink /home/tg/pcmk-Wed-07-Aug-2013
+ olddir=/home/tg
+ dirname /home/tg/pcmk-Wed-07-Aug-2013
+ dir=/home/tg
+ basename /home/tg/pcmk-Wed-07-Aug-2013
+ base=pcmk-Wed-07-Aug-2013
+ target=/home/tg/pcmk-Wed-07-Aug-2013.tar
+ tar_options=cf
+ pickfirst bzip2 gzip false
+ which bzip2
+ echo bzip2
+ return 0
+ variant=bzip2
+ tar_options=jcf
+ target=/home/tg/pcmk-Wed-07-Aug-2013.tar.bz2
+ [ -e /home/tg/pcmk-Wed-07-Aug-2013.tar.bz2 ]
+ cd /home/tg
+ tar jcf /home/tg/pcmk-Wed-07-Aug-2013.tar.bz2 pcmk-Wed-07-Aug-2013
+ cd /home/tg
+ echo /home/tg/pcmk-Wed-07-Aug-2013.tar.bz2
+ fname=/home/tg/pcmk-Wed-07-Aug-2013.tar.bz2
+ rm -rf /home/tg/pcmk-Wed-07-Aug-2013
+ log Collected results are available in /home/tg/pcmk-Wed-07-Aug-2013.tar.bz2
+ printf %-10s Collected results are available in /home/tg/pcmk-Wed-07-Aug-2013.tar.bz2\n apache-03:
apache-03: Collected results are available in /home/tg/pcmk-Wed-07-Aug-2013.tar.bz2
+ log
+ printf %-10s \n apache-03:
apache-03:
+ log Please create a bug entry at
+ printf %-10s Please create a bug entry at\n apache-03:
apache-03: Please create a bug entry at
+ log http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
+ printf %-10s http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker\n apache-03:
apache-03: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
+ log Include a description of your problem and attach this tarball
+ printf %-10s Include a description of your problem and attach this tarball\n apache-03:
apache-03: Include a description of your problem and attach this tarball
+ log
+ printf %-10s \n apache-03:
apache-03:
+ log Thank you for taking time to create this report.
+ printf %-10s Thank you for taking time to create this report.\n apache-03:
apache-03: Thank you for taking time to create this report.
+ log
+ printf %-10s \n apache-03:

Resulting file is here:
https://thomas.glanzmann.de/tmp/pcmk-Wed-07-Aug-2013.tar.bz2
Post by Andrew Beekhof
I can't do anything with the core file I'm afraid. I don't run debian
at all, let alone that particular version with the same binaries,
libraries and symbols as you. Without those, the core file is
meaningless (which is why crm_report generates backtraces).
I see, I also think that Debian does not package the debug symbols so
that the core files are really useless. Please point me to the right
packages if I'm wrong.
Post by Andrew Beekhof
That shouldn't have resulted in a crash.
It does. Also I tried to reproduce it on a 32 BIT System and the system
at least rebooted both nodes at the same time but did not loose the
config and this time crm just reported an error and did not core dump.
Post by Andrew Beekhof
I would _really_ recommend upgrading to something a little more
recent. And it might be time to get off heartbeat while you're at it.
Just to be absolutly sure: I should upgrade to the most recent pacemaker
release and use corosync as communication layer?

I tried corosync a few years back and I was annoyed because back than it
could not handle more than two heartbeat links between the nodes,
however I saw that it now can and the moment I don't need more anyway.

Has anyone Debian packages that can be used in production or should I
package it myself?

Has someone a howto guide howto use the peer outdater with corosync?

One last question about maintance mode: I want to use maintance mode to
change the configuration without affecting the production. See that the
monitors take the system out of maintance mode and than try the
failover. I already have verified the resource agents work correctly. Is
that a valid use of the maintance mode or should I always test my setup
on a lab system and only than put into the production system?

Cheers,
Thomas
Andrew Beekhof
2013-08-07 22:56:29 UTC
Permalink
Post by Thomas Glanzmann
Hello Andrew,
Post by Andrew Beekhof
I can try and fix that if you re-run with -x and paste the output.
(apache-03) [~] crm_report -l /var/adm/syslog/2013/08/05 -f "2013-08-04 18:30:00" -t "2013-08-04 19:15" -x
+ shift
+ true
+ [ ! -z ]
+ break
+ [ x != x ]
+ [ x1375633800 != x ]
+ masterlog=
+ [ -z ]
+ log WARNING: The tarball produced by this program may contain
apache-03: WARNING: The tarball produced by this program may contain
+ log sensitive information such as passwords.
apache-03: sensitive information such as passwords.
+ log
+ log We will attempt to remove such information if you use the
apache-03: We will attempt to remove such information if you use the
+ log -p option. For example: -p "pass.*" -p "user.*"
apache-03: -p option. For example: -p "pass.*" -p "user.*"
+ log
+ log However, doing this may reduce the ability for the recipients
apache-03: However, doing this may reduce the ability for the recipients
+ log to diagnose issues and generally provide assistance.
apache-03: to diagnose issues and generally provide assistance.
+ log
+ log IT IS YOUR RESPONSIBILITY TO PROTECT SENSITIVE DATA FROM EXPOSURE
apache-03: IT IS YOUR RESPONSIBILITY TO PROTECT SENSITIVE DATA FROM EXPOSURE
+ log
+ [ -z ]
+ getnodes any
+ [ -z any ]
+ cluster=any
+ [ -z ]
+ HA_STATE_DIR=/var/lib/heartbeat
+ find_cluster_cf any
+ warning Unknown cluster type: any
+ log WARN: Unknown cluster type: any
apache-03: WARN: Unknown cluster type: any
+ cluster_cf=
+ ps -ef
+ egrep -qs [c]ib
+ debug Querying CIB for nodes
+ [ 0 -gt 0 ]
+ cibadmin -Ql -o nodes
+ awk
/type="normal"/ {
for( i=1; i<=NF; i++ )
if( $i~/^uname=/ ) {
sub("uname=.","",$i);
sub("\".*","",$i);
print $i;
next;
}
}
+ tr \n
+ nodes=apache-03 apache-04
+ log Calculated node list: apache-03 apache-04
apache-03: Calculated node list: apache-03 apache-04
+ [ -z apache-03 apache-04 ]
+ echo apache-03 apache-04
+ grep -qs apache-03
+ debug We are a cluster node
+ [ 0 -gt 0 ]
+ [ -z 1375636500 ]
+ date +%a-%d-%b-%Y
+ label=pcmk-Wed-07-Aug-2013
+ time2str 1375633800
+ perl -e use POSIX; print strftime('%x %X',localtime(1375633800));
+ time2str 1375636500
+ perl -e use POSIX; print strftime('%x %X',localtime(1375636500));
+ log Collecting data from apache-03 apache-04 (08/04/13 18:30:00 to 08/04/13 19:15:00)
apache-03: Collecting data from apache-03 apache-04 (08/04/13 18:30:00 to 08/04/13 19:15:00)
+ collect_data pcmk-Wed-07-Aug-2013 1375633800 1375636500
+ label=pcmk-Wed-07-Aug-2013
+ expr 1375633800 - 10
+ start=1375633790
+ expr 1375636500 + 10
+ end=1375636510
+ masterlog=
+ [ x != x ]
+ l_base=/home/tg/pcmk-Wed-07-Aug-2013
+ r_base=pcmk-Wed-07-Aug-2013
+ [ -e /home/tg/pcmk-Wed-07-Aug-2013 ]
+ mkdir -p /home/tg/pcmk-Wed-07-Aug-2013
+ [ x != x ]
+ cat
+ [ apache-03 = apache-03 ]
+ cat
+ cat /home/tg/pcmk-Wed-07-Aug-2013/.env /usr/share/pacemaker/report.common /usr/share/pacemaker/report.collector
+ bash /home/tg/pcmk-Wed-07-Aug-2013/collector
apache-03: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path
+ cat
+ [ apache-03 = apache-04 ]
+ cat /home/tg/pcmk-Wed-07-Aug-2013/.env /usr/share/pacemaker/report.common /usr/share/pacemaker/report.collector
+ ssh+ -l root -T apache-04 -- mkdir -p pcmk-Wed-07-Aug-2013; cat > pcmk-Wed-07-Aug-2013/collector; bash pcmk-Wed-07-Aug-2013/collectorcd
/home/tg/pcmk-Wed-07-Aug-2013
+ tar xf -
apache-04: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
+ analyze /home/tg/pcmk-Wed-07-Aug-2013
+ flist=hostcache members.txt cib.xml crm_mon.txt logd.cf sysinfo.txt
+ printf Diff hostcache...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/hostcache
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/hostcache :/
+ continue
+ printf Diff members.txt...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/members.txt
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/members.txt :/
+ continue
+ printf Diff cib.xml...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/cib.xml
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/cib.xml :/
+ continue
+ printf Diff crm_mon.txt...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/crm_mon.txt
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/crm_mon.txt :/
+ continue
+ printf Diff logd.cf...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/logd.cf
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/logd.cf :/
+ continue
+ printf Diff sysinfo.txt...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/sysinfo.txt
+ echo no /home/tg/pcmk-Wed-07-Aug-2013/*/sysinfo.txt :/
+ continue
+ [ -f /home/tg/pcmk-Wed-07-Aug-2013/cluster-log.txt ]
+ cat /home/tg/pcmk-Wed-07-Aug-2013/apache-03/analysis.txt
cat: /home/tg/pcmk-Wed-07-Aug-2013/apache-03/analysis.txt: No such file or directory
+ [ -s /home/tg/pcmk-Wed-07-Aug-2013/apache-03/events.txt ]
+ [ -s /home/tg/pcmk-Wed-07-Aug-2013/cluster-log.txt ]
+ cat /home/tg/pcmk-Wed-07-Aug-2013/apache-04/analysis.txt
cat: /home/tg/pcmk-Wed-07-Aug-2013/apache-04/analysis.txt: No such file or directory
+ [ -s /home/tg/pcmk-Wed-07-Aug-2013/apache-04/events.txt ]
+ [ -s /home/tg/pcmk-Wed-07-Aug-2013/cluster-log.txt ]
+ log
+ [ 1 = 1 ]
+ shrink /home/tg/pcmk-Wed-07-Aug-2013
+ olddir=/home/tg
+ dirname /home/tg/pcmk-Wed-07-Aug-2013
+ dir=/home/tg
+ basename /home/tg/pcmk-Wed-07-Aug-2013
+ base=pcmk-Wed-07-Aug-2013
+ target=/home/tg/pcmk-Wed-07-Aug-2013.tar
+ tar_options=cf
+ pickfirst bzip2 gzip false
+ which bzip2
+ echo bzip2
+ return 0
+ variant=bzip2
+ tar_options=jcf
+ target=/home/tg/pcmk-Wed-07-Aug-2013.tar.bz2
+ [ -e /home/tg/pcmk-Wed-07-Aug-2013.tar.bz2 ]
+ cd /home/tg
+ tar jcf /home/tg/pcmk-Wed-07-Aug-2013.tar.bz2 pcmk-Wed-07-Aug-2013
+ cd /home/tg
+ echo /home/tg/pcmk-Wed-07-Aug-2013.tar.bz2
+ fname=/home/tg/pcmk-Wed-07-Aug-2013.tar.bz2
+ rm -rf /home/tg/pcmk-Wed-07-Aug-2013
+ log Collected results are available in /home/tg/pcmk-Wed-07-Aug-2013.tar.bz2
apache-03: Collected results are available in /home/tg/pcmk-Wed-07-Aug-2013.tar.bz2
+ log
+ log Please create a bug entry at
apache-03: Please create a bug entry at
+ log http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
apache-03: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
+ log Include a description of your problem and attach this tarball
apache-03: Include a description of your problem and attach this tarball
+ log
+ log Thank you for taking time to create this report.
apache-03: Thank you for taking time to create this report.
+ log
It really helps to read the output of the commands you're running:

Did you not see these messages the first time?

apache-03: WARN: Unknown cluster type: any
apache-03: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path
apache-04: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path

Try adding -H and --logfile {somevalue} next time.
Post by Thomas Glanzmann
https://thomas.glanzmann.de/tmp/pcmk-Wed-07-Aug-2013.tar.bz2
Post by Andrew Beekhof
I can't do anything with the core file I'm afraid. I don't run debian
at all, let alone that particular version with the same binaries,
libraries and symbols as you. Without those, the core file is
meaningless (which is why crm_report generates backtraces).
I see, I also think that Debian does not package the debug symbols so
that the core files are really useless. Please point me to the right
packages if I'm wrong.
I have no experience with debian.
Post by Thomas Glanzmann
Post by Andrew Beekhof
That shouldn't have resulted in a crash.
It does. Also I tried to reproduce it on a 32 BIT System and the system
at least rebooted both nodes at the same time but did not loose the
config and this time crm just reported an error and did not core dump.
Post by Andrew Beekhof
I would _really_ recommend upgrading to something a little more
recent. And it might be time to get off heartbeat while you're at it.
Just to be absolutly sure: I should upgrade to the most recent pacemaker
release and use corosync as communication layer?
An updated pacemaker is the important part.
Whether you switch to corosync too is up to you.

Pacemaker+heartbeat is by far the least tested combination.
Post by Thomas Glanzmann
I tried corosync a few years back and I was annoyed because back than it
could not handle more than two heartbeat links between the nodes,
however I saw that it now can and the moment I don't need more anyway.
Has anyone Debian packages that can be used in production or should I
package it myself?
Best to poke the debian maintainers
Post by Thomas Glanzmann
Has someone a howto guide howto use the peer outdater with corosync?
I'm sure linbit has one somewhere
Post by Thomas Glanzmann
One last question about maintance mode: I want to use maintance mode to
change the configuration without affecting the production. See that the
monitors take the system out of maintance mode and than try the
failover. I already have verified the resource agents work correctly. Is
that a valid use of the maintance mode or should I always test my setup
on a lab system and only than put into the production system?
Do you mean "See that the monitors _work, then_ take the system out of maintance mode..."?
If so, then yes.
Post by Thomas Glanzmann
Cheers,
Thomas
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Thomas Glanzmann
2013-08-08 05:49:27 UTC
Permalink
Hello Andrew,
Post by Andrew Beekhof
Did you not see these messages the first time?
apache-03: WARN: Unknown cluster type: any
apache-03: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path
apache-04: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path
Try adding -H and --logfile {somevalue} next time.
I'll do that and report back.
Post by Andrew Beekhof
An updated pacemaker is the important part. Whether you switch to
corosync too is up to you.
I'll do that.
Post by Andrew Beekhof
Pacemaker+heartbeat is by far the least tested combination.
What is the best tested combination? Pacemaker and corosync? Any
specific version or should I go with the lastest release of both?
Post by Andrew Beekhof
Best to poke the debian maintainers
I'll do that as well.
Post by Andrew Beekhof
Do you mean "See that the monitors _work, then_ take the system out of
maintance mode..."? If so, then yes.
Yes, that is what I want to do. :-)

Cheers,
Thomas
Andrew Beekhof
2013-08-09 02:38:25 UTC
Permalink
Post by Thomas Glanzmann
Hello Andrew,
Post by Andrew Beekhof
Did you not see these messages the first time?
apache-03: WARN: Unknown cluster type: any
apache-03: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path
apache-04: ERROR: Could not determine the location of your cluster logs, try specifying --logfile /some/path
Try adding -H and --logfile {somevalue} next time.
I'll do that and report back.
Post by Andrew Beekhof
An updated pacemaker is the important part. Whether you switch to
corosync too is up to you.
I'll do that.
Post by Andrew Beekhof
Pacemaker+heartbeat is by far the least tested combination.
What is the best tested combination? Pacemaker and corosync? Any
specific version or should I go with the lastest release of both?
SUSE tests corosync 1.x (with the plugin) and is starting to dabble with corosync 2.x
I, and Red Hat, test corosync 1.x (with cman) and corosync 2.x
Post by Thomas Glanzmann
Post by Andrew Beekhof
Best to poke the debian maintainers
I'll do that as well.
Post by Andrew Beekhof
Do you mean "See that the monitors _work, then_ take the system out of
maintance mode..."? If so, then yes.
Yes, that is what I want to do. :-)
Cheers,
Thomas
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Ulrich Windl
2013-08-05 07:00:01 UTC
Permalink
Post by Thomas Glanzmann
Thomas Glanzmann <thomas at glanzmann.de> schrieb am 04.08.2013 um 19:11 in
Hello Andrew,
I just got another crash when putting a node into unmanaged node, this
Hi!

Did it happen when you put the cluster into maintenance-mode, or did it happen after someone fiddled with the resources manually? Or did it happen when you turned maintenance-mode off again?

Maybe syslog has something more to say...

Regards,
Ulrich
Post by Thomas Glanzmann
- Both nodes sucided or snothined each other
- One out of four md devices where detected on both nodes after
reset.
- Half of the config was gone. Could you help me get to the
bottom of this?
This was on Debian Wheezy.
Cheers,
Thomas
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Thomas Glanzmann
2013-08-05 17:03:45 UTC
Permalink
Hello Ulrich,
Post by Ulrich Windl
Did it happen when you put the cluster into maintenance-mode, or did
it happen after someone fiddled with the resources manually? Or did it
happen when you turned maintenance-mode off again?
I did not remember, but checked the log files, and yes I did a config
change (I removed apache_loadbalancer from group apache). And that is probably
the reason I could not reproduce it in my lab environemnt because I never tried
to fiddle with it afterwards.. Probably the way to reproduce it is: put it to
maintance-mode and than change something to the config and it crashes, but I
have to verify that in my lab and report back. I'll do that right now and
report back.

...
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <configuration >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <crm_config >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <cluster_property_set id="cib-bootstrap-options" >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <nvpair id="cib-bootstrap-options-maintenance-mode" name="maintenance-mode" value="true" __crm_diff_marker__="added:top" />
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </cluster_property_set>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </crm_config>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </configuration>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </cib>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <cib admin_epoch="0" epoch="94" num_updates="100" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <configuration >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <resources >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <group id="apache" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <primitive class="ocf" id="apache_loadbalancer" provider="heartbeat" type="apachetg" __crm_diff_marker__="removed:top" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <operations >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <op id="apache_loadbalancer-monitor-60s" interval="60s" name="monitor" />
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </operations>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </primitive>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </group>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </resources>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </configuration>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </cib>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <cib epoch="95" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.6" update-origin="apache-03" update-client="cibadmin" cib-last-written="Sun Aug 4
18:49:18 2013" have-quorum="1" dc-uuid="61e8f424-b538-4352-b3fe-955ca853e5fb" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <configuration >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <resources >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <primitive class="ocf" id="apache_loadbalancer" provider="heartbeat" type="apachetg" __crm_diff_marker__="added:top" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <operations >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <op id="apache_loadbalancer-monitor-60s" interval="60s" name="monitor" />
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </operations>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </primitive>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </resources>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </configuration>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </cib>
...
Aug 4 18:50:27 apache-03 heartbeat: [29380]: ERROR: Managed /usr/lib/heartbeat/crmd process 29398 dumped core

Complete syslog is my other e-mail I just sent to Alan, if you want to
check it.

Cheers,
Thomas
Ulrich Windl
2013-08-06 07:24:43 UTC
Permalink
Post by Thomas Glanzmann
Post by Ulrich Windl
Thomas Glanzmann <thomas at glanzmann.de> schrieb am 05.08.2013 um 19:03 in
Hello Ulrich,
Post by Ulrich Windl
Did it happen when you put the cluster into maintenance-mode, or did
it happen after someone fiddled with the resources manually? Or did it
happen when you turned maintenance-mode off again?
I did not remember, but checked the log files, and yes I did a config
change (I removed apache_loadbalancer from group apache). And that is probably
the reason I could not reproduce it in my lab environemnt because I never tried
to fiddle with it afterwards.. Probably the way to reproduce it is: put it to
maintance-mode and than change something to the config and it crashes, but I
have to verify that in my lab and report back. I'll do that right now and
report back.
Hi!

I think it's a common misconception that you can modify cluster resources while in maintenance mode: It seems the cluster expects the state it had when mainteneance mode was turned on later when you turn it off. Maybe it's like the airplane's autopilot: You can turn it off, fly the plane the way the autopilot would have done, and then when you turn the autopilot on again, the flight path will continue; however if you change direction while the autopilot is off, big confusion may arise when you turn the autopilot on again... ;-)


Am I right?

Still this doesn't explain where your configuration or log files went...

Regards,
Ulrich
Post by Thomas Glanzmann
...
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <configuration >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <crm_config >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +
<cluster_property_set id="cib-bootstrap-options" >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <nvpair
id="cib-bootstrap-options-maintenance-mode" name="maintenance-mode" value="true"
__crm_diff_marker__="added:top" />
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +
</cluster_property_set>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </crm_config>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </configuration>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </cib>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <cib admin_epoch="0"
epoch="94" num_updates="100" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <configuration >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <resources >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <group id="apache" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <primitive
class="ocf" id="apache_loadbalancer" provider="heartbeat" type="apachetg"
__crm_diff_marker__="removed:top" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <operations
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <op
id="apache_loadbalancer-monitor-60s" interval="60s" name="monitor" />
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: -
</operations>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </primitive>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </group>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </resources>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </configuration>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </cib>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <cib epoch="95"
num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2"
crm_feature_set="3.0.6" update-origin="apache-03" update-client="cibadmin"
cib-last-written="Sun Aug 4
18:49:18 2013" have-quorum="1" dc-uuid="61e8f424-b538-4352-b3fe-955ca853e5fb" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <configuration >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <resources >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <primitive
class="ocf" id="apache_loadbalancer" provider="heartbeat" type="apachetg"
__crm_diff_marker__="added:top" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <operations >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <op
id="apache_loadbalancer-monitor-60s" interval="60s" name="monitor" />
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </operations>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </primitive>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </resources>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </configuration>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </cib>
...
Aug 4 18:50:27 apache-03 heartbeat: [29380]: ERROR: Managed
/usr/lib/heartbeat/crmd process 29398 dumped core
Complete syslog is my other e-mail I just sent to Alan, if you want to
check it.
Cheers,
Thomas
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Andrew Beekhof
2013-08-06 20:10:28 UTC
Permalink
Post by Ulrich Windl
Post by Thomas Glanzmann
Post by Ulrich Windl
Thomas Glanzmann <thomas at glanzmann.de> schrieb am 05.08.2013 um 19:03 in
Hello Ulrich,
Post by Ulrich Windl
Did it happen when you put the cluster into maintenance-mode, or did
it happen after someone fiddled with the resources manually? Or did it
happen when you turned maintenance-mode off again?
I did not remember, but checked the log files, and yes I did a config
change (I removed apache_loadbalancer from group apache). And that is probably
the reason I could not reproduce it in my lab environemnt because I never tried
to fiddle with it afterwards.. Probably the way to reproduce it is: put it to
maintance-mode and than change something to the config and it crashes, but I
have to verify that in my lab and report back. I'll do that right now and
report back.
Hi!
No, you _should_ be able to. If that's not the case, its a bug.
Post by Ulrich Windl
It seems the cluster expects the state it had when mainteneance mode was turned on later when you turn it off. Maybe it's like the airplane's autopilot: You can turn it off, fly the plane the way the autopilot would have done, and then when you turn the autopilot on again, the flight path will continue; however if you change direction while the autopilot is off, big confusion may arise when you turn the autopilot on again... ;-)
Am I right?
No, more likely its just not a well tested scenario.
Post by Ulrich Windl
Still this doesn't explain where your configuration or log files went...
Regards,
Ulrich
Post by Thomas Glanzmann
...
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <configuration >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <crm_config >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +
<cluster_property_set id="cib-bootstrap-options" >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <nvpair
id="cib-bootstrap-options-maintenance-mode" name="maintenance-mode" value="true"
__crm_diff_marker__="added:top" />
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +
</cluster_property_set>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </crm_config>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </configuration>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </cib>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <cib admin_epoch="0"
epoch="94" num_updates="100" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <configuration >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <resources >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <group id="apache" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <primitive
class="ocf" id="apache_loadbalancer" provider="heartbeat" type="apachetg"
__crm_diff_marker__="removed:top" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <operations
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <op
id="apache_loadbalancer-monitor-60s" interval="60s" name="monitor" />
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: -
</operations>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </primitive>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </group>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </resources>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </configuration>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </cib>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <cib epoch="95"
num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2"
crm_feature_set="3.0.6" update-origin="apache-03" update-client="cibadmin"
cib-last-written="Sun Aug 4
18:49:18 2013" have-quorum="1" dc-uuid="61e8f424-b538-4352-b3fe-955ca853e5fb" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <configuration >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <resources >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <primitive
class="ocf" id="apache_loadbalancer" provider="heartbeat" type="apachetg"
__crm_diff_marker__="added:top" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <operations >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <op
id="apache_loadbalancer-monitor-60s" interval="60s" name="monitor" />
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </operations>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </primitive>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </resources>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </configuration>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </cib>
...
Aug 4 18:50:27 apache-03 heartbeat: [29380]: ERROR: Managed
/usr/lib/heartbeat/crmd process 29398 dumped core
Complete syslog is my other e-mail I just sent to Alan, if you want to
check it.
Cheers,
Thomas
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Ulrich Windl
2013-08-07 06:06:18 UTC
Permalink
Post by Andrew Beekhof
Post by Ulrich Windl
Andrew Beekhof <andrew at beekhof.net> schrieb am 06.08.2013 um 22:10 in Nachricht
Post by Ulrich Windl
Thomas Glanzmann <thomas at glanzmann.de> schrieb am 05.08.2013 um 19:03 in
Hello Ulrich,
Post by Ulrich Windl
Did it happen when you put the cluster into maintenance-mode, or did
it happen after someone fiddled with the resources manually? Or did it
happen when you turned maintenance-mode off again?
I did not remember, but checked the log files, and yes I did a config
change (I removed apache_loadbalancer from group apache). And that is probably
the reason I could not reproduce it in my lab environemnt because I never tried
to fiddle with it afterwards.. Probably the way to reproduce it is: put it to
maintance-mode and than change something to the config and it crashes, but I
have to verify that in my lab and report back. I'll do that right now and
report back.
Hi!
I think it's a common misconception that you can modify cluster resources
No, you _should_ be able to. If that's not the case, its a bug.
So the end of maintenance mode starts with a "re-probe"? Even if, until the end of the re-probe all resources should be considered to be in an unclean state.
I had some quite bad experience with maintenenace mode...
Post by Andrew Beekhof
Post by Ulrich Windl
It seems the cluster expects the state it had when mainteneance mode was
turned on later when you turn it off. Maybe it's like the airplane's
autopilot: You can turn it off, fly the plane the way the autopilot would
have done, and then when you turn the autopilot on again, the flight path
will continue; however if you change direction while the autopilot is off,
big confusion may arise when you turn the autopilot on again... ;-)
Post by Ulrich Windl
Am I right?
No, more likely its just not a well tested scenario.
Post by Ulrich Windl
Still this doesn't explain where your configuration or log files went...
Regards,
Ulrich
...
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <configuration >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <crm_config >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +
<cluster_property_set id="cib-bootstrap-options" >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <nvpair
id="cib-bootstrap-options-maintenance-mode" name="maintenance-mode" value="true"
__crm_diff_marker__="added:top" />
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +
</cluster_property_set>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </crm_config>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </configuration>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </cib>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <cib admin_epoch="0"
epoch="94" num_updates="100" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <configuration >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <resources >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <group id="apache" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <primitive
class="ocf" id="apache_loadbalancer" provider="heartbeat" type="apachetg"
__crm_diff_marker__="removed:top" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <operations
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <op
id="apache_loadbalancer-monitor-60s" interval="60s" name="monitor" />
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: -
</operations>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </primitive>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </group>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </resources>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </configuration>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </cib>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <cib epoch="95"
num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2"
crm_feature_set="3.0.6" update-origin="apache-03" update-client="cibadmin"
cib-last-written="Sun Aug 4
18:49:18 2013" have-quorum="1" dc-uuid="61e8f424-b538-4352-b3fe-955ca853e5fb" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <configuration >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <resources >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <primitive
class="ocf" id="apache_loadbalancer" provider="heartbeat" type="apachetg"
__crm_diff_marker__="added:top" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <operations >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <op
id="apache_loadbalancer-monitor-60s" interval="60s" name="monitor" />
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </operations>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </primitive>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </resources>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </configuration>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </cib>
...
Aug 4 18:50:27 apache-03 heartbeat: [29380]: ERROR: Managed
/usr/lib/heartbeat/crmd process 29398 dumped core
Complete syslog is my other e-mail I just sent to Alan, if you want to
check it.
Cheers,
Thomas
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Andrew Beekhof
2013-08-07 06:52:29 UTC
Permalink
Post by Ulrich Windl
Post by Andrew Beekhof
Post by Ulrich Windl
Andrew Beekhof <andrew at beekhof.net> schrieb am 06.08.2013 um 22:10 in Nachricht
Post by Ulrich Windl
Thomas Glanzmann <thomas at glanzmann.de> schrieb am 05.08.2013 um 19:03 in
Hello Ulrich,
Post by Ulrich Windl
Did it happen when you put the cluster into maintenance-mode, or did
it happen after someone fiddled with the resources manually? Or did it
happen when you turned maintenance-mode off again?
I did not remember, but checked the log files, and yes I did a config
change (I removed apache_loadbalancer from group apache). And that is probably
the reason I could not reproduce it in my lab environemnt because I never tried
to fiddle with it afterwards.. Probably the way to reproduce it is: put it to
maintance-mode and than change something to the config and it crashes, but I
have to verify that in my lab and report back. I'll do that right now and
report back.
Hi!
I think it's a common misconception that you can modify cluster resources
No, you _should_ be able to. If that's not the case, its a bug.
So the end of maintenance mode starts with a "re-probe"?
No, but it doesn't need to.
The policy engine already knows if the resource definitions changed and the recurring monitor ops will find out if any are not running.
Post by Ulrich Windl
Even if, until the end of the re-probe all resources should be considered to be in an unclean state.
They should probably be considered unclean the moment you turn maintenance mode on.
Post by Ulrich Windl
I had some quite bad experience with maintenenace mode...
Post by Andrew Beekhof
Post by Ulrich Windl
It seems the cluster expects the state it had when mainteneance mode was
turned on later when you turn it off. Maybe it's like the airplane's
autopilot: You can turn it off, fly the plane the way the autopilot would
have done, and then when you turn the autopilot on again, the flight path
will continue; however if you change direction while the autopilot is off,
big confusion may arise when you turn the autopilot on again... ;-)
Post by Ulrich Windl
Am I right?
No, more likely its just not a well tested scenario.
Post by Ulrich Windl
Still this doesn't explain where your configuration or log files went...
Regards,
Ulrich
...
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <configuration >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <crm_config >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +
<cluster_property_set id="cib-bootstrap-options" >
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + <nvpair
id="cib-bootstrap-options-maintenance-mode" name="maintenance-mode" value="true"
__crm_diff_marker__="added:top" />
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +
</cluster_property_set>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </crm_config>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </configuration>
Aug 4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + </cib>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <cib admin_epoch="0"
epoch="94" num_updates="100" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <configuration >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <resources >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <group id="apache" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <primitive
class="ocf" id="apache_loadbalancer" provider="heartbeat" type="apachetg"
__crm_diff_marker__="removed:top" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <operations
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - <op
id="apache_loadbalancer-monitor-60s" interval="60s" name="monitor" />
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: -
</operations>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </primitive>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </group>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </resources>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </configuration>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - </cib>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <cib epoch="95"
num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2"
crm_feature_set="3.0.6" update-origin="apache-03" update-client="cibadmin"
cib-last-written="Sun Aug 4
18:49:18 2013" have-quorum="1" dc-uuid="61e8f424-b538-4352-b3fe-955ca853e5fb" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <configuration >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <resources >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <primitive
class="ocf" id="apache_loadbalancer" provider="heartbeat" type="apachetg"
__crm_diff_marker__="added:top" >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <operations >
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + <op
id="apache_loadbalancer-monitor-60s" interval="60s" name="monitor" />
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </operations>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </primitive>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </resources>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </configuration>
Aug 4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + </cib>
...
Aug 4 18:50:27 apache-03 heartbeat: [29380]: ERROR: Managed
/usr/lib/heartbeat/crmd process 29398 dumped core
Complete syslog is my other e-mail I just sent to Alan, if you want to
check it.
Cheers,
Thomas
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
Linux-HA at lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems
Loading...