SAP Knowledge Base Article - Preview

2624991 - MPX Coordinator node hangs during restart recovery processing, post secondary node file system full hang - SAP IQ

Symptom

This issue has a specific set of conditions and sequence of events that lead to the Coordinator (CN) restart hang problem...

  • A multiplex secondary writer node hangs after a filesystem becomes full.

  • The multiplex system is shut down and the filesystem full issue is resolved, however on restarting the MPX the CN nodes appears to hang during mpx recovery processing, at this message point in the iqmsg file. 
    st_database::CompleteMpxRecovery() - change RecoveryState from RECOVERED to RECOVERING
  • Secondary nodes are not able to communicate with the CN , reporting "failed to communicate over INC" errors.

  • An attempt to drop all secondary nodes and revert to simplex, then rebuild the multiplex does not resolve the problem,  the CN hangs after adding the first secondary node to the MPX.

  • Pstacks taken of the hung CN process show no or very little movement and are likely to show the start database thread in a posix lock during MPX recovery,  possibly during global transaction restore processing similar to this...
    ----------------- lwp# nnnn / thread# nnnn --------------------
     fffffd7ffd59257a fcntl (2c, 7, fffffd7fe830f720)
     fffffd7ffd5786be _lockf () + be
     fffffd7ffd580bdd lockf () + 7d
     fffffd7fe75617e3 _posix_lockf () + 2b
     fffffd7fe75561a9 __1cSc_cfg_inifile_baseElock6M_C_ () + 1d
     fffffd7fe755616c __1cSc_cfg_inifile_baseEopen6M_C_ () + 184
     fffffd7fe750cd44 __1cRc_strm_port_layerHconnect6MpknMc_conn_parms_nQa_strm_conn_type_pcIpvrknPc_encrypt_level_ppnUc_strm_conn_notifier_ppnMc_strm_pconn__nRan_int_strm_error__ () + c4c
     fffffd7fe7509e8b __1cRc_strm_tran_layerHconnect6MpknMc_conn_parms_nQa_strm_conn_type_pcIpvppnUc_strm_conn_notifier_ppnMc_strm_tconn__nRan_int_strm_error__ () + e3
     fffffd7fe7500c93 __1cRc_strm_sess_layerHconnect6MpnMc_conn_parms_nQa_strm_conn_type_pcIpvppnMc_strm_sconn__nRan_int_strm_error__ () + 313
     fffffd7fe74f8f19 __1cLStrmConnect6FpnMc_conn_parms_pnKa_strm_env_nQa_strm_conn_type_ppnLa_strm_conn_pcIpkcpv_C_ () + 49
     fffffd7fe74cdb80 i_cs_EngineConnect () + 3bc
     fffffd7fe74d3310 CmdSeqEngineConnect () + 3c
     fffffd7fe74d14f4 __1cbCi_cs_StringConnectNoRedirect6FppnNa_cmdseq_conn_pnTa_cmdseq_error_info_pnWa_cmdseq_strconn_param_IpnWc_cmdseq_redirect_info__nPa_cmdseq_return__ () + 458
     fffffd7fe74d2868 i_cs_StringConnect () + 310
     fffffd7fe74d0fd2 CmdSeqStringConnect () + 2ae
     fffffd7fe74be402 db_string_connect () + 176
     fffffd7fe50416d8 __1cRinc_MpxConnectionHConnect6Mpkc_i_ () + 2dc
     fffffd7fe502b8c1 __1cOinc_ControllerOGetGTxnRecConn6MrpnRinc_MpxConnection_I_i_ () + 14d
     fffffd7fe503e9b5 __1cSinc_CommandHandlerXCallRemoteRPCForGTxnRec6MIIrnNinc_MpxRPCMsg_2pnSs_bufman_errorInfo__i_ () + 179
     fffffd7fe5067f8c __1cTinc_gTxnRecEndpointICallProc6MIIrnNinc_MpxRPCMsg_2IpnMinc_rpcStats_pnMinc_connPool_pnSs_bufman_errorInfo__i_ () + b0
     fffffd7fe5068380 __1cTinc_gTxnRecEndpointTGetGlobalTxnCxtData6MIrpkCrX_i_ () + 70
     fffffd7fe4cefd08 __1cLst_databaseZDoRestoreGlobalTxnCxtData6MpnOgtrWorkUnitRow_IpnZst_gTxnCxtRestoreWorkIter__v_ () + 51c
     fffffd7fe4cf1613 __1cLst_databaseXRestoreGlobalTxnCxtData6M_v_ () + 4ef
     fffffd7fe4cef6ee __1cLst_databaseTCompleteMpxRecovery6M_i_ () + 292
     fffffd7fe4cfe783 __1cIst_iqctlTCompleteMpxRecovery6MpnKUIDatabase__I_ () + 1b
     fffffd7fe364bc0b UIQCtl_CompleteMpxRecovery () + 1b
     fffffd7fe4d07ed0 __1cUst_SAIQdDInterfaceInfoMcallFunction6M_v_ () + 1c
     fffffd7fe4d8d417 __1cQst_SAIQdDInterfaceJRunIQdDFunc6MpnUst_SAIQdDInterfaceInfo__v_ () + 2b3
     fffffd7fe4d084e5 __1cQst_SAIQdDInterfaceHExecute6MipvpF11_I1I_v_ () + 259
     fffffd7fe4010216 __1cUsaint_iqthresholdctlTCompleteMpxRecovery6MpnJIDatabase__I_ () + 5e
     fffffd7ffdef7b26 __1cPInitIQdDMultiplex6FpnIDatabase_pnGWorker__nPan_errmap_index__ () + 10e
     fffffd7ffdefa71e __1cNStartDatabase6FpnSa_single_db_config_pnURQdDStartDatabaseParam_II_nPan_errmap_index__ () + 1cda
     fffffd7ffde1f5b9 __1cPRQdDStartDatabaseKdo_request6M_v_ () + 14d
     fffffd7ffe1fb096 __1cKRQdDBaseItemHdo_work6MpnGWorker__I_ () + 22
     fffffd7ffe2577c1 __1cMRequestQdDueueLworker_body6F_v_ () + 215
     fffffd7ffe1fafe1 __1cMrequest_task6Fpv_v_ () + 85
     fffffd7ffe259cd9 __1cIUnixTaskIpre_body6Fpv_1_ () + 119
     fffffd7ffd58daeb _thr_setup () + 5b
     fffffd7ffd58dd20 _lwp_start ()

Read more...

Environment

  • SAP IQ 16.0 unix platforms

  • SAP IQ 16.1 unix platforms

Product

SAP IQ 16.0 ; SAP IQ 16.1

Keywords

sybase, IQ16, hang, stuck, failure, recovery, TLV, GTR, resilience, 'long time', kill, crash, 'file system', abort, unresponsive, root, reboot, CO, simplex, hung, semaphore, semafore, tmp, temp , KBA , BC-SYB-IQ , Sybase IQ , Problem

About this page

This is a preview of a SAP Knowledge Base Article. Click more to access the full version on SAP ONE Support launchpad (Login required).

Search for additional results

Visit SAP Support Portal's SAP Notes and KBA Search.