Diameter at a Glance Telecom Authentication Process

Transport Failure Detection

Here today i try to explain in more general fassion, Transport failure detection depends on the deployment of the Network. I will explain this with the help of an Example.




Example

Suppose their are two nodes Node-1 and Node-2 , Peer connection is already  established between them and they are exchanging messages on that connection. Now Node-1 sends a message MESSAGE-X to Node-2 and doesnot receive the response for the MESSAGE-X. So how long Node-1 should WAIT for the Respose (say 10ms) or should Node-1  retry (say YES) or How many time NOde-1 should retry (say 2-Times) all these things are deployment specific.

After satisfying all the deployment specific conditions Node-1 would check whether there is break in network connection or not. So for this Node-1 send DWR message to Node-2 and does not receive the DWA in specific period of time then it will retry the DWR for 3 time (include in the first DWR). If DWA is not received for any the DWR then it will take this situation as the Connection Failure. and Send the Other Messages to the Secondary Peer.

If Node-1 will receive the DWA with the Error does not mean that Connection Failure, because Node-1 has received the DWA on that Network for which Node-1 was checking whether the Transport-connection was there or not. DWA with error may contain Diameter_too_Busy or any other Error message is just  to inform the Node-1 the status of Node-2.

Failover

The process of detecting the Transport connection failure with its peer and forwarding the all pending messages to the Secondary Peer Node (Alternate Node) is known as failover.

Avp Structure of DWR and DWA
Device-Watchdog-Request
<DWR> ::= < Diameter Header: 280, REQ >
                { Origin-Host }
                { Origin-Realm }
                [ Origin-State-Id ]

Device-Watchdog-Answer
<DWA> ::= < Diameter Header: 280 >
                { Result-Code }
                { Origin-Host }
                { Origin-Realm }
                [ Error-Message ]
                * [ Failed-AVP ]
                [ Original-State-Id ] =
[ Origin-State-Id ]
Avp Description

Failed-AVP:- is a grouped avp provide the Debugging information in case of reject or Error during the processing such as AVP not supported etc.

Error-Message:- provides the Error in human readable form.



Original-State-Id:- is misprinted in RFC. It is basically  Origin-State-Id.

Origin-State-Id :- Origin-State-Id is used to infer the session/connection between two nodes. Whenever there is  change is state due break/disconnection in session or transport because of reboot for instance, Then rebooted node will increase the value so that other node become aware of the fact that state of peer is changed and all previous session are no more valid. Origin-State-Id is stored on non-volatile memory on all nodes.




Every time the session fails or the node is rebooted this Origin-State-Id is monotonically increased. Both nodes that are communicating stores or maps this id for mapping the Answer-Message with proper Request-Message.



Your Comments /Suggestions and Questions are always welcome.I would try to clarify doubts with best of my knowledge. So feel free to put Questions.  

44 comments:

  1. Hi Vinay,

    Thanks for this article. I've a query though.

    Please let me know when no Origin-State-Id is sent in the DWR, then what Origin-State-Id value should we expect in the DWA message?

    I'm facing an issue, where invalid AVP bits of Origin-State-Id is received in DWA when NO Origin-State-Id is sent in the DWR. Error is shown below:-

    #### <> <> <> <1322126427209>
    180.20.100.90
    origin.com
    N/A
    2001


    Regards,
    Rishi

    ReplyDelete
    Replies
    1. Hi Rishi,

      If there is no Origin-State-Id in DWR then there should not be any Origin-State-Id in DWA.

      Thanks for your query.
      Happy to help you again.
      Team-Diameter

      Delete
  2. Hi Vinay,
    Thank you for the article.
    Let's take peer 1 configured to send a DWR every 30 seconds if no traffic is detected.
    Peer 2 is configured the same way.
    I'd like to verify something:

    At t0 peer 1 sends DWR
    at t0+30 peer2 sends DWR
    at T0+60 peer1 sends DWR

    Do you think the DWR is considered as a traffic and in this case peer1 when receiveing the DWR at T0+30 would wait another 3à to send the second DWR, that is at T0+60?

    Thank you
    Nicolas.

    ReplyDelete
    Replies
    1. Hi Nicolas

      DWR message exchange happens when there is no traffic between two nodes for a given period of time (i.e suppose we have configured 30 secs as DWR time then if there no message is exchange between considered nodes for 30 secs then DWR will be triggered.)

      Hence in Load condition there will not be any case where message is not exchanged for such a long time (i.e. TIME configured for DWR generally 2-5 secs) Therefore DWR is not part of LOAD.

      Under Load condition system will be busted with the message there fore DWR will not occur.


      Thanks for your query.
      Happy to help you again.
      Team-Diameter

      Delete
  3. Hello Vinay,

    Thanks for the nice article. lets there is a x-request message and waiting for y-answer message. How long the device will wait for the answer, is it application specific or session specific(depends on particular session say IP-CAN session for Gx)?

    ReplyDelete
    Replies
    1. Hi Moumita Barman,

      It should wait till it timed-out.

      Operator shall mention a time (generally in milliseconds)at client node, that how long client should wait for reply from Server. If Client receives answer/reply from Server after a given time frame then it shall discard the answer because as soon as it timedout session id corresponding to Request message is no more valid.


      Thanks for your query.
      Happy to help you again.
      Team-Diameter

      Delete
  4. Hi i am kavin, its my first time to commenting anyplace, when i read this
    post i thought i could also create comment due to this sensible paragraph.
    My web page ... piano lessons

    ReplyDelete
  5. Hi Vinay,

    Watchdog timer need to enable separately or DWR/DWA are triggered by default?

    ReplyDelete
    Replies
    1. Hi Kamal,

      It is Diameter Stack dependent thing. It totally depends on stack vendor, how they provide it. Generally there is a provision to change default time-span value of DWR/DWA message.

      Standard says two Nodes shall check whether Link is UP or Not.

      Delete
  6. Hi,

    For First DWR got DWA MESSAGE and after immediately getting DWA message client sending 2nd DWR again after that getting error as SCTP : ABORT : User Initiated Abort. issue will be at DWR timer vlaue or Association ?

    ReplyDelete
    Replies
    1. Hi Bharath

      DWR/DWA messages are used to check whether SCTP/TCP Link is UP or not Between two nodes (Specifically TCP Link because TCP has no mechanism of health check of link)

      There is no association of DWR time and with SCTP Abort. If for a certain period of time (DWR Time) no message is exchanged then node shall send DWR to check whether LINK and Other node is up or not

      Thanks for your query.
      Happy to help you again.
      Team-Diameter

      Delete
  7. Hi,

    What if the node-1 do not send de DWR?? it only send CER and recive CEA and that all.

    ReplyDelete
  8. I have a problem, the node-1 does not send the DWR, someone know what happen? Node-1 send de CER and recive de CEA, but thats all, the conections does not establish.

    ReplyDelete
    Replies
    1. Hi Cruz,

      This issue happens because of the one of the following reasons.

      1) CEA doesn't come with DIAMETER_SUCCESS or No Common Application.

      Kindly check CEA, or post the trace using tshark, following link shall help you.
      http://diameter-protocol.blogspot.in/2013/04/capture-diameter-messages-without-wire.html

      2)Any two peer node of NODE-1 or NODE-2 shall have same DIAMETER Identity.
      In this case it shall toggle; basically it drops the earlier connection, now earlier connection retries then it drops new connection.

      Kindly check DIAMETER Identity of each Node.

      3) (Un-usual case) Receives any other message before the CEA; then some times goes in unknown state.


      If you could share some more details then it would be better for whole world to solve it. Some times these issues are implementation specific.

      Thanks for your query.
      Happy to help you again.
      Team-Diameter

      Delete
  9. question on transport failure detection in Diameter.
    Say I have a Diameter peer connection established and my watchdog timer is 30seconds.
    Now if I do a ifconfig down on that IP interface over which the peer connection is established.
    How long will it take my local Diameter layer to detect that the IP interface has gone down? Will this be immediate or will it have to do the watchdog procedure

    thanks,
    Vijaya

    ReplyDelete
    Replies
    1. Hi VV,

      I consider following cases

      1) LOAD Condition: Under the Load condition, Watchdog request does not come into the picture, As state in article Watchdog happens only when there is no message exchange between Peers for 30 seconds(Watchdog Time). But system is heavily loaded there-fore; In this case Transport connection would be immediate.


      2) LEAN Hour Condition: If there is no message exchange between nodes for 30 seconds then failure would only be detected with DWR message. i.e. either DWR won't be initiated by STACK or DWR would timeout, Bcz DWA won't be received in expected time. SO then detection time would be 30secs + timeout sec.


      Regards
      Ajay

      Delete
  10. If Origin-State-Id is sent in CER with value 0, is it mandatory to send the Origin-State-Id set to value 0 in the CEA message?

    ReplyDelete
    Replies
    1. Hi Vijay,


      Origin-State-Id set to Zero shall be inferred as Origin-State-Id not present in request.

      Delete
  11. Hi Vinay, I've a couple of questions re: transport failure

    Lets say as per your example we have Node 1 and Node 2 connected and exchanging messages.

    If I understand the RF3539 correctly the Tw timer is reset (with Jitter) for every Answer message. So as you say in the busy hour the DWR is never sent.

    So lets say Node 1 has sent a CCR request to Node 2 and response-timeout (10ms in your example expires) Node 1 looks to see if it should retry ('Yes' & twice as per your example) so we would see two more attempts completed before Node 1 stops retrying, the request. Each retry would reset the Tw timer.

    Couple of things I need some help with
    - I'm not sure I understand why after 3 failures (as per local config) the DWR would be initiated? Assume this is because Tw is reset on Answers and not requests so although there may be more requests sent the lack of answers means that Tw will expire
    - How does the Credit Control Tx timer overlay onto the base response-timeout i.e. if Tx was 5ms and we set the Credit Control application to Terminate no further attempts are made, does this override the base config?
    - Lean hour vs Busy hour RFC 3539 suggests that in a busy hour it may take 2Tw to fail over I assume this is because only a DWR/DWA failure can be used to infeer the peer is down?

    Kind regards Jim





    ReplyDelete
    Replies
    1. Hi Jim

      We hope, that we are not deviating you from your point and correctly understood your point of view.

      If DWA is not received of a DWR in given time (TIME-OUT time), then it is implies that there is a transport layer failure between two adjacent node called as PEERs.

      In strict Implementation of RFC-6733
      If CCA is not received doesn't imply the transport failure between peer. because there can be a case in which there is an intermediate node is present between CCR client and CCR server. For CCR client peer is Intermediate node.

      following link can help you.
      http://diameter-protocol.blogspot.in/2013/08/diameter-connection-establishment.html

      Our team has also inserted an IMAGE on this blog explaining DWR

      Thanks for your query.

      Happy to help you again.
      Team-Diameter

      Delete
    2. Many thanks much appreciated
      Kind regards
      Jim

      Delete
  12. I want to understand how the DWR exchange is different from the SCTP HEARTBEAT mechanism? A diameter protocol using SCTP as transport layer will any how detect the transport failures using the HEARTBEAT messages exchanged between the two SCTP nodes, then why there is a need to exchage DWR/DWA messages still to detect transport failures?

    ReplyDelete
    Replies
    1. Yes Vijay

      You are right.
      If we are using TCP then there no heartbeat mechanism on TCP. DIAMETER Node can use any transport. that is why DWR is there in DIAMETER implementation.

      Delete
    2. Thank you Ethan for the clarification. Does this mean that a Diameter node using SCTP as transport layer need/should not use DWR/DWA messages? May I know if this is documented anywhere in the RFC?

      Delete
    3. @ Ethan

      Your clarification is correct.

      @ Vijay

      DWR is proactive solution to detect transport failure. No Reference document telling SCTP should not implement it.

      Being a server a NODE MUST support TCP and SCTP connection. Client can be TCP or SCTP.

      Delete
  13. I have a query regarding the Failed-AVP AVP content to be encoded whenever a diameter node returns DIAMETER_MISSING_AVP error. RFC describes the following:
    7.1.5. Permanent Failures
    DIAMETER_MISSING_AVP 5005
    The request did not contain an AVP that is required by the Command
    Code definition. If this value is sent in the Result-Code AVP, a
    Failed-AVP AVP SHOULD be included in the message. The Failed-AVP
    AVP MUST contain an example of the missing AVP complete with the
    Vendor-Id if applicable. The value field of the missing AVP
    should be of correct minimum length and contain zeroes.

    7.5. Failed-AVP AVP
    ……
    A Diameter message SHOULD contain one Failed-AVP AVP, containing the
    entire AVP that could not be processed successfully. If the failure
    reason is omission of a required AVP, an AVP with the missing AVP
    code, the missing Vendor-Id, and a zero-filled payload of the minimum
    required length for the omitted AVP will be added.

    I am confused about the value to be encoded as defined in the above two sections(one section says as it should be filled with zeros and other section says it should be a zero-filled payload??).
    May I know what is the expected result? Is it that the Value field be left empty or encode the value field with the value "00" which is one byte and append the padding bytes?

    ReplyDelete
    Replies
    1. Failed-AVP is a group AVP.
      It is implied that Data field of Missing AVP shall be filled with ZERO up-to minimum length.
      ::= < AVP Header: 279 >
      1* {Missing-AVP Header: - - - [Data]} Data shall be filled be ZERO

      Thanks for your query.

      Happy to help you again.
      Team-Diameter

      Delete
    2. Ok, can you confirm if the following encoding is correct, for example for "Origin-Realm" AVP this would look like as below:
      + Failed-AVP
      ::= < AVP Header: 279 >
      ::= Origin-Realm
      AVP Code: 296
      AVP Flags: 0x40
      AVP Length: 8

      ---> Data field is empty

      Delete
    3. Wireshark/tshark is the tool to check format.


      Thanks for your query.

      Happy to help you again.
      Team-Diameter

      Delete
  14. Hello,

    In the example above, if there is an underlying transport link failure between Node-1 and Node-2, but Node-2 has not been seen as suspect Diameter peer by Node-1 because Tw has not expired between Node-1 and Node-2; also DWR/DWA process has not taken place to conclude that Node-2 is suspect and there is a transport link failure.

    Questions:

    1) I believe in Node-1 Tx timer keeps expiring and it will keep sending CCR to Node-2 setting T-bit at re-transmission each time, until the number of configurable re-transmission times is reached by Node-1?

    2) If during this time window, Tw expires and Node-1 starts to send DWR towards Node-2; and Node-1 has not exhausted the number of its configurable re-transmission times for CCR; can CCR and DWR be sent by Node-1 towards Node-2 simultaneously?

    Thanks.

    Sam

    ReplyDelete
  15. This comment has been removed by the author.

    ReplyDelete
  16. Hi all ,
    can any one help on this

    1)have you ever used seagull tool as a client for pumping Sy call flow
    when i am using seagull as a client ,as per my requirement i need to put timeout .In that time DWR message is receiving from server to seagull client and seagull response back with DWA,after that subsequent DWR message is sending from server but seagull never sends DWA

    is any one faced this problem .kindly provide the solution for this

    2)actually when no traffic exchanged in between two nodes with in 30 min DWR and DWA will be initiated is this time configurable in both server and client ?

    point 2 is applicable to 3GPP standards ,can we configure time for DWR and DWA both client and server side ?

    plese correct me if i am wrong

    Thanks in advance

    ReplyDelete
  17. Hi Team-Diameter,

    I have two questions.

    1. If already a connection is established to diameter server. and if we try to open second connection to diameter server using same client identity. How will server react?

    2. If 'new Origin-State-Id > older Origin-State-Id' in CER, will the server clear any old socket with same diameter client (if any, and where server is using watchdog mechanism to figure out the connection state, but watchdog timer still has not expired).

    ReplyDelete
    Replies
    1. Hi Devesh,

      Implementation of our suggestion could be vary in different vendor's DIAMETER stack, here we would explain what RFC-6733 say,

      1) If a DIAMETER server receives CER message again on established connection with same DIAMETER identity, then server would respond to second CER with CEA and establish the diameter connection on the basis of second CER, It shall disconnect First connection created by first CER. Because in this scenario Server would assume that client might have been rebooted and sending a fresh request to create DIAMETER connection with same DIAMETER IDENTITY. As we know CER is the first message exchanged to establish a DIAMETER connection.

      Following things we have observed with different vendor stacks in context with above explanation, do share if any thing new you people have observed.

      a) Stack would not allow to connect another node with same DIAMETER IDENTITY.
      b) Diameter connection fluctuates between two clients, because second client breaks the connection of first by sending CER with same identity and first client retries for its broken connection shall break the connection created by second client.


      2) working on it.

      we hope our suggestions would help you,

      Thanks for your query.
      Happy to help you again.
      Team-Diameter

      Delete
    2. Hi Devesh,

      If a Diameter entity receives, new Origin-State-Id higher than previous, it is an indication that all previous sessions don't exist now. Resources associated with previous sessions can be freed.

      Thanks for your query.
      Happy to help you again
      Team-Diameter



      Delete
    3. This is the same scenario I’m facing with one of my node.
      In case of failover with node-1, it will try to establish the connection with node-2(CER) with the same host name, but node-2 is not probably accepting the connection & sessions are dropping.
      The answer you posted based on RFC, can you please give the exact reference for that? (RFC/section?)

      “If a DIAMETER server receives CER message again on established connection with same DIAMETER identity, then server would respond to second CER with CEA and establish the diameter connection on the basis of second CER, It shall disconnect First connection created by first CER. Because in this scenario Server would assume that client might have been rebooted and sending a fresh request to create DIAMETER connection with same DIAMETER IDENTITY. As we know CER is the first message exchanged to establish a DIAMETER connection.”


      Delete
  18. Hi Team,

    Actually I am getting the "DIAMETER_LOGOUT" error.

    Could you please anyone let me know what would be the reason.

    Regards,
    Harish

    ReplyDelete
    Replies
    1. Hi Harish,

      As far as our understanding of scenario. you people are working on session based application, and client is logged out(sign-out) therefore sending client is send STR Session-Terminating-Request to server with reason in Termination-Cause AVP i.e User is logged out, indicating to server to close the session.

      Thanks for your query.

      Happy to help you again.
      Team-Diameter

      Delete
  19. HI Team ,

    I have a scenario where Node A sent Exchange capability request and Node B sent Exchange capability answer with diameter success result code .Now after 29.79 sec Node B initiates watchdog request and but Node A didnt send any response for the watchdog request.
    As well as after 30.28 sec Node initiates the SCTP abort with error code user-initiated ABORT.

    User Initiated Abort (12)

    Cause of error
    --------------

    This error cause MAY be included in ABORT chunks which are send
    because of an upper layer request. The upper layer can specify
    an Upper Layer Abort Reason which is transported by SCTP
    transparently and MAY be delivered to the upper layer protocol
    at the peer.

    now questions :)
    1. Why node A sent SCTP-abort ( user-initiated ) ?Is it because the uppe layer ie diameter didnt received watchdog-request ,so diameter request sctp to initiate SCTP abort.
    2. what can be the reason for diameter request SCTP to initiate SCTP abort ( is it transport layer failure dected by diameter ) ?
    3. After successful exchange capability request and answer which node will initiates the watchdog request if there is no diameter traffic .

    Thanks in advance .
    Regards
    Victor

    ReplyDelete
    Replies
    1. HI Team,

      I was really expecting a answer on this . It will be great help to me if i recive some comments .

      Regards
      Victor

      Delete
    2. Hi Victor

      Sorry for delayed response.

      3) it is immaterial that which node first initiates DWR. DWR is used to check the status of Transport.

      Here we see a strange thing, Why do you have DWR time set to so long 29.79 messages.

      As we know DIAMETER is an application layer protocol that runs over Transport Layer protocol (TCP or SCTP) so we need to first check whether Transport is working or not.

      So kindly tell us what all SCTP messages have been exchanged during 29.79 seconds.
      Check whether SCTP heart beat message is exchanged or not
      Kindly try to reduce DWR time to some milii-seconds.

      Kindly revert

      Thanks for your query.

      Happy to help you again.
      Team-Diameter



      Delete
  20. HI Team,

    Thanks for your reply . I agree with you that there is some problem with transport layer .
    Yes there is sctp heartbeat message sent from Node A which Node B didnt respond to.

    SCTP message exchanged between two nodes are

    node A Node B
    init-------------------------->
    < ------------------------init_ack
    cookie_echo--------------->
    <------------------------cookie_ack
    after this diameter establised
    CER-------------------------------->
    <----------------------------CEA

    SCTP heartbeat ------------->
    <-----------------------DWR
    SCTP abort ------------------>

    so from above as nodeB didnt responded to sctp heartbeat message that why Node a sends SCTP abort message .
    but just one last question :) why node A didnt responded to DWR is it because of transport layer that is node A didnt recived the DWR message and same could be the reason for Node B didnt responded to heartbeat message .

    Am i right ? kindly let me know your views too .

    Thanks and regards
    Victor

    ReplyDelete