最近在解决一个客户的Escalation时,花了较多的时间解决了server端的问题,并加了更详细的log. 原先问题解决后,发现在客户的环境下出现了新的问题。没有response返回到client。server端抛出如下错误:
java.io.IOException: java.net.SocketException: Connection reset at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118) at java.net.SocketOutputStream.write(SocketOutputStream.java:159) at weblogic.utils.io.ChunkedOutputStream.writeTo(ChunkedOutputStream.java:284) at weblogic.servlet.internal.ServletOutputStreamImpl.writeHeader(ServletOutputStreamImpl.java:170) at weblogic.servlet.internal.ResponseHeaders.writeHeaders(ResponseHeaders.java:498) at weblogic.servlet.internal.ServletResponseImpl.writeHeaders(ServletResponseImpl.java:1315) at weblogic.servlet.internal.ServletOutputStreamImpl.sendHeaders(ServletOutputStreamImpl.java:284) at weblogic.servlet.internal.ChunkOutput.flush(ChunkOutput.java:433) at weblogic.servlet.internal.CharsetChunkOutput.flush(CharsetChunkOutput.java:298) at weblogic.servlet.internal.ChunkOutput$2.checkForFlush(ChunkOutput.java:657) at weblogic.servlet.internal.CharsetChunkOutput.write(CharsetChunkOutput.java:200) at weblogic.servlet.internal.ChunkOutputWrapper.write(ChunkOutputWrapper.java:148) at weblogic.servlet.internal.ServletOutputStreamImpl.write(ServletOutputStreamImpl.java:151) at org.apache.axis.utils.ByteArray.writeTo(ByteArray.java:375) at org.apache.axis.SOAPPart.writeTo(SOAPPart.java:265) at org.apache.axis.Message.writeTo(Message.java:539) at org.apache.axis.transport.http.AxisServlet.sendResponse(AxisServlet.java:902) at org.apache.axis.transport.http.AxisServlet.doPost(AxisServlet.java:777) at javax.servlet.http.HttpServlet.service(HttpServlet.java:751) at org.apache.axis.transport.http.AxisServletBase.service(AxisServletBase.java:374) at javax.servlet.http.HttpServlet.service(HttpServlet.java:844) at weblogic.servlet.internal.StubSecurityHelper$ServletServiceAction.run(StubSecurityHelper.java:242) at weblogic.servlet.internal.StubSecurityHelper$ServletServiceAction.run(StubSecurityHelper.java:216) at weblogic.servlet.internal.StubSecurityHelper.invokeServlet(StubSecurityHelper.java:132) at weblogic.servlet.internal.ServletStubImpl.execute(ServletStubImpl.java:338) at weblogic.servlet.internal.ServletStubImpl.execute(ServletStubImpl.java:221) at weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.wrapRun(WebAppServletContext.java:3284) at weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3254) at weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321) at weblogic.security.service.SecurityManager.runAs(SecurityManager.java:120) at weblogic.servlet.provider.WlsSubjectHandle.run(WlsSubjectHandle.java:57) at weblogic.servlet.internal.WebAppServletContext.doSecuredExecute(WebAppServletContext.java:2163) at weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2089) at weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2074) at weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1513) at weblogic.servlet.provider.ContainerSupportProviderImpl$WlsRequestExecutor.run(ContainerSupportProviderImpl.java:254) at weblogic.work.ExecuteThread.execute(ExecuteThread.java:256) at weblogic.work.ExecuteThread.run(ExecuteThread.java:221)
通过fiddler和服务端打印的信息,我们可以确定server返回时出错,写response时,connection 被reset/close了。研究了一些参考文档:
http://stackoverflow.com/questions/62929/java-net-socketexception-connection-reset
这种情况可以确定是被外部kill掉了,于是想到客户是否使用了集群及loadbalance setting。做了单server下的测试,没有问题。单node下loadbalance的情况问题继续重现。可以确认为loadbalance下的设置有问题。客户使用big-IP. 于是网上查了下相关资料:
https://support.f5.com/kb/en-us/solutions/public/7000/600/sol7606.html
这篇文章很好解释了相关的timeout设置,默认值为300秒,解释了为什么超过5分钟的请求就会fail。
于是要求客户更改了TCP Protocol的timeout时间为1个小时,再做测试。问题解决。注意HTTP的timeout 设置基于TCP.
The following BIG-IP objects have idle time-out values:
Protocol profiles
OneConnect profile
SNATs
NATs