Project

General

Profile

Actions

Bug #10646

closed

"SocketException(message='Socket closed'" error at the end of generation with 500 nodes

Added by Nicolas CHARLES almost 7 years ago. Updated over 5 years ago.

Status:
Released
Priority:
N/A
Category:
Web - Config management
Target version:
Severity:
Critical - prevents main use of Rudder | no workaround | data loss | security
UX impact:
User visibility:
Infrequent - complex configurations | third party integrations
Effort required:
Priority:
54
Name check:
Fix check:
Regression:

Description

On a fairly large install (2000 nodes, 500 nodes accepted at once), I got the following error at the end of generation

⇨ Can not save (modify) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:389: SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155) / write(BufferedOutputStream.java:122) / write(ByteStringBuffer.java:1664) / writeTo(ASN1Buffer.java:851) / sendMessage(LDAPConnectionInternals.java:525) / sendMessage(LDAPConnection.java:4249) / processAsync(ModifyRequest.java:745) / process(ModifyRequest.java:650) / modify(LDAPConnection.java:2745) / $anonfun$modifyAction$3(LDAPConnection.scala:541) / apply(Exception.scala:224) / $anonfun$modifyAction$1(LDAPConnection.scala:541) / applyMod(LDAPConnection.scala:489) / $anonfun$applyModify$2(LDAPConnection.scala:552) / save(LDAPConnection.scala:609) / $anonfun$save$8(NodeConfigurationCacheRepository.scala:451) / flatMap(Box.scala:808) / $anonfun$save$4(NodeConfigurationCacheRepository.scala:447) / flatMap(Box.scala:808) / $anonfun$save$3(NodeConfigurationCacheRepository.scala:446) / $anonfun$flatMap$1(LDAPConnectionProvider.scala:117) / withCon(LDAPConnectionProvider.scala:154) / withCon$(LDAPConnectionProvider.scala:143) / withCon(LDAPConnectionProvider.scala:385) / flatMap(LDAPConnectionProvider.scala:116) / flatMap$(LDAPConnectionProvider.scala:115) / flatMap(LDAPConnectionProvider.scala:385) / save(NodeConfigurationCacheRepository.scala:445) / cacheNodeConfiguration(NodeConfigurationServiceImpl.scala:72) / $anonfun$writeNodeConfigurations$4(DeploymentService.scala:1010) / flatMap(Box.scala:808) / writeNodeConfigurations(DeploymentService.scala:1004) / writeNodeConfigurations$(DeploymentService.scala:992) / writeNodeConfigurations(DeploymentService.scala:480) / $anonfun$deploy$71(DeploymentService.scala:232) / flatMap(Box.scala:808) / $anonfun$deploy$65(DeploymentService.scala:221) / flatMap(Box.scala:808) / $anonfun$deploy$61(DeploymentService.scala:215) / flatMap(Box.scala:808) / $anonfun$deploy$58(DeploymentService.scala:214) / flatMap(Box.scala:808) / $anonfun$deploy$54(DeploymentService.scala:209) / flatMap(Box.scala:808) / $anonfun$deploy$50(DeploymentService.scala:204) / flatMap(Box.scala:808) / $anonfun$deploy$47(DeploymentService.scala:197) / flatMap(Box.scala:808) / $anonfun$deploy$43(DeploymentService.scala:188) / flatMap(Box.scala:808) / $anonfun$deploy$38(DeploymentService.scala:174) / flatMap(Box.scala:808) / $anonfun$deploy$36(DeploymentService.scala:173) / flatMap(Box.scala:808) / $anonfun$deploy$34(DeploymentService.scala:172) / flatMap(Box.scala:808) / $anonfun$deploy$33(DeploymentService.scala:171) / flatMap(Box.scala:808) / $anonfun$deploy$31(DeploymentService.scala:170) / flatMap(Box.scala:808) / $anonfun$deploy$28(DeploymentService.scala:167) / flatMap(Box.scala:808) / $anonfun$deploy$24(DeploymentService.scala:164) / flatMap(Box.scala:808) / $anonfun$deploy$20(DeploymentService.scala:161) / flatMap(Box.scala:808) / $anonfun$deploy$16(DeploymentService.scala:158) / flatMap(Box.scala:808) / $anonfun$deploy$12(DeploymentService.scala:155) / flatMap(Box.scala:808) / $anonfun$deploy$8(DeploymentService.scala:152) / flatMap(Box.scala:808) / $anonfun$deploy$5(DeploymentService.scala:147) / flatMap(Box.scala:808) / $anonfun$deploy$2(DeploymentService.scala:142) / flatMap(Box.scala:808) / deploy(DeploymentService.scala:141) / deploy$(DeploymentService.scala:127) / deploy(DeploymentService.scala:480) / applyOrElse(AsyncDeploymentAgent.scala:350) / apply(AbstractPartialFunction.scala:34) / execTranslate(LiftActor.scala:504) / execTranslate$(LiftActor.scala:496) / execTranslate(AsyncDeploymentAgent.scala:342) / proc2(LiftActor.scala:305) / $anonfun$processMailbox$1(LiftActor.scala:227) / apply(JFunction0$mcV$sp.java:12) / around(LiftActor.scala:241) / around$(LiftActor.scala:240) / around(AsyncDeploymentAgent.scala:342) / processMailbox(LiftActor.scala:227) / $anonfun$$bang$3(LiftActor.scala:190) / run(LiftActor.scala:76) / runWorker(ThreadPoolExecutor.java:1142) / run(ThreadPoolExecutor.java:617) / run(Thread.java:745)', revision=24201)

no error on the ldap side in the logs


Related issues 2 (0 open2 closed)

Related to Rudder - Bug #10645: Proxy error when accepting 500 inventories at the same timeRejectedActions
Related to Rudder - Bug #13256: LDAP IO error on generation with a lot of nodesReleasedVincent MEMBRÉActions
Actions #1

Updated by Nicolas CHARLES almost 7 years ago

  • Related to Bug #10645: Proxy error when accepting 500 inventories at the same time added
Actions #2

Updated by Jonathan CLARKE almost 7 years ago

  • Subject changed from error at the end of generation, after accepting 500 nodes to Error at the end of generation after accepting 500 nodes
  • Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
  • User visibility changed from Infrequent - complex configurations | third party integrations to Operational - other Techniques | Technique editor | Rudder settings
  • Priority changed from 0 to 60
Actions #3

Updated by Jonathan CLARKE almost 7 years ago

  • User visibility changed from Operational - other Techniques | Technique editor | Rudder settings to Infrequent - complex configurations | third party integrations
  • Priority changed from 60 to 49

Assuming this only happens when 500 nodes are accepted simultanteously, this is an infrequent situation. If it also happens for less nodes, say <=100, we'll need to increase the visibility field again.

Actions #4

Updated by Nicolas CHARLES almost 7 years ago

  • Severity deleted (Critical - prevents main use of Rudder | no workaround | data loss | security)
  • Priority changed from 49 to 0

logs say:

Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access to "nodeId=154ec2c4-a5a7-4c5c-b49a-75d2f56d1c68,ou=Nodes,ou=Accepted Inventories,ou=Inventories,cn=rudder-configuration" "osName" requested
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: <= root access granted
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access granted by manage(=mwrscxd)
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: result not in cache (osFullName)
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: => access_allowed: read access to "nodeId=154ec2c4-a5a7-4c5c-b49a-75d2f56d1c68,ou=Nodes,ou=Accepted Inventories,ou=Inventories,cn=rudder-configuration" "osFullName" requested
Apr 24 17:01:39 dhcp240 rudder-slapd[10927]: <= root access granted
Apr 24 17:02:09 dhcp240 rudder-slapd[10927]: dnMatch 1024

but slapd log says

[2017-04-24 17:01:43] DEBUG application - [Store Agent Run Times] Scheduled task finished in 595 ms (started at 2017/04/24 17:01:42, finished at 2017/04/24 17:01:43)
[2017-04-24 17:01:45] DEBUG com.normation.rudder.services.policies.PromiseGenerationServiceImpl - Policy generation completed in 242543 ms
[2017-04-24 17:01:45] ERROR com.normation.rudder.batch.AsyncDeploymentAgent$DeployerAgent - Error when updating policy, reason Cannot write configuration node <- Can not save (modify) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:389:  SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155)
Actions #5

Updated by Nicolas CHARLES almost 7 years ago

  • Severity set to Critical - prevents main use of Rudder | no workaround | data loss | security
  • Priority changed from 0 to 49
Actions #6

Updated by François ARMAND almost 7 years ago

It is a problem occuring in the underlying LDAP lib (unbound).

So the connection, for some reason, can not be done because the socket is closed, and so the connection never reach the LDAP server, which is why OpenLDAP log are void at that point of time.

It happens when we try to aquire a connection in NodeConfigurationCacheRepository.scala #l445:

  def save(caches: Set[NodeConfigurationHash]): Box[Set[NodeId]] = {
    val updatedIds = caches.map(_.id)
    for {
      ldap          <- ldapCon  // <========= HERE
      existingEntry <- ldap.get(rudderDit.NODE_CONFIGS.dn)

This code is totally standard, we are using it again and again everywhere. If it was repeatable, I would say that something went wrong with the LDAP connection pool before / during the deployment, and that for some reason, the fault happen around the same time (based on number of open con?). Or perhaps something during the deployment, just before that action, corrupt the connection pool -perhaps a bad data.

I fear that I won't be able to make more assumption.

So, some questions:

  • does restarting Rudder solved the problem?
  • cleaning cache ? (but I really believe that the problem is not linked to NodeHash, but something before)

Next step is trying to set unboundId at debug or trace level, and see what the logs are telling.

Actions #7

Updated by Vincent MEMBRÉ almost 7 years ago

  • Target version changed from 4.1.2 to 4.1.3
Actions #8

Updated by Vincent MEMBRÉ almost 7 years ago

  • Target version changed from 4.1.3 to 4.1.4
Actions #9

Updated by Vincent MEMBRÉ almost 7 years ago

  • Target version changed from 4.1.4 to 4.1.5
Actions #10

Updated by Alexis Mousset almost 7 years ago

  • Target version changed from 4.1.5 to 4.1.6
  • Priority changed from 49 to 48
Actions #11

Updated by Benoît PECCATTE almost 7 years ago

  • Priority changed from 48 to 64
Actions #12

Updated by Benoît PECCATTE almost 7 years ago

  • Assignee set to François ARMAND

Fixing #10645 will probably solve this one

Actions #13

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.1.6 to 4.1.7
  • Priority changed from 64 to 62
Actions #14

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.1.7 to 4.1.8
Actions #15

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.1.8 to 4.1.9
  • Priority changed from 62 to 60
Actions #16

Updated by Benoît PECCATTE over 6 years ago

  • Priority changed from 60 to 59
Actions #17

Updated by Vincent MEMBRÉ over 6 years ago

  • Target version changed from 4.1.9 to 4.1.10
  • Priority changed from 59 to 58
Actions #18

Updated by Vincent MEMBRÉ about 6 years ago

  • Target version changed from 4.1.10 to 4.1.11
  • Priority changed from 58 to 56
Actions #19

Updated by Vincent MEMBRÉ about 6 years ago

  • Target version changed from 4.1.11 to 4.1.12
  • Priority changed from 56 to 54
Actions #20

Updated by Vincent MEMBRÉ almost 6 years ago

  • Target version changed from 4.1.12 to 4.1.13
Actions #21

Updated by Vincent MEMBRÉ almost 6 years ago

  • Target version changed from 4.1.13 to 4.1.14
Actions #22

Updated by François ARMAND over 5 years ago

One of our user experience the same problem on a Rudder 4.1.7.

The server is quite loaded and has > 10k nodes.

Actions #23

Updated by François ARMAND over 5 years ago

I'm able to reproduce by faking a huge Node Configuration entry by changing that method in NodeConfigurationCacheRepository:

  private[this] def toLdap(nodeConfigs: Set[NodeConfigurationHash]): LDAPEntry = {
    val cc = if(nodeConfigs.nonEmpty) {
      val c = nodeConfigs.head
       for(i <- 0 to 10000) yield c.copy(
         id = NodeId(c.id.value+i.toString),
          policyHash = c.policyHash ++ (for(j <- 1 to 100) yield PolicyHash( Cf3PolicyDraftId(RuleId("test--dir-" + j), DirectiveId("test-rule-"+j)) , j))
        )
    } else {
      Set[NodeConfigurationHash]()
    }
    val caches = (cc ++ nodeConfigs).map{ x => write(x) }
    val entry = rudderDit.NODE_CONFIGS.model
    entry +=! (A_NODE_CONFIG, caches.toSeq:_*)
    entry
  }

Actions #24

Updated by François ARMAND over 5 years ago

I was able to minify it with that program:

object NodesConfigurationErrorMain {

  val rwLdap =
    new RWPooledSimpleAuthConnectionProvider(
      host = "localhost",
      port = 1389,
      authDn = "cn=Manager,cn=rudder-configuration",
      authPw = "secret",
      poolSize = 2
  )

  def str(size: Int) = new String( Array.fill[Char](size)('x'))

  def main(args: Array[String]): Unit = {
    val dn = new DN("cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration")
    val entry = LDAPEntry(dn)

    entry += new Attribute("objectClass", "top", "nodeConfigurations")
    entry += new Attribute("cn", "Nodes Configuration")

//here, with 454, we get:
//Failure(Can not save (add) 'cn=Nodes Configuration,ou=Rudder,cn=rudder-configuration': An error occurred while attempting to send the LDAP message to server localhost:1389:  SocketException(message='Socket closed', trace='socketWrite(SocketOutputStream.java:118) / write(SocketOutputStream.java:155) / write(BufferedOutputStream.java:122) / write(ByteStringBuffer.java:1664) ....
    val values = for(i <- 0 until 167) yield i + "-"+str(100000) ++ str(453)
    entry += new Attribute("nodeConfig", values:_*)
    println(for {
      ldap <- rwLdap
      _ <- ldap.delete(dn)
      _ <- ldap.save(entry)
    } yield {
      "done" 
    })

  }
}
Actions #25

Updated by François ARMAND over 5 years ago

To be clearer: in the previous code snippets, it's really the total sum size of the entry which is the problem. I can decrease the size of each value and add more of them, or increase it and add even less.

It is more likely because somewhere, we hit an int overflow, likely in the ASN.1 encoding or the actual write of the code into the socket.

Actions #27

Updated by François ARMAND over 5 years ago

  • Status changed from New to In progress
Actions #28

Updated by François ARMAND over 5 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from François ARMAND to Nicolas CHARLES
  • Pull Request set to https://github.com/Normation/rudder-packages/pull/1626
Actions #29

Updated by François ARMAND over 5 years ago

Actually, the problem is just that OpenLDAP limit the size of request to avoid DOS by people sending huge, useless request.

The correction is to increase "sockbuf_max_incoming_auth" in slapd.conf.

# Specify the maximum incoming LDAP PDU size for authenticated sessions. The default is 4194303.
# This is too little for us, we have big entries (like Nodes Configuration)
sockbuf_max_incoming_auth 1000000000
Actions #30

Updated by Nicolas CHARLES over 5 years ago

  • Status changed from Pending technical review to In progress

I'm taking over this issue!

Actions #31

Updated by Nicolas CHARLES over 5 years ago

  • Status changed from In progress to Pending technical review
  • Assignee changed from Nicolas CHARLES to François ARMAND
  • Pull Request changed from https://github.com/Normation/rudder-packages/pull/1626 to https://github.com/Normation/rudder-packages/pull/1627
Actions #32

Updated by Rudder Quality Assistant over 5 years ago

  • Assignee changed from François ARMAND to Nicolas CHARLES
Actions #33

Updated by François ARMAND over 5 years ago

  • Status changed from Pending technical review to Pending release
Actions #34

Updated by François ARMAND over 5 years ago

So, to So, to sum up that problem:

- OpenLDAP has (like all LDAP server) a protection agains denial of service attack that limit the size of request it processes. This is to avoid having an attacker sending big, useless request with hard to decode ASN.1, and make the server burn CPU on them.
- that protection is by "PDU" (protocol data unit), so it's not a diret mapping into request size
- but it explains why we sometime get LDAP error "socket close" - when too much data comes, openldap just close the socket to protect itself.
- you get the following message with OpenLDAP log level "conns trace" (yeah, that's why we never saw it - the log level needed is extremelly deep)

Jul 23 10:03:19 server rudder-slapd[31057]: connection_get(10): got connid=1002
Jul 23 10:03:19 server rudder-slapd[31057]: connection_read(10): checking for input on id=1002
Jul 23 10:03:19 server rudder-slapd[31057]: ber_get_next on fd 10 failed errno=34 (Numerical result out of range)
Jul 23 10:03:19 server rudder-slapd[31057]: connection_read(10): input error=-2 id=1002, closing.
Jul 23 10:03:19 server rudder-slapd[31057]: connection_closing: readying conn=1002 sd=10 for close
Jul 23 10:03:19 server rudder-slapd[31057]: connection_close: conn=1002 sd=10
Actions #35

Updated by François ARMAND over 5 years ago

  • Subject changed from Error at the end of generation after accepting 500 nodes to "SocketException(message='Socket closed'" error at the end of generation with 500 nodes
Actions #36

Updated by Benoît PECCATTE over 5 years ago

  • Status changed from Pending release to Released

This bug has been fixed in Rudder 4.1.14 and 4.3.4 which were released today.

Actions #37

Updated by François ARMAND over 5 years ago

  • Related to Bug #13256: LDAP IO error on generation with a lot of nodes added
Actions

Also available in: Atom PDF