Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: upstream nodes was not updated correctly by service discover #9805

Closed
ZhangShangyu opened this issue Jul 9, 2023 · 10 comments
Closed
Assignees
Labels
bug Something isn't working don't close Do not close this issue.

Comments

@ZhangShangyu
Copy link
Contributor

ZhangShangyu commented Jul 9, 2023

Current Behavior

Bug description:

We have implement a service discover plugin to meet this requirement:

  1. get service id from host subdomain, then set upstream to the ip of this service id
  2. the host subdomain is dynamic so we cannot enumerate all host and config a detail route for it

for example:
a.myhost.com should route to ip A
b.myhost.com should route to ip B
c.myhost.com should route to ip C
and so on

the service discover plugin demo is like this:

demo_discover.lua

local demo_nodes_tab = {
  a =  {host = "10.0.0.1", port = 80 },
  b =  {host = "10.0.0.2", port = 80 }
  ...
}

function _M.nodes(service_name, discovery_args)
      local host = ngx.var.host
      -- host example: a.myhost.com
      local service_id = host:match("([^.]+).myhost.com")
      return demo_nodes_tab[service_id]
end

then create a service and route for it

service:
{
 "id": "demo_service"
 "name": "demo_service",
  "upstream": {
    "discovery_type": "demo_discover",
    "service_name": "demo_service"
    "pass_host": "pass",
    "scheme": "http",
    "type": "roundrobin"
  },
}

route:
{
  "host": "*.myhost.com",
  "name": "demo_route",
  "service_id": "demo_service"
  "uri": "/*"
}

do some requests to a.myhost.com and b.myhost.com,
magic happened, the upstream ip is not correct! seems like ABA problem

  • some case request a.myhost.com get b.myhost.com ip
  • some case request b.myhost.com get a.myhost.com ip

root cause:

1.upstream_conf is from cache
in the apisix source code
https://github.com/apache/apisix/blob/release/3.4/apisix/init.lua#L631C24-L631C43
https://github.com/apache/apisix/blob/release/3.4/apisix/plugin.lua#L634
if route.value.service_id then will call plugin.merge_service_route
plugin.merge_service_route will use a lrucache by service_conf key and verison

2. cached upstream_conf is used in set_upstream
https://github.com/apache/apisix/blob/release/3.4/apisix/init.lua#L478
https://github.com/apache/apisix/blob/release/3.4/apisix/init.lua#L478

3. compare cached upstream_conf original_node and service discover new nodes
https://github.com/apache/apisix/blob/release/3.4/apisix/upstream.lua#L277
https://github.com/apache/apisix/blob/release/3.4/apisix/utils/upstream.lua#L37

4. update upstream_conf by compare result
https://github.com/apache/apisix/blob/release/3.4/apisix/upstream.lua#L285
https://github.com/apache/apisix/blob/release/3.4/apisix/upstream.lua#L336
if original_nodes is same as new nodes, it will fill_node_info to old cached upstream_conf, everything is ok

if original_node is not the same as new_nodes, update nodes to new nodes, then clone a new_up_conf,
after this it will fill_node_info to new_up_conf
the bug cause is here, the old cached upstream_conf.original_nodes is not updated by fill_node_info, because new_up_conf is another address by clone

for example

  1. request a.myhost.com firstly, then nodes in cache is ip A, and original_nodes is ip A
  2. then request b.myhost.com, compare new discover ip B with original_nodes ip A is not same, then update nodes to ip B, but the cached original_nodes is not updated, still ip A
  3. request a.myhost.com again compare new discover ip A with original_nodes ip A is the same! the nodes ip will not be update to new discover ip A, so the get a wrong upstream nodes ip B!

solution

call fill_node_info before clone to new_up_conf, so the cached up_conf original_node will update

Expected Behavior

No response

Error Logs

No response

Steps to Reproduce

  1. implement the demo_discover, add init.lua and schema.lua to apisix/discover/demo_discover/ directory

init.lua

local core = require("apisix.core")
local ngx = ngx
local log = core.log

local _M = {
    version = 0.1,
}

local demo_nodes_tab = {
    a = { host = "127.0.0.1", port = 1111 },
    b = { host = "127.0.0.1", port = 2222 }
}

function _M.nodes(service_name, discovery_args)
    local host = ngx.var.host
    -- host example: a.myhost.com b.myhost.com

    local service_id = host:match("([^.]+).myhost.com")
    local demo_node = demo_nodes_tab[service_id]

    local nodes = core.table.new(1, 0)
    core.table.insert(nodes, {
        host = demo_node.host,
        port = tonumber(demo_node.port),
        weight = 100,
    })

    return nodes
end

function _M.init_worker()
    log.warn("demo_discover init")
end

function _M.dump_data()
    return { config = 'demo_discover' }
end

return _M

schema.lua

return {
    type = "object",
    properties = {
        enable = { type = "boolean" },
    }
}
image
  1. config the config.yaml and apisix.yaml

config.yaml config nginx work to 1 and enable demo_discover plugin

nginx_config:
  worker_processes: 1
discovery:
  demo_discover:
    enable: true

apisix.yaml config route and service

routes:
  - id: demo_route
    name: demo_route
    uri: /*
    hosts:
      - '*.myhost.com'
    service_id: demo_service

services:
  - id: demo_service
    name: demo_service
    upstream:
      discovery_type: 'demo_discover'
      service_name: 'demo_service'
      pass_host: 'pass'
      schema: 'http'
      type: 'roundrobin'
  1. do 5 times request with this order
curl 'http://localhost:9080' -H "host: a.myhost.com"
curl 'http://localhost:9080' -H "host: a.myhost.com"
curl 'http://localhost:9080' -H "host: b.myhost.com"
curl 'http://localhost:9080' -H "host: b.myhost.com"
curl 'http://localhost:9080' -H "host: a.myhost.com"
  1. check the access log
172.17.0.1 - - [10/Aug/2023:16:12:49 +0000] a.myhost.com "GET / HTTP/1.1" 502 229 0.002 "-" "curl/7.86.0" 127.0.0.1:1111 502 0.003 "http://a.myhost.com"
172.17.0.1 - - [10/Aug/2023:16:12:51 +0000] a.myhost.com "GET / HTTP/1.1" 502 229 0.001 "-" "curl/7.86.0" 127.0.0.1:1111 502 0.001 "http://a.myhost.com"
172.17.0.1 - - [10/Aug/2023:16:12:55 +0000] b.myhost.com "GET / HTTP/1.1" 502 229 0.001 "-" "curl/7.86.0" 127.0.0.1:2222 502 0.001 "http://b.myhost.com"
172.17.0.1 - - [10/Aug/2023:16:12:56 +0000] b.myhost.com "GET / HTTP/1.1" 502 229 0.001 "-" "curl/7.86.0" 127.0.0.1:2222 502 0.001 "http://b.myhost.com"
172.17.0.1 - - [10/Aug/2023:16:13:00 +0000] a.myhost.com "GET / HTTP/1.1" 502 229 0.006 "-" "curl/7.86.0" 127.0.0.1:2222 502 0.006 "http://a.myhost.com"

you can see the fifth time access log a.myhost.com get wrong upstream 127.0.0.1:2222

Environment

  • APISIX version 2.15.3
@Sn0rt
Copy link
Contributor

Sn0rt commented Jul 20, 2023

thx your PR

@shreemaan-abhishek
Copy link
Contributor

Could you please share your whole demo_discover.lua? I am finding it difficult to setup the custom discovery logic. It would be even better if you could provide me with steps to setup the discovery logic.

@ZhangShangyu
Copy link
Contributor Author

Could you please share your whole demo_discover.lua? I am finding it difficult to setup the custom discovery logic. It would be even better if you could provide me with steps to setup the discovery logic.

ok let me give the detail later

@ZhangShangyu
Copy link
Contributor Author

Could you please share your whole demo_discover.lua? I am finding it difficult to setup the custom discovery logic. It would be even better if you could provide me with steps to setup the discovery logic.

@shreemaan-abhishek hi, i add the detail in Steps to Reproduce, can have a try

@shreemaan-abhishek
Copy link
Contributor

shreemaan-abhishek commented Aug 13, 2023

Thanks, @ZhangShangyu, I was able to reproduce this. But I am unsure how to provide a more straightforward way to test this case.

Can you think of a way to simplify the testing process for PR #9806?

However, the way you have used the service discovery integration is the recommended way to use it

@Sn0rt Sn0rt removed their assignment Aug 24, 2023
@Revolyssup Revolyssup self-assigned this Oct 25, 2023
@shreemaan-abhishek shreemaan-abhishek added the don't close Do not close this issue. label Oct 30, 2023
@sheharyaar
Copy link
Contributor

I will be taking up this task and the PR, @Revolyssup , can you please assign me this ?

@monkeyDluffy6017
Copy link
Contributor

@ZhangShangyu Good job!

@sheharyaar
Copy link
Contributor

sheharyaar commented Dec 28, 2023

I would like to add to the explanation.

  1. When the nodes are different on comparison, the new_nodes are assigned to up_conf.nodes using clone : https://github.com/apache/apisix/blob/release/3.4/apisix/upstream.lua#L277-L285. Here the original_nodes is not changed.

  2. The code then calls set_directly with this updated config (where original nodes has not changed): https://github.com/apache/apisix/blob/release/3.4/apisix/upstream.lua#L61-L79

  3. Here is the set_directly() : https://github.com/apache/apisix/blob/release/3.4/apisix/upstream.lua#L61-L79. This sets the conf in the context. This version is used by merge_service_route to update the cache. Hence, the original_nodes is not updated till now.

  4. After these steps happen, the function calls fill_node_info, but since the context version does not include the updated original_nodes, it is not updated. It is just updated locally in this function call.

Is my explanation correct?? CC : @monkeyDluffy6017 @Sn0rt @shreemaan-abhishek

@sheharyaar
Copy link
Contributor

@shreemaan-abhishek , can this be closed??

@monkeyDluffy6017
Copy link
Contributor

considered resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working don't close Do not close this issue.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

7 participants