Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post Fast reboot static route configured is missing #6225

Closed

Conversation

sudhanshukumar22
Copy link
Contributor

- Why I did it
Post Fast reboot static route configured is missing.

Log:
Prior To fast reboot:
2020-11-03 07:48:28,726 T0000: INFO [D1-H3] FCMD: show ip route
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] Codes: K - kernel route, C - connected, S - static, R - RIP,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] F - PBR, f - OpenFabric,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] > - selected route, * - FIB route, q - queued route, r - rejected route
2020-11-03 07:48:28,829 T0000: INFO [D1-H3]
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] K>* 0.0.0.0/0 [0/202] via 10.59.128.1, eth0, 00:02:04
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] C>* 10.10.10.0/24 is directly connected, Ethernet120, 00:01:34
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] C>* 10.59.128.0/20 is directly connected, eth0, 00:02:04
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] S>* 20.20.20.0/24 [1/0] via 10.10.10.2, Ethernet120, 00:01:23
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] C>* 192.168.12.0/24 is directly connected, Ethernet121, 00:00:29
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] sonic#

Post Fast Reboot:
2020-11-03 07:51:40,228 T0000: INFO [D1-H3] FCMD: show ip route
2020-11-03 07:51:40,329 T0000: INFO [D1-H3] Codes: K - kernel route, C - connected, S - static, R - RIP,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] F - PBR, f - OpenFabric,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] > - selected route, * - FIB route, q - queued route, r - rejected route
2020-11-03 07:51:40,330 T0000: INFO [D1-H3]
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] K>* 0.0.0.0/0 [0/202] via 10.59.128.1, eth0, 00:02:08
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] C>* 10.10.10.0/24 is directly connected, Ethernet120, 00:00:58
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] C>* 10.59.128.0/20 is directly connected, eth0, 00:02:08
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] C>* 192.168.12.0/24 is directly connected, Ethernet121, 00:00:58
2020-11-03 07:51:40,331 T0000: INFO [D1-H3] sonic#
- How I did it
Allow zebra socket to come up first before static daemon can start. This is because if static daemon comes up before or at the same time as zebra, zebra misses the static route add message from static daemon. This causes static route to be not present in zebra. This is easily seen during reboot (after config save in FRR).
- How to verify it

- Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

@sudhanshukumar22 sudhanshukumar22 force-pushed the docker_frr_changes branch 2 times, most recently from 855e573 to 7ba63c0 Compare December 16, 2020 17:30
@ben-gale
Copy link
Collaborator

@pavel-shirshov , @lguohan - This one has been inactive for a week - how do we move towards merge pls?

@sudhanshukumar22
Copy link
Contributor Author

Hi all,
There are no NTP changes in my defect. can anyone suggest why NTP testcase is failing here ?

@ben-gale
Copy link
Collaborator

retest vsimage please

@lguohan lguohan requested a review from shi-su January 20, 2021 22:36
@lguohan
Copy link
Collaborator

lguohan commented Jan 20, 2021

does it also happen on the normal reboot? from the description it seems it can happen on normal reboot as well.

@lguohan
Copy link
Collaborator

lguohan commented Jan 20, 2021

@shi-su , we have solve a race condition between zebra and staticd in #6478 , @sudhanshukumar22, can you look at that pr and see if we can the same approach to solve the problem.

@sudhanshukumar22
Copy link
Contributor Author

does it also happen on the normal reboot? from the description it seems it can happen on normal reboot as well.

@lguohan , yes, it is a basic testcase and will happen with normal reboot as well. here, we have to create the static route, save the FRR config and reboot. After reboot, the static route present in FRR config will not be replayed.

@lguohan
Copy link
Collaborator

lguohan commented Jan 21, 2021

can you look the other people and use the same approach

-t /usr/share/sonic/templates/unisolate.j2,/usr/sbin/bgp-unisolate \
"

CFGGEN_PARAMS_UNIFIED=" \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is unrelated change, should be move to a different pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will move it to another PR. This change was done to support split mode along with separate mode. The default is still separate mode.

rm -f /etc/frr/bgpd.conf /etc/frr/zebra.conf /etc/frr/staticd.conf
elif [ "$CONFIG_TYPE" == "split" ]; then
echo "service integrated-vtysh-config" > /etc/frr/vtysh.conf
sonic-cfggen $CFGGEN_PARAMS_SPLIT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated changes. can you pull to a different pr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -39,6 +39,17 @@ stderr_logfile=syslog
dependent_startup=true
dependent_startup_wait_for=rsyslogd:running

[program:zsocket]
command=/usr/bin/zsocket.sh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shi-su , i think the problem you discovered is a generic problem. i think it is better that you pull you script as a delicated script and wait for zebra socket available, and then both bgpd, staticd or other daemons shoud wait for that script to exit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I will rewrite my script as a separate one to wait for zebra to be available and then let the daemons wait for it to exit similar to the code changes here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I drafted a PR to pull the script #6519. It took some ideas in this PR, thanks to @sudhanshukumar22. Maybe we can stick to that script. There are some other changes in this PR, maybe they can be rebased onto #6519.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ben-gale to comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will review that code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @shi-su has done the changes for wait script, I will remove this. I will put my changes on top of them.

@lguohan
Copy link
Collaborator

lguohan commented Jan 21, 2021

@sudhanshukumar22, can you take a look at #6519 ?

@@ -0,0 +1,13 @@
#!/usr/bin/env bash
secs=30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to wait this long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had put max wait. It will not matter since the script will exit as soon as zebra socket is available.

@@ -18,7 +18,8 @@ RUN apt-get update && \
iproute2 \
libjson-c3 \
logrotate \
libunwind8
libunwind8 \
net-tools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is net-tools still necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one doubt. I am deleting the /etc/sonic/config_db.json and rebooting the system. After reboot, I expect that config db and vtysh config should be empty. But I see that we still have BGP entries in redis DB. e.g.",
router bgp 65100
neighbor 10.0.0.1 remote-as 65200"
Can you suggest where from these entries are coming after reboot. It is expected that when we delete config_db.json and rebooting, the redis DB should also become empty.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check if /etc/sonic/minigraph.xml exists? I think if config_db.json does not exist, it will try to recover the configuration from minigraph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked this file. It is not present in /etc/sonic, but at some other place. But even after deleting the file and deleting /etc/sonic/config_db.json and reboot, I still see that all BGP configurations are retained. I don't know from where this config is getting replayed. I am using default mode (separated).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything (e.g., config_db.json, minigraph.xml) in /host/old_config?

@@ -49,6 +50,7 @@ RUN apt-get clean -y && \

COPY ["frr", "/usr/share/sonic/templates"]
COPY ["docker_init.sh", "/usr/bin/"]
COPY ["zsocket.sh", "/usr/bin/"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is already added a few lines later.

@@ -29,7 +29,7 @@ stderr_logfile=syslog
dependent_startup=true

[program:zebra]
command=/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M fpm -M snmp
command=/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -K 300 -r -M fpm -M snmp --v6-rr-semantics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please help clarify why these changes are necessary on this topic? What will happen without these changes?

@@ -25,7 +25,7 @@
def do_work():
""" Main function """
frr = FRR(["bgpd", "zebra", "staticd"])
frr.wait_for_daemons(seconds=20)
frr.wait_for_daemons(seconds=60)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to wait for this long? Is there a scenario that the 20-second is not enough?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants