Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved support for p2p tunnels #1025

Merged
merged 4 commits into from
Mar 17, 2020
Merged

Improved support for p2p tunnels #1025

merged 4 commits into from
Mar 17, 2020

Conversation

bandaru-viswanath
Copy link
Contributor

@bandaru-viswanath bandaru-viswanath commented Dec 15, 2019

As of SAI version 1.5 the “tunnel” has a p2mp connotation. It holds the VTEP SIP whereas there is no DIP. The DIP is specified as part of the FDB entry or as part of Next Hop entry.

It is proposed to add the DIP as part of the sai_tunnel_attr_t structure as an optional parameter.

Details are available in the documentation markdown file, which is part of the PR.

As of SAI version 1.5 the “tunnel” has a p2mp connotation. It holds the VTEP SIP whereas there is no DIP.  The DIP is specified as part of the FDB entry or as part of Next Hop entry.

It is proposed to add the DIP as part of the sai_tunnel_attr_t structure as an optional parameter.

The following are the motivations for introducing a DIP in the tunnel structure and to model it as a p2p entity. 

1: To support Head End Replication as the method for handling BUM traffic. 
2: To support per remote IP and per remote IP+VNI  Tx and Rx counters. 
3: To support the notion of operational status per remote IP. 
4: To flush FDB per remote IP. 
5: Learning enable/disable per tunnel. 

Head End Replication
--------------------

All the remote members as well as the local members are part of a broadcast domain. 

In a generic scenario the following cases are applicable. 
Forwarding from local to remote members, remote to local members, local to local members should be allowed. 
Forwarding from remote to remote members should not be allowed. 
In a DCI case forwarding from the DCI tunnels to intra-DC tunnels and vice versa should be allowed. 
Forwarding from one remote to another remote need to be controlled by per member configuration like it being hub/spoke or a generic split horizon group.  
The remote end points can be of any encapsulation like VXLAN, MPLS, L2GRE etc and a single broadcast domain could have local members as well as remote members of different encapsulations.

To handle all these scenarios it is proposed to model the remote members on similar lines and as a point to point entity. 

For a dot1q bridge, 
A tunnel is created for each remote passing the DIP as the newly introduced attribute. 
A bridge_port is created with the following attributes.
Bridge port type as SAI_BRIDGE_PORT_TYPE_TUNNEL
SAI_BRIDGE_PORT_ATTR_TUNNEL_ID as the tunnel created above.
SAI_BRIDGE_PORT_ATTR_BRIDGE_ID as the default dot1q bridge. 
SAI_BRIDGE_PORT_ATTR_ISOLATION_GROUP can be set appropriately depending on the application. 
vlan_member objects are created with bridge port attribute as the above created entity. 

For a dot1d bridge, the steps are similar except that the bridge port is created for each dot1d bridge, using the same tunnel created. There are no vlan_members created as the bridgeports serve that purpose.  

Per Remote IP Tx and Rx Counters
--------------------------------

By modeling the tunnel as a p2p entity the sai_get_tunnel_stats_ext_fn can be used to fetch the per remote IP stats. 

Per VNI Per Remote IP Tx and Rx counters

In a dot1d bridge case the bridgeport counters can also be used to fetch per remote VNI, per remote IP counters. 

Per Remote IP Operational Status
-----------------------------

With a p2p modelling it is possible to associate the notion of an operational status. The operational status could be based on the underlay IP reachability to the remote IP, other device specific constraints or resource availability considerations. 

Per Remote IP DB Flush
----------------------

With a p2p modelling it is possible to flush the FDB entries associated with a remote IP. The SAI_FDB_FLUSH_ATTR_BRIDGE_PORT_ID can be re-used for tunnel bridgeports when modelled as p2p.

The flush might happen, 
as a result of operational status going down as above. 
It could be admin initiated. 
Could happen due to a VXLAN BFD session going down. 

Per Remote IP Learning enable/disable
-------------------------------------
 
One more benefit is to enable learning enable/disable per tunnel bridge port. Learning will need to be enabled for static tunnels whereas disabled for EVPN tunnels.
@marian-pritsak
Copy link
Contributor

Can you please move the description to an .md file for inline comments?

@bandaru-viswanath
Copy link
Contributor Author

Can you please move the description to an .md file for inline comments?

Done. The markdown is now part of the PR.

@@ -0,0 +1,61 @@
# Introduction
As of SAI version 1.5 the “tunnel” has a p2mp connotation. It holds the VTEP SIP whereas there is no DIP. The DIP is specified as part of the FDB entry or as part of Next Hop entry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can P2P tunnel model be applied to Next Hop entry?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SAI nexthop can refer to the p2p tunnel id, tunnel vni and MAC just as it does today for p2mp.
However it is not the expectation to replace the p2mp tunnel. It is expected that nexthop programming and FDB continue to use the p2mp version.


In a dot1d bridge case the bridgeport counters can also be used to fetch per remote VNI, per remote IP counters.

# Per Remote IP Operational Status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description seems too vague. If there is not enough resources, SAI should return error on creation. What attribute is proposed to be used for that?

Copy link

@srj102 srj102 Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mechanism to declare an operational status as up or down is vendor dependent. Hence seems a little vague.
It represents the data forwarding capability towards the remote IP over VXLAN.
For resource failures which are part of the tunnel create it is possible to return the appropriate error code. However
there could be resource failures beyond tunnel like the underlay IP which again is vendor specific which needs to be
exposed in a vendor neutral way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reflects the h/w programming/fwding status of the corresponding tunnel. Hence, the SAI will indicate whether this specific tunnel is programmed (and traffic can be forwarded) in the hardware.


The following are the motivations for introducing a DIP in the tunnel structure and to model it as a p2p entity.

- Support Head End Replication as the method for handling BUM traffic.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an API to configure the flood vector for BUM traffic:

// add tunnel endpoints 1.1.1.2, 1.1.1.3, 1.1.1.4 to flood vector
// vlan, tunnel, tunnel_bridge_port objects are pre-created

sai_object_id_t l2mc_group;
sai_object_id_t member2, member3, member4;

sai_attribute_t attr;
vector<sai_attribute_t> attrs;

// create L2MC group

sai_l2mc_group_api->cerate_l2mc_group(&l2mc_group, g_switch_id, attrs.size(), attrs.data());

// create L2MC group members

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_GROUP_ID;
attr.value.oid = l2mc_group;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_OUTPUT_ID;
attr.value.oid = tunnel_bridge_port;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_ENDPOINT_IP;
attr.value.ipaddr.addr = 0x01010102;
attrs.push_back(attr);

sai_l2mc_group_api->cerate_l2mc_group_member(&member2, g_switch_id, attrs.size(), attrs.data());

attrs.clear();

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_GROUP_ID;
attr.value.oid = l2mc_group;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_OUTPUT_ID;
attr.value.oid = tunnel_bridge_port;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_ENDPOINT_IP;
attr.value.ipaddr.addr = 0x01010103;
attrs.push_back(attr);

sai_l2mc_group_api->cerate_l2mc_group_member(&member3, g_switch_id, attrs.size(), attrs.data());

attrs.clear();

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_GROUP_ID;
attr.value.oid = l2mc_group;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_OUTPUT_ID;
attr.value.oid = tunnel_bridge_port;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_ENDPOINT_IP;
attr.value.ipaddr.addr = 0x01010104;
attrs.push_back(attr);

sai_l2mc_group_api->cerate_l2mc_group_member(&member4, g_switch_id, attrs.size(), attrs.data());

// set the L2MC group on a vlan

attr.id = SAI_VLAN_ATTR_UNKNOWN_UNICAST_FLOOD_CONTROL_TYPE;
attr.value.oid = SAI_VLAN_FLOOD_CONTROL_TYPE_COMBINED;

sai_vlan_api->set_vlan_attribute(vlan, &attr);

attr.id = SAI_VLAN_ATTR_UNKNOWN_UNICAST_FLOOD_GROUP;
attr.value.oid = l2mc_group;

sai_vlan_api->set_vlan_attribute(vlan, &attr);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is the output id per remote ip or 1 per source ip ?
  2. How would we achieve split horizon with this approach ?
  3. The API assumes a separate flood group for vxlan side members and a separate one for access side members.. How would a device with single flooding group handle the following
    cases without burning up unnecessary flood groups ?
  • addition of first tunnel member after the access members have been added
  • removal of last tunnel member after the access members have been added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can also be done per remote IP
For the sake of conserving L2MC groups, the "read-modify-write" operation is also an option

/*****************************************************
 * Get flood vector on VLAN
 *****************************************************/

attr.id = SAI_VLAN_ATTR_UNKNOWN_UNICAST_FLOOD_GROUP;

sai_vlan_api->get_vlan_attribute(vlan, &attr);

sai_object_it l2mc_group = attr.value.oid;

/*****************************************************
 * create L2MC group members
 *****************************************************/
 
sai_object_id_t member2, member3, member4;


attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_GROUP_ID;
attr.value.oid = l2mc_group;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_OUTPUT_ID;
attr.value.oid = bridge_port2;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_ENDPOINT_IP;
attr.value.ipaddr.addr = 1.1.1.2;
attrs.push_back(attr);

sai_l2mc_group_api->cerate_l2mc_group_member(&member2, g_switch_id, attrs.size(), attrs.data());

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_GROUP_ID;
attr.value.oid = l2mc_group;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_OUTPUT_ID;
attr.value.oid = bridge_port3;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_ENDPOINT_IP;
attr.value.ipaddr.addr = 1.1.1.3;
attrs.push_back(attr);

sai_l2mc_group_api->cerate_l2mc_group_member(&member3, g_switch_id, attrs.size(), attrs.data());

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_GROUP_ID;
attr.value.oid = l2mc_group;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_OUTPUT_ID;
attr.value.oid = bridge_port4;
attrs.push_back(attr);

attr.id = SAI_L2MC_GROUP_MEMBER_ATTR_L2MC_ENDPOINT_IP;
attr.value.ipaddr.addr = 1.1.1.4;
attrs.push_back(attr);

sai_l2mc_group_api->cerate_l2mc_group_member(&member4, g_switch_id, attrs.size(), attrs.data());

Comment on lines +9 to +12
- Support per remote IP and per remote IP+VNI Tx and Rx counters.
- Support the notion of operational status per remote IP.
- Support flushing FDB per remote IP.
- Support learning enable/disable per tunnel.
Copy link
Contributor

@marian-pritsak marian-pritsak Jan 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these 4 items above, a P2P tunnel can be created with the current API as well. By creating multiple tunnels with the same SRC_IP, same functionality will be achieved.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate with examples ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see below

sai_attribute_t attr;
vector<sai_attribute_t> attrs;

/*****************************************************
 * Create tunnel for each endpoint
 *****************************************************/

sai_object_id_t tunnel2, tunnel3, tunnel4;

attr.id = SAI_TUNNEL_ATTR_TYPE;
attr.value.s32 = SAI_TUNNEL_TYPE_VXLAN;
attrs.push_back(attr);

attr.id = SAI_TUNNEL_ATTR_ENCAP_SRC_IP;
attr.value.ipaddr = 1.1.1.1;
attrs.push_back(attr);

sai_tunnel_api->create_tunnel(&tunnel2, gSwitchId, attrs.size(), attrs.data());
sai_tunnel_api->create_tunnel(&tunnel3, gSwitchId, attrs.size(), attrs.data());
sai_tunnel_api->create_tunnel(&tunnel4, gSwitchId, attrs.size(), attrs.data());

/*****************************************************
 * Create isolation group
 *****************************************************/

sai_object_id_t isolation_group;
 
attr.id = SAI_ISOLATION_GROUP_ATTR_TYPE;
attr.value.s32 = SAI_ISOLATION_GROUP_TYPE_BRIDGE_PORT;
attrs.push_back(attr);

sai_isolation_api->create_isolation_group(&isolation_group, gSwitchId, attrs.size(), attrs.data());

/*****************************************************
 * Create bridge ports for each tunnel
 *****************************************************/
 
sai_object_id_t bridge_port2, bridge_port3, bridge_port4;
 
attr.id = SAI_BRIDGE_PORT_ATTR_TYPE;
attr.value.s32 = SAI_BRIDGE_PORT_TYPE_TUNNEL;
attrs.push_back(attr);

attr.id = SAI_BRIDGE_PORT_ATTR_ADMIN_STATE;
attr.value.booldata = true;
bpt_attrs.push_back(attr);

attr.id = SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE;
attr.value.s32 = SAI_BRIDGE_PORT_FDB_LEARNING_MODE_ENABLE;
attrs.push_back(attr);

attr.id = SAI_BRIDGE_PORT_ATTR_ISOLATION_GROUP;
attr.value.oid = isolation_group;
attrs.push_back(attr);

attr.id = SAI_BRIDGE_PORT_ATTR_TUNNEL_ID;
attr.value.oid = tunnel2;
attrs.push_back(attr);

sai_bridge_api->create_bridge_port(&bridge_port2, gSwitchId, attrs.size(), attrs.data());

attrs.pop_back()
attr.id = SAI_BRIDGE_PORT_ATTR_TUNNEL_ID;
attr.value.oid = tunnel3;
attrs.push_back(attr);

sai_bridge_api->create_bridge_port(&bridge_port3, gSwitchId, attrs.size(), attrs.data());

attrs.pop_back()
attr.id = SAI_BRIDGE_PORT_ATTR_TUNNEL_ID;
attr.value.oid = tunnel4;
attrs.push_back(attr);

sai_bridge_api->create_bridge_port(&bridge_port4, gSwitchId, attrs.size(), attrs.data());

/*****************************************************
 * Add bridge ports to isolation_group
 *****************************************************/
 
sai_object_id_t isolation_group_member2, isolation_group_member3, isolation_group_member4;
 
attr.id = SAI_ISOLATION_GROUP_MEMBER_ATTR_GROUP_ID;
attr.value.oid = isolation_group;
attrs.push_back(attr);

attr.id = SAI_ISOLATION_GROUP_MEMBER_ATTR_ISOLATION_OBJECT;
attr.value.oid = bridge_port2;
attrs.push_back(attr);

sai_isolation_api->create_isolation_group_member(&isolation_group_member2, gSwitchId, attrs.size(), attrs.data());

attrs.pop_back()
attr.id = SAI_ISOLATION_GROUP_MEMBER_ATTR_ISOLATION_OBJECT;
attr.value.oid = bridge_port3;
attrs.push_back(attr);

sai_isolation_api->create_isolation_group_member(&isolation_group_member3, gSwitchId, attrs.size(), attrs.data());

attrs.pop_back()
attr.id = SAI_ISOLATION_GROUP_MEMBER_ATTR_ISOLATION_OBJECT;
attr.value.oid = bridge_port4;
attrs.push_back(attr);

sai_isolation_api->create_isolation_group_member(&isolation_group_member4, gSwitchId, attrs.size(), attrs.data());

/*****************************************************
 * Get counters for a remote IP
 *****************************************************/
 
sai_tunnel_stat_t stat_ids[] = { SAI_TUNNEL_STAT_IN_PACKETS, SAI_TUNNEL_STAT_OUT_PACKETS }
uint64_t stats[2];

sai_tunnel_api->get_tunnel_stats(tunnel2, 2, stat_ids, stats);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get attribute call will return an oid which is not created by an application. In the case of SONiC this will mean that there has to be a VID created in syncd as part of the get call specifically for this get attribute call. This just complicates the application in addition to book keeping associated with the L2MC group.
As has been elaborated in the requirements it is clear that the L2 VXLAN applications will need to go with a P2P model as it is a superset and handles all the requirements. It would be better to use P2MP model for L3 and P2P instead of allowing for both the models.
The need to use an L2MC group to specify the flood vector is driven by the fact that we support a P2MP model for L2VXLAN. If we agree upon using the P2P model for L2VXLAN then the need to specify the multicast flood vector no longer arises.

The changes suggested in this PR and the use of VLAN_MEMBER_PORT for dot1q and bridge port for dot1d bridges will suffice to specify the flood vector.

If there is concern in adding DESTIP and P2P/P2MP flags to the tunnel object as the tunnel terminator has these already then allowing for a new bridge port type of type tunnel_terminator would also achieve the goal of specifying the flood vector without having to explicitly use the L2MC group.

@lguohan lguohan merged commit 4e42a02 into opencomputeproject:master Mar 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants