Ravi Kiran Eticala · Follow
8 min read · Jun 5, 2020
--
This article is divided into three parts discussing Linux internals of ethernet bridge
- Bridge kernel module
- Adding interface into the bridge
- Life of packet inside the bridge
Bridge Kernel Module
In the Linux kernel, the bridge is implemented as a kernel module “bridge”.
$ lsmod | grep bridge
The bridge module is not yet inserted, solsmod
doesn’t show anything about the bridge module.
brctl, bridge, ip
are the utilities to manage the bridge on Linux system.
$ brctl
Usage: brctl [commands]
commands:
addbr <bridge> add bridge
delbr <bridge> delete bridge
addif <bridge> <device> add interface to bridge
delif <bridge> <device> delete interface from bridge
hairpin <bridge> <port> {on|off} turn hairpin on/off
...
Creating a new bridge interface
# brctl addbr br0
This creates a new network interface br0
. ip link
or brctl show
will list the new bridge interface created.
# ip link show dev br0
3: br0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 5e:cd:d7:1c:52:6e brd ff:ff:ff:ff:ff:ff
# brctl show br0
bridge name bridge id STP enabled interfaces
br0 8000.000000000000 no
Now let’s look at lsmod
output again
# lsmod | grep bridge
bridge 135168 0
stp 16384 1 bridge
llc 16384 2 bridge,stp
Now we see that a bridge
and other related modules are inserted. But what inserted these modules automatically? Let us look into it.
$ lsmod | grep bridge
bridge 135168 0
stp 16384 1 bridge
llc 16384 2 bridge,stp$ rmmod bridge
$ lsmod | grep bridge
$ strace -e trace=socket,ioctl brctl addbr br0
socket(AF_UNIX, SOCK_STREAM, 0) = 3
bridge name bridge id STP enabled interfaces
ioctl(3, SIOCBRADDBR, "br0") = 0
+++ exited with 0 +++$ lsmod | grep bridge
bridge 135168 0
stp 16384 1 bridge
llc 16384 2 bridge,stp
Here I have removed the kernel module “bridge” with rmmod
and ran strace to see what system calls are triggering the insertion of bridge module. Looks like a socket ioctl
with SIOCBRADDBR is inserting the modules into the kernel.
static long sock_ioctl(struct file *file, unsigned cmd, unsigned long arg)
{...switch (cmd) {
case FIOSETOWN:
...
case SIOCGIFBR:
case SIOCSIFBR:
case SIOCBRADDBR:
case SIOCBRDELBR:
err = -ENOPKG;
if (!br_ioctl_hook)
request_module("bridge");mutex_lock(&br_ioctl_mutex);
if (br_ioctl_hook)
err = br_ioctl_hook(net, cmd, argp);
mutex_unlock(&br_ioctl_mutex);
break;
case SIOCGIFVLAN:
...
return err;
}
socket ioctl
with any cmd SIOCGIFBR, SIOCSIFBR, SIOBRADDBR, SIOCBRDELBR would have inserted the “bridge” module into the kernel if br_ioctl_hook
is null. request_module
is a macro for __request_module which takes care of inserting the module.
After the __request_module completed the br_ioctl_hook(net,cmd,argp)
is called to create a new bridge interface. So request_module
is setting up br_ioctl_hook
. Now let us look into bridge kernel module init function to see what all are initialized and how is br_ioctl_hook
set up.
static int __init br_init(void)
{
...
err = stp_proto_register(&br_stp_proto);
...
err = br_fdb_init();
...
err = register_pernet_subsys(&br_net_ops);
...
err = br_nf_core_init();
...
err = br_netlink_init();
...
brioctl_set(br_ioctl_deviceless_stub);
...
}module_init(br_init)
...
MODULE_ALIAS_RTNL_LINK("bridge");static const struct stp_proto br_stp_proto = {
.rcv= br_stp_rcv,
};
br_init
is the bridge kernel module init function. It registers stp_proto_register variable which will handle BPDU (Bridge protocol data unit) frames which contains STP (Spanning tree protocol) information.
int __init br_fdb_init(void)
{
br_fdb_cache = kmem_cache_create("bridge_fdb_cache",
sizeof(struct net_bridge_fdb_entry), 0, SLAB_HWCACHE_ALIGN, NULL);
...
return 0;
}
br_fdb_init
allocates a cache for bridge forwarding database. struct net_bridge_fdb_entry is the important data structure which maintains a mapping between struct net_bridge_fdb_key (mac_address and VLAN id) to bridge port. We will see more about this structure later in the article.
static struct pernet_operations br_net_ops = {
.exit= br_net_exit,
};static void __net_exit br_net_exit(struct net *net)
{
...
for_each_netdev(net, dev)
if (dev->priv_flags & IFF_EBRIDGE)
br_dev_delete(dev, &list);
...
}
register_pernet_subsys
registers a network namespace subsystem. Here the br_net_ops as only exit function defined. It removes all devices from this network on exit.
br_nf_core_init
initializes the firewall core for the ethernet bridge. br_netlink_init
initializes routing Netlink address family and link operations.
void brioctl_set(int (*hook) (struct net *, unsigned int, void __user *))
{
mutex_lock(&br_ioctl_mutex);
br_ioctl_hook = hook;
mutex_unlock(&br_ioctl_mutex);
}
EXPORT_SYMBOL(brioctl_set);
brioctl_set
assigns the br_ioctl_deviceless_stub
function to br_ioctl_hook
which we saw earlier.
int br_ioctl_deviceless_stub(struct net *net, unsigned int cmd, void __user *uarg)
{
switch (cmd) {
...
case SIOCBRADDBR:
case SIOCBRDELBR:
{
...
if (cmd == SIOCBRADDBR)
return br_add_bridge(net, buf);return br_del_bridge(net, buf);
}
}
return EOPNOTSUPP;
}
So socket ioctl
with cmd SIOCBRADDRD
would call br_ioctl_deviceless_stub
that would intern call br_add_bridge
with struct net*
and buffer which contains interface name to be created.
br_add_bridge
creates struct net_device
(a core network driver layer structure). struct net_device
is created for both physical and virtual interfaces. For the NIC (network interface card), the device driver which is responsible to manage the NIC creates the struct net_device
and that gets added to the kernel global struct net_device
list. In the bridge case, br_dev_setup
initializes the struct net_device
of the bridge which is called from br_add_bridge
.
void br_dev_setup(struct net_device *dev)
{
struct net_bridge *br = netdev_priv(dev);eth_hw_addr_random(dev);
ether_setup(dev);dev->netdev_ops = &br_netdev_ops;
dev->needs_free_netdev = true;
dev->ethtool_ops = &br_ethtool_ops;
...
}
eth_hw_addr_random
generates random ethernet address(MAC) and assigns to dev->dev_addr
.
br_netdev_ops
is of type struct net_device_ops
which contains all the operations that can be performed on net_device.
br_ethtool_ops
is of the type struct ethtool_ops
which contains optional device operations. ethtool
utility calls these operations to set/get the network device configuration.
$ ethtool -i br0
driver: bridge
version: 2.3
firmware-version: N/A
expansion-rom-version:
bus-info: N/A
supports-statistics: no
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: no
Adding interface to the bridge
Next, let us see how the Linux handles adding interfaces to bridge. To do this we will first create two pairs of veth
(virtual ethernet device) interfaces.
$ip link add veth10 type veth peer name veth20
$ip link add veth30 type veth peer name veth40
The interface can be added to a bridge by iproute2
or brctl
utility. iproute2 uses netlink
socket and brctl
uses ioctl
to add an interface to a bridge. Either way, both utilities end up calling br_add_if
in bridge module.
$brctl show br0
bridge name bridge id STP enabled interfaces
br0 8000.000000000000 no$ip link set veth10 master br0
$ip link set veth30 master br0$brctl show br0
bridge name bridge id STP enabled interfaces
br0 8000.2669427cd774 no veth10
veth30
Let us look into the internals of what happens when the interface is added to the bridge br0.
int br_add_if(struct net_bridge *br, struct net_device *dev,
struct netlink_ext_ack *extack)
{...
p = new_nbp(br, dev);...err = kobject_init_and_add(&p->kobj, &brport_ktype, &(dev->dev.kobj),
SYSFS_BRIDGE_PORT_ATTR);...
err = netdev_rx_handler_register(dev, br_handle_frame, p);...
}
Some of the important initialization done in br_add_if
are
- Create a
bridge port
from bridge and net_device - Setting up a
sysfs
entry - Registering a bridge handler for receiving packets on net_device
There are three main structures struct net_bridge
, struct net_device
, struct net_bridge_port
in br_add_if
. net_bridge is the bridge to which the net_device interface is going to be added. net_bridge_port is the new bridge port created by calling new_nbp
. The net_bridge_port
contains kobject which is initialized and added under net_device kboject by calling kobject_init_and_add
. The macro SYSFS_BRIDGE_PORT_ATTR
is brport
. We can check this addition under sysfs
.
# ls -la /sys/class/net/veth10/brport/
total 0
drwxr-xr-x 2 root root 0 Mar 22 02:05 .
drwxr-xr-x 6 root root 0 Mar 22 02:04 ..
-rw-r--r-- 1 root root 4096 Mar 23 23:31 bpdu_guard
lrwxrwxrwx 1 root root 0 Mar 22 02:10 bridge -> ../../br0
-r--r--r-- 1 root root 4096 Mar 23 23:31 change_ack
-r--r--r-- 1 root root 4096 Mar 23 23:31 config_pending
-r--r--r-- 1 root root 4096 Mar 23 23:31 designated_bridge
-r--r--r-- 1 root root 4096 Mar 23 23:31 designated_cost
...
br_handle_frame
is a callback registered in the net_device interface. So that every packet received on this interface is handled by bridge code.
Life of packet inside the birdge
# brctl show br0
bridge name bridge id STP enabled interfaces
br0 8000.865ee85c4139 no veth10
veth30
Let us create a network namespace and move one end of veth
pair to into namespace.
#ip netns add ns1
#ip netns add ns2
#ip link set veth20 netns ns1
#ip link set veth40 netns ns2
Now set up interface IP inside the namespaces
#ip link set br0 up
#ip link set veth10 up
#ip link set veth30 up
#ip netns exec ns1 ip link set veth20 up
#ip netns exec ns2 ip link set veth40 up
Assign IP’s to interfaces in the namespace ns1
and ns2
#ip netns exec ns1 ip addr add dev veth20 192.168.56.1/24
#ip netns exec ns2 ip addr add dev veth40 192.168.56.2/24
With the interfaces and IP address setup, we can try to see if there is connectivity.
$ip netns exec ns1 ping -c 1 192.168.56.2
PING 192.168.56.2 (192.168.56.2) 56(84) bytes of data.--- 192.168.56.2 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
Here I have run the Ping from the network namespace ns1
to reach the interface on network namespace ns2
. But I don’t see ICMP response. Running a tcpdump on br0 shows that ICMP request reached bridge but there is no ICMP response.
$tcpdump -qnni br0
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on br0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:04:47.431771 IP 192.168.56.1 > 192.168.56.2: ICMP echo request, id 1573, seq 1, length 64
15:04:52.561271 ARP, Request who-has 192.168.56.2 tell 192.168.56.1, length 28
15:04:52.561315 ARP, Reply 192.168.56.2 is-at 76:35:3c:05:68:ea, length 28
So it seems like the bridge is not forwarding the packet to other port. Will run ftrace on bridge code to see what is happening.
$cd /sys/kernel/debug/tracing
$echo br* > set_ftrace_filter
$echo function_graph > current_tracer
$echo 1 > tracing_on ; ip netns exec ns1 ping -c 1 192.168.56.2 ; echo 0 > tracing_on
Here I have set ftrace filter to show only bridge functions by setting br*
. This tracing output shows the last bridge function the packet was handled is inbr_nf_forward_ip
.
1) | br_handle_frame [bridge]() {
1) | br_nf_pre_routing [br_netfilter]() {... 1) | br_forward [bridge]() {
1) 0.149 us | br_allowed_egress [bridge]();
1) 0.141 us | br_handle_vlan [bridge]();
1) | br_nf_forward_ip [br_netfilter]() {
1) 0.198 us | br_validate_ipv4.isra.30 [br_netfilter]();
1) 0.147 us | brnf_get_logical_dev.isra.27 [br_netfilter]();
1) 7.276 us | }... 1) + 44.181 us | }
Looking at the br_nf_forward_ip
code the packet is handled by NF_INET_FORWARD. If the packet was forwarded we would have seen br_nf_forward_finish
in ftrace output. So it means that the packet was dropped in the FORWARD chain.
static unsigned int br_nf_forward_ip(void *priv,
struct sk_buff *skb,
const struct nf_hook_state *state)
{
...NF_HOOK(pf, NF_INET_FORWARD, state->net, NULL, skb,
brnf_get_logical_dev(skb, state->in),
parent,br_nf_forward_finish);
return NF_STOLEN;
}
So there are two options either we add Iptables rule to allow these packets or completely disable Netfilter calling Iptables. For the sake of simplicity, we will disable Iptables. There is a sysctl to disable this.
$sysctl net.bridge.bridge-nf-call-iptables
net.bridge.bridge-nf-call-iptables = 1
$sysctl -w net.bridge.bridge-nf-call-iptables=0
net.bridge.bridge-nf-call-iptables = 0
We will run the same ping command from network namespace again and check if there is an ICMP response.
$ echo 1 > tracing_on ; ip netns exec ns1 ping -c 1 192.168.56.2 ; echo 0 > tracing_on
PING 192.168.56.2 (192.168.56.2) 56(84) bytes of data.
64 bytes from 192.168.56.2: icmp_seq=1 ttl=64 time=0.065 ms--- 192.168.56.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.065/0.065/0.065/0.000 ms
Now we see the response for ICMP request. The ftrace output for this ICMP request shows the function graph of bridge code. Let us look more into these functions.
0) | br_handle_frame [bridge]() {
0) 0.203 us | br_nf_pre_routing [br_netfilter]();
0) | br_handle_frame_finish [bridge]() {
0) 0.192 us | br_allowed_ingress [bridge]();
0) 0.962 us | br_fdb_update [bridge]();
0) 0.291 us | br_fdb_find_rcu [bridge]();
0) | br_forward [bridge]() {
0) 0.187 us | br_allowed_egress [bridge]();
0) 0.186 us | br_handle_vlan [bridge]();
0) 0.186 us | br_nf_forward_ip [br_netfilter]();
0) 0.184 us | br_nf_forward_arp [br_netfilter]();
0) | br_forward_finish [bridge]() {
0) 0.713 us | br_nf_post_routing [br_netfilter]();
0) 1.049 us | br_dev_queue_push_xmit [bridge]();
0) 2.391 us | }
0) 4.364 us | }
0) 6.833 us | }
0) 7.761 us | }
br_dev_queue_push_xmit
is the last call which forwards the packet to destination interface.