diff options
Diffstat (limited to 'Documentation/networking')
28 files changed, 1105 insertions, 567 deletions
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index df27a1a50..415154a48 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -44,6 +44,8 @@ can.txt - documentation on CAN protocol family. cdc_mbim.txt - 3G/LTE USB modem (Mobile Broadband Interface Model) +checksum-offloads.txt + - Explanation of checksum offloads; LCO, RCO cops.txt - info on the COPS LocalTalk Linux driver cs89x0.txt diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt index 3f24df8c6..50b8589d1 100644 --- a/Documentation/networking/altera_tse.txt +++ b/Documentation/networking/altera_tse.txt @@ -6,7 +6,7 @@ This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers using the SGDMA and MSGDMA soft DMA IP components. The driver uses the platform bus to obtain component resources. The designs used to test this driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board, -and tested with ARM and NIOS processor hosts seperately. The anticipated use +and tested with ARM and NIOS processor hosts separately. The anticipated use cases are simple communications between an embedded system and an external peer for status and simple configuration of the embedded system. @@ -65,14 +65,14 @@ Driver parameters can be also passed in command line by using: 4.1) Transmit process When the driver's transmit routine is called by the kernel, it sets up a transmit descriptor by calling the underlying DMA transmit routine (SGDMA or -MSGDMA), and initites a transmit operation. Once the transmit is complete, an +MSGDMA), and initiates a transmit operation. Once the transmit is complete, an interrupt is driven by the transmit DMA logic. The driver handles the transmit completion in the context of the interrupt handling chain by recycling resource required to send and track the requested transmit operation. 4.2) Receive process The driver will post receive buffers to the receive DMA logic during driver -intialization. Receive buffers may or may not be queued depending upon the +initialization. Receive buffers may or may not be queued depending upon the underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able to queue receive buffers to the SGDMA receive logic). When a packet is received, the DMA logic generates an interrupt. The driver handles a receive diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt index ff23b755f..1b5e7a7f2 100644 --- a/Documentation/networking/batman-adv.txt +++ b/Documentation/networking/batman-adv.txt @@ -187,7 +187,7 @@ interfaces to the kernel module settings. For more information, please see the manpage (man batctl). -batctl is available on http://www.open-mesh.org/ +batctl is available on https://www.open-mesh.org/ CONTACT diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 334b49ef0..57f52cdce 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -1880,8 +1880,8 @@ or more peers on the local network. The ARP monitor relies on the device driver itself to verify that traffic is flowing. In particular, the driver must keep up to -date the last receive time, dev->last_rx, and transmit start time, -dev->trans_start. If these are not updated by the driver, then the +date the last receive time, dev->last_rx. Drivers that use NETIF_F_LLTX +flag must also update netdev_queue->trans_start. If they do not, then the ARP monitor will immediately fail any slaves using that driver, and those slaves will stay down. If networking monitoring (tcpdump, etc) shows the ARP requests and replies on the network, then it may be that diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt index 6ab619fcc..aa15b9ee2 100644 --- a/Documentation/networking/can.txt +++ b/Documentation/networking/can.txt @@ -31,6 +31,7 @@ This file contains 4.2.4 Broadcast Manager message sequence transmission 4.2.5 Broadcast Manager receive filter timers 4.2.6 Broadcast Manager multiplex message receive filter + 4.2.7 Broadcast Manager CAN FD support 4.3 connected transport protocols (SOCK_SEQPACKET) 4.4 unconnected transport protocols (SOCK_DGRAM) @@ -799,7 +800,7 @@ solution for a couple of reasons: } mytxmsg; (..) - mytxmsg.nframes = 4; + mytxmsg.msg_head.nframes = 4; (..) write(s, &mytxmsg, sizeof(mytxmsg)); @@ -852,6 +853,28 @@ solution for a couple of reasons: write(s, &msg, sizeof(msg)); + 4.2.7 Broadcast Manager CAN FD support + + The programming API of the CAN_BCM depends on struct can_frame which is + given as array directly behind the bcm_msg_head structure. To follow this + schema for the CAN FD frames a new flag 'CAN_FD_FRAME' in the bcm_msg_head + flags indicates that the concatenated CAN frame structures behind the + bcm_msg_head are defined as struct canfd_frame. + + struct { + struct bcm_msg_head msg_head; + struct canfd_frame frame[5]; + } msg; + + msg.msg_head.opcode = RX_SETUP; + msg.msg_head.can_id = 0x42; + msg.msg_head.flags = CAN_FD_FRAME; + msg.msg_head.nframes = 5; + (..) + + When using CAN FD frames for multiplex filtering the MUX mask is still + expected in the first 64 bit of the struct canfd_frame data section. + 4.3 connected transport protocols (SOCK_SEQPACKET) 4.4 unconnected transport protocols (SOCK_DGRAM) @@ -1256,7 +1279,7 @@ solution for a couple of reasons: 7. SocketCAN resources ----------------------- - The Linux CAN / SocketCAN project ressources (project site / mailing list) + The Linux CAN / SocketCAN project resources (project site / mailing list) are referenced in the MAINTAINERS file in the Linux source tree. Search for CAN NETWORK [LAYERS|DRIVERS]. diff --git a/Documentation/networking/checksum-offloads.txt b/Documentation/networking/checksum-offloads.txt new file mode 100644 index 000000000..56e368612 --- /dev/null +++ b/Documentation/networking/checksum-offloads.txt @@ -0,0 +1,119 @@ +Checksum Offloads in the Linux Networking Stack + + +Introduction +============ + +This document describes a set of techniques in the Linux networking stack + to take advantage of checksum offload capabilities of various NICs. + +The following technologies are described: + * TX Checksum Offload + * LCO: Local Checksum Offload + * RCO: Remote Checksum Offload + +Things that should be documented here but aren't yet: + * RX Checksum Offload + * CHECKSUM_UNNECESSARY conversion + + +TX Checksum Offload +=================== + +The interface for offloading a transmit checksum to a device is explained + in detail in comments near the top of include/linux/skbuff.h. +In brief, it allows to request the device fill in a single ones-complement + checksum defined by the sk_buff fields skb->csum_start and + skb->csum_offset. The device should compute the 16-bit ones-complement + checksum (i.e. the 'IP-style' checksum) from csum_start to the end of the + packet, and fill in the result at (csum_start + csum_offset). +Because csum_offset cannot be negative, this ensures that the previous + value of the checksum field is included in the checksum computation, thus + it can be used to supply any needed corrections to the checksum (such as + the sum of the pseudo-header for UDP or TCP). +This interface only allows a single checksum to be offloaded. Where + encapsulation is used, the packet may have multiple checksum fields in + different header layers, and the rest will have to be handled by another + mechanism such as LCO or RCO. +No offloading of the IP header checksum is performed; it is always done in + software. This is OK because when we build the IP header, we obviously + have it in cache, so summing it isn't expensive. It's also rather short. +The requirements for GSO are more complicated, because when segmenting an + encapsulated packet both the inner and outer checksums may need to be + edited or recomputed for each resulting segment. See the skbuff.h comment + (section 'E') for more details. + +A driver declares its offload capabilities in netdev->hw_features; see + Documentation/networking/netdev-features for more. Note that a device + which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start + and csum_offset given in the SKB; if it tries to deduce these itself in + hardware (as some NICs do) the driver should check that the values in the + SKB match those which the hardware will deduce, and if not, fall back to + checksumming in software instead (with skb_checksum_help or one of the + skb_csum_off_chk* functions as mentioned in include/linux/skbuff.h). This + is a pain, but that's what you get when hardware tries to be clever. + +The stack should, for the most part, assume that checksum offload is + supported by the underlying device. The only place that should check is + validate_xmit_skb(), and the functions it calls directly or indirectly. + That function compares the offload features requested by the SKB (which + may include other offloads besides TX Checksum Offload) and, if they are + not supported or enabled on the device (determined by netdev->features), + performs the corresponding offload in software. In the case of TX + Checksum Offload, that means calling skb_checksum_help(skb). + + +LCO: Local Checksum Offload +=========================== + +LCO is a technique for efficiently computing the outer checksum of an + encapsulated datagram when the inner checksum is due to be offloaded. +The ones-complement sum of a correctly checksummed TCP or UDP packet is + equal to the complement of the sum of the pseudo header, because everything + else gets 'cancelled out' by the checksum field. This is because the sum was + complemented before being written to the checksum field. +More generally, this holds in any case where the 'IP-style' ones complement + checksum is used, and thus any checksum that TX Checksum Offload supports. +That is, if we have set up TX Checksum Offload with a start/offset pair, we + know that after the device has filled in that checksum, the ones + complement sum from csum_start to the end of the packet will be equal to + the complement of whatever value we put in the checksum field beforehand. + This allows us to compute the outer checksum without looking at the payload: + we simply stop summing when we get to csum_start, then add the complement of + the 16-bit word at (csum_start + csum_offset). +Then, when the true inner checksum is filled in (either by hardware or by + skb_checksum_help()), the outer checksum will become correct by virtue of + the arithmetic. + +LCO is performed by the stack when constructing an outer UDP header for an + encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for + the IPv6 equivalents, in udp6_set_csum(). +It is also performed when constructing an IPv4 GRE header, in + net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when + constructing an IPv6 GRE header; the GRE checksum is computed over the + whole packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be + possible to use LCO here as IPv6 GRE still uses an IP-style checksum. +All of the LCO implementations use a helper function lco_csum(), in + include/linux/skbuff.h. + +LCO can safely be used for nested encapsulations; in this case, the outer + encapsulation layer will sum over both its own header and the 'middle' + header. This does mean that the 'middle' header will get summed multiple + times, but there doesn't seem to be a way to avoid that without incurring + bigger costs (e.g. in SKB bloat). + + +RCO: Remote Checksum Offload +============================ + +RCO is a technique for eliding the inner checksum of an encapsulated + datagram, allowing the outer checksum to be offloaded. It does, however, + involve a change to the encapsulation protocols, which the receiver must + also support. For this reason, it is disabled by default. +RCO is detailed in the following Internet-Drafts: +https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00 +https://tools.ietf.org/html/draft-herbert-vxlan-rco-00 +In Linux, RCO is implemented individually in each encapsulation protocol, + and most tunnel types have flags controlling its use. For instance, VXLAN + has the flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that + RCO should be used when transmitting to a given remote destination. diff --git a/Documentation/networking/dsa/bcm_sf2.txt b/Documentation/networking/dsa/bcm_sf2.txt index d999d0c1c..eba3a2431 100644 --- a/Documentation/networking/dsa/bcm_sf2.txt +++ b/Documentation/networking/dsa/bcm_sf2.txt @@ -38,7 +38,7 @@ Implementation details ====================== The driver is located in drivers/net/dsa/bcm_sf2.c and is implemented as a DSA -driver; see Documentation/networking/dsa/dsa.txt for details on the subsytem +driver; see Documentation/networking/dsa/dsa.txt for details on the subsystem and what it provides. The SF2 switch is configured to enable a Broadcom specific 4-bytes switch tag diff --git a/Documentation/networking/dsa/dsa.txt b/Documentation/networking/dsa/dsa.txt index aa9c1f931..f20c884c0 100644 --- a/Documentation/networking/dsa/dsa.txt +++ b/Documentation/networking/dsa/dsa.txt @@ -334,7 +334,7 @@ more specifically with its VLAN filtering portion when configuring VLANs on top of per-port slave network devices. Since DSA primarily deals with MDIO-connected switches, although not exclusively, SWITCHDEV's prepare/abort/commit phases are often simplified into a prepare phase which -checks whether the operation is supporte by the DSA switch driver, and a commit +checks whether the operation is supported by the DSA switch driver, and a commit phase which applies the changes. As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN @@ -369,8 +369,6 @@ does not allocate any driver private context space. Switch configuration -------------------- -- priv_size: additional size needed by the switch driver for its private context - - tag_protocol: this is to indicate what kind of tagging protocol is supported, should be a valid value from the dsa_tag_protocol enum @@ -416,11 +414,6 @@ PHY devices and link management to the switch port MDIO registers. If unavailable return a negative error code. -- poll_link: Function invoked by DSA to query the link state of the switch - builtin Ethernet PHYs, per port. This function is responsible for calling - netif_carrier_{on,off} when appropriate, and can be used to poll all ports in a - single call. Executes from workqueue context. - - adjust_link: Function invoked by the PHY library when a slave network device is attached to a PHY device. This function is responsible for appropriately configuring the switch port link parameters: speed, duplex, pause based on @@ -521,22 +514,19 @@ See Documentation/hwmon/sysfs-interface for details. Bridge layer ------------ -- port_join_bridge: bridge layer function invoked when a given switch port is +- port_bridge_join: bridge layer function invoked when a given switch port is added to a bridge, this function should be doing the necessary at the switch level to permit the joining port from being added to the relevant logical - domain for it to ingress/egress traffic with other members of the bridge. DSA - does nothing but calculate a bitmask of switch ports currently members of the - specified bridge being requested the join + domain for it to ingress/egress traffic with other members of the bridge. -- port_leave_bridge: bridge layer function invoked when a given switch port is +- port_bridge_leave: bridge layer function invoked when a given switch port is removed from a bridge, this function should be doing the necessary at the switch level to deny the leaving port from ingress/egress traffic from the remaining bridge members. When the port leaves the bridge, it should be aged out at the switch hardware for the switch to (re) learn MAC addresses behind - this port. DSA calculates the bitmask of ports still members of the bridge - being left + this port. -- port_stp_update: bridge layer function invoked when a given switch port STP +- port_stp_state_set: bridge layer function invoked when a given switch port STP state is computed by the bridge layer and should be propagated to switch hardware to forward/block/learn traffic. The switch driver is responsible for computing a STP state change based on current and asked parameters and perform @@ -545,11 +535,21 @@ Bridge layer Bridge VLAN filtering --------------------- -- port_pvid_get: bridge layer function invoked when a Port-based VLAN ID is - queried for the given switch port - -- port_pvid_set: bridge layer function invoked when a Port-based VLAN ID needs - to be configured on the given switch port +- port_vlan_filtering: bridge layer function invoked when the bridge gets + configured for turning on or off VLAN filtering. If nothing specific needs to + be done at the hardware level, this callback does not need to be implemented. + When VLAN filtering is turned on, the hardware must be programmed with + rejecting 802.1Q frames which have VLAN IDs outside of the programmed allowed + VLAN ID map/rules. If there is no PVID programmed into the switch port, + untagged frames must be rejected as well. When turned off the switch must + accept any 802.1Q frames irrespective of their VLAN ID, and untagged frames are + allowed. + +- port_vlan_prepare: bridge layer function invoked when the bridge prepares the + configuration of a VLAN on the given port. If the operation is not supported + by the hardware, this function should return -EOPNOTSUPP to inform the bridge + code to fallback to a software implementation. No hardware setup must be done + in this function. See port_vlan_add for this and details. - port_vlan_add: bridge layer function invoked when a VLAN is configured (tagged or untagged) for the given switch port @@ -557,8 +557,15 @@ Bridge VLAN filtering - port_vlan_del: bridge layer function invoked when a VLAN is removed from the given switch port -- vlan_getnext: bridge layer function invoked to query the next configured VLAN - in the switch, i.e. returns the bitmaps of members and untagged ports +- port_vlan_dump: bridge layer function invoked with a switchdev callback + function that the driver has to call for each VLAN the given port is a member + of. A switchdev object is used to carry the VID and bridge flags. + +- port_fdb_prepare: bridge layer function invoked when the bridge prepares the + installation of a Forwarding Database entry. If the operation is not + supported, this function should return -EOPNOTSUPP to inform the bridge code + to fallback to a software implementation. No hardware setup must be done in + this function. See port_fdb_add for this and details. - port_fdb_add: bridge layer function invoked when the bridge wants to install a Forwarding Database entry, the switch hardware should be programmed with the @@ -573,29 +580,13 @@ of DSA, would be the its port-based VLAN, used by the associated bridge device. the specified MAC address from the specified VLAN ID if it was mapped into this port forwarding database +- port_fdb_dump: bridge layer function invoked with a switchdev callback + function that the driver has to call for each MAC address known to be behind + the given port. A switchdev object is used to carry the VID and FDB info. + TODO ==== -The platform device problem ---------------------------- -DSA is currently implemented as a platform device driver which is far from ideal -as was discussed in this thread: - -http://permalink.gmane.org/gmane.linux.network/329848 - -This basically prevents the device driver model to be properly used and applied, -and support non-MDIO, non-MMIO Ethernet connected switches. - -Another problem with the platform device driver approach is that it prevents the -use of a modular switch drivers build due to a circular dependency, illustrated -here: - -http://comments.gmane.org/gmane.linux.network/345803 - -Attempts of reworking this has been done here: - -https://lwn.net/Articles/643149/ - Making SWITCHDEV and DSA converge towards an unified codebase ------------------------------------------------------------- diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 96da119a4..683ada5ad 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -216,21 +216,21 @@ opcodes as defined in linux/filter.h stand for: jmp 6 Jump to label ja 6 Jump to label - jeq 7, 8 Jump on k == A - jneq 8 Jump on k != A - jne 8 Jump on k != A - jlt 8 Jump on k < A - jle 8 Jump on k <= A - jgt 7, 8 Jump on k > A - jge 7, 8 Jump on k >= A - jset 7, 8 Jump on k & A + jeq 7, 8 Jump on A == k + jneq 8 Jump on A != k + jne 8 Jump on A != k + jlt 8 Jump on A < k + jle 8 Jump on A <= k + jgt 7, 8 Jump on A > k + jge 7, 8 Jump on A >= k + jset 7, 8 Jump on A & k add 0, 4 A + <x> sub 0, 4 A - <x> mul 0, 4 A * <x> div 0, 4 A / <x> mod 0, 4 A % <x> - neg 0, 4 !A + neg !A and 0, 4 A & <x> or 0, 4 A | <x> xor 0, 4 A ^ <x> @@ -1095,6 +1095,87 @@ all use cases. See details of eBPF verifier in kernel/bpf/verifier.c +Direct packet access +-------------------- +In cls_bpf and act_bpf programs the verifier allows direct access to the packet +data via skb->data and skb->data_end pointers. +Ex: +1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ +2: r3 = *(u32 *)(r1 +76) /* load skb->data */ +3: r5 = r3 +4: r5 += 14 +5: if r5 > r4 goto pc+16 +R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp +6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ + +this 2byte load from the packet is safe to do, since the program author +did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which +means that in the fall-through case the register R3 (which points to skb->data) +has at least 14 directly accessible bytes. The verifier marks it +as R3=pkt(id=0,off=0,r=14). +id=0 means that no additional variables were added to the register. +off=0 means that no additional constants were added. +r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. +Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points +to the packet data, but constant 14 was added to the register, so +it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14) +which is zero bytes. + +More complex packet access may look like: + R0=imm1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp + 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ + 7: r4 = *(u8 *)(r3 +12) + 8: r4 *= 14 + 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ +10: r3 += r4 +11: r2 = r1 +12: r2 <<= 48 +13: r2 >>= 48 +14: r3 += r2 +15: r2 = r3 +16: r2 += 8 +17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ +18: if r2 > r1 goto pc+2 + R0=inv56 R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv52 R5=pkt(id=0,off=14,r=14) R10=fp +19: r1 = *(u8 *)(r3 +4) +The state of the register R3 is R3=pkt(id=2,off=0,r=8) +id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some +offset within a packet and since the program author did +'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8). +The verifier only allows 'add' operation on packet registers. Any other +operation will set the register state to 'unknown_value' and it won't be +available for direct packet access. +Operation 'r3 += rX' may overflow and become less than original skb->data, +therefore the verifier has to prevent that. So it tracks the number of +upper zero bits in all 'uknown_value' registers, so when it sees +'r3 += rX' instruction and rX is more than 16-bit value, it will error as: +"cannot add integer value with N upper zero bits to ptr_to_packet" +Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is +R4=inv56 which means that upper 56 bits on the register are guaranteed +to be zero. After insn 'r4 *= 14' the state becomes R4=inv52, since +multiplying 8-bit value by constant 14 will keep upper 52 bits as zero. +Similarly 'r2 >>= 48' will make R2=inv48, since the shift is not sign +extending. This logic is implemented in evaluate_reg_alu() function. + +The end result is that bpf program author can access packet directly +using normal C code as: + void *data = (void *)(long)skb->data; + void *data_end = (void *)(long)skb->data_end; + struct eth_hdr *eth = data; + struct iphdr *iph = data + sizeof(*eth); + struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); + + if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) + return 0; + if (eth->h_proto != htons(ETH_P_IP)) + return 0; + if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) + return 0; + if (udp->dest == 53 || udp->source == 9) + ...; +which makes such programs easier to write comparing to LD_ABS insn +and significantly faster. + eBPF maps --------- 'maps' is a generic storage of different types for sharing data between kernel @@ -1293,5 +1374,5 @@ to give potential BPF hackers or security auditors a better overview of the underlying architecture. Jay Schulist <jschlst@samba.org> -Daniel Borkmann <dborkman@redhat.com> -Alexei Starovoitov <ast@plumgrid.com> +Daniel Borkmann <daniel@iogearbox.net> +Alexei Starovoitov <ast@kernel.org> diff --git a/Documentation/networking/gen_stats.txt b/Documentation/networking/gen_stats.txt index 70e6275b7..179b18ce4 100644 --- a/Documentation/networking/gen_stats.txt +++ b/Documentation/networking/gen_stats.txt @@ -21,7 +21,7 @@ struct mystruct { ... }; -Update statistics: +Update statistics, in dequeue() methods only, (while owning qdisc->running) mystruct->tstats.packet++; mystruct->qstats.backlog += skb->pkt_len; @@ -33,7 +33,8 @@ my_dumping_routine(struct sk_buff *skb, ...) { struct gnet_dump dump; - if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump) < 0) + if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump, + TCA_PAD) < 0) goto rtattr_failure; if (gnet_stats_copy_basic(&dump, &mystruct->bstats) < 0 || @@ -56,7 +57,8 @@ existing TLV types. my_dumping_routine(struct sk_buff *skb, ...) { if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, - TCA_XSTATS, &mystruct->lock, &dump) < 0) + TCA_XSTATS, &mystruct->lock, &dump, + TCA_PAD) < 0) goto rtattr_failure; ... } diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 73b36d7c7..9ae929395 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -63,6 +63,16 @@ fwmark_reflect - BOOLEAN fwmark of the packet they are replying to. Default: 0 +fib_multipath_use_neigh - BOOLEAN + Use status of existing neighbor entry when determining nexthop for + multipath routes. If disabled, neighbor information is not used and + packets could be directed to a failed nexthop. Only valid for kernels + built with CONFIG_IP_ROUTE_MULTIPATH enabled. + Default: 0 (disabled) + Possible values: + 0 - disabled + 1 - enabled + route/max_size - INTEGER Maximum number of routes allowed in the kernel. Increase this when using large numbers of interfaces and/or routes. @@ -946,15 +956,20 @@ igmp_max_memberships - INTEGER The value 5459 assumes no IP header options, so in practice this number may be lower. - conf/interface/* changes special settings per interface (where - "interface" is the name of your network interface) - - conf/all/* is special, changes the settings for all interfaces +igmp_max_msf - INTEGER + Maximum number of addresses allowed in the source filter list for a + multicast group. + Default: 10 igmp_qrv - INTEGER - Controls the IGMP query robustness variable (see RFC2236 8.1). - Default: 2 (as specified by RFC2236 8.1) - Minimum: 1 (as specified by RFC6636 4.5) + Controls the IGMP query robustness variable (see RFC2236 8.1). + Default: 2 (as specified by RFC2236 8.1) + Minimum: 1 (as specified by RFC6636 4.5) + +conf/interface/* changes special settings per interface (where +"interface" is the name of your network interface) + +conf/all/* is special, changes the settings for all interfaces log_martians - BOOLEAN Log packets with impossible addresses to kernel log. @@ -1021,15 +1036,17 @@ proxy_arp_pvlan - BOOLEAN shared_media - BOOLEAN Send(router) or accept(host) RFC1620 shared media redirects. - Overrides ip_secure_redirects. + Overrides secure_redirects. shared_media for the interface will be enabled if at least one of conf/{all,interface}/shared_media is set to TRUE, it will be disabled otherwise default TRUE secure_redirects - BOOLEAN - Accept ICMP redirect messages only for gateways, - listed in default gateway list. + Accept ICMP redirect messages only to gateways listed in the + interface's current gateway list. Even if disabled, RFC1122 redirect + rules still apply. + Overridden by shared_media. secure_redirects for the interface will be enabled if at least one of conf/{all,interface}/secure_redirects is set to TRUE, it will be disabled otherwise @@ -1216,6 +1233,19 @@ promote_secondaries - BOOLEAN promote a corresponding secondary IP address instead of removing all the corresponding secondary IP addresses. +drop_unicast_in_l2_multicast - BOOLEAN + Drop any unicast IP packets that are received in link-layer + multicast (or broadcast) frames. + This behavior (for multicast) is actually a SHOULD in RFC + 1122, but is disabled by default for compatibility reasons. + Default: off (0) + +drop_gratuitous_arp - BOOLEAN + Drop all gratuitous ARP frames, for example if there's a known + good ARP proxy on the network and such frames need not be used + (or in the case of 802.11, must not be used to prevent attacks.) + Default: off (0) + tag - INTEGER Allows you to write a number, which can be used as required. @@ -1550,6 +1580,15 @@ temp_prefered_lft - INTEGER Preferred lifetime (in seconds) for temporary addresses. Default: 86400 (1 day) +keep_addr_on_down - INTEGER + Keep all IPv6 addresses on an interface down event. If set static + global addresses with no expiration time are not flushed. + >0 : enabled + 0 : system default + <0 : disabled + + Default: 0 (addresses are removed) + max_desync_factor - INTEGER Maximum value for DESYNC_FACTOR, which is a random value that ensures that clients don't synchronize with each @@ -1661,6 +1700,19 @@ stable_secret - IPv6 address By default the stable secret is unset. +drop_unicast_in_l2_multicast - BOOLEAN + Drop any unicast IPv6 packets that are received in link-layer + multicast (or broadcast) frames. + + By default this is turned off. + +drop_unsolicited_na - BOOLEAN + Drop all unsolicited neighbor advertisements, for example if there's + a known good NA proxy on the network and such frames need not be used + (or in the case of 802.11, must not be used to prevent attacks.) + + By default this is turned off. + icmp/*: ratelimit - INTEGER Limit the maximal rates for sending ICMPv6 packets. diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt index cf996394e..14422f8fc 100644 --- a/Documentation/networking/ipvlan.txt +++ b/Documentation/networking/ipvlan.txt @@ -8,7 +8,7 @@ Initial Release: This is conceptually very similar to the macvlan driver with one major exception of using L3 for mux-ing /demux-ing among slaves. This property makes the master device share the L2 with it's slave devices. I have developed this -driver in conjuntion with network namespaces and not sure if there is use case +driver in conjunction with network namespaces and not sure if there is use case outside of it. @@ -42,7 +42,7 @@ out. In this mode the slaves will RX/TX multicast and broadcast (if applicable) as well. 4.2 L3 mode: - In this mode TX processing upto L3 happens on the stack instance attached + In this mode TX processing up to L3 happens on the stack instance attached to the slave device and packets are switched to the stack instance of the master device for the L2 processing and routing from that instance will be used before packets are queued on the outbound device. In this mode the slaves @@ -56,7 +56,7 @@ situations defines your use case then you can choose to use ipvlan - (a) The Linux host that is connected to the external switch / router has policy configured that allows only one mac per port. (b) No of virtual devices created on a master exceed the mac capacity and -puts the NIC in promiscous mode and degraded performance is a concern. +puts the NIC in promiscuous mode and degraded performance is a concern. (c) If the slave device is to be put into the hostile / untrusted network namespace where L2 on the slave could be changed / misused. diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt new file mode 100644 index 000000000..3476ede5b --- /dev/null +++ b/Documentation/networking/kcm.txt @@ -0,0 +1,285 @@ +Kernel Connection Mulitplexor +----------------------------- + +Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based +interface over TCP for generic application protocols. With KCM an application +can efficiently send and receive application protocol messages over TCP using +datagram sockets. + +KCM implements an NxM multiplexor in the kernel as diagrammed below: + ++------------+ +------------+ +------------+ +------------+ +| KCM socket | | KCM socket | | KCM socket | | KCM socket | ++------------+ +------------+ +------------+ +------------+ + | | | | + +-----------+ | | +----------+ + | | | | + +----------------------------------+ + | Multiplexor | + +----------------------------------+ + | | | | | + +---------+ | | | ------------+ + | | | | | ++----------+ +----------+ +----------+ +----------+ +----------+ +| Psock | | Psock | | Psock | | Psock | | Psock | ++----------+ +----------+ +----------+ +----------+ +----------+ + | | | | | ++----------+ +----------+ +----------+ +----------+ +----------+ +| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock | ++----------+ +----------+ +----------+ +----------+ +----------+ + +KCM sockets +----------- + +The KCM sockets provide the user interface to the muliplexor. All the KCM sockets +bound to a multiplexor are considered to have equivalent function, and I/O +operations in different sockets may be done in parallel without the need for +synchronization between threads in userspace. + +Multiplexor +----------- + +The multiplexor provides the message steering. In the transmit path, messages +written on a KCM socket are sent atomically on an appropriate TCP socket. +Similarly, in the receive path, messages are constructed on each TCP socket +(Psock) and complete messages are steered to a KCM socket. + +TCP sockets & Psocks +-------------------- + +TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated +for each bound TCP socket, this structure holds the state for constructing +messages on receive as well as other connection specific information for KCM. + +Connected mode semantics +------------------------ + +Each multiplexor assumes that all attached TCP connections are to the same +destination and can use the different connections for load balancing when +transmitting. The normal send and recv calls (include sendmmsg and recvmmsg) +can be used to send and receive messages from the KCM socket. + +Socket types +------------ + +KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types. + +Message delineation +------------------- + +Messages are sent over a TCP stream with some application protocol message +format that typically includes a header which frames the messages. The length +of a received message can be deduced from the application protocol header +(often just a simple length field). + +A TCP stream must be parsed to determine message boundaries. Berkeley Packet +Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a +BPF program must be specified. The program is called at the start of receiving +a new message and is given an skbuff that contains the bytes received so far. +It parses the message header and returns the length of the message. Given this +information, KCM will construct the message of the stated length and deliver it +to a KCM socket. + +TCP socket management +--------------------- + +When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and +write space available (POLLOUT) events are handled by the multiplexor. If there +is a state change (disconnection) or other error on a TCP socket, an error is +posted on the TCP socket so that a POLLERR event happens and KCM discontinues +using the socket. When the application gets the error notification for a +TCP socket, it should unattach the socket from KCM and then handle the error +condition (the typical response is to close the socket and create a new +connection if necessary). + +KCM limits the maximum receive message size to be the size of the receive +socket buffer on the attached TCP socket (the socket buffer size can be set by +SO_RCVBUF). If the length of a new message reported by the BPF program is +greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP +socket. The BPF program may also enforce a maximum messages size and report an +error when it is exceeded. + +A timeout may be set for assembling messages on a receive socket. The timeout +value is taken from the receive timeout of the attached TCP socket (this is set +by SO_RCVTIMEO). If the timer expires before assembly is complete an error +(ETIMEDOUT) is posted on the socket. + +User interface +============== + +Creating a multiplexor +---------------------- + +A new multiplexor and initial KCM socket is created by a socket call: + + socket(AF_KCM, type, protocol) + + - type is either SOCK_DGRAM or SOCK_SEQPACKET + - protocol is KCMPROTO_CONNECTED + +Cloning KCM sockets +------------------- + +After the first KCM socket is created using the socket call as described +above, additional sockets for the multiplexor can be created by cloning +a KCM socket. This is accomplished by an ioctl on a KCM socket: + + /* From linux/kcm.h */ + struct kcm_clone { + int fd; + }; + + struct kcm_clone info; + + memset(&info, 0, sizeof(info)); + + err = ioctl(kcmfd, SIOCKCMCLONE, &info); + + if (!err) + newkcmfd = info.fd; + +Attach transport sockets +------------------------ + +Attaching of transport sockets to a multiplexor is performed by calling an +ioctl on a KCM socket for the multiplexor. e.g.: + + /* From linux/kcm.h */ + struct kcm_attach { + int fd; + int bpf_fd; + }; + + struct kcm_attach info; + + memset(&info, 0, sizeof(info)); + + info.fd = tcpfd; + info.bpf_fd = bpf_prog_fd; + + ioctl(kcmfd, SIOCKCMATTACH, &info); + +The kcm_attach structure contains: + fd: file descriptor for TCP socket being attached + bpf_prog_fd: file descriptor for compiled BPF program downloaded + +Unattach transport sockets +-------------------------- + +Unattaching a transport socket from a multiplexor is straightforward. An +"unattach" ioctl is done with the kcm_unattach structure as the argument: + + /* From linux/kcm.h */ + struct kcm_unattach { + int fd; + }; + + struct kcm_unattach info; + + memset(&info, 0, sizeof(info)); + + info.fd = cfd; + + ioctl(fd, SIOCKCMUNATTACH, &info); + +Disabling receive on KCM socket +------------------------------- + +A setsockopt is used to disable or enable receiving on a KCM socket. +When receive is disabled, any pending messages in the socket's +receive buffer are moved to other sockets. This feature is useful +if an application thread knows that it will be doing a lot of +work on a request and won't be able to service new messages for a +while. Example use: + + int val = 1; + + setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val)) + +BFP programs for message delineation +------------------------------------ + +BPF programs can be compiled using the BPF LLVM backend. For exmple, +the BPF program for parsing Thrift is: + + #include "bpf.h" /* for __sk_buff */ + #include "bpf_helpers.h" /* for load_word intrinsic */ + + SEC("socket_kcm") + int bpf_prog1(struct __sk_buff *skb) + { + return load_word(skb, 0) + 4; + } + + char _license[] SEC("license") = "GPL"; + +Use in applications +=================== + +KCM accelerates application layer protocols. Specifically, it allows +applications to use a message based interface for sending and receiving +messages. The kernel provides necessary assurances that messages are sent +and received atomically. This relieves much of the burden applications have +in mapping a message based protocol onto the TCP stream. KCM also make +application layer messages a unit of work in the kernel for the purposes of +steerng and scheduling, which in turn allows a simpler networking model in +multithreaded applications. + +Configurations +-------------- + +In an Nx1 configuration, KCM logically provides multiple socket handles +to the same TCP connection. This allows parallelism between in I/O +operations on the TCP socket (for instance copyin and copyout of data is +parallelized). In an application, a KCM socket can be opened for each +processing thread and inserted into the epoll (similar to how SO_REUSEPORT +is used to allow multiple listener sockets on the same port). + +In a MxN configuration, multiple connections are established to the +same destination. These are used for simple load balancing. + +Message batching +---------------- + +The primary purpose of KCM is load balancing between KCM sockets and hence +threads in a nominal use case. Perfect load balancing, that is steering +each received message to a different KCM socket or steering each sent +message to a different TCP socket, can negatively impact performance +since this doesn't allow for affinities to be established. Balancing +based on groups, or batches of messages, can be beneficial for performance. + +On transmit, there are three ways an application can batch (pipeline) +messages on a KCM socket. + 1) Send multiple messages in a single sendmmsg. + 2) Send a group of messages each with a sendmsg call, where all messages + except the last have MSG_BATCH in the flags of sendmsg call. + 3) Create "super message" composed of multiple messages and send this + with a single sendmsg. + +On receive, the KCM module attempts to queue messages received on the +same KCM socket during each TCP ready callback. The targeted KCM socket +changes at each receive ready callback on the KCM socket. The application +does not need to configure this. + +Error handling +-------------- + +An application should include a thread to monitor errors raised on +the TCP connection. Normally, this will be done by placing each +TCP socket attached to a KCM multiplexor in epoll set for POLLERR +event. If an error occurs on an attached TCP socket, KCM sets an EPIPE +on the socket thus waking up the application thread. When the application +sees the error (which may just be a disconnect) it should unattach the +socket from KCM and then close it. It is assumed that once an error is +posted on the TCP socket the data stream is unrecoverable (i.e. an error +may have occurred in in the middle of receiving a messssge). + +TCP connection monitoring +------------------------- + +In KCM there is no means to correlate a message to the TCP socket that +was used to send or receive the message (except in the case there is +only one attached TCP socket). However, the application does retain +an open file descriptor to the socket so it will be able to get statistics +from the socket which can be used in detecting issues (such as high +retransmissions on the socket). diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.txt index 3a930072b..d58d78df9 100644 --- a/Documentation/networking/mac80211-injection.txt +++ b/Documentation/networking/mac80211-injection.txt @@ -28,6 +28,36 @@ radiotap headers and used to control injection: IEEE80211_RADIOTAP_F_TX_NOACK: frame should be sent without waiting for an ACK even if it is a unicast frame + * IEEE80211_RADIOTAP_RATE + + legacy rate for the transmission (only for devices without own rate control) + + * IEEE80211_RADIOTAP_MCS + + HT rate for the transmission (only for devices without own rate control). + Also some flags are parsed + + IEEE80211_RADIOTAP_MCS_SGI: use short guard interval + IEEE80211_RADIOTAP_MCS_BW_40: send in HT40 mode + + * IEEE80211_RADIOTAP_DATA_RETRIES + + number of retries when either IEEE80211_RADIOTAP_RATE or + IEEE80211_RADIOTAP_MCS was used + + * IEEE80211_RADIOTAP_VHT + + VHT mcs and number of streams used in the transmission (only for devices + without own rate control). Also other fields are parsed + + flags field + IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval + + bandwidth field + 1: send using 40MHz channel width + 4: send using 80MHz channel width + 11: send using 160MHz channel width + The injection code can also skip all other currently defined radiotap fields facilitating replay of captured radiotap headers directly. diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt index f310edec8..7413eb052 100644 --- a/Documentation/networking/netdev-features.txt +++ b/Documentation/networking/netdev-features.txt @@ -131,13 +131,11 @@ stack. Driver should not change behaviour based on them. * LLTX driver (deprecated for hardware drivers) -NETIF_F_LLTX should be set in drivers that implement their own locking in -transmit path or don't need locking at all (e.g. software tunnels). -In ndo_start_xmit, it is recommended to use a try_lock and return -NETDEV_TX_LOCKED when the spin lock fails. The locking should also properly -protect against other callbacks (the rules you need to find out). +NETIF_F_LLTX is meant to be used by drivers that don't need locking at all, +e.g. software tunnels. -Don't use it for new drivers. +This is also used in a few legacy drivers that implement their +own locking, don't use it for new (hardware) drivers. * netns-local device diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt index 0b1cf6b2a..7fec2061a 100644 --- a/Documentation/networking/netdevices.txt +++ b/Documentation/networking/netdevices.txt @@ -69,10 +69,9 @@ ndo_start_xmit: When the driver sets NETIF_F_LLTX in dev->features this will be called without holding netif_tx_lock. In this case the driver - has to lock by itself when needed. It is recommended to use a try lock - for this and return NETDEV_TX_LOCKED when the spin lock fails. - The locking there should also properly protect against - set_rx_mode. Note that the use of NETIF_F_LLTX is deprecated. + has to lock by itself when needed. + The locking there should also properly protect against + set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated. Don't use it for new drivers. Context: Process with BHs disabled or BH (timer), @@ -83,8 +82,6 @@ ndo_start_xmit: o NETDEV_TX_BUSY Cannot transmit packet, try later Usually a bug, means queue start/stop flow control is broken in the driver. Note: the driver must NOT put the skb in its DMA ring. - o NETDEV_TX_LOCKED Locking failed, please retry quickly. - Only valid when NETIF_F_LLTX is set. ndo_tx_timeout: Synchronization: netif_tx_lock spinlock; all TX queues frozen. diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt deleted file mode 100644 index 54f10478e..000000000 --- a/Documentation/networking/netlink_mmap.txt +++ /dev/null @@ -1,332 +0,0 @@ -This file documents how to use memory mapped I/O with netlink. - -Author: Patrick McHardy <kaber@trash.net> - -Overview --------- - -Memory mapped netlink I/O can be used to increase throughput and decrease -overhead of unicast receive and transmit operations. Some netlink subsystems -require high throughput, these are mainly the netfilter subsystems -nfnetlink_queue and nfnetlink_log, but it can also help speed up large -dump operations of f.i. the routing database. - -Memory mapped netlink I/O used two circular ring buffers for RX and TX which -are mapped into the processes address space. - -The RX ring is used by the kernel to directly construct netlink messages into -user-space memory without copying them as done with regular socket I/O, -additionally as long as the ring contains messages no recvmsg() or poll() -syscalls have to be issued by user-space to get more message. - -The TX ring is used to process messages directly from user-space memory, the -kernel processes all messages contained in the ring using a single sendmsg() -call. - -Usage overview --------------- - -In order to use memory mapped netlink I/O, user-space needs three main changes: - -- ring setup -- conversion of the RX path to get messages from the ring instead of recvmsg() -- conversion of the TX path to construct messages into the ring - -Ring setup is done using setsockopt() to provide the ring parameters to the -kernel, then a call to mmap() to map the ring into the processes address space: - -- setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, ¶ms, sizeof(params)); -- setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, ¶ms, sizeof(params)); -- ring = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0) - -Usage of either ring is optional, but even if only the RX ring is used the -mapping still needs to be writable in order to update the frame status after -processing. - -Conversion of the reception path involves calling poll() on the file -descriptor, once the socket is readable the frames from the ring are -processed in order until no more messages are available, as indicated by -a status word in the frame header. - -On kernel side, in order to make use of memory mapped I/O on receive, the -originating netlink subsystem needs to support memory mapped I/O, otherwise -it will use an allocated socket buffer as usual and the contents will be - copied to the ring on transmission, nullifying most of the performance gains. -Dumps of kernel databases automatically support memory mapped I/O. - -Conversion of the transmit path involves changing message construction to -use memory from the TX ring instead of (usually) a buffer declared on the -stack and setting up the frame header appropriately. Optionally poll() can -be used to wait for free frames in the TX ring. - -Structured and definitions for using memory mapped I/O are contained in -<linux/netlink.h>. - -RX and TX rings ----------------- - -Each ring contains a number of continuous memory blocks, containing frames of -fixed size dependent on the parameters used for ring setup. - -Ring: [ block 0 ] - [ frame 0 ] - [ frame 1 ] - [ block 1 ] - [ frame 2 ] - [ frame 3 ] - ... - [ block n ] - [ frame 2 * n ] - [ frame 2 * n + 1 ] - -The blocks are only visible to the kernel, from the point of view of user-space -the ring just contains the frames in a continuous memory zone. - -The ring parameters used for setting up the ring are defined as follows: - -struct nl_mmap_req { - unsigned int nm_block_size; - unsigned int nm_block_nr; - unsigned int nm_frame_size; - unsigned int nm_frame_nr; -}; - -Frames are grouped into blocks, where each block is a continuous region of memory -and holds nm_block_size / nm_frame_size frames. The total number of frames in -the ring is nm_frame_nr. The following invariants hold: - -- frames_per_block = nm_block_size / nm_frame_size - -- nm_frame_nr = frames_per_block * nm_block_nr - -Some parameters are constrained, specifically: - -- nm_block_size must be a multiple of the architectures memory page size. - The getpagesize() function can be used to get the page size. - -- nm_frame_size must be equal or larger to NL_MMAP_HDRLEN, IOW a frame must be - able to hold at least the frame header - -- nm_frame_size must be smaller or equal to nm_block_size - -- nm_frame_size must be a multiple of NL_MMAP_MSG_ALIGNMENT - -- nm_frame_nr must equal the actual number of frames as specified above. - -When the kernel can't allocate physically continuous memory for a ring block, -it will fall back to use physically discontinuous memory. This might affect -performance negatively, in order to avoid this the nm_frame_size parameter -should be chosen to be as small as possible for the required frame size and -the number of blocks should be increased instead. - -Ring frames ------------- - -Each frames contain a frame header, consisting of a synchronization word and some -meta-data, and the message itself. - -Frame: [ header message ] - -The frame header is defined as follows: - -struct nl_mmap_hdr { - unsigned int nm_status; - unsigned int nm_len; - __u32 nm_group; - /* credentials */ - __u32 nm_pid; - __u32 nm_uid; - __u32 nm_gid; -}; - -- nm_status is used for synchronizing processing between the kernel and user- - space and specifies ownership of the frame as well as the operation to perform - -- nm_len contains the length of the message contained in the data area - -- nm_group specified the destination multicast group of message - -- nm_pid, nm_uid and nm_gid contain the netlink pid, UID and GID of the sending - process. These values correspond to the data available using SOCK_PASSCRED in - the SCM_CREDENTIALS cmsg. - -The possible values in the status word are: - -- NL_MMAP_STATUS_UNUSED: - RX ring: frame belongs to the kernel and contains no message - for user-space. Approriate action is to invoke poll() - to wait for new messages. - - TX ring: frame belongs to user-space and can be used for - message construction. - -- NL_MMAP_STATUS_RESERVED: - RX ring only: frame is currently used by the kernel for message - construction and contains no valid message yet. - Appropriate action is to invoke poll() to wait for - new messages. - -- NL_MMAP_STATUS_VALID: - RX ring: frame contains a valid message. Approriate action is - to process the message and release the frame back to - the kernel by setting the status to - NL_MMAP_STATUS_UNUSED or queue the frame by setting the - status to NL_MMAP_STATUS_SKIP. - - TX ring: the frame contains a valid message from user-space to - be processed by the kernel. After completing processing - the kernel will release the frame back to user-space by - setting the status to NL_MMAP_STATUS_UNUSED. - -- NL_MMAP_STATUS_COPY: - RX ring only: a message is ready to be processed but could not be - stored in the ring, either because it exceeded the - frame size or because the originating subsystem does - not support memory mapped I/O. Appropriate action is - to invoke recvmsg() to receive the message and release - the frame back to the kernel by setting the status to - NL_MMAP_STATUS_UNUSED. - -- NL_MMAP_STATUS_SKIP: - RX ring only: user-space queued the message for later processing, but - processed some messages following it in the ring. The - kernel should skip this frame when looking for unused - frames. - -The data area of a frame begins at a offset of NL_MMAP_HDRLEN relative to the -frame header. - -TX limitations --------------- - -As of Jan 2015 the message is always copied from the ring frame to an -allocated buffer due to unresolved security concerns. -See commit 4682a0358639b29cf ("netlink: Always copy on mmap TX."). - -Example -------- - -Ring setup: - - unsigned int block_size = 16 * getpagesize(); - struct nl_mmap_req req = { - .nm_block_size = block_size, - .nm_block_nr = 64, - .nm_frame_size = 16384, - .nm_frame_nr = 64 * block_size / 16384, - }; - unsigned int ring_size; - void *rx_ring, *tx_ring; - - /* Configure ring parameters */ - if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0) - exit(1); - if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0) - exit(1) - - /* Calculate size of each individual ring */ - ring_size = req.nm_block_nr * req.nm_block_size; - - /* Map RX/TX rings. The TX ring is located after the RX ring */ - rx_ring = mmap(NULL, 2 * ring_size, PROT_READ | PROT_WRITE, - MAP_SHARED, fd, 0); - if ((long)rx_ring == -1L) - exit(1); - tx_ring = rx_ring + ring_size: - -Message reception: - -This example assumes some ring parameters of the ring setup are available. - - unsigned int frame_offset = 0; - struct nl_mmap_hdr *hdr; - struct nlmsghdr *nlh; - unsigned char buf[16384]; - ssize_t len; - - while (1) { - struct pollfd pfds[1]; - - pfds[0].fd = fd; - pfds[0].events = POLLIN | POLLERR; - pfds[0].revents = 0; - - if (poll(pfds, 1, -1) < 0 && errno != -EINTR) - exit(1); - - /* Check for errors. Error handling omitted */ - if (pfds[0].revents & POLLERR) - <handle error> - - /* If no new messages, poll again */ - if (!(pfds[0].revents & POLLIN)) - continue; - - /* Process all frames */ - while (1) { - /* Get next frame header */ - hdr = rx_ring + frame_offset; - - if (hdr->nm_status == NL_MMAP_STATUS_VALID) { - /* Regular memory mapped frame */ - nlh = (void *)hdr + NL_MMAP_HDRLEN; - len = hdr->nm_len; - - /* Release empty message immediately. May happen - * on error during message construction. - */ - if (len == 0) - goto release; - } else if (hdr->nm_status == NL_MMAP_STATUS_COPY) { - /* Frame queued to socket receive queue */ - len = recv(fd, buf, sizeof(buf), MSG_DONTWAIT); - if (len <= 0) - break; - nlh = buf; - } else - /* No more messages to process, continue polling */ - break; - - process_msg(nlh); -release: - /* Release frame back to the kernel */ - hdr->nm_status = NL_MMAP_STATUS_UNUSED; - - /* Advance frame offset to next frame */ - frame_offset = (frame_offset + frame_size) % ring_size; - } - } - -Message transmission: - -This example assumes some ring parameters of the ring setup are available. -A single message is constructed and transmitted, to send multiple messages -at once they would be constructed in consecutive frames before a final call -to sendto(). - - unsigned int frame_offset = 0; - struct nl_mmap_hdr *hdr; - struct nlmsghdr *nlh; - struct sockaddr_nl addr = { - .nl_family = AF_NETLINK, - }; - - hdr = tx_ring + frame_offset; - if (hdr->nm_status != NL_MMAP_STATUS_UNUSED) - /* No frame available. Use poll() to avoid. */ - exit(1); - - nlh = (void *)hdr + NL_MMAP_HDRLEN; - - /* Build message */ - build_message(nlh); - - /* Fill frame header: length and status need to be set */ - hdr->nm_len = nlh->nlmsg_len; - hdr->nm_status = NL_MMAP_STATUS_VALID; - - if (sendto(fd, NULL, 0, 0, &addr, sizeof(addr)) < 0) - exit(1); - - /* Advance frame offset to next frame */ - frame_offset = (frame_offset + frame_size) % ring_size; diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt index f55599c62..4fb51d32f 100644 --- a/Documentation/networking/nf_conntrack-sysctl.txt +++ b/Documentation/networking/nf_conntrack-sysctl.txt @@ -7,12 +7,13 @@ nf_conntrack_acct - BOOLEAN Enable connection tracking flow accounting. 64-bit byte and packet counters per flow are added. -nf_conntrack_buckets - INTEGER (read-only) +nf_conntrack_buckets - INTEGER Size of hash table. If not specified as parameter during module loading, the default size is calculated by dividing total memory by 16384 to determine the number of buckets but the hash table will never have fewer than 32 and limited to 16384 buckets. For systems with more than 4GB of memory it will be 65536 buckets. + This sysctl is only writeable in the initial net namespace. nf_conntrack_checksum - BOOLEAN 0 - disabled diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt index e839e7efc..7ab9404a8 100644 --- a/Documentation/networking/phy.txt +++ b/Documentation/networking/phy.txt @@ -267,13 +267,23 @@ Writing a PHY driver config_intr: Enable or disable interrupts remove: Does any driver take-down ts_info: Queries about the HW timestamping status + match_phy_device: used for Clause 45 capable PHYs to match devices + in package and ensure they are compatible hwtstamp: Set the PHY HW timestamping configuration rxtstamp: Requests a receive timestamp at the PHY level for a 'skb' txtsamp: Requests a transmit timestamp at the PHY level for a 'skb' set_wol: Enable Wake-on-LAN at the PHY level get_wol: Get the Wake-on-LAN status at the PHY level + link_change_notify: called to inform the core is about to change the + link state, can be used to work around bogus PHY between state changes read_mmd_indirect: Read PHY MMD indirect register write_mmd_indirect: Write PHY MMD indirect register + module_info: Get the size and type of an EEPROM contained in an plug-in + module + module_eeprom: Get EEPROM information of a plug-in module + get_sset_count: Get number of strings sets that get_strings will count + get_strings: Get strings from requested objects (statistics) + get_stats: Get the extended statistics from the PHY device Of these, only config_aneg and read_status are required to be assigned by the driver code. The rest are optional. Also, it is diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt index f4be85e96..2c4e3354e 100644 --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -67,12 +67,12 @@ The two basic thread commands are: * add_device DEVICE@NAME -- adds a single device * rem_device_all -- remove all associated devices -When adding a device to a thread, a corrosponding procfile is created +When adding a device to a thread, a corresponding procfile is created which is used for configuring this device. Thus, device names need to be unique. To support adding the same device to multiple threads, which is useful -with multi queue NICs, a the device naming scheme is extended with "@": +with multi queue NICs, the device naming scheme is extended with "@": device@something The part after "@" can be anything, but it is custom to use the thread @@ -221,7 +221,7 @@ Sample scripts A collection of tutorial scripts and helpers for pktgen is in the samples/pktgen directory. The helper parameters.sh file support easy -and consistant parameter parsing across the sample scripts. +and consistent parameter parsing across the sample scripts. Usage example and help: ./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2 diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt index e1a3d59bb..0235ae69a 100644 --- a/Documentation/networking/rds.txt +++ b/Documentation/networking/rds.txt @@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like TCP. RDS is not Infiniband-specific; it was designed to support different transports. The current implementation used to support RDS over TCP as well -as IB. Work is in progress to support RDS over iWARP, and using DCE to -guarantee no dropped packets on Ethernet, it may be possible to use RDS over -UDP in the future. +as IB. The high-level semantics of RDS from the application's point of view are @@ -87,7 +85,8 @@ Socket Interface bind(fd, &sockaddr_in, ...) This binds the socket to a local IP address and port, and a - transport. + transport, if one has not already been selected via the + SO_RDS_TRANSPORT socket option sendmsg(fd, ...) Sends a message to the indicated recipient. The kernel will @@ -148,6 +147,20 @@ Socket Interface operation. In this case, it would use RDS_CANCEL_SENT_TO to nuke any pending messages. + setsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) + getsockopt(fd, SOL_RDS, SO_RDS_TRANSPORT, (int *)&transport ..) + Set or read an integer defining the underlying + encapsulating transport to be used for RDS packets on the + socket. When setting the option, integer argument may be + one of RDS_TRANS_TCP or RDS_TRANS_IB. When retrieving the + value, RDS_TRANS_NONE will be returned on an unbound socket. + This socket option may only be set exactly once on the socket, + prior to binding it via the bind(2) system call. Attempts to + set SO_RDS_TRANSPORT on a socket for which the transport has + been previously attached explicitly (by SO_RDS_TRANSPORT) or + implicitly (via bind(2)) will return an error of EOPNOTSUPP. + An attempt to set SO_RDS_TRANSPPORT to RDS_TRANS_NONE will + always return EINVAL. RDMA for RDS ============ @@ -352,4 +365,59 @@ The recv path handle CMSGs return to application +Multipath RDS (mprds) +===================== + Mprds is multipathed-RDS, primarily intended for RDS-over-TCP + (though the concept can be extended to other transports). The classical + implementation of RDS-over-TCP is implemented by demultiplexing multiple + PF_RDS sockets between any 2 endpoints (where endpoint == [IP address, + port]) over a single TCP socket between the 2 IP addresses involved. This + has the limitation that it ends up funneling multiple RDS flows over a + single TCP flow, thus it is + (a) upper-bounded to the single-flow bandwidth, + (b) suffers from head-of-line blocking for all the RDS sockets. + + Better throughput (for a fixed small packet size, MTU) can be achieved + by having multiple TCP/IP flows per rds/tcp connection, i.e., multipathed + RDS (mprds). Each such TCP/IP flow constitutes a path for the rds/tcp + connection. RDS sockets will be attached to a path based on some hash + (e.g., of local address and RDS port number) and packets for that RDS + socket will be sent over the attached path using TCP to segment/reassemble + RDS datagrams on that path. + + Multipathed RDS is implemented by splitting the struct rds_connection into + a common (to all paths) part, and a per-path struct rds_conn_path. All + I/O workqs and reconnect threads are driven from the rds_conn_path. + Transports such as TCP that are multipath capable may then set up a + TPC socket per rds_conn_path, and this is managed by the transport via + the transport privatee cp_transport_data pointer. + + Transports announce themselves as multipath capable by setting the + t_mp_capable bit during registration with the rds core module. When the + transport is multipath-capable, rds_sendmsg() hashes outgoing traffic + across multiple paths. The outgoing hash is computed based on the + local address and port that the PF_RDS socket is bound to. + + Additionally, even if the transport is MP capable, we may be + peering with some node that does not support mprds, or supports + a different number of paths. As a result, the peering nodes need + to agree on the number of paths to be used for the connection. + This is done by sending out a control packet exchange before the + first data packet. The control packet exchange must have completed + prior to outgoing hash completion in rds_sendmsg() when the transport + is mutlipath capable. + + The control packet is an RDS ping packet (i.e., packet to rds dest + port 0) with the ping packet having a rds extension header option of + type RDS_EXTHDR_NPATHS, length 2 bytes, and the value is the + number of paths supported by the sender. The "probe" ping packet will + get sent from some reserved port, RDS_FLAG_PROBE_PORT (in <linux/rds.h>) + The receiver of a ping from RDS_FLAG_PROBE_PORT will thus immediately + be able to compute the min(sender_paths, rcvr_paths). The pong + sent in response to a probe-ping should contain the rcvr's npaths + when the rcvr is mprds-capable. + + If the rcvr is not mprds-capable, the exthdr in the ping will be + ignored. In this case the pong will not have any exthdrs, so the sender + of the probe-ping can default to single-path mprds. diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index 16a924c48..70c926ae2 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -790,13 +790,12 @@ The kernel interface functions are as follows: Data messages can have their contents extracted with the usual bunch of socket buffer manipulation functions. A data message can be determined to be the last one in a sequence with rxrpc_kernel_is_data_last(). When a - data message has been used up, rxrpc_kernel_data_delivered() should be - called on it.. + data message has been used up, rxrpc_kernel_data_consumed() should be + called on it. - Non-data messages should be handled to rxrpc_kernel_free_skb() to dispose - of. It is possible to get extra refs on all types of message for later - freeing, but this may pin the state of a call until the message is finally - freed. + Messages should be handled to rxrpc_kernel_free_skb() to dispose of. It + is possible to get extra refs on all types of message for later freeing, + but this may pin the state of a call until the message is finally freed. (*) Accept an incoming call. @@ -821,12 +820,14 @@ The kernel interface functions are as follows: Other errors may be returned if the call had been aborted (-ECONNABORTED) or had timed out (-ETIME). - (*) Record the delivery of a data message and free it. + (*) Record the delivery of a data message. - void rxrpc_kernel_data_delivered(struct sk_buff *skb); + void rxrpc_kernel_data_consumed(struct rxrpc_call *call, + struct sk_buff *skb); - This is used to record a data message as having been delivered and to - update the ACK state for the call. The socket buffer will be freed. + This is used to record a data message as having been consumed and to + update the ACK state for the call. The message must still be passed to + rxrpc_kernel_free_skb() for disposal by the caller. (*) Free a message. diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt new file mode 100644 index 000000000..f200467ad --- /dev/null +++ b/Documentation/networking/segmentation-offloads.txt @@ -0,0 +1,130 @@ +Segmentation Offloads in the Linux Networking Stack + +Introduction +============ + +This document describes a set of techniques in the Linux networking stack +to take advantage of segmentation offload capabilities of various NICs. + +The following technologies are described: + * TCP Segmentation Offload - TSO + * UDP Fragmentation Offload - UFO + * IPIP, SIT, GRE, and UDP Tunnel Offloads + * Generic Segmentation Offload - GSO + * Generic Receive Offload - GRO + * Partial Generic Segmentation Offload - GSO_PARTIAL + +TCP Segmentation Offload +======================== + +TCP segmentation allows a device to segment a single frame into multiple +frames with a data payload size specified in skb_shinfo()->gso_size. +When TCP segmentation requested the bit for either SKB_GSO_TCP or +SKB_GSO_TCP6 should be set in skb_shinfo()->gso_type and +skb_shinfo()->gso_size should be set to a non-zero value. + +TCP segmentation is dependent on support for the use of partial checksum +offload. For this reason TSO is normally disabled if the Tx checksum +offload for a given device is disabled. + +In order to support TCP segmentation offload it is necessary to populate +the network and transport header offsets of the skbuff so that the device +drivers will be able determine the offsets of the IP or IPv6 header and the +TCP header. In addition as CHECKSUM_PARTIAL is required csum_start should +also point to the TCP header of the packet. + +For IPv4 segmentation we support one of two types in terms of the IP ID. +The default behavior is to increment the IP ID with every segment. If the +GSO type SKB_GSO_TCP_FIXEDID is specified then we will not increment the IP +ID and all segments will use the same IP ID. If a device has +NETIF_F_TSO_MANGLEID set then the IP ID can be ignored when performing TSO +and we will either increment the IP ID for all frames, or leave it at a +static value based on driver preference. + +UDP Fragmentation Offload +========================= + +UDP fragmentation offload allows a device to fragment an oversized UDP +datagram into multiple IPv4 fragments. Many of the requirements for UDP +fragmentation offload are the same as TSO. However the IPv4 ID for +fragments should not increment as a single IPv4 datagram is fragmented. + +IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads +======================================================== + +In addition to the offloads described above it is possible for a frame to +contain additional headers such as an outer tunnel. In order to account +for such instances an additional set of segmentation offload types were +introduced including SKB_GSO_IPIP, SKB_GSO_SIT, SKB_GSO_GRE, and +SKB_GSO_UDP_TUNNEL. These extra segmentation types are used to identify +cases where there are more than just 1 set of headers. For example in the +case of IPIP and SIT we should have the network and transport headers moved +from the standard list of headers to "inner" header offsets. + +Currently only two levels of headers are supported. The convention is to +refer to the tunnel headers as the outer headers, while the encapsulated +data is normally referred to as the inner headers. Below is the list of +calls to access the given headers: + +IPIP/SIT Tunnel: + Outer Inner +MAC skb_mac_header +Network skb_network_header skb_inner_network_header +Transport skb_transport_header + +UDP/GRE Tunnel: + Outer Inner +MAC skb_mac_header skb_inner_mac_header +Network skb_network_header skb_inner_network_header +Transport skb_transport_header skb_inner_transport_header + +In addition to the above tunnel types there are also SKB_GSO_GRE_CSUM and +SKB_GSO_UDP_TUNNEL_CSUM. These two additional tunnel types reflect the +fact that the outer header also requests to have a non-zero checksum +included in the outer header. + +Finally there is SKB_GSO_REMCSUM which indicates that a given tunnel header +has requested a remote checksum offload. In this case the inner headers +will be left with a partial checksum and only the outer header checksum +will be computed. + +Generic Segmentation Offload +============================ + +Generic segmentation offload is a pure software offload that is meant to +deal with cases where device drivers cannot perform the offloads described +above. What occurs in GSO is that a given skbuff will have its data broken +out over multiple skbuffs that have been resized to match the MSS provided +via skb_shinfo()->gso_size. + +Before enabling any hardware segmentation offload a corresponding software +offload is required in GSO. Otherwise it becomes possible for a frame to +be re-routed between devices and end up being unable to be transmitted. + +Generic Receive Offload +======================= + +Generic receive offload is the complement to GSO. Ideally any frame +assembled by GRO should be segmented to create an identical sequence of +frames using GSO, and any sequence of frames segmented by GSO should be +able to be reassembled back to the original by GRO. The only exception to +this is IPv4 ID in the case that the DF bit is set for a given IP header. +If the value of the IPv4 ID is not sequentially incrementing it will be +altered so that it is when a frame assembled via GRO is segmented via GSO. + +Partial Generic Segmentation Offload +==================================== + +Partial generic segmentation offload is a hybrid between TSO and GSO. What +it effectively does is take advantage of certain traits of TCP and tunnels +so that instead of having to rewrite the packet headers for each segment +only the inner-most transport header and possibly the outer-most network +header need to be updated. This allows devices that do not support tunnel +offloads or tunnel offloads with checksum to still make use of segmentation. + +With the partial offload what occurs is that all headers excluding the +inner transport header are updated such that they will contain the correct +values for if the header was simply duplicated. The one exception to this +is the outer IPv4 ID field. It is up to the device drivers to guarantee +that the IPv4 ID field is incremented in the case that a given header does +not have the DF bit set. diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index d64a14714..e226f8925 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt @@ -1,6 +1,6 @@ STMicroelectronics 10/100/1000 Synopsys Ethernet driver -Copyright (C) 2007-2014 STMicroelectronics Ltd +Copyright (C) 2007-2015 STMicroelectronics Ltd Author: Giuseppe Cavallaro <peppe.cavallaro@st.com> This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers @@ -138,6 +138,8 @@ struct plat_stmmacenet_data { int (*init)(struct platform_device *pdev, void *priv); void (*exit)(struct platform_device *pdev, void *priv); void *bsp_priv; + int has_gmac4; + bool tso_en; }; Where: @@ -181,6 +183,8 @@ Where: registers. init/exit callbacks should not use or modify platform data. o bsp_priv: another private pointer. + o has_gmac4: uses GMAC4 core. + o tso_en: Enables TSO (TCP Segmentation Offload) feature. For MDIO bus The we have: @@ -278,6 +282,14 @@ Please see the following document: o stmmac_ethtool.c: to implement the ethtool support; o stmmac.h: private driver structure; o common.h: common definitions and VFTs; + o mmc_core.c/mmc.h: Management MAC Counters; + o stmmac_hwtstamp.c: HW timestamp support for PTP; + o stmmac_ptp.c: PTP 1588 clock; + o stmmac_pcs.h: Physical Coding Sublayer common implementation; + o dwmac-<XXX>.c: these are for the platform glue-logic file; e.g. dwmac-sti.c + for STMicroelectronics SoCs. + +- GMAC 3.x o descs.h: descriptor structure definitions; o dwmac1000_core.c: dwmac GiGa core functions; o dwmac1000_dma.c: dma functions for the GMAC chip; @@ -289,11 +301,32 @@ Please see the following document: o enh_desc.c: functions for handling enhanced descriptors; o norm_desc.c: functions for handling normal descriptors; o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes; - o mmc_core.c/mmc.h: Management MAC Counters; - o stmmac_hwtstamp.c: HW timestamp support for PTP; - o stmmac_ptp.c: PTP 1588 clock; - o dwmac-<XXX>.c: these are for the platform glue-logic file; e.g. dwmac-sti.c - for STMicroelectronics SoCs. + +- GMAC4.x generation + o dwmac4_core.c: dwmac GMAC4.x core functions; + o dwmac4_desc.c: functions for handling GMAC4.x descriptors; + o dwmac4_descs.h: descriptor definitions; + o dwmac4_dma.c: dma functions for the GMAC4.x chip; + o dwmac4_dma.h: dma definitions for the GMAC4.x chip; + o dwmac4.h: core definitions for the GMAC4.x chip; + o dwmac4_lib.c: generic GMAC4.x functions; + +4.12) TSO support (GMAC4.x) + +TSO (Tcp Segmentation Offload) feature is supported by GMAC 4.x chip family. +When a packet is sent through TCP protocol, the TCP stack ensures that +the SKB provided to the low level driver (stmmac in our case) matches with +the maximum frame len (IP header + TCP header + payload <= 1500 bytes (for +MTU set to 1500)). It means that if an application using TCP want to send a +packet which will have a length (after adding headers) > 1514 the packet +will be split in several TCP packets: The data payload is split and headers +(TCP/IP ..) are added. It is done by software. + +When TSO is enabled, the TCP stack doesn't care about the maximum frame +length and provide SKB packet to stmmac as it is. The GMAC IP will have to +perform the segmentation by it self to match with maximum frame length. + +This feature can be enabled in device tree through "snps,tso" entry. 5) Debug Information diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt index fad63136e..31c391158 100644 --- a/Documentation/networking/switchdev.txt +++ b/Documentation/networking/switchdev.txt @@ -89,6 +89,18 @@ Typically, the management port is not participating in offloaded data plane and is loaded with a different driver, such as a NIC driver, on the management port device. +Switch ID +^^^^^^^^^ + +The switchdev driver must implement the switchdev op switchdev_port_attr_get +for SWITCHDEV_ATTR_ID_PORT_PARENT_ID for each port netdev, returning the same +physical ID for each port of a switch. The ID must be unique between switches +on the same system. The ID does not need to be unique between switches on +different systems. + +The switch ID is used to locate ports on a switch and to know if aggregated +ports belong to the same switch. + Port Netdev Naming ^^^^^^^^^^^^^^^^^^ @@ -104,25 +116,13 @@ external configuration. For example, if a physical 40G port is split logically into 4 10G ports, resulting in 4 port netdevs, the device can give a unique name for each port using port PHYS name. The udev rule would be: -SUBSYSTEM=="net", ACTION=="add", DRIVER="<driver>", ATTR{phys_port_name}!="", \ - NAME="$attr{phys_port_name}" +SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \ + ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 would be sub-port 0 on port 1 on switch 1. -Switch ID -^^^^^^^^^ - -The switchdev driver must implement the switchdev op switchdev_port_attr_get -for SWITCHDEV_ATTR_ID_PORT_PARENT_ID for each port netdev, returning the same -physical ID for each port of a switch. The ID must be unique between switches -on the same system. The ID does not need to be unique between switches on -different systems. - -The switch ID is used to locate ports on a switch and to know if aggregated -ports belong to the same switch. - Port Features ^^^^^^^^^^^^^ @@ -386,7 +386,7 @@ used. First phase is to "prepare" anything needed, including various checks, memory allocation, etc. The goal is to handle the stuff that is not unlikely to fail here. The second phase is to "commit" the actual changes. -Switchdev provides an inftrastructure for sharing items (for example memory +Switchdev provides an infrastructure for sharing items (for example memory allocations) between the two phases. The object created by a driver in "prepare" phase and it is queued up by: diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index a977339fb..671cccf0d 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -44,11 +44,17 @@ timeval of SO_TIMESTAMP (ms). Supports multiple types of timestamp requests. As a result, this socket option takes a bitmap of flags, not a boolean. In - err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, &val); + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, + sizeof(val)); val is an integer with any of the following bits set. Setting other bit returns EINVAL and does not change the current state. +The socket option configures timestamp generation for individual +sk_buffs (1.3.1), timestamp reporting to the socket's error +queue (1.3.2) and options (1.3.3). Timestamp generation can also +be enabled for individual sendmsg calls using cmsg (1.3.4). + 1.3.1 Timestamp Generation @@ -71,13 +77,16 @@ SOF_TIMESTAMPING_RX_SOFTWARE: kernel receive stack. SOF_TIMESTAMPING_TX_HARDWARE: - Request tx timestamps generated by the network adapter. + Request tx timestamps generated by the network adapter. This flag + can be enabled via both socket options and control messages. SOF_TIMESTAMPING_TX_SOFTWARE: Request tx timestamps when data leaves the kernel. These timestamps are generated in the device driver as close as possible, but always prior to, passing the packet to the network interface. Hence, they require driver support and may not be available for all devices. + This flag can be enabled via both socket options and control messages. + SOF_TIMESTAMPING_TX_SCHED: Request tx timestamps prior to entering the packet scheduler. Kernel @@ -90,7 +99,8 @@ SOF_TIMESTAMPING_TX_SCHED: machines with virtual devices where a transmitted packet travels through multiple devices and, hence, multiple packet schedulers, a timestamp is generated at each layer. This allows for fine - grained measurement of queuing delay. + grained measurement of queuing delay. This flag can be enabled + via both socket options and control messages. SOF_TIMESTAMPING_TX_ACK: Request tx timestamps when all data in the send buffer has been @@ -99,6 +109,7 @@ SOF_TIMESTAMPING_TX_ACK: over-report measurement, because the timestamp is generated when all data up to and including the buffer at send() was acknowledged: the cumulative acknowledgment. The mechanism ignores SACK and FACK. + This flag can be enabled via both socket options and control messages. 1.3.2 Timestamp Reporting @@ -183,6 +194,37 @@ having access to the contents of the original packet, so cannot be combined with SOF_TIMESTAMPING_OPT_TSONLY. +1.3.4. Enabling timestamps via control messages + +In addition to socket options, timestamp generation can be requested +per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1). +Using this feature, applications can sample timestamps per sendmsg() +without paying the overhead of enabling and disabling timestamps via +setsockopt: + + struct msghdr *msg; + ... + cmsg = CMSG_FIRSTHDR(msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SO_TIMESTAMPING; + cmsg->cmsg_len = CMSG_LEN(sizeof(__u32)); + *((__u32 *) CMSG_DATA(cmsg)) = SOF_TIMESTAMPING_TX_SCHED | + SOF_TIMESTAMPING_TX_SOFTWARE | + SOF_TIMESTAMPING_TX_ACK; + err = sendmsg(fd, msg, 0); + +The SOF_TIMESTAMPING_TX_* flags set via cmsg will override +the SOF_TIMESTAMPING_TX_* flags set via setsockopt. + +Moreover, applications must still enable timestamp reporting via +setsockopt to receive timestamps: + + __u32 val = SOF_TIMESTAMPING_SOFTWARE | + SOF_TIMESTAMPING_OPT_ID /* or any other flag */; + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, + sizeof(val)); + + 1.4 Bytestream Timestamps The SO_TIMESTAMPING interface supports timestamping of bytes in a diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt index d52aa10cf..755dab856 100644 --- a/Documentation/networking/vrf.txt +++ b/Documentation/networking/vrf.txt @@ -15,9 +15,9 @@ the use of higher priority ip rules (Policy Based Routing, PBR) to take precedence over the VRF device rules directing specific traffic as desired. In addition, VRF devices allow VRFs to be nested within namespaces. For -example network namespaces provide separation of network interfaces at L1 -(Layer 1 separation), VLANs on the interfaces within a namespace provide -L2 separation and then VRF devices provide L3 separation. +example network namespaces provide separation of network interfaces at the +device layer, VLANs on the interfaces within a namespace provide L2 separation +and then VRF devices provide L3 separation. Design ------ @@ -37,21 +37,22 @@ are then enslaved to a VRF device: +------+ +------+ Packets received on an enslaved device and are switched to the VRF device -using an rx_handler which gives the impression that packets flow through -the VRF device. Similarly on egress routing rules are used to send packets -to the VRF device driver before getting sent out the actual interface. This -allows tcpdump on a VRF device to capture all packets into and out of the -VRF as a whole.[1] Similiarly, netfilter [2] and tc rules can be applied -using the VRF device to specify rules that apply to the VRF domain as a whole. +in the IPv4 and IPv6 processing stacks giving the impression that packets +flow through the VRF device. Similarly on egress routing rules are used to +send packets to the VRF device driver before getting sent out the actual +interface. This allows tcpdump on a VRF device to capture all packets into +and out of the VRF as a whole.[1] Similarly, netfilter[2] and tc rules can be +applied using the VRF device to specify rules that apply to the VRF domain +as a whole. [1] Packets in the forwarded state do not flow through the device, so those packets are not seen by tcpdump. Will revisit this limitation in a future release. -[2] Iptables on ingress is limited to NF_INET_PRE_ROUTING only with skb->dev - set to real ingress device and egress is limited to NF_INET_POST_ROUTING. - Will revisit this limitation in a future release. - +[2] Iptables on ingress supports PREROUTING with skb->dev set to the real + ingress device and both INPUT and PREROUTING rules with skb->dev set to + the VRF device. For egress POSTROUTING and OUTPUT rules can be written + using either the VRF device or real egress device. Setup ----- @@ -59,23 +60,33 @@ Setup e.g, ip link add vrf-blue type vrf table 10 ip link set dev vrf-blue up -2. Rules are added that send lookups to the associated FIB table when the - iif or oif is the VRF device. e.g., +2. An l3mdev FIB rule directs lookups to the table associated with the device. + A single l3mdev rule is sufficient for all VRFs. The VRF device adds the + l3mdev rule for IPv4 and IPv6 when the first device is created with a + default preference of 1000. Users may delete the rule if desired and add + with a different priority or install per-VRF rules. + + Prior to the v4.8 kernel iif and oif rules are needed for each VRF device: ip ru add oif vrf-blue table 10 ip ru add iif vrf-blue table 10 - Set the default route for the table (and hence default route for the VRF). - e.g, ip route add table 10 prohibit default +3. Set the default route for the table (and hence default route for the VRF). + ip route add table 10 unreachable default -3. Enslave L3 interfaces to a VRF device. - e.g, ip link set dev eth1 master vrf-blue +4. Enslave L3 interfaces to a VRF device. + ip link set dev eth1 master vrf-blue Local and connected routes for enslaved devices are automatically moved to the table associated with VRF device. Any additional routes depending on - the enslaved device will need to be reinserted following the enslavement. + the enslaved device are dropped and will need to be reinserted to the VRF + FIB table following the enslavement. + + The IPv6 sysctl option keep_addr_on_down can be enabled to keep IPv6 global + addresses as VRF enslavement changes. + sysctl -w net.ipv6.conf.all.keep_addr_on_down=1 -4. Additional VRF routes are added to associated table. - e.g., ip route add table 10 ... +5. Additional VRF routes are added to associated table. + ip route add table 10 ... Applications @@ -87,39 +98,34 @@ VRF device: or to specify the output device using cmsg and IP_PKTINFO. +TCP services running in the default VRF context (ie., not bound to any VRF +device) can work across all VRF domains by enabling the tcp_l3mdev_accept +sysctl option: + sysctl -w net.ipv4.tcp_l3mdev_accept=1 -Limitations ------------ -Index of original ingress interface is not available via cmsg. Will address -soon. +netfilter rules on the VRF device can be used to limit access to services +running in the default VRF context as well. + +The default VRF does not have limited scope with respect to port bindings. +That is, if a process does a wildcard bind to a port in the default VRF it +owns the port across all VRF domains within the network namespace. ################################################################################ Using iproute2 for VRFs ======================= -VRF devices do *not* have to start with 'vrf-'. That is a convention used here -for emphasis of the device type, similar to use of 'br' in bridge names. +iproute2 supports the vrf keyword as of v4.7. For backwards compatibility this +section lists both commands where appropriate -- with the vrf keyword and the +older form without it. 1. Create a VRF To instantiate a VRF device and associate it with a table: $ ip link add dev NAME type vrf table ID - Remember to add the ip rules as well: - $ ip ru add oif NAME table 10 - $ ip ru add iif NAME table 10 - $ ip -6 ru add oif NAME table 10 - $ ip -6 ru add iif NAME table 10 - - Without the rules route lookups are not directed to the table. - - For example: - $ ip link add dev vrf-blue type vrf table 10 - $ ip ru add pref 200 oif vrf-blue table 10 - $ ip ru add pref 200 iif vrf-blue table 10 - $ ip -6 ru add pref 200 oif vrf-blue table 10 - $ ip -6 ru add pref 200 iif vrf-blue table 10 - + As of v4.8 the kernel supports the l3mdev FIB rule where a single rule + covers all VRFs. The l3mdev rule is created for IPv4 and IPv6 on first + device create. 2. List VRFs @@ -129,16 +135,16 @@ for emphasis of the device type, similar to use of 'br' in bridge names. For example: $ ip -d link show type vrf - 11: vrf-mgmt: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + 11: mgmt: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether 72:b3:ba:91:e2:24 brd ff:ff:ff:ff:ff:ff promiscuity 0 vrf table 1 addrgenmode eui64 - 12: vrf-red: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + 12: red: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether b6:6f:6e:f6:da:73 brd ff:ff:ff:ff:ff:ff promiscuity 0 vrf table 10 addrgenmode eui64 - 13: vrf-blue: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + 13: blue: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether 36:62:e8:7d:bb:8c brd ff:ff:ff:ff:ff:ff promiscuity 0 vrf table 66 addrgenmode eui64 - 14: vrf-green: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 + 14: green: <NOARP,MASTER,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000 link/ether e6:28:b8:63:70:bb brd ff:ff:ff:ff:ff:ff promiscuity 0 vrf table 81 addrgenmode eui64 @@ -146,43 +152,44 @@ for emphasis of the device type, similar to use of 'br' in bridge names. Or in brief output: $ ip -br link show type vrf - vrf-mgmt UP 72:b3:ba:91:e2:24 <NOARP,MASTER,UP,LOWER_UP> - vrf-red UP b6:6f:6e:f6:da:73 <NOARP,MASTER,UP,LOWER_UP> - vrf-blue UP 36:62:e8:7d:bb:8c <NOARP,MASTER,UP,LOWER_UP> - vrf-green UP e6:28:b8:63:70:bb <NOARP,MASTER,UP,LOWER_UP> + mgmt UP 72:b3:ba:91:e2:24 <NOARP,MASTER,UP,LOWER_UP> + red UP b6:6f:6e:f6:da:73 <NOARP,MASTER,UP,LOWER_UP> + blue UP 36:62:e8:7d:bb:8c <NOARP,MASTER,UP,LOWER_UP> + green UP e6:28:b8:63:70:bb <NOARP,MASTER,UP,LOWER_UP> 3. Assign a Network Interface to a VRF Network interfaces are assigned to a VRF by enslaving the netdevice to a VRF device: - $ ip link set dev NAME master VRF-NAME + $ ip link set dev NAME master NAME On enslavement connected and local routes are automatically moved to the table associated with the VRF device. For example: - $ ip link set dev eth0 master vrf-mgmt + $ ip link set dev eth0 master mgmt 4. Show Devices Assigned to a VRF To show devices that have been assigned to a specific VRF add the master option to the ip command: - $ ip link show master VRF-NAME + $ ip link show vrf NAME + $ ip link show master NAME For example: - $ ip link show master vrf-red - 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vrf-red state UP mode DEFAULT group default qlen 1000 + $ ip link show vrf red + 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff - 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vrf-red state UP mode DEFAULT group default qlen 1000 + 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP mode DEFAULT group default qlen 1000 link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff - 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master vrf-red state DOWN mode DEFAULT group default qlen 1000 + 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN mode DEFAULT group default qlen 1000 link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff Or using the brief output: - $ ip -br link show master vrf-red + $ ip -br link show vrf red eth1 UP 02:00:00:00:02:02 <BROADCAST,MULTICAST,UP,LOWER_UP> eth2 UP 02:00:00:00:02:03 <BROADCAST,MULTICAST,UP,LOWER_UP> eth5 DOWN 02:00:00:00:02:06 <BROADCAST,MULTICAST> @@ -192,26 +199,28 @@ for emphasis of the device type, similar to use of 'br' in bridge names. To list neighbor entries associated with devices enslaved to a VRF device add the master option to the ip command: - $ ip [-6] neigh show master VRF-NAME + $ ip [-6] neigh show vrf NAME + $ ip [-6] neigh show master NAME For example: - $ ip neigh show master vrf-red + $ ip neigh show vrf red 10.2.1.254 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE 10.2.2.254 dev eth2 lladdr 5e:54:01:6a:ee:80 REACHABLE - $ ip -6 neigh show master vrf-red - 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE + $ ip -6 neigh show vrf red + 2002:1::64 dev eth1 lladdr a6:d9:c7:4f:06:23 REACHABLE 6. Show Addresses for a VRF To show addresses for interfaces associated with a VRF add the master option to the ip command: - $ ip addr show master VRF-NAME + $ ip addr show vrf NAME + $ ip addr show master NAME For example: - $ ip addr show master vrf-red - 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vrf-red state UP group default qlen 1000 + $ ip addr show vrf red + 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:00:00:00:02:02 brd ff:ff:ff:ff:ff:ff inet 10.2.1.2/24 brd 10.2.1.255 scope global eth1 valid_lft forever preferred_lft forever @@ -219,7 +228,7 @@ for emphasis of the device type, similar to use of 'br' in bridge names. valid_lft forever preferred_lft forever inet6 fe80::ff:fe00:202/64 scope link valid_lft forever preferred_lft forever - 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master vrf-red state UP group default qlen 1000 + 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master red state UP group default qlen 1000 link/ether 02:00:00:00:02:03 brd ff:ff:ff:ff:ff:ff inet 10.2.2.2/24 brd 10.2.2.255 scope global eth2 valid_lft forever preferred_lft forever @@ -227,11 +236,11 @@ for emphasis of the device type, similar to use of 'br' in bridge names. valid_lft forever preferred_lft forever inet6 fe80::ff:fe00:203/64 scope link valid_lft forever preferred_lft forever - 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master vrf-red state DOWN group default qlen 1000 + 7: eth5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop master red state DOWN group default qlen 1000 link/ether 02:00:00:00:02:06 brd ff:ff:ff:ff:ff:ff Or in brief format: - $ ip -br addr show master vrf-red + $ ip -br addr show vrf red eth1 UP 10.2.1.2/24 2002:1::2/120 fe80::ff:fe00:202/64 eth2 UP 10.2.2.2/24 2002:2::2/120 fe80::ff:fe00:203/64 eth5 DOWN @@ -241,10 +250,11 @@ for emphasis of the device type, similar to use of 'br' in bridge names. To show routes for a VRF use the ip command to display the table associated with the VRF device: + $ ip [-6] route show vrf NAME $ ip [-6] route show table ID For example: - $ ip route show table vrf-red + $ ip route show vrf red prohibit default broadcast 10.2.1.0 dev eth1 proto kernel scope link src 10.2.1.2 10.2.1.0/24 dev eth1 proto kernel scope link src 10.2.1.2 @@ -255,7 +265,7 @@ for emphasis of the device type, similar to use of 'br' in bridge names. local 10.2.2.2 dev eth2 proto kernel scope host src 10.2.2.2 broadcast 10.2.2.255 dev eth2 proto kernel scope link src 10.2.2.2 - $ ip -6 route show table vrf-red + $ ip -6 route show vrf red local 2002:1:: dev lo proto none metric 0 pref medium local 2002:1::2 dev lo proto none metric 0 pref medium 2002:1::/120 dev eth1 proto kernel metric 256 pref medium @@ -268,23 +278,24 @@ for emphasis of the device type, similar to use of 'br' in bridge names. local fe80::ff:fe00:203 dev lo proto none metric 0 pref medium fe80::/64 dev eth1 proto kernel metric 256 pref medium fe80::/64 dev eth2 proto kernel metric 256 pref medium - ff00::/8 dev vrf-red metric 256 pref medium + ff00::/8 dev red metric 256 pref medium ff00::/8 dev eth1 metric 256 pref medium ff00::/8 dev eth2 metric 256 pref medium 8. Route Lookup for a VRF - A test route lookup can be done for a VRF by adding the oif option to ip: - $ ip [-6] route get oif VRF-NAME ADDRESS + A test route lookup can be done for a VRF: + $ ip [-6] route get vrf NAME ADDRESS + $ ip [-6] route get oif NAME ADDRESS For example: - $ ip route get 10.2.1.40 oif vrf-red - 10.2.1.40 dev eth1 table vrf-red src 10.2.1.2 + $ ip route get 10.2.1.40 vrf red + 10.2.1.40 dev eth1 table red src 10.2.1.2 cache - $ ip -6 route get 2002:1::32 oif vrf-red - 2002:1::32 from :: dev eth1 table vrf-red proto kernel src 2002:1::2 metric 256 pref medium + $ ip -6 route get 2002:1::32 vrf red + 2002:1::32 from :: dev eth1 table red proto kernel src 2002:1::2 metric 256 pref medium 9. Removing Network Interface from a VRF @@ -303,46 +314,40 @@ for emphasis of the device type, similar to use of 'br' in bridge names. Commands used in this example: -cat >> /etc/iproute2/rt_tables <<EOF -1 vrf-mgmt -10 vrf-red -66 vrf-blue -81 vrf-green +cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF +1 mgmt +10 red +66 blue +81 green EOF function vrf_create { VRF=$1 TBID=$2 - # create VRF device - ip link add vrf-${VRF} type vrf table ${TBID} - # add rules that direct lookups to vrf table - ip ru add pref 200 oif vrf-${VRF} table ${TBID} - ip ru add pref 200 iif vrf-${VRF} table ${TBID} - ip -6 ru add pref 200 oif vrf-${VRF} table ${TBID} - ip -6 ru add pref 200 iif vrf-${VRF} table ${TBID} + # create VRF device + ip link add ${VRF} type vrf table ${TBID} if [ "${VRF}" != "mgmt" ]; then - ip route add table ${TBID} prohibit default + ip route add table ${TBID} unreachable default fi - ip link set dev vrf-${VRF} up - ip link set dev vrf-${VRF} state up + ip link set dev ${VRF} up } vrf_create mgmt 1 -ip link set dev eth0 master vrf-mgmt +ip link set dev eth0 master mgmt vrf_create red 10 -ip link set dev eth1 master vrf-red -ip link set dev eth2 master vrf-red -ip link set dev eth5 master vrf-red +ip link set dev eth1 master red +ip link set dev eth2 master red +ip link set dev eth5 master red vrf_create blue 66 -ip link set dev eth3 master vrf-blue +ip link set dev eth3 master blue vrf_create green 81 -ip link set dev eth4 master vrf-green +ip link set dev eth4 master green Interface addresses from /etc/network/interfaces: diff --git a/Documentation/networking/xfrm_sync.txt b/Documentation/networking/xfrm_sync.txt index d7aac9ded..8d88e0f2e 100644 --- a/Documentation/networking/xfrm_sync.txt +++ b/Documentation/networking/xfrm_sync.txt @@ -4,7 +4,7 @@ Krisztian <hidden@balabit.hu> and others and additional patches from Jamal <hadi@cyberus.ca>. The end goal for syncing is to be able to insert attributes + generate -events so that the an SA can be safely moved from one machine to another +events so that the SA can be safely moved from one machine to another for HA purposes. The idea is to synchronize the SA so that the takeover machine can do the processing of the SA as accurate as possible if it has access to it. @@ -13,7 +13,7 @@ We already have the ability to generate SA add/del/upd events. These patches add ability to sync and have accurate lifetime byte (to ensure proper decay of SAs) and replay counters to avoid replay attacks with as minimal loss at failover time. -This way a backup stays as closely uptodate as an active member. +This way a backup stays as closely up-to-date as an active member. Because the above items change for every packet the SA receives, it is possible for a lot of the events to be generated. @@ -163,7 +163,7 @@ If you have an SA that is getting hit by traffic in bursts such that there is a period where the timer threshold expires with no packets seen, then an odd behavior is seen as follows: The first packet arrival after a timer expiry will trigger a timeout -aevent; i.e we dont wait for a timeout period or a packet threshold +event; i.e we don't wait for a timeout period or a packet threshold to be reached. This is done for simplicity and efficiency reasons. -JHS |