summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/00-INDEX2
-rw-r--r--Documentation/networking/altera_tse.txt6
-rw-r--r--Documentation/networking/batman-adv.txt2
-rw-r--r--Documentation/networking/checksum-offloads.txt119
-rw-r--r--Documentation/networking/dsa/dsa.txt22
-rw-r--r--Documentation/networking/ip-sysctl.txt54
-rw-r--r--Documentation/networking/ipvlan.txt6
-rw-r--r--Documentation/networking/kcm.txt285
-rw-r--r--Documentation/networking/mac80211-injection.txt17
-rw-r--r--Documentation/networking/netlink_mmap.txt332
-rw-r--r--Documentation/networking/phy.txt10
-rw-r--r--Documentation/networking/pktgen.txt6
-rw-r--r--Documentation/networking/rds.txt4
-rw-r--r--Documentation/networking/switchdev.txt2
-rw-r--r--Documentation/networking/vrf.txt2
-rw-r--r--Documentation/networking/xfrm_sync.txt6
16 files changed, 503 insertions, 372 deletions
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
index df27a1a50..415154a48 100644
--- a/Documentation/networking/00-INDEX
+++ b/Documentation/networking/00-INDEX
@@ -44,6 +44,8 @@ can.txt
- documentation on CAN protocol family.
cdc_mbim.txt
- 3G/LTE USB modem (Mobile Broadband Interface Model)
+checksum-offloads.txt
+ - Explanation of checksum offloads; LCO, RCO
cops.txt
- info on the COPS LocalTalk Linux driver
cs89x0.txt
diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt
index 3f24df8c6..50b8589d1 100644
--- a/Documentation/networking/altera_tse.txt
+++ b/Documentation/networking/altera_tse.txt
@@ -6,7 +6,7 @@ This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers
using the SGDMA and MSGDMA soft DMA IP components. The driver uses the
platform bus to obtain component resources. The designs used to test this
driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board,
-and tested with ARM and NIOS processor hosts seperately. The anticipated use
+and tested with ARM and NIOS processor hosts separately. The anticipated use
cases are simple communications between an embedded system and an external peer
for status and simple configuration of the embedded system.
@@ -65,14 +65,14 @@ Driver parameters can be also passed in command line by using:
4.1) Transmit process
When the driver's transmit routine is called by the kernel, it sets up a
transmit descriptor by calling the underlying DMA transmit routine (SGDMA or
-MSGDMA), and initites a transmit operation. Once the transmit is complete, an
+MSGDMA), and initiates a transmit operation. Once the transmit is complete, an
interrupt is driven by the transmit DMA logic. The driver handles the transmit
completion in the context of the interrupt handling chain by recycling
resource required to send and track the requested transmit operation.
4.2) Receive process
The driver will post receive buffers to the receive DMA logic during driver
-intialization. Receive buffers may or may not be queued depending upon the
+initialization. Receive buffers may or may not be queued depending upon the
underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able
to queue receive buffers to the SGDMA receive logic). When a packet is
received, the DMA logic generates an interrupt. The driver handles a receive
diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt
index ff23b755f..1b5e7a7f2 100644
--- a/Documentation/networking/batman-adv.txt
+++ b/Documentation/networking/batman-adv.txt
@@ -187,7 +187,7 @@ interfaces to the kernel module settings.
For more information, please see the manpage (man batctl).
-batctl is available on http://www.open-mesh.org/
+batctl is available on https://www.open-mesh.org/
CONTACT
diff --git a/Documentation/networking/checksum-offloads.txt b/Documentation/networking/checksum-offloads.txt
new file mode 100644
index 000000000..56e368612
--- /dev/null
+++ b/Documentation/networking/checksum-offloads.txt
@@ -0,0 +1,119 @@
+Checksum Offloads in the Linux Networking Stack
+
+
+Introduction
+============
+
+This document describes a set of techniques in the Linux networking stack
+ to take advantage of checksum offload capabilities of various NICs.
+
+The following technologies are described:
+ * TX Checksum Offload
+ * LCO: Local Checksum Offload
+ * RCO: Remote Checksum Offload
+
+Things that should be documented here but aren't yet:
+ * RX Checksum Offload
+ * CHECKSUM_UNNECESSARY conversion
+
+
+TX Checksum Offload
+===================
+
+The interface for offloading a transmit checksum to a device is explained
+ in detail in comments near the top of include/linux/skbuff.h.
+In brief, it allows to request the device fill in a single ones-complement
+ checksum defined by the sk_buff fields skb->csum_start and
+ skb->csum_offset. The device should compute the 16-bit ones-complement
+ checksum (i.e. the 'IP-style' checksum) from csum_start to the end of the
+ packet, and fill in the result at (csum_start + csum_offset).
+Because csum_offset cannot be negative, this ensures that the previous
+ value of the checksum field is included in the checksum computation, thus
+ it can be used to supply any needed corrections to the checksum (such as
+ the sum of the pseudo-header for UDP or TCP).
+This interface only allows a single checksum to be offloaded. Where
+ encapsulation is used, the packet may have multiple checksum fields in
+ different header layers, and the rest will have to be handled by another
+ mechanism such as LCO or RCO.
+No offloading of the IP header checksum is performed; it is always done in
+ software. This is OK because when we build the IP header, we obviously
+ have it in cache, so summing it isn't expensive. It's also rather short.
+The requirements for GSO are more complicated, because when segmenting an
+ encapsulated packet both the inner and outer checksums may need to be
+ edited or recomputed for each resulting segment. See the skbuff.h comment
+ (section 'E') for more details.
+
+A driver declares its offload capabilities in netdev->hw_features; see
+ Documentation/networking/netdev-features for more. Note that a device
+ which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start
+ and csum_offset given in the SKB; if it tries to deduce these itself in
+ hardware (as some NICs do) the driver should check that the values in the
+ SKB match those which the hardware will deduce, and if not, fall back to
+ checksumming in software instead (with skb_checksum_help or one of the
+ skb_csum_off_chk* functions as mentioned in include/linux/skbuff.h). This
+ is a pain, but that's what you get when hardware tries to be clever.
+
+The stack should, for the most part, assume that checksum offload is
+ supported by the underlying device. The only place that should check is
+ validate_xmit_skb(), and the functions it calls directly or indirectly.
+ That function compares the offload features requested by the SKB (which
+ may include other offloads besides TX Checksum Offload) and, if they are
+ not supported or enabled on the device (determined by netdev->features),
+ performs the corresponding offload in software. In the case of TX
+ Checksum Offload, that means calling skb_checksum_help(skb).
+
+
+LCO: Local Checksum Offload
+===========================
+
+LCO is a technique for efficiently computing the outer checksum of an
+ encapsulated datagram when the inner checksum is due to be offloaded.
+The ones-complement sum of a correctly checksummed TCP or UDP packet is
+ equal to the complement of the sum of the pseudo header, because everything
+ else gets 'cancelled out' by the checksum field. This is because the sum was
+ complemented before being written to the checksum field.
+More generally, this holds in any case where the 'IP-style' ones complement
+ checksum is used, and thus any checksum that TX Checksum Offload supports.
+That is, if we have set up TX Checksum Offload with a start/offset pair, we
+ know that after the device has filled in that checksum, the ones
+ complement sum from csum_start to the end of the packet will be equal to
+ the complement of whatever value we put in the checksum field beforehand.
+ This allows us to compute the outer checksum without looking at the payload:
+ we simply stop summing when we get to csum_start, then add the complement of
+ the 16-bit word at (csum_start + csum_offset).
+Then, when the true inner checksum is filled in (either by hardware or by
+ skb_checksum_help()), the outer checksum will become correct by virtue of
+ the arithmetic.
+
+LCO is performed by the stack when constructing an outer UDP header for an
+ encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for
+ the IPv6 equivalents, in udp6_set_csum().
+It is also performed when constructing an IPv4 GRE header, in
+ net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when
+ constructing an IPv6 GRE header; the GRE checksum is computed over the
+ whole packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be
+ possible to use LCO here as IPv6 GRE still uses an IP-style checksum.
+All of the LCO implementations use a helper function lco_csum(), in
+ include/linux/skbuff.h.
+
+LCO can safely be used for nested encapsulations; in this case, the outer
+ encapsulation layer will sum over both its own header and the 'middle'
+ header. This does mean that the 'middle' header will get summed multiple
+ times, but there doesn't seem to be a way to avoid that without incurring
+ bigger costs (e.g. in SKB bloat).
+
+
+RCO: Remote Checksum Offload
+============================
+
+RCO is a technique for eliding the inner checksum of an encapsulated
+ datagram, allowing the outer checksum to be offloaded. It does, however,
+ involve a change to the encapsulation protocols, which the receiver must
+ also support. For this reason, it is disabled by default.
+RCO is detailed in the following Internet-Drafts:
+https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00
+https://tools.ietf.org/html/draft-herbert-vxlan-rco-00
+In Linux, RCO is implemented individually in each encapsulation protocol,
+ and most tunnel types have flags controlling its use. For instance, VXLAN
+ has the flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that
+ RCO should be used when transmitting to a given remote destination.
diff --git a/Documentation/networking/dsa/dsa.txt b/Documentation/networking/dsa/dsa.txt
index aa9c1f931..3b196c304 100644
--- a/Documentation/networking/dsa/dsa.txt
+++ b/Documentation/networking/dsa/dsa.txt
@@ -521,20 +521,17 @@ See Documentation/hwmon/sysfs-interface for details.
Bridge layer
------------
-- port_join_bridge: bridge layer function invoked when a given switch port is
+- port_bridge_join: bridge layer function invoked when a given switch port is
added to a bridge, this function should be doing the necessary at the switch
level to permit the joining port from being added to the relevant logical
- domain for it to ingress/egress traffic with other members of the bridge. DSA
- does nothing but calculate a bitmask of switch ports currently members of the
- specified bridge being requested the join
+ domain for it to ingress/egress traffic with other members of the bridge.
-- port_leave_bridge: bridge layer function invoked when a given switch port is
+- port_bridge_leave: bridge layer function invoked when a given switch port is
removed from a bridge, this function should be doing the necessary at the
switch level to deny the leaving port from ingress/egress traffic from the
remaining bridge members. When the port leaves the bridge, it should be aged
out at the switch hardware for the switch to (re) learn MAC addresses behind
- this port. DSA calculates the bitmask of ports still members of the bridge
- being left
+ this port.
- port_stp_update: bridge layer function invoked when a given switch port STP
state is computed by the bridge layer and should be propagated to switch
@@ -545,20 +542,15 @@ Bridge layer
Bridge VLAN filtering
---------------------
-- port_pvid_get: bridge layer function invoked when a Port-based VLAN ID is
- queried for the given switch port
-
-- port_pvid_set: bridge layer function invoked when a Port-based VLAN ID needs
- to be configured on the given switch port
-
- port_vlan_add: bridge layer function invoked when a VLAN is configured
(tagged or untagged) for the given switch port
- port_vlan_del: bridge layer function invoked when a VLAN is removed from the
given switch port
-- vlan_getnext: bridge layer function invoked to query the next configured VLAN
- in the switch, i.e. returns the bitmaps of members and untagged ports
+- port_vlan_dump: bridge layer function invoked with a switchdev callback
+ function that the driver has to call for each VLAN the given port is a member
+ of. A switchdev object is used to carry the VID and bridge flags.
- port_fdb_add: bridge layer function invoked when the bridge wants to install a
Forwarding Database entry, the switch hardware should be programmed with the
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 73b36d7c7..b183e2b60 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -946,15 +946,20 @@ igmp_max_memberships - INTEGER
The value 5459 assumes no IP header options, so in practice
this number may be lower.
- conf/interface/* changes special settings per interface (where
- "interface" is the name of your network interface)
-
- conf/all/* is special, changes the settings for all interfaces
+igmp_max_msf - INTEGER
+ Maximum number of addresses allowed in the source filter list for a
+ multicast group.
+ Default: 10
igmp_qrv - INTEGER
- Controls the IGMP query robustness variable (see RFC2236 8.1).
- Default: 2 (as specified by RFC2236 8.1)
- Minimum: 1 (as specified by RFC6636 4.5)
+ Controls the IGMP query robustness variable (see RFC2236 8.1).
+ Default: 2 (as specified by RFC2236 8.1)
+ Minimum: 1 (as specified by RFC6636 4.5)
+
+conf/interface/* changes special settings per interface (where
+"interface" is the name of your network interface)
+
+conf/all/* is special, changes the settings for all interfaces
log_martians - BOOLEAN
Log packets with impossible addresses to kernel log.
@@ -1216,6 +1221,19 @@ promote_secondaries - BOOLEAN
promote a corresponding secondary IP address instead of
removing all the corresponding secondary IP addresses.
+drop_unicast_in_l2_multicast - BOOLEAN
+ Drop any unicast IP packets that are received in link-layer
+ multicast (or broadcast) frames.
+ This behavior (for multicast) is actually a SHOULD in RFC
+ 1122, but is disabled by default for compatibility reasons.
+ Default: off (0)
+
+drop_gratuitous_arp - BOOLEAN
+ Drop all gratuitous ARP frames, for example if there's a known
+ good ARP proxy on the network and such frames need not be used
+ (or in the case of 802.11, must not be used to prevent attacks.)
+ Default: off (0)
+
tag - INTEGER
Allows you to write a number, which can be used as required.
@@ -1550,6 +1568,15 @@ temp_prefered_lft - INTEGER
Preferred lifetime (in seconds) for temporary addresses.
Default: 86400 (1 day)
+keep_addr_on_down - INTEGER
+ Keep all IPv6 addresses on an interface down event. If set static
+ global addresses with no expiration time are not flushed.
+ >0 : enabled
+ 0 : system default
+ <0 : disabled
+
+ Default: 0 (addresses are removed)
+
max_desync_factor - INTEGER
Maximum value for DESYNC_FACTOR, which is a random value
that ensures that clients don't synchronize with each
@@ -1661,6 +1688,19 @@ stable_secret - IPv6 address
By default the stable secret is unset.
+drop_unicast_in_l2_multicast - BOOLEAN
+ Drop any unicast IPv6 packets that are received in link-layer
+ multicast (or broadcast) frames.
+
+ By default this is turned off.
+
+drop_unsolicited_na - BOOLEAN
+ Drop all unsolicited neighbor advertisements, for example if there's
+ a known good NA proxy on the network and such frames need not be used
+ (or in the case of 802.11, must not be used to prevent attacks.)
+
+ By default this is turned off.
+
icmp/*:
ratelimit - INTEGER
Limit the maximal rates for sending ICMPv6 packets.
diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt
index cf996394e..14422f8fc 100644
--- a/Documentation/networking/ipvlan.txt
+++ b/Documentation/networking/ipvlan.txt
@@ -8,7 +8,7 @@ Initial Release:
This is conceptually very similar to the macvlan driver with one major
exception of using L3 for mux-ing /demux-ing among slaves. This property makes
the master device share the L2 with it's slave devices. I have developed this
-driver in conjuntion with network namespaces and not sure if there is use case
+driver in conjunction with network namespaces and not sure if there is use case
outside of it.
@@ -42,7 +42,7 @@ out. In this mode the slaves will RX/TX multicast and broadcast (if applicable)
as well.
4.2 L3 mode:
- In this mode TX processing upto L3 happens on the stack instance attached
+ In this mode TX processing up to L3 happens on the stack instance attached
to the slave device and packets are switched to the stack instance of the
master device for the L2 processing and routing from that instance will be
used before packets are queued on the outbound device. In this mode the slaves
@@ -56,7 +56,7 @@ situations defines your use case then you can choose to use ipvlan -
(a) The Linux host that is connected to the external switch / router has
policy configured that allows only one mac per port.
(b) No of virtual devices created on a master exceed the mac capacity and
-puts the NIC in promiscous mode and degraded performance is a concern.
+puts the NIC in promiscuous mode and degraded performance is a concern.
(c) If the slave device is to be put into the hostile / untrusted network
namespace where L2 on the slave could be changed / misused.
diff --git a/Documentation/networking/kcm.txt b/Documentation/networking/kcm.txt
new file mode 100644
index 000000000..3476ede5b
--- /dev/null
+++ b/Documentation/networking/kcm.txt
@@ -0,0 +1,285 @@
+Kernel Connection Mulitplexor
+-----------------------------
+
+Kernel Connection Multiplexor (KCM) is a mechanism that provides a message based
+interface over TCP for generic application protocols. With KCM an application
+can efficiently send and receive application protocol messages over TCP using
+datagram sockets.
+
+KCM implements an NxM multiplexor in the kernel as diagrammed below:
+
++------------+ +------------+ +------------+ +------------+
+| KCM socket | | KCM socket | | KCM socket | | KCM socket |
++------------+ +------------+ +------------+ +------------+
+ | | | |
+ +-----------+ | | +----------+
+ | | | |
+ +----------------------------------+
+ | Multiplexor |
+ +----------------------------------+
+ | | | | |
+ +---------+ | | | ------------+
+ | | | | |
++----------+ +----------+ +----------+ +----------+ +----------+
+| Psock | | Psock | | Psock | | Psock | | Psock |
++----------+ +----------+ +----------+ +----------+ +----------+
+ | | | | |
++----------+ +----------+ +----------+ +----------+ +----------+
+| TCP sock | | TCP sock | | TCP sock | | TCP sock | | TCP sock |
++----------+ +----------+ +----------+ +----------+ +----------+
+
+KCM sockets
+-----------
+
+The KCM sockets provide the user interface to the muliplexor. All the KCM sockets
+bound to a multiplexor are considered to have equivalent function, and I/O
+operations in different sockets may be done in parallel without the need for
+synchronization between threads in userspace.
+
+Multiplexor
+-----------
+
+The multiplexor provides the message steering. In the transmit path, messages
+written on a KCM socket are sent atomically on an appropriate TCP socket.
+Similarly, in the receive path, messages are constructed on each TCP socket
+(Psock) and complete messages are steered to a KCM socket.
+
+TCP sockets & Psocks
+--------------------
+
+TCP sockets may be bound to a KCM multiplexor. A Psock structure is allocated
+for each bound TCP socket, this structure holds the state for constructing
+messages on receive as well as other connection specific information for KCM.
+
+Connected mode semantics
+------------------------
+
+Each multiplexor assumes that all attached TCP connections are to the same
+destination and can use the different connections for load balancing when
+transmitting. The normal send and recv calls (include sendmmsg and recvmmsg)
+can be used to send and receive messages from the KCM socket.
+
+Socket types
+------------
+
+KCM supports SOCK_DGRAM and SOCK_SEQPACKET socket types.
+
+Message delineation
+-------------------
+
+Messages are sent over a TCP stream with some application protocol message
+format that typically includes a header which frames the messages. The length
+of a received message can be deduced from the application protocol header
+(often just a simple length field).
+
+A TCP stream must be parsed to determine message boundaries. Berkeley Packet
+Filter (BPF) is used for this. When attaching a TCP socket to a multiplexor a
+BPF program must be specified. The program is called at the start of receiving
+a new message and is given an skbuff that contains the bytes received so far.
+It parses the message header and returns the length of the message. Given this
+information, KCM will construct the message of the stated length and deliver it
+to a KCM socket.
+
+TCP socket management
+---------------------
+
+When a TCP socket is attached to a KCM multiplexor data ready (POLLIN) and
+write space available (POLLOUT) events are handled by the multiplexor. If there
+is a state change (disconnection) or other error on a TCP socket, an error is
+posted on the TCP socket so that a POLLERR event happens and KCM discontinues
+using the socket. When the application gets the error notification for a
+TCP socket, it should unattach the socket from KCM and then handle the error
+condition (the typical response is to close the socket and create a new
+connection if necessary).
+
+KCM limits the maximum receive message size to be the size of the receive
+socket buffer on the attached TCP socket (the socket buffer size can be set by
+SO_RCVBUF). If the length of a new message reported by the BPF program is
+greater than this limit a corresponding error (EMSGSIZE) is posted on the TCP
+socket. The BPF program may also enforce a maximum messages size and report an
+error when it is exceeded.
+
+A timeout may be set for assembling messages on a receive socket. The timeout
+value is taken from the receive timeout of the attached TCP socket (this is set
+by SO_RCVTIMEO). If the timer expires before assembly is complete an error
+(ETIMEDOUT) is posted on the socket.
+
+User interface
+==============
+
+Creating a multiplexor
+----------------------
+
+A new multiplexor and initial KCM socket is created by a socket call:
+
+ socket(AF_KCM, type, protocol)
+
+ - type is either SOCK_DGRAM or SOCK_SEQPACKET
+ - protocol is KCMPROTO_CONNECTED
+
+Cloning KCM sockets
+-------------------
+
+After the first KCM socket is created using the socket call as described
+above, additional sockets for the multiplexor can be created by cloning
+a KCM socket. This is accomplished by an ioctl on a KCM socket:
+
+ /* From linux/kcm.h */
+ struct kcm_clone {
+ int fd;
+ };
+
+ struct kcm_clone info;
+
+ memset(&info, 0, sizeof(info));
+
+ err = ioctl(kcmfd, SIOCKCMCLONE, &info);
+
+ if (!err)
+ newkcmfd = info.fd;
+
+Attach transport sockets
+------------------------
+
+Attaching of transport sockets to a multiplexor is performed by calling an
+ioctl on a KCM socket for the multiplexor. e.g.:
+
+ /* From linux/kcm.h */
+ struct kcm_attach {
+ int fd;
+ int bpf_fd;
+ };
+
+ struct kcm_attach info;
+
+ memset(&info, 0, sizeof(info));
+
+ info.fd = tcpfd;
+ info.bpf_fd = bpf_prog_fd;
+
+ ioctl(kcmfd, SIOCKCMATTACH, &info);
+
+The kcm_attach structure contains:
+ fd: file descriptor for TCP socket being attached
+ bpf_prog_fd: file descriptor for compiled BPF program downloaded
+
+Unattach transport sockets
+--------------------------
+
+Unattaching a transport socket from a multiplexor is straightforward. An
+"unattach" ioctl is done with the kcm_unattach structure as the argument:
+
+ /* From linux/kcm.h */
+ struct kcm_unattach {
+ int fd;
+ };
+
+ struct kcm_unattach info;
+
+ memset(&info, 0, sizeof(info));
+
+ info.fd = cfd;
+
+ ioctl(fd, SIOCKCMUNATTACH, &info);
+
+Disabling receive on KCM socket
+-------------------------------
+
+A setsockopt is used to disable or enable receiving on a KCM socket.
+When receive is disabled, any pending messages in the socket's
+receive buffer are moved to other sockets. This feature is useful
+if an application thread knows that it will be doing a lot of
+work on a request and won't be able to service new messages for a
+while. Example use:
+
+ int val = 1;
+
+ setsockopt(kcmfd, SOL_KCM, KCM_RECV_DISABLE, &val, sizeof(val))
+
+BFP programs for message delineation
+------------------------------------
+
+BPF programs can be compiled using the BPF LLVM backend. For exmple,
+the BPF program for parsing Thrift is:
+
+ #include "bpf.h" /* for __sk_buff */
+ #include "bpf_helpers.h" /* for load_word intrinsic */
+
+ SEC("socket_kcm")
+ int bpf_prog1(struct __sk_buff *skb)
+ {
+ return load_word(skb, 0) + 4;
+ }
+
+ char _license[] SEC("license") = "GPL";
+
+Use in applications
+===================
+
+KCM accelerates application layer protocols. Specifically, it allows
+applications to use a message based interface for sending and receiving
+messages. The kernel provides necessary assurances that messages are sent
+and received atomically. This relieves much of the burden applications have
+in mapping a message based protocol onto the TCP stream. KCM also make
+application layer messages a unit of work in the kernel for the purposes of
+steerng and scheduling, which in turn allows a simpler networking model in
+multithreaded applications.
+
+Configurations
+--------------
+
+In an Nx1 configuration, KCM logically provides multiple socket handles
+to the same TCP connection. This allows parallelism between in I/O
+operations on the TCP socket (for instance copyin and copyout of data is
+parallelized). In an application, a KCM socket can be opened for each
+processing thread and inserted into the epoll (similar to how SO_REUSEPORT
+is used to allow multiple listener sockets on the same port).
+
+In a MxN configuration, multiple connections are established to the
+same destination. These are used for simple load balancing.
+
+Message batching
+----------------
+
+The primary purpose of KCM is load balancing between KCM sockets and hence
+threads in a nominal use case. Perfect load balancing, that is steering
+each received message to a different KCM socket or steering each sent
+message to a different TCP socket, can negatively impact performance
+since this doesn't allow for affinities to be established. Balancing
+based on groups, or batches of messages, can be beneficial for performance.
+
+On transmit, there are three ways an application can batch (pipeline)
+messages on a KCM socket.
+ 1) Send multiple messages in a single sendmmsg.
+ 2) Send a group of messages each with a sendmsg call, where all messages
+ except the last have MSG_BATCH in the flags of sendmsg call.
+ 3) Create "super message" composed of multiple messages and send this
+ with a single sendmsg.
+
+On receive, the KCM module attempts to queue messages received on the
+same KCM socket during each TCP ready callback. The targeted KCM socket
+changes at each receive ready callback on the KCM socket. The application
+does not need to configure this.
+
+Error handling
+--------------
+
+An application should include a thread to monitor errors raised on
+the TCP connection. Normally, this will be done by placing each
+TCP socket attached to a KCM multiplexor in epoll set for POLLERR
+event. If an error occurs on an attached TCP socket, KCM sets an EPIPE
+on the socket thus waking up the application thread. When the application
+sees the error (which may just be a disconnect) it should unattach the
+socket from KCM and then close it. It is assumed that once an error is
+posted on the TCP socket the data stream is unrecoverable (i.e. an error
+may have occurred in in the middle of receiving a messssge).
+
+TCP connection monitoring
+-------------------------
+
+In KCM there is no means to correlate a message to the TCP socket that
+was used to send or receive the message (except in the case there is
+only one attached TCP socket). However, the application does retain
+an open file descriptor to the socket so it will be able to get statistics
+from the socket which can be used in detecting issues (such as high
+retransmissions on the socket).
diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.txt
index 3a930072b..ec8f934c2 100644
--- a/Documentation/networking/mac80211-injection.txt
+++ b/Documentation/networking/mac80211-injection.txt
@@ -28,6 +28,23 @@ radiotap headers and used to control injection:
IEEE80211_RADIOTAP_F_TX_NOACK: frame should be sent without waiting for
an ACK even if it is a unicast frame
+ * IEEE80211_RADIOTAP_RATE
+
+ legacy rate for the transmission (only for devices without own rate control)
+
+ * IEEE80211_RADIOTAP_MCS
+
+ HT rate for the transmission (only for devices without own rate control).
+ Also some flags are parsed
+
+ IEEE80211_TX_RC_SHORT_GI: use short guard interval
+ IEEE80211_TX_RC_40_MHZ_WIDTH: send in HT40 mode
+
+ * IEEE80211_RADIOTAP_DATA_RETRIES
+
+ number of retries when either IEEE80211_RADIOTAP_RATE or
+ IEEE80211_RADIOTAP_MCS was used
+
The injection code can also skip all other currently defined radiotap fields
facilitating replay of captured radiotap headers directly.
diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt
deleted file mode 100644
index 54f10478e..000000000
--- a/Documentation/networking/netlink_mmap.txt
+++ /dev/null
@@ -1,332 +0,0 @@
-This file documents how to use memory mapped I/O with netlink.
-
-Author: Patrick McHardy <kaber@trash.net>
-
-Overview
---------
-
-Memory mapped netlink I/O can be used to increase throughput and decrease
-overhead of unicast receive and transmit operations. Some netlink subsystems
-require high throughput, these are mainly the netfilter subsystems
-nfnetlink_queue and nfnetlink_log, but it can also help speed up large
-dump operations of f.i. the routing database.
-
-Memory mapped netlink I/O used two circular ring buffers for RX and TX which
-are mapped into the processes address space.
-
-The RX ring is used by the kernel to directly construct netlink messages into
-user-space memory without copying them as done with regular socket I/O,
-additionally as long as the ring contains messages no recvmsg() or poll()
-syscalls have to be issued by user-space to get more message.
-
-The TX ring is used to process messages directly from user-space memory, the
-kernel processes all messages contained in the ring using a single sendmsg()
-call.
-
-Usage overview
---------------
-
-In order to use memory mapped netlink I/O, user-space needs three main changes:
-
-- ring setup
-- conversion of the RX path to get messages from the ring instead of recvmsg()
-- conversion of the TX path to construct messages into the ring
-
-Ring setup is done using setsockopt() to provide the ring parameters to the
-kernel, then a call to mmap() to map the ring into the processes address space:
-
-- setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &params, sizeof(params));
-- setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &params, sizeof(params));
-- ring = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)
-
-Usage of either ring is optional, but even if only the RX ring is used the
-mapping still needs to be writable in order to update the frame status after
-processing.
-
-Conversion of the reception path involves calling poll() on the file
-descriptor, once the socket is readable the frames from the ring are
-processed in order until no more messages are available, as indicated by
-a status word in the frame header.
-
-On kernel side, in order to make use of memory mapped I/O on receive, the
-originating netlink subsystem needs to support memory mapped I/O, otherwise
-it will use an allocated socket buffer as usual and the contents will be
- copied to the ring on transmission, nullifying most of the performance gains.
-Dumps of kernel databases automatically support memory mapped I/O.
-
-Conversion of the transmit path involves changing message construction to
-use memory from the TX ring instead of (usually) a buffer declared on the
-stack and setting up the frame header appropriately. Optionally poll() can
-be used to wait for free frames in the TX ring.
-
-Structured and definitions for using memory mapped I/O are contained in
-<linux/netlink.h>.
-
-RX and TX rings
-----------------
-
-Each ring contains a number of continuous memory blocks, containing frames of
-fixed size dependent on the parameters used for ring setup.
-
-Ring: [ block 0 ]
- [ frame 0 ]
- [ frame 1 ]
- [ block 1 ]
- [ frame 2 ]
- [ frame 3 ]
- ...
- [ block n ]
- [ frame 2 * n ]
- [ frame 2 * n + 1 ]
-
-The blocks are only visible to the kernel, from the point of view of user-space
-the ring just contains the frames in a continuous memory zone.
-
-The ring parameters used for setting up the ring are defined as follows:
-
-struct nl_mmap_req {
- unsigned int nm_block_size;
- unsigned int nm_block_nr;
- unsigned int nm_frame_size;
- unsigned int nm_frame_nr;
-};
-
-Frames are grouped into blocks, where each block is a continuous region of memory
-and holds nm_block_size / nm_frame_size frames. The total number of frames in
-the ring is nm_frame_nr. The following invariants hold:
-
-- frames_per_block = nm_block_size / nm_frame_size
-
-- nm_frame_nr = frames_per_block * nm_block_nr
-
-Some parameters are constrained, specifically:
-
-- nm_block_size must be a multiple of the architectures memory page size.
- The getpagesize() function can be used to get the page size.
-
-- nm_frame_size must be equal or larger to NL_MMAP_HDRLEN, IOW a frame must be
- able to hold at least the frame header
-
-- nm_frame_size must be smaller or equal to nm_block_size
-
-- nm_frame_size must be a multiple of NL_MMAP_MSG_ALIGNMENT
-
-- nm_frame_nr must equal the actual number of frames as specified above.
-
-When the kernel can't allocate physically continuous memory for a ring block,
-it will fall back to use physically discontinuous memory. This might affect
-performance negatively, in order to avoid this the nm_frame_size parameter
-should be chosen to be as small as possible for the required frame size and
-the number of blocks should be increased instead.
-
-Ring frames
-------------
-
-Each frames contain a frame header, consisting of a synchronization word and some
-meta-data, and the message itself.
-
-Frame: [ header message ]
-
-The frame header is defined as follows:
-
-struct nl_mmap_hdr {
- unsigned int nm_status;
- unsigned int nm_len;
- __u32 nm_group;
- /* credentials */
- __u32 nm_pid;
- __u32 nm_uid;
- __u32 nm_gid;
-};
-
-- nm_status is used for synchronizing processing between the kernel and user-
- space and specifies ownership of the frame as well as the operation to perform
-
-- nm_len contains the length of the message contained in the data area
-
-- nm_group specified the destination multicast group of message
-
-- nm_pid, nm_uid and nm_gid contain the netlink pid, UID and GID of the sending
- process. These values correspond to the data available using SOCK_PASSCRED in
- the SCM_CREDENTIALS cmsg.
-
-The possible values in the status word are:
-
-- NL_MMAP_STATUS_UNUSED:
- RX ring: frame belongs to the kernel and contains no message
- for user-space. Approriate action is to invoke poll()
- to wait for new messages.
-
- TX ring: frame belongs to user-space and can be used for
- message construction.
-
-- NL_MMAP_STATUS_RESERVED:
- RX ring only: frame is currently used by the kernel for message
- construction and contains no valid message yet.
- Appropriate action is to invoke poll() to wait for
- new messages.
-
-- NL_MMAP_STATUS_VALID:
- RX ring: frame contains a valid message. Approriate action is
- to process the message and release the frame back to
- the kernel by setting the status to
- NL_MMAP_STATUS_UNUSED or queue the frame by setting the
- status to NL_MMAP_STATUS_SKIP.
-
- TX ring: the frame contains a valid message from user-space to
- be processed by the kernel. After completing processing
- the kernel will release the frame back to user-space by
- setting the status to NL_MMAP_STATUS_UNUSED.
-
-- NL_MMAP_STATUS_COPY:
- RX ring only: a message is ready to be processed but could not be
- stored in the ring, either because it exceeded the
- frame size or because the originating subsystem does
- not support memory mapped I/O. Appropriate action is
- to invoke recvmsg() to receive the message and release
- the frame back to the kernel by setting the status to
- NL_MMAP_STATUS_UNUSED.
-
-- NL_MMAP_STATUS_SKIP:
- RX ring only: user-space queued the message for later processing, but
- processed some messages following it in the ring. The
- kernel should skip this frame when looking for unused
- frames.
-
-The data area of a frame begins at a offset of NL_MMAP_HDRLEN relative to the
-frame header.
-
-TX limitations
---------------
-
-As of Jan 2015 the message is always copied from the ring frame to an
-allocated buffer due to unresolved security concerns.
-See commit 4682a0358639b29cf ("netlink: Always copy on mmap TX.").
-
-Example
--------
-
-Ring setup:
-
- unsigned int block_size = 16 * getpagesize();
- struct nl_mmap_req req = {
- .nm_block_size = block_size,
- .nm_block_nr = 64,
- .nm_frame_size = 16384,
- .nm_frame_nr = 64 * block_size / 16384,
- };
- unsigned int ring_size;
- void *rx_ring, *tx_ring;
-
- /* Configure ring parameters */
- if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
- exit(1);
- if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
- exit(1)
-
- /* Calculate size of each individual ring */
- ring_size = req.nm_block_nr * req.nm_block_size;
-
- /* Map RX/TX rings. The TX ring is located after the RX ring */
- rx_ring = mmap(NULL, 2 * ring_size, PROT_READ | PROT_WRITE,
- MAP_SHARED, fd, 0);
- if ((long)rx_ring == -1L)
- exit(1);
- tx_ring = rx_ring + ring_size:
-
-Message reception:
-
-This example assumes some ring parameters of the ring setup are available.
-
- unsigned int frame_offset = 0;
- struct nl_mmap_hdr *hdr;
- struct nlmsghdr *nlh;
- unsigned char buf[16384];
- ssize_t len;
-
- while (1) {
- struct pollfd pfds[1];
-
- pfds[0].fd = fd;
- pfds[0].events = POLLIN | POLLERR;
- pfds[0].revents = 0;
-
- if (poll(pfds, 1, -1) < 0 && errno != -EINTR)
- exit(1);
-
- /* Check for errors. Error handling omitted */
- if (pfds[0].revents & POLLERR)
- <handle error>
-
- /* If no new messages, poll again */
- if (!(pfds[0].revents & POLLIN))
- continue;
-
- /* Process all frames */
- while (1) {
- /* Get next frame header */
- hdr = rx_ring + frame_offset;
-
- if (hdr->nm_status == NL_MMAP_STATUS_VALID) {
- /* Regular memory mapped frame */
- nlh = (void *)hdr + NL_MMAP_HDRLEN;
- len = hdr->nm_len;
-
- /* Release empty message immediately. May happen
- * on error during message construction.
- */
- if (len == 0)
- goto release;
- } else if (hdr->nm_status == NL_MMAP_STATUS_COPY) {
- /* Frame queued to socket receive queue */
- len = recv(fd, buf, sizeof(buf), MSG_DONTWAIT);
- if (len <= 0)
- break;
- nlh = buf;
- } else
- /* No more messages to process, continue polling */
- break;
-
- process_msg(nlh);
-release:
- /* Release frame back to the kernel */
- hdr->nm_status = NL_MMAP_STATUS_UNUSED;
-
- /* Advance frame offset to next frame */
- frame_offset = (frame_offset + frame_size) % ring_size;
- }
- }
-
-Message transmission:
-
-This example assumes some ring parameters of the ring setup are available.
-A single message is constructed and transmitted, to send multiple messages
-at once they would be constructed in consecutive frames before a final call
-to sendto().
-
- unsigned int frame_offset = 0;
- struct nl_mmap_hdr *hdr;
- struct nlmsghdr *nlh;
- struct sockaddr_nl addr = {
- .nl_family = AF_NETLINK,
- };
-
- hdr = tx_ring + frame_offset;
- if (hdr->nm_status != NL_MMAP_STATUS_UNUSED)
- /* No frame available. Use poll() to avoid. */
- exit(1);
-
- nlh = (void *)hdr + NL_MMAP_HDRLEN;
-
- /* Build message */
- build_message(nlh);
-
- /* Fill frame header: length and status need to be set */
- hdr->nm_len = nlh->nlmsg_len;
- hdr->nm_status = NL_MMAP_STATUS_VALID;
-
- if (sendto(fd, NULL, 0, 0, &addr, sizeof(addr)) < 0)
- exit(1);
-
- /* Advance frame offset to next frame */
- frame_offset = (frame_offset + frame_size) % ring_size;
diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
index e839e7efc..7ab9404a8 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -267,13 +267,23 @@ Writing a PHY driver
config_intr: Enable or disable interrupts
remove: Does any driver take-down
ts_info: Queries about the HW timestamping status
+ match_phy_device: used for Clause 45 capable PHYs to match devices
+ in package and ensure they are compatible
hwtstamp: Set the PHY HW timestamping configuration
rxtstamp: Requests a receive timestamp at the PHY level for a 'skb'
txtsamp: Requests a transmit timestamp at the PHY level for a 'skb'
set_wol: Enable Wake-on-LAN at the PHY level
get_wol: Get the Wake-on-LAN status at the PHY level
+ link_change_notify: called to inform the core is about to change the
+ link state, can be used to work around bogus PHY between state changes
read_mmd_indirect: Read PHY MMD indirect register
write_mmd_indirect: Write PHY MMD indirect register
+ module_info: Get the size and type of an EEPROM contained in an plug-in
+ module
+ module_eeprom: Get EEPROM information of a plug-in module
+ get_sset_count: Get number of strings sets that get_strings will count
+ get_strings: Get strings from requested objects (statistics)
+ get_stats: Get the extended statistics from the PHY device
Of these, only config_aneg and read_status are required to be
assigned by the driver code. The rest are optional. Also, it is
diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt
index f4be85e96..2c4e3354e 100644
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.txt
@@ -67,12 +67,12 @@ The two basic thread commands are:
* add_device DEVICE@NAME -- adds a single device
* rem_device_all -- remove all associated devices
-When adding a device to a thread, a corrosponding procfile is created
+When adding a device to a thread, a corresponding procfile is created
which is used for configuring this device. Thus, device names need to
be unique.
To support adding the same device to multiple threads, which is useful
-with multi queue NICs, a the device naming scheme is extended with "@":
+with multi queue NICs, the device naming scheme is extended with "@":
device@something
The part after "@" can be anything, but it is custom to use the thread
@@ -221,7 +221,7 @@ Sample scripts
A collection of tutorial scripts and helpers for pktgen is in the
samples/pktgen directory. The helper parameters.sh file support easy
-and consistant parameter parsing across the sample scripts.
+and consistent parameter parsing across the sample scripts.
Usage example and help:
./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2
diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
index e1a3d59bb..9d219d856 100644
--- a/Documentation/networking/rds.txt
+++ b/Documentation/networking/rds.txt
@@ -19,9 +19,7 @@ to N*N if you use a connection-oriented socket transport like TCP.
RDS is not Infiniband-specific; it was designed to support different
transports. The current implementation used to support RDS over TCP as well
-as IB. Work is in progress to support RDS over iWARP, and using DCE to
-guarantee no dropped packets on Ethernet, it may be possible to use RDS over
-UDP in the future.
+as IB.
The high-level semantics of RDS from the application's point of view are
diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt
index fad63136e..2f6591296 100644
--- a/Documentation/networking/switchdev.txt
+++ b/Documentation/networking/switchdev.txt
@@ -386,7 +386,7 @@ used. First phase is to "prepare" anything needed, including various checks,
memory allocation, etc. The goal is to handle the stuff that is not unlikely
to fail here. The second phase is to "commit" the actual changes.
-Switchdev provides an inftrastructure for sharing items (for example memory
+Switchdev provides an infrastructure for sharing items (for example memory
allocations) between the two phases.
The object created by a driver in "prepare" phase and it is queued up by:
diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt
index d52aa10cf..5da679c57 100644
--- a/Documentation/networking/vrf.txt
+++ b/Documentation/networking/vrf.txt
@@ -41,7 +41,7 @@ using an rx_handler which gives the impression that packets flow through
the VRF device. Similarly on egress routing rules are used to send packets
to the VRF device driver before getting sent out the actual interface. This
allows tcpdump on a VRF device to capture all packets into and out of the
-VRF as a whole.[1] Similiarly, netfilter [2] and tc rules can be applied
+VRF as a whole.[1] Similarly, netfilter [2] and tc rules can be applied
using the VRF device to specify rules that apply to the VRF domain as a whole.
[1] Packets in the forwarded state do not flow through the device, so those
diff --git a/Documentation/networking/xfrm_sync.txt b/Documentation/networking/xfrm_sync.txt
index d7aac9ded..8d88e0f2e 100644
--- a/Documentation/networking/xfrm_sync.txt
+++ b/Documentation/networking/xfrm_sync.txt
@@ -4,7 +4,7 @@ Krisztian <hidden@balabit.hu> and others and additional patches
from Jamal <hadi@cyberus.ca>.
The end goal for syncing is to be able to insert attributes + generate
-events so that the an SA can be safely moved from one machine to another
+events so that the SA can be safely moved from one machine to another
for HA purposes.
The idea is to synchronize the SA so that the takeover machine can do
the processing of the SA as accurate as possible if it has access to it.
@@ -13,7 +13,7 @@ We already have the ability to generate SA add/del/upd events.
These patches add ability to sync and have accurate lifetime byte (to
ensure proper decay of SAs) and replay counters to avoid replay attacks
with as minimal loss at failover time.
-This way a backup stays as closely uptodate as an active member.
+This way a backup stays as closely up-to-date as an active member.
Because the above items change for every packet the SA receives,
it is possible for a lot of the events to be generated.
@@ -163,7 +163,7 @@ If you have an SA that is getting hit by traffic in bursts such that
there is a period where the timer threshold expires with no packets
seen, then an odd behavior is seen as follows:
The first packet arrival after a timer expiry will trigger a timeout
-aevent; i.e we dont wait for a timeout period or a packet threshold
+event; i.e we don't wait for a timeout period or a packet threshold
to be reached. This is done for simplicity and efficiency reasons.
-JHS