summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
authorAndré Fabian Silva Delgado <emulatorman@parabola.nu>2015-08-05 17:04:01 -0300
committerAndré Fabian Silva Delgado <emulatorman@parabola.nu>2015-08-05 17:04:01 -0300
commit57f0f512b273f60d52568b8c6b77e17f5636edc0 (patch)
tree5e910f0e82173f4ef4f51111366a3f1299037a7b /Documentation/networking
Initial import
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/00-INDEX238
-rw-r--r--Documentation/networking/3c509.txt213
-rw-r--r--Documentation/networking/6pack.txt175
-rw-r--r--Documentation/networking/LICENSE.qla3xxx46
-rw-r--r--Documentation/networking/LICENSE.qlcnic288
-rw-r--r--Documentation/networking/LICENSE.qlge288
-rw-r--r--Documentation/networking/Makefile1
-rw-r--r--Documentation/networking/PLIP.txt215
-rw-r--r--Documentation/networking/README.ipw2100293
-rw-r--r--Documentation/networking/README.ipw2200472
-rw-r--r--Documentation/networking/README.sb1000207
-rw-r--r--Documentation/networking/alias.txt40
-rw-r--r--Documentation/networking/altera_tse.txt263
-rw-r--r--Documentation/networking/arcnet-hardware.txt3133
-rw-r--r--Documentation/networking/arcnet.txt556
-rw-r--r--Documentation/networking/atm.txt8
-rw-r--r--Documentation/networking/ax25.txt10
-rw-r--r--Documentation/networking/batman-adv.txt202
-rw-r--r--Documentation/networking/baycom.txt158
-rw-r--r--Documentation/networking/bonding.txt2743
-rw-r--r--Documentation/networking/bridge.txt15
-rw-r--r--Documentation/networking/caif/Linux-CAIF.txt175
-rw-r--r--Documentation/networking/caif/README109
-rw-r--r--Documentation/networking/caif/spi_porting.txt208
-rw-r--r--Documentation/networking/can.txt1216
-rw-r--r--Documentation/networking/cdc_mbim.txt339
-rw-r--r--Documentation/networking/cops.txt63
-rw-r--r--Documentation/networking/cs89x0.txt624
-rw-r--r--Documentation/networking/cxacru-cf.py48
-rw-r--r--Documentation/networking/cxacru.txt100
-rw-r--r--Documentation/networking/cxgb.txt352
-rw-r--r--Documentation/networking/dccp.txt207
-rw-r--r--Documentation/networking/dctcp.txt43
-rw-r--r--Documentation/networking/de4x5.txt178
-rw-r--r--Documentation/networking/decnet.txt232
-rw-r--r--Documentation/networking/dl2k.txt282
-rw-r--r--Documentation/networking/dm9000.txt167
-rw-r--r--Documentation/networking/dmfe.txt66
-rw-r--r--Documentation/networking/dns_resolver.txt157
-rw-r--r--Documentation/networking/driver.txt93
-rw-r--r--Documentation/networking/e100.txt197
-rw-r--r--Documentation/networking/e1000.txt461
-rw-r--r--Documentation/networking/e1000e.txt312
-rw-r--r--Documentation/networking/eql.txt528
-rw-r--r--Documentation/networking/fib_trie.txt145
-rw-r--r--Documentation/networking/filter.txt1297
-rw-r--r--Documentation/networking/fore200e.txt39
-rw-r--r--Documentation/networking/framerelay.txt39
-rw-r--r--Documentation/networking/gen_stats.txt117
-rw-r--r--Documentation/networking/generic-hdlc.txt132
-rw-r--r--Documentation/networking/generic_netlink.txt3
-rw-r--r--Documentation/networking/gianfar.txt42
-rw-r--r--Documentation/networking/i40e.txt118
-rw-r--r--Documentation/networking/i40evf.txt47
-rw-r--r--Documentation/networking/ieee802154.txt151
-rw-r--r--Documentation/networking/igb.txt129
-rw-r--r--Documentation/networking/igbvf.txt80
-rw-r--r--Documentation/networking/ip-sysctl.txt1858
-rw-r--r--Documentation/networking/ip_dynaddr.txt29
-rw-r--r--Documentation/networking/ipddp.txt73
-rw-r--r--Documentation/networking/iphase.txt158
-rw-r--r--Documentation/networking/ipsec.txt38
-rw-r--r--Documentation/networking/ipv6.txt72
-rw-r--r--Documentation/networking/ipvlan.txt107
-rw-r--r--Documentation/networking/ipvs-sysctl.txt232
-rw-r--r--Documentation/networking/irda.txt10
-rw-r--r--Documentation/networking/ixgb.txt433
-rw-r--r--Documentation/networking/ixgbe.txt349
-rw-r--r--Documentation/networking/ixgbevf.txt52
-rw-r--r--Documentation/networking/l2tp.txt348
-rw-r--r--Documentation/networking/lapb-module.txt263
-rw-r--r--Documentation/networking/ltpc.txt131
-rw-r--r--Documentation/networking/mac80211-auth-assoc-deauth.txt95
-rw-r--r--Documentation/networking/mac80211-injection.txt67
-rw-r--r--Documentation/networking/mac80211_hwsim/README68
-rw-r--r--Documentation/networking/mac80211_hwsim/hostapd.conf11
-rw-r--r--Documentation/networking/mac80211_hwsim/wpa_supplicant.conf10
-rw-r--r--Documentation/networking/mpls-sysctl.txt29
-rw-r--r--Documentation/networking/multiqueue.txt79
-rw-r--r--Documentation/networking/netconsole.txt177
-rw-r--r--Documentation/networking/netdev-FAQ.txt224
-rw-r--r--Documentation/networking/netdev-features.txt167
-rw-r--r--Documentation/networking/netdevices.txt107
-rw-r--r--Documentation/networking/netif-msg.txt79
-rw-r--r--Documentation/networking/netlink_mmap.txt332
-rw-r--r--Documentation/networking/nf_conntrack-sysctl.txt177
-rw-r--r--Documentation/networking/nfc.txt128
-rw-r--r--Documentation/networking/openvswitch.txt248
-rw-r--r--Documentation/networking/operstates.txt162
-rw-r--r--Documentation/networking/packet_mmap.txt1068
-rw-r--r--Documentation/networking/phonet.txt214
-rw-r--r--Documentation/networking/phy.txt339
-rw-r--r--Documentation/networking/pktgen.txt323
-rw-r--r--Documentation/networking/policy-routing.txt150
-rw-r--r--Documentation/networking/ppp_generic.txt432
-rw-r--r--Documentation/networking/proc_net_tcp.txt48
-rw-r--r--Documentation/networking/radiotap-headers.txt152
-rw-r--r--Documentation/networking/ray_cs.txt150
-rw-r--r--Documentation/networking/rds.txt355
-rw-r--r--Documentation/networking/regulatory.txt214
-rw-r--r--Documentation/networking/rxrpc.txt947
-rw-r--r--Documentation/networking/s2io.txt141
-rw-r--r--Documentation/networking/scaling.txt445
-rw-r--r--Documentation/networking/sctp.txt35
-rw-r--r--Documentation/networking/secid.txt14
-rw-r--r--Documentation/networking/skfp.txt220
-rw-r--r--Documentation/networking/smc9.txt42
-rw-r--r--Documentation/networking/spider_net.txt204
-rw-r--r--Documentation/networking/stmmac.txt369
-rw-r--r--Documentation/networking/switchdev.txt59
-rw-r--r--Documentation/networking/tc-actions-env-rules.txt30
-rw-r--r--Documentation/networking/tcp-thin.txt47
-rw-r--r--Documentation/networking/tcp.txt106
-rw-r--r--Documentation/networking/team.txt2
-rw-r--r--Documentation/networking/timestamping.txt457
-rw-r--r--Documentation/networking/timestamping/.gitignore3
-rw-r--r--Documentation/networking/timestamping/Makefile14
-rw-r--r--Documentation/networking/timestamping/hwtstamp_config.c134
-rw-r--r--Documentation/networking/timestamping/timestamping.c528
-rw-r--r--Documentation/networking/timestamping/txtimestamp.c549
-rw-r--r--Documentation/networking/tlan.txt117
-rw-r--r--Documentation/networking/tproxy.txt84
-rw-r--r--Documentation/networking/tuntap.txt227
-rw-r--r--Documentation/networking/udplite.txt278
-rw-r--r--Documentation/networking/vortex.txt448
-rw-r--r--Documentation/networking/vxge.txt93
-rw-r--r--Documentation/networking/vxlan.txt47
-rw-r--r--Documentation/networking/x25-iface.txt123
-rw-r--r--Documentation/networking/x25.txt44
-rw-r--r--Documentation/networking/xfrm_proc.txt74
-rw-r--r--Documentation/networking/xfrm_sync.txt169
-rw-r--r--Documentation/networking/xfrm_sysctl.txt4
-rw-r--r--Documentation/networking/z8530drv.txt657
133 files changed, 34529 insertions, 0 deletions
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
new file mode 100644
index 000000000..df27a1a50
--- /dev/null
+++ b/Documentation/networking/00-INDEX
@@ -0,0 +1,238 @@
+00-INDEX
+ - this file
+3c509.txt
+ - information on the 3Com Etherlink III Series Ethernet cards.
+6pack.txt
+ - info on the 6pack protocol, an alternative to KISS for AX.25
+LICENSE.qla3xxx
+ - GPLv2 for QLogic Linux Networking HBA Driver
+LICENSE.qlge
+ - GPLv2 for QLogic Linux qlge NIC Driver
+LICENSE.qlcnic
+ - GPLv2 for QLogic Linux qlcnic NIC Driver
+Makefile
+ - Makefile for docsrc.
+PLIP.txt
+ - PLIP: The Parallel Line Internet Protocol device driver
+README.ipw2100
+ - README for the Intel PRO/Wireless 2100 driver.
+README.ipw2200
+ - README for the Intel PRO/Wireless 2915ABG and 2200BG driver.
+README.sb1000
+ - info on General Instrument/NextLevel SURFboard1000 cable modem.
+alias.txt
+ - info on using alias network devices.
+altera_tse.txt
+ - Altera Triple-Speed Ethernet controller.
+arcnet-hardware.txt
+ - tons of info on ARCnet, hubs, jumper settings for ARCnet cards, etc.
+arcnet.txt
+ - info on the using the ARCnet driver itself.
+atm.txt
+ - info on where to get ATM programs and support for Linux.
+ax25.txt
+ - info on using AX.25 and NET/ROM code for Linux
+batman-adv.txt
+ - B.A.T.M.A.N routing protocol on top of layer 2 Ethernet Frames.
+baycom.txt
+ - info on the driver for Baycom style amateur radio modems
+bonding.txt
+ - Linux Ethernet Bonding Driver HOWTO: link aggregation in Linux.
+bridge.txt
+ - where to get user space programs for ethernet bridging with Linux.
+can.txt
+ - documentation on CAN protocol family.
+cdc_mbim.txt
+ - 3G/LTE USB modem (Mobile Broadband Interface Model)
+cops.txt
+ - info on the COPS LocalTalk Linux driver
+cs89x0.txt
+ - the Crystal LAN (CS8900/20-based) Ethernet ISA adapter driver
+cxacru.txt
+ - Conexant AccessRunner USB ADSL Modem
+cxacru-cf.py
+ - Conexant AccessRunner USB ADSL Modem configuration file parser
+cxgb.txt
+ - Release Notes for the Chelsio N210 Linux device driver.
+dccp.txt
+ - the Datagram Congestion Control Protocol (DCCP) (RFC 4340..42).
+dctcp.txt
+ - DataCenter TCP congestion control
+de4x5.txt
+ - the Digital EtherWORKS DE4?? and DE5?? PCI Ethernet driver
+decnet.txt
+ - info on using the DECnet networking layer in Linux.
+dl2k.txt
+ - README for D-Link DL2000-based Gigabit Ethernet Adapters (dl2k.ko).
+dm9000.txt
+ - README for the Simtec DM9000 Network driver.
+dmfe.txt
+ - info on the Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver.
+dns_resolver.txt
+ - The DNS resolver module allows kernel servies to make DNS queries.
+driver.txt
+ - Softnet driver issues.
+e100.txt
+ - info on Intel's EtherExpress PRO/100 line of 10/100 boards
+e1000.txt
+ - info on Intel's E1000 line of gigabit ethernet boards
+e1000e.txt
+ - README for the Intel Gigabit Ethernet Driver (e1000e).
+eql.txt
+ - serial IP load balancing
+fib_trie.txt
+ - Level Compressed Trie (LC-trie) notes: a structure for routing.
+filter.txt
+ - Linux Socket Filtering
+fore200e.txt
+ - FORE Systems PCA-200E/SBA-200E ATM NIC driver info.
+framerelay.txt
+ - info on using Frame Relay/Data Link Connection Identifier (DLCI).
+gen_stats.txt
+ - Generic networking statistics for netlink users.
+generic-hdlc.txt
+ - The generic High Level Data Link Control (HDLC) layer.
+generic_netlink.txt
+ - info on Generic Netlink
+gianfar.txt
+ - Gianfar Ethernet Driver.
+i40e.txt
+ - README for the Intel Ethernet Controller XL710 Driver (i40e).
+i40evf.txt
+ - Short note on the Driver for the Intel(R) XL710 X710 Virtual Function
+ieee802154.txt
+ - Linux IEEE 802.15.4 implementation, API and drivers
+igb.txt
+ - README for the Intel Gigabit Ethernet Driver (igb).
+igbvf.txt
+ - README for the Intel Gigabit Ethernet Driver (igbvf).
+ip-sysctl.txt
+ - /proc/sys/net/ipv4/* variables
+ip_dynaddr.txt
+ - IP dynamic address hack e.g. for auto-dialup links
+ipddp.txt
+ - AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation
+iphase.txt
+ - Interphase PCI ATM (i)Chip IA Linux driver info.
+ipsec.txt
+ - Note on not compressing IPSec payload and resulting failed policy check.
+ipv6.txt
+ - Options to the ipv6 kernel module.
+ipvs-sysctl.txt
+ - Per-inode explanation of the /proc/sys/net/ipv4/vs interface.
+irda.txt
+ - where to get IrDA (infrared) utilities and info for Linux.
+ixgb.txt
+ - README for the Intel 10 Gigabit Ethernet Driver (ixgb).
+ixgbe.txt
+ - README for the Intel 10 Gigabit Ethernet Driver (ixgbe).
+ixgbevf.txt
+ - README for the Intel Virtual Function (VF) Driver (ixgbevf).
+l2tp.txt
+ - User guide to the L2TP tunnel protocol.
+lapb-module.txt
+ - programming information of the LAPB module.
+ltpc.txt
+ - the Apple or Farallon LocalTalk PC card driver
+mac80211-auth-assoc-deauth.txt
+ - authentication and association / deauth-disassoc with max80211
+mac80211-injection.txt
+ - HOWTO use packet injection with mac80211
+multiqueue.txt
+ - HOWTO for multiqueue network device support.
+netconsole.txt
+ - The network console module netconsole.ko: configuration and notes.
+netdev-FAQ.txt
+ - FAQ describing how to submit net changes to netdev mailing list.
+netdev-features.txt
+ - Network interface features API description.
+netdevices.txt
+ - info on network device driver functions exported to the kernel.
+netif-msg.txt
+ - Design of the network interface message level setting (NETIF_MSG_*).
+netlink_mmap.txt
+ - memory mapped I/O with netlink
+nf_conntrack-sysctl.txt
+ - list of netfilter-sysctl knobs.
+nfc.txt
+ - The Linux Near Field Communication (NFS) subsystem.
+openvswitch.txt
+ - Open vSwitch developer documentation.
+operstates.txt
+ - Overview of network interface operational states.
+packet_mmap.txt
+ - User guide to memory mapped packet socket rings (PACKET_[RT]X_RING).
+phonet.txt
+ - The Phonet packet protocol used in Nokia cellular modems.
+phy.txt
+ - The PHY abstraction layer.
+pktgen.txt
+ - User guide to the kernel packet generator (pktgen.ko).
+policy-routing.txt
+ - IP policy-based routing
+ppp_generic.txt
+ - Information about the generic PPP driver.
+proc_net_tcp.txt
+ - Per inode overview of the /proc/net/tcp and /proc/net/tcp6 interfaces.
+radiotap-headers.txt
+ - Background on radiotap headers.
+ray_cs.txt
+ - Raylink Wireless LAN card driver info.
+rds.txt
+ - Background on the reliable, ordered datagram delivery method RDS.
+regulatory.txt
+ - Overview of the Linux wireless regulatory infrastructure.
+rxrpc.txt
+ - Guide to the RxRPC protocol.
+s2io.txt
+ - Release notes for Neterion Xframe I/II 10GbE driver.
+scaling.txt
+ - Explanation of network scaling techniques: RSS, RPS, RFS, aRFS, XPS.
+sctp.txt
+ - Notes on the Linux kernel implementation of the SCTP protocol.
+secid.txt
+ - Explanation of the secid member in flow structures.
+skfp.txt
+ - SysKonnect FDDI (SK-5xxx, Compaq Netelligent) driver info.
+smc9.txt
+ - the driver for SMC's 9000 series of Ethernet cards
+spider_net.txt
+ - README for the Spidernet Driver (as found in PS3 / Cell BE).
+stmmac.txt
+ - README for the STMicro Synopsys Ethernet driver.
+tc-actions-env-rules.txt
+ - rules for traffic control (tc) actions.
+timestamping.txt
+ - overview of network packet timestamping variants.
+tcp.txt
+ - short blurb on how TCP output takes place.
+tcp-thin.txt
+ - kernel tuning options for low rate 'thin' TCP streams.
+team.txt
+ - pointer to information for ethernet teaming devices.
+tlan.txt
+ - ThunderLAN (Compaq Netelligent 10/100, Olicom OC-2xxx) driver info.
+tproxy.txt
+ - Transparent proxy support user guide.
+tuntap.txt
+ - TUN/TAP device driver, allowing user space Rx/Tx of packets.
+udplite.txt
+ - UDP-Lite protocol (RFC 3828) introduction.
+vortex.txt
+ - info on using 3Com Vortex (3c590, 3c592, 3c595, 3c597) Ethernet cards.
+vxge.txt
+ - README for the Neterion X3100 PCIe Server Adapter.
+vxlan.txt
+ - Virtual extensible LAN overview
+x25.txt
+ - general info on X.25 development.
+x25-iface.txt
+ - description of the X.25 Packet Layer to LAPB device interface.
+xfrm_proc.txt
+ - description of the statistics package for XFRM.
+xfrm_sync.txt
+ - sync patches for XFRM enable migration of an SA between hosts.
+xfrm_sysctl.txt
+ - description of the XFRM configuration options.
+z8530drv.txt
+ - info about Linux driver for Z8530 based HDLC cards for AX.25
diff --git a/Documentation/networking/3c509.txt b/Documentation/networking/3c509.txt
new file mode 100644
index 000000000..fbf722e15
--- /dev/null
+++ b/Documentation/networking/3c509.txt
@@ -0,0 +1,213 @@
+Linux and the 3Com EtherLink III Series Ethercards (driver v1.18c and higher)
+----------------------------------------------------------------------------
+
+This file contains the instructions and caveats for v1.18c and higher versions
+of the 3c509 driver. You should not use the driver without reading this file.
+
+release 1.0
+28 February 2002
+Current maintainer (corrections to):
+ David Ruggiero <jdr@farfalle.com>
+
+----------------------------------------------------------------------------
+
+(0) Introduction
+
+The following are notes and information on using the 3Com EtherLink III series
+ethercards in Linux. These cards are commonly known by the most widely-used
+card's 3Com model number, 3c509. They are all 10mb/s ISA-bus cards and shouldn't
+be (but sometimes are) confused with the similarly-numbered PCI-bus "3c905"
+(aka "Vortex" or "Boomerang") series. Kernel support for the 3c509 family is
+provided by the module 3c509.c, which has code to support all of the following
+models:
+
+ 3c509 (original ISA card)
+ 3c509B (later revision of the ISA card; supports full-duplex)
+ 3c589 (PCMCIA)
+ 3c589B (later revision of the 3c589; supports full-duplex)
+ 3c579 (EISA)
+
+Large portions of this documentation were heavily borrowed from the guide
+written the original author of the 3c509 driver, Donald Becker. The master
+copy of that document, which contains notes on older versions of the driver,
+currently resides on Scyld web server: http://www.scyld.com/.
+
+
+(1) Special Driver Features
+
+Overriding card settings
+
+The driver allows boot- or load-time overriding of the card's detected IOADDR,
+IRQ, and transceiver settings, although this capability shouldn't generally be
+needed except to enable full-duplex mode (see below). An example of the syntax
+for LILO parameters for doing this:
+
+ ether=10,0x310,3,0x3c509,eth0
+
+This configures the first found 3c509 card for IRQ 10, base I/O 0x310, and
+transceiver type 3 (10base2). The flag "0x3c509" must be set to avoid conflicts
+with other card types when overriding the I/O address. When the driver is
+loaded as a module, only the IRQ may be overridden. For example,
+setting two cards to IRQ10 and IRQ11 is done by using the irq module
+option:
+
+ options 3c509 irq=10,11
+
+
+(2) Full-duplex mode
+
+The v1.18c driver added support for the 3c509B's full-duplex capabilities.
+In order to enable and successfully use full-duplex mode, three conditions
+must be met:
+
+(a) You must have a Etherlink III card model whose hardware supports full-
+duplex operations. Currently, the only members of the 3c509 family that are
+positively known to support full-duplex are the 3c509B (ISA bus) and 3c589B
+(PCMCIA) cards. Cards without the "B" model designation do *not* support
+full-duplex mode; these include the original 3c509 (no "B"), the original
+3c589, the 3c529 (MCA bus), and the 3c579 (EISA bus).
+
+(b) You must be using your card's 10baseT transceiver (i.e., the RJ-45
+connector), not its AUI (thick-net) or 10base2 (thin-net/coax) interfaces.
+AUI and 10base2 network cabling is physically incapable of full-duplex
+operation.
+
+(c) Most importantly, your 3c509B must be connected to a link partner that is
+itself full-duplex capable. This is almost certainly one of two things: a full-
+duplex-capable Ethernet switch (*not* a hub), or a full-duplex-capable NIC on
+another system that's connected directly to the 3c509B via a crossover cable.
+
+Full-duplex mode can be enabled using 'ethtool'.
+
+/////Extremely important caution concerning full-duplex mode/////
+Understand that the 3c509B's hardware's full-duplex support is much more
+limited than that provide by more modern network interface cards. Although
+at the physical layer of the network it fully supports full-duplex operation,
+the card was designed before the current Ethernet auto-negotiation (N-way)
+spec was written. This means that the 3c509B family ***cannot and will not
+auto-negotiate a full-duplex connection with its link partner under any
+circumstances, no matter how it is initialized***. If the full-duplex mode
+of the 3c509B is enabled, its link partner will very likely need to be
+independently _forced_ into full-duplex mode as well; otherwise various nasty
+failures will occur - at the very least, you'll see massive numbers of packet
+collisions. This is one of very rare circumstances where disabling auto-
+negotiation and forcing the duplex mode of a network interface card or switch
+would ever be necessary or desirable.
+
+
+(3) Available Transceiver Types
+
+For versions of the driver v1.18c and above, the available transceiver types are:
+
+0 transceiver type from EEPROM config (normally 10baseT); force half-duplex
+1 AUI (thick-net / DB15 connector)
+2 (undefined)
+3 10base2 (thin-net == coax / BNC connector)
+4 10baseT (RJ-45 connector); force half-duplex mode
+8 transceiver type and duplex mode taken from card's EEPROM config settings
+12 10baseT (RJ-45 connector); force full-duplex mode
+
+Prior to driver version 1.18c, only transceiver codes 0-4 were supported. Note
+that the new transceiver codes 8 and 12 are the *only* ones that will enable
+full-duplex mode, no matter what the card's detected EEPROM settings might be.
+This insured that merely upgrading the driver from an earlier version would
+never automatically enable full-duplex mode in an existing installation;
+it must always be explicitly enabled via one of these code in order to be
+activated.
+
+The transceiver type can be changed using 'ethtool'.
+
+
+(4a) Interpretation of error messages and common problems
+
+Error Messages
+
+eth0: Infinite loop in interrupt, status 2011.
+These are "mostly harmless" message indicating that the driver had too much
+work during that interrupt cycle. With a status of 0x2011 you are receiving
+packets faster than they can be removed from the card. This should be rare
+or impossible in normal operation. Possible causes of this error report are:
+
+ - a "green" mode enabled that slows the processor down when there is no
+ keyboard activity.
+
+ - some other device or device driver hogging the bus or disabling interrupts.
+ Check /proc/interrupts for excessive interrupt counts. The timer tick
+ interrupt should always be incrementing faster than the others.
+
+No received packets
+If a 3c509, 3c562 or 3c589 can successfully transmit packets, but never
+receives packets (as reported by /proc/net/dev or 'ifconfig') you likely
+have an interrupt line problem. Check /proc/interrupts to verify that the
+card is actually generating interrupts. If the interrupt count is not
+increasing you likely have a physical conflict with two devices trying to
+use the same ISA IRQ line. The common conflict is with a sound card on IRQ10
+or IRQ5, and the easiest solution is to move the 3c509 to a different
+interrupt line. If the device is receiving packets but 'ping' doesn't work,
+you have a routing problem.
+
+Tx Carrier Errors Reported in /proc/net/dev
+If an EtherLink III appears to transmit packets, but the "Tx carrier errors"
+field in /proc/net/dev increments as quickly as the Tx packet count, you
+likely have an unterminated network or the incorrect media transceiver selected.
+
+3c509B card is not detected on machines with an ISA PnP BIOS.
+While the updated driver works with most PnP BIOS programs, it does not work
+with all. This can be fixed by disabling PnP support using the 3Com-supplied
+setup program.
+
+3c509 card is not detected on overclocked machines
+Increase the delay time in id_read_eeprom() from the current value, 500,
+to an absurdly high value, such as 5000.
+
+
+(4b) Decoding Status and Error Messages
+
+The bits in the main status register are:
+
+value description
+0x01 Interrupt latch
+0x02 Tx overrun, or Rx underrun
+0x04 Tx complete
+0x08 Tx FIFO room available
+0x10 A complete Rx packet has arrived
+0x20 A Rx packet has started to arrive
+0x40 The driver has requested an interrupt
+0x80 Statistics counter nearly full
+
+The bits in the transmit (Tx) status word are:
+
+value description
+0x02 Out-of-window collision.
+0x04 Status stack overflow (normally impossible).
+0x08 16 collisions.
+0x10 Tx underrun (not enough PCI bus bandwidth).
+0x20 Tx jabber.
+0x40 Tx interrupt requested.
+0x80 Status is valid (this should always be set).
+
+
+When a transmit error occurs the driver produces a status message such as
+
+ eth0: Transmit error, Tx status register 82
+
+The two values typically seen here are:
+
+0x82
+Out of window collision. This typically occurs when some other Ethernet
+host is incorrectly set to full duplex on a half duplex network.
+
+0x88
+16 collisions. This typically occurs when the network is exceptionally busy
+or when another host doesn't correctly back off after a collision. If this
+error is mixed with 0x82 errors it is the result of a host incorrectly set
+to full duplex (see above).
+
+Both of these errors are the result of network problems that should be
+corrected. They do not represent driver malfunction.
+
+
+(5) Revision history (this file)
+
+28Feb02 v1.0 DR New; major portions based on Becker original 3c509 docs
+
diff --git a/Documentation/networking/6pack.txt b/Documentation/networking/6pack.txt
new file mode 100644
index 000000000..8f339428f
--- /dev/null
+++ b/Documentation/networking/6pack.txt
@@ -0,0 +1,175 @@
+This is the 6pack-mini-HOWTO, written by
+
+Andreas Könsgen DG3KQ
+Internet: ajk@comnets.uni-bremen.de
+AMPR-net: dg3kq@db0pra.ampr.org
+AX.25: dg3kq@db0ach.#nrw.deu.eu
+
+Last update: April 7, 1998
+
+1. What is 6pack, and what are the advantages to KISS?
+
+6pack is a transmission protocol for data exchange between the PC and
+the TNC over a serial line. It can be used as an alternative to KISS.
+
+6pack has two major advantages:
+- The PC is given full control over the radio
+ channel. Special control data is exchanged between the PC and the TNC so
+ that the PC knows at any time if the TNC is receiving data, if a TNC
+ buffer underrun or overrun has occurred, if the PTT is
+ set and so on. This control data is processed at a higher priority than
+ normal data, so a data stream can be interrupted at any time to issue an
+ important event. This helps to improve the channel access and timing
+ algorithms as everything is computed in the PC. It would even be possible
+ to experiment with something completely different from the known CSMA and
+ DAMA channel access methods.
+ This kind of real-time control is especially important to supply several
+ TNCs that are connected between each other and the PC by a daisy chain
+ (however, this feature is not supported yet by the Linux 6pack driver).
+
+- Each packet transferred over the serial line is supplied with a checksum,
+ so it is easy to detect errors due to problems on the serial line.
+ Received packets that are corrupt are not passed on to the AX.25 layer.
+ Damaged packets that the TNC has received from the PC are not transmitted.
+
+More details about 6pack are described in the file 6pack.ps that is located
+in the doc directory of the AX.25 utilities package.
+
+2. Who has developed the 6pack protocol?
+
+The 6pack protocol has been developed by Ekki Plicht DF4OR, Henning Rech
+DF9IC and Gunter Jost DK7WJ. A driver for 6pack, written by Gunter Jost and
+Matthias Welwarsky DG2FEF, comes along with the PC version of FlexNet.
+They have also written a firmware for TNCs to perform the 6pack
+protocol (see section 4 below).
+
+3. Where can I get the latest version of 6pack for LinuX?
+
+At the moment, the 6pack stuff can obtained via anonymous ftp from
+db0bm.automation.fh-aachen.de. In the directory /incoming/dg3kq,
+there is a file named 6pack.tgz.
+
+4. Preparing the TNC for 6pack operation
+
+To be able to use 6pack, a special firmware for the TNC is needed. The EPROM
+of a newly bought TNC does not contain 6pack, so you will have to
+program an EPROM yourself. The image file for 6pack EPROMs should be
+available on any packet radio box where PC/FlexNet can be found. The name of
+the file is 6pack.bin. This file is copyrighted and maintained by the FlexNet
+team. It can be used under the terms of the license that comes along
+with PC/FlexNet. Please do not ask me about the internals of this file as I
+don't know anything about it. I used a textual description of the 6pack
+protocol to program the Linux driver.
+
+TNCs contain a 64kByte EPROM, the lower half of which is used for
+the firmware/KISS. The upper half is either empty or is sometimes
+programmed with software called TAPR. In the latter case, the TNC
+is supplied with a DIP switch so you can easily change between the
+two systems. When programming a new EPROM, one of the systems is replaced
+by 6pack. It is useful to replace TAPR, as this software is rarely used
+nowadays. If your TNC is not equipped with the switch mentioned above, you
+can build in one yourself that switches over the highest address pin
+of the EPROM between HIGH and LOW level. After having inserted the new EPROM
+and switched to 6pack, apply power to the TNC for a first test. The connect
+and the status LED are lit for about a second if the firmware initialises
+the TNC correctly.
+
+5. Building and installing the 6pack driver
+
+The driver has been tested with kernel version 2.1.90. Use with older
+kernels may lead to a compilation error because the interface to a kernel
+function has been changed in the 2.1.8x kernels.
+
+How to turn on 6pack support:
+
+- In the linux kernel configuration program, select the code maturity level
+ options menu and turn on the prompting for development drivers.
+
+- Select the amateur radio support menu and turn on the serial port 6pack
+ driver.
+
+- Compile and install the kernel and the modules.
+
+To use the driver, the kissattach program delivered with the AX.25 utilities
+has to be modified.
+
+- Do a cd to the directory that holds the kissattach sources. Edit the
+ kissattach.c file. At the top, insert the following lines:
+
+ #ifndef N_6PACK
+ #define N_6PACK (N_AX25+1)
+ #endif
+
+ Then find the line
+
+ int disc = N_AX25;
+
+ and replace N_AX25 by N_6PACK.
+
+- Recompile kissattach. Rename it to spattach to avoid confusions.
+
+Installing the driver:
+
+- Do an insmod 6pack. Look at your /var/log/messages file to check if the
+ module has printed its initialization message.
+
+- Do a spattach as you would launch kissattach when starting a KISS port.
+ Check if the kernel prints the message '6pack: TNC found'.
+
+- From here, everything should work as if you were setting up a KISS port.
+ The only difference is that the network device that represents
+ the 6pack port is called sp instead of sl or ax. So, sp0 would be the
+ first 6pack port.
+
+Although the driver has been tested on various platforms, I still declare it
+ALPHA. BE CAREFUL! Sync your disks before insmoding the 6pack module
+and spattaching. Watch out if your computer behaves strangely. Read section
+6 of this file about known problems.
+
+Note that the connect and status LEDs of the TNC are controlled in a
+different way than they are when the TNC is used with PC/FlexNet. When using
+FlexNet, the connect LED is on if there is a connection; the status LED is
+on if there is data in the buffer of the PC's AX.25 engine that has to be
+transmitted. Under Linux, the 6pack layer is beyond the AX.25 layer,
+so the 6pack driver doesn't know anything about connects or data that
+has not yet been transmitted. Therefore the LEDs are controlled
+as they are in KISS mode: The connect LED is turned on if data is transferred
+from the PC to the TNC over the serial line, the status LED if data is
+sent to the PC.
+
+6. Known problems
+
+When testing the driver with 2.0.3x kernels and
+operating with data rates on the radio channel of 9600 Baud or higher,
+the driver may, on certain systems, sometimes print the message '6pack:
+bad checksum', which is due to data loss if the other station sends two
+or more subsequent packets. I have been told that this is due to a problem
+with the serial driver of 2.0.3x kernels. I don't know yet if the problem
+still exists with 2.1.x kernels, as I have heard that the serial driver
+code has been changed with 2.1.x.
+
+When shutting down the sp interface with ifconfig, the kernel crashes if
+there is still an AX.25 connection left over which an IP connection was
+running, even if that IP connection is already closed. The problem does not
+occur when there is a bare AX.25 connection still running. I don't know if
+this is a problem of the 6pack driver or something else in the kernel.
+
+The driver has been tested as a module, not yet as a kernel-builtin driver.
+
+The 6pack protocol supports daisy-chaining of TNCs in a token ring, which is
+connected to one serial port of the PC. This feature is not implemented
+and at least at the moment I won't be able to do it because I do not have
+the opportunity to build a TNC daisy-chain and test it.
+
+Some of the comments in the source code are inaccurate. They are left from
+the SLIP/KISS driver, from which the 6pack driver has been derived.
+I haven't modified or removed them yet -- sorry! The code itself needs
+some cleaning and optimizing. This will be done in a later release.
+
+If you encounter a bug or if you have a question or suggestion concerning the
+driver, feel free to mail me, using the addresses given at the beginning of
+this file.
+
+Have fun!
+
+Andreas
diff --git a/Documentation/networking/LICENSE.qla3xxx b/Documentation/networking/LICENSE.qla3xxx
new file mode 100644
index 000000000..2f2077e34
--- /dev/null
+++ b/Documentation/networking/LICENSE.qla3xxx
@@ -0,0 +1,46 @@
+Copyright (c) 2003-2006 QLogic Corporation
+QLogic Linux Networking HBA Driver
+
+This program includes a device driver for Linux 2.6 that may be
+distributed with QLogic hardware specific firmware binary file.
+You may modify and redistribute the device driver code under the
+GNU General Public License as published by the Free Software
+Foundation (version 2 or a later version).
+
+You may redistribute the hardware specific firmware binary file
+under the following terms:
+
+ 1. Redistribution of source code (only if applicable),
+ must retain the above copyright notice, this list of
+ conditions and the following disclaimer.
+
+ 2. Redistribution in binary form must reproduce the above
+ copyright notice, this list of conditions and the
+ following disclaimer in the documentation and/or other
+ materials provided with the distribution.
+
+ 3. The name of QLogic Corporation may not be used to
+ endorse or promote products derived from this software
+ without specific prior written permission
+
+REGARDLESS OF WHAT LICENSING MECHANISM IS USED OR APPLICABLE,
+THIS PROGRAM IS PROVIDED BY QLOGIC CORPORATION "AS IS'' AND ANY
+EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR
+BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
+ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGE.
+
+USER ACKNOWLEDGES AND AGREES THAT USE OF THIS PROGRAM WILL NOT
+CREATE OR GIVE GROUNDS FOR A LICENSE BY IMPLICATION, ESTOPPEL, OR
+OTHERWISE IN ANY INTELLECTUAL PROPERTY RIGHTS (PATENT, COPYRIGHT,
+TRADE SECRET, MASK WORK, OR OTHER PROPRIETARY RIGHT) EMBODIED IN
+ANY OTHER QLOGIC HARDWARE OR SOFTWARE EITHER SOLELY OR IN
+COMBINATION WITH THIS PROGRAM.
+
diff --git a/Documentation/networking/LICENSE.qlcnic b/Documentation/networking/LICENSE.qlcnic
new file mode 100644
index 000000000..2ae3b6498
--- /dev/null
+++ b/Documentation/networking/LICENSE.qlcnic
@@ -0,0 +1,288 @@
+Copyright (c) 2009-2013 QLogic Corporation
+QLogic Linux qlcnic NIC Driver
+
+You may modify and redistribute the device driver code under the
+GNU General Public License (a copy of which is attached hereto as
+Exhibit A) published by the Free Software Foundation (version 2).
+
+
+EXHIBIT A
+
+ GNU GENERAL PUBLIC LICENSE
+ Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+ Preamble
+
+ The licenses for most software are designed to take away your
+freedom to share and change it. By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users. This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it. (Some other Free Software Foundation software is covered by
+the GNU Lesser General Public License instead.) You can apply it to
+your programs, too.
+
+ When we speak of free software, we are referring to freedom, not
+price. Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+ To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+ For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have. You must make sure that they, too, receive or can get the
+source code. And you must show them these terms so they know their
+rights.
+
+ We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+ Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software. If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+ Finally, any free program is threatened constantly by software
+patents. We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary. To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+ The precise terms and conditions for copying, distribution and
+modification follow.
+
+ GNU GENERAL PUBLIC LICENSE
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+ 0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License. The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language. (Hereinafter, translation is included without limitation in
+the term "modification".) Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope. The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+ 1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+ 2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+ a) You must cause the modified files to carry prominent notices
+ stating that you changed the files and the date of any change.
+
+ b) You must cause any work that you distribute or publish, that in
+ whole or in part contains or is derived from the Program or any
+ part thereof, to be licensed as a whole at no charge to all third
+ parties under the terms of this License.
+
+ c) If the modified program normally reads commands interactively
+ when run, you must cause it, when started running for such
+ interactive use in the most ordinary way, to print or display an
+ announcement including an appropriate copyright notice and a
+ notice that there is no warranty (or else, saying that you provide
+ a warranty) and that users may redistribute the program under
+ these conditions, and telling the user how to view a copy of this
+ License. (Exception: if the Program itself is interactive but
+ does not normally print such an announcement, your work based on
+ the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole. If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works. But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+ 3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+ a) Accompany it with the complete corresponding machine-readable
+ source code, which must be distributed under the terms of Sections
+ 1 and 2 above on a medium customarily used for software interchange; or,
+
+ b) Accompany it with a written offer, valid for at least three
+ years, to give any third party, for a charge no more than your
+ cost of physically performing source distribution, a complete
+ machine-readable copy of the corresponding source code, to be
+ distributed under the terms of Sections 1 and 2 above on a medium
+ customarily used for software interchange; or,
+
+ c) Accompany it with the information you received as to the offer
+ to distribute corresponding source code. (This alternative is
+ allowed only for noncommercial distribution and only if you
+ received the program in object code or executable form with such
+ an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it. For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable. However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+ 4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License. Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+ 5. You are not required to accept this License, since you have not
+signed it. However, nothing else grants you permission to modify or
+distribute the Program or its derivative works. These actions are
+prohibited by law if you do not accept this License. Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+ 6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions. You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+ 7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all. For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices. Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+ 8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded. In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+ 9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time. Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation. If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+ 10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission. For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this. Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+ NO WARRANTY
+
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
diff --git a/Documentation/networking/LICENSE.qlge b/Documentation/networking/LICENSE.qlge
new file mode 100644
index 000000000..ce64e4d15
--- /dev/null
+++ b/Documentation/networking/LICENSE.qlge
@@ -0,0 +1,288 @@
+Copyright (c) 2003-2011 QLogic Corporation
+QLogic Linux qlge NIC Driver
+
+You may modify and redistribute the device driver code under the
+GNU General Public License (a copy of which is attached hereto as
+Exhibit A) published by the Free Software Foundation (version 2).
+
+
+EXHIBIT A
+
+ GNU GENERAL PUBLIC LICENSE
+ Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+ Preamble
+
+ The licenses for most software are designed to take away your
+freedom to share and change it. By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users. This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it. (Some other Free Software Foundation software is covered by
+the GNU Lesser General Public License instead.) You can apply it to
+your programs, too.
+
+ When we speak of free software, we are referring to freedom, not
+price. Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+ To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+ For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have. You must make sure that they, too, receive or can get the
+source code. And you must show them these terms so they know their
+rights.
+
+ We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+ Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software. If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+ Finally, any free program is threatened constantly by software
+patents. We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary. To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+ The precise terms and conditions for copying, distribution and
+modification follow.
+
+ GNU GENERAL PUBLIC LICENSE
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+ 0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License. The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language. (Hereinafter, translation is included without limitation in
+the term "modification".) Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope. The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+ 1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+ 2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+ a) You must cause the modified files to carry prominent notices
+ stating that you changed the files and the date of any change.
+
+ b) You must cause any work that you distribute or publish, that in
+ whole or in part contains or is derived from the Program or any
+ part thereof, to be licensed as a whole at no charge to all third
+ parties under the terms of this License.
+
+ c) If the modified program normally reads commands interactively
+ when run, you must cause it, when started running for such
+ interactive use in the most ordinary way, to print or display an
+ announcement including an appropriate copyright notice and a
+ notice that there is no warranty (or else, saying that you provide
+ a warranty) and that users may redistribute the program under
+ these conditions, and telling the user how to view a copy of this
+ License. (Exception: if the Program itself is interactive but
+ does not normally print such an announcement, your work based on
+ the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole. If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works. But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+ 3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+ a) Accompany it with the complete corresponding machine-readable
+ source code, which must be distributed under the terms of Sections
+ 1 and 2 above on a medium customarily used for software interchange; or,
+
+ b) Accompany it with a written offer, valid for at least three
+ years, to give any third party, for a charge no more than your
+ cost of physically performing source distribution, a complete
+ machine-readable copy of the corresponding source code, to be
+ distributed under the terms of Sections 1 and 2 above on a medium
+ customarily used for software interchange; or,
+
+ c) Accompany it with the information you received as to the offer
+ to distribute corresponding source code. (This alternative is
+ allowed only for noncommercial distribution and only if you
+ received the program in object code or executable form with such
+ an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it. For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable. However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+ 4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License. Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+ 5. You are not required to accept this License, since you have not
+signed it. However, nothing else grants you permission to modify or
+distribute the Program or its derivative works. These actions are
+prohibited by law if you do not accept this License. Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+ 6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions. You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+ 7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all. For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices. Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+ 8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded. In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+ 9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time. Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation. If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+ 10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission. For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this. Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+ NO WARRANTY
+
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
diff --git a/Documentation/networking/Makefile b/Documentation/networking/Makefile
new file mode 100644
index 000000000..4c5d7c485
--- /dev/null
+++ b/Documentation/networking/Makefile
@@ -0,0 +1 @@
+subdir-y := timestamping
diff --git a/Documentation/networking/PLIP.txt b/Documentation/networking/PLIP.txt
new file mode 100644
index 000000000..ad7e3f7c3
--- /dev/null
+++ b/Documentation/networking/PLIP.txt
@@ -0,0 +1,215 @@
+PLIP: The Parallel Line Internet Protocol Device
+
+Donald Becker (becker@super.org)
+I.D.A. Supercomputing Research Center, Bowie MD 20715
+
+At some point T. Thorn will probably contribute text,
+Tommy Thorn (tthorn@daimi.aau.dk)
+
+PLIP Introduction
+-----------------
+
+This document describes the parallel port packet pusher for Net/LGX.
+This device interface allows a point-to-point connection between two
+parallel ports to appear as a IP network interface.
+
+What is PLIP?
+=============
+
+PLIP is Parallel Line IP, that is, the transportation of IP packages
+over a parallel port. In the case of a PC, the obvious choice is the
+printer port. PLIP is a non-standard, but [can use] uses the standard
+LapLink null-printer cable [can also work in turbo mode, with a PLIP
+cable]. [The protocol used to pack IP packages, is a simple one
+initiated by Crynwr.]
+
+Advantages of PLIP
+==================
+
+It's cheap, it's available everywhere, and it's easy.
+
+The PLIP cable is all that's needed to connect two Linux boxes, and it
+can be built for very few bucks.
+
+Connecting two Linux boxes takes only a second's decision and a few
+minutes' work, no need to search for a [supported] netcard. This might
+even be especially important in the case of notebooks, where netcards
+are not easily available.
+
+Not requiring a netcard also means that apart from connecting the
+cables, everything else is software configuration [which in principle
+could be made very easy.]
+
+Disadvantages of PLIP
+=====================
+
+Doesn't work over a modem, like SLIP and PPP. Limited range, 15 m.
+Can only be used to connect three (?) Linux boxes. Doesn't connect to
+an existing Ethernet. Isn't standard (not even de facto standard, like
+SLIP).
+
+Performance
+===========
+
+PLIP easily outperforms Ethernet cards....(ups, I was dreaming, but
+it *is* getting late. EOB)
+
+PLIP driver details
+-------------------
+
+The Linux PLIP driver is an implementation of the original Crynwr protocol,
+that uses the parallel port subsystem of the kernel in order to properly
+share parallel ports between PLIP and other services.
+
+IRQs and trigger timeouts
+=========================
+
+When a parallel port used for a PLIP driver has an IRQ configured to it, the
+PLIP driver is signaled whenever data is sent to it via the cable, such that
+when no data is available, the driver isn't being used.
+
+However, on some machines it is hard, if not impossible, to configure an IRQ
+to a certain parallel port, mainly because it is used by some other device.
+On these machines, the PLIP driver can be used in IRQ-less mode, where
+the PLIP driver would constantly poll the parallel port for data waiting,
+and if such data is available, process it. This mode is less efficient than
+the IRQ mode, because the driver has to check the parallel port many times
+per second, even when no data at all is sent. Some rough measurements
+indicate that there isn't a noticeable performance drop when using IRQ-less
+mode as compared to IRQ mode as far as the data transfer speed is involved.
+There is a performance drop on the machine hosting the driver.
+
+When the PLIP driver is used in IRQ mode, the timeout used for triggering a
+data transfer (the maximal time the PLIP driver would allow the other side
+before announcing a timeout, when trying to handshake a transfer of some
+data) is, by default, 500usec. As IRQ delivery is more or less immediate,
+this timeout is quite sufficient.
+
+When in IRQ-less mode, the PLIP driver polls the parallel port HZ times
+per second (where HZ is typically 100 on most platforms, and 1024 on an
+Alpha, as of this writing). Between two such polls, there are 10^6/HZ usecs.
+On an i386, for example, 10^6/100 = 10000usec. It is easy to see that it is
+quite possible for the trigger timeout to expire between two such polls, as
+the timeout is only 500usec long. As a result, it is required to change the
+trigger timeout on the *other* side of a PLIP connection, to about
+10^6/HZ usecs. If both sides of a PLIP connection are used in IRQ-less mode,
+this timeout is required on both sides.
+
+It appears that in practice, the trigger timeout can be shorter than in the
+above calculation. It isn't an important issue, unless the wire is faulty,
+in which case a long timeout would stall the machine when, for whatever
+reason, bits are dropped.
+
+A utility that can perform this change in Linux is plipconfig, which is part
+of the net-tools package (its location can be found in the
+Documentation/Changes file). An example command would be
+'plipconfig plipX trigger 10000', where plipX is the appropriate
+PLIP device.
+
+PLIP hardware interconnection
+-----------------------------
+
+PLIP uses several different data transfer methods. The first (and the
+only one implemented in the early version of the code) uses a standard
+printer "null" cable to transfer data four bits at a time using
+data bit outputs connected to status bit inputs.
+
+The second data transfer method relies on both machines having
+bi-directional parallel ports, rather than output-only ``printer''
+ports. This allows byte-wide transfers and avoids reconstructing
+nibbles into bytes, leading to much faster transfers.
+
+Parallel Transfer Mode 0 Cable
+==============================
+
+The cable for the first transfer mode is a standard
+printer "null" cable which transfers data four bits at a time using
+data bit outputs of the first port (machine T) connected to the
+status bit inputs of the second port (machine R). There are five
+status inputs, and they are used as four data inputs and a clock (data
+strobe) input, arranged so that the data input bits appear as contiguous
+bits with standard status register implementation.
+
+A cable that implements this protocol is available commercially as a
+"Null Printer" or "Turbo Laplink" cable. It can be constructed with
+two DB-25 male connectors symmetrically connected as follows:
+
+ STROBE output 1*
+ D0->ERROR 2 - 15 15 - 2
+ D1->SLCT 3 - 13 13 - 3
+ D2->PAPOUT 4 - 12 12 - 4
+ D3->ACK 5 - 10 10 - 5
+ D4->BUSY 6 - 11 11 - 6
+ D5,D6,D7 are 7*, 8*, 9*
+ AUTOFD output 14*
+ INIT output 16*
+ SLCTIN 17 - 17
+ extra grounds are 18*,19*,20*,21*,22*,23*,24*
+ GROUND 25 - 25
+* Do not connect these pins on either end
+
+If the cable you are using has a metallic shield it should be
+connected to the metallic DB-25 shell at one end only.
+
+Parallel Transfer Mode 1
+========================
+
+The second data transfer method relies on both machines having
+bi-directional parallel ports, rather than output-only ``printer''
+ports. This allows byte-wide transfers, and avoids reconstructing
+nibbles into bytes. This cable should not be used on unidirectional
+``printer'' (as opposed to ``parallel'') ports or when the machine
+isn't configured for PLIP, as it will result in output driver
+conflicts and the (unlikely) possibility of damage.
+
+The cable for this transfer mode should be constructed as follows:
+
+ STROBE->BUSY 1 - 11
+ D0->D0 2 - 2
+ D1->D1 3 - 3
+ D2->D2 4 - 4
+ D3->D3 5 - 5
+ D4->D4 6 - 6
+ D5->D5 7 - 7
+ D6->D6 8 - 8
+ D7->D7 9 - 9
+ INIT -> ACK 16 - 10
+ AUTOFD->PAPOUT 14 - 12
+ SLCT->SLCTIN 13 - 17
+ GND->ERROR 18 - 15
+ extra grounds are 19*,20*,21*,22*,23*,24*
+ GROUND 25 - 25
+* Do not connect these pins on either end
+
+Once again, if the cable you are using has a metallic shield it should
+be connected to the metallic DB-25 shell at one end only.
+
+PLIP Mode 0 transfer protocol
+=============================
+
+The PLIP driver is compatible with the "Crynwr" parallel port transfer
+standard in Mode 0. That standard specifies the following protocol:
+
+ send header nibble '0x8'
+ count-low octet
+ count-high octet
+ ... data octets
+ checksum octet
+
+Each octet is sent as
+ <wait for rx. '0x1?'> <send 0x10+(octet&0x0F)>
+ <wait for rx. '0x0?'> <send 0x00+((octet>>4)&0x0F)>
+
+To start a transfer the transmitting machine outputs a nibble 0x08.
+That raises the ACK line, triggering an interrupt in the receiving
+machine. The receiving machine disables interrupts and raises its own ACK
+line.
+
+Restated:
+
+(OUT is bit 0-4, OUT.j is bit j from OUT. IN likewise)
+Send_Byte:
+ OUT := low nibble, OUT.4 := 1
+ WAIT FOR IN.4 = 1
+ OUT := high nibble, OUT.4 := 0
+ WAIT FOR IN.4 = 0
diff --git a/Documentation/networking/README.ipw2100 b/Documentation/networking/README.ipw2100
new file mode 100644
index 000000000..6f85e1d06
--- /dev/null
+++ b/Documentation/networking/README.ipw2100
@@ -0,0 +1,293 @@
+
+Intel(R) PRO/Wireless 2100 Driver for Linux in support of:
+
+Intel(R) PRO/Wireless 2100 Network Connection
+
+Copyright (C) 2003-2006, Intel Corporation
+
+README.ipw2100
+
+Version: git-1.1.5
+Date : January 25, 2006
+
+Index
+-----------------------------------------------
+0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+1. Introduction
+2. Release git-1.1.5 Current Features
+3. Command Line Parameters
+4. Sysfs Helper Files
+5. Radio Kill Switch
+6. Dynamic Firmware
+7. Power Management
+8. Support
+9. License
+
+
+0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+-----------------------------------------------
+
+Important Notice FOR ALL USERS OR DISTRIBUTORS!!!!
+
+Intel wireless LAN adapters are engineered, manufactured, tested, and
+quality checked to ensure that they meet all necessary local and
+governmental regulatory agency requirements for the regions that they
+are designated and/or marked to ship into. Since wireless LANs are
+generally unlicensed devices that share spectrum with radars,
+satellites, and other licensed and unlicensed devices, it is sometimes
+necessary to dynamically detect, avoid, and limit usage to avoid
+interference with these devices. In many instances Intel is required to
+provide test data to prove regional and local compliance to regional and
+governmental regulations before certification or approval to use the
+product is granted. Intel's wireless LAN's EEPROM, firmware, and
+software driver are designed to carefully control parameters that affect
+radio operation and to ensure electromagnetic compliance (EMC). These
+parameters include, without limitation, RF power, spectrum usage,
+channel scanning, and human exposure.
+
+For these reasons Intel cannot permit any manipulation by third parties
+of the software provided in binary format with the wireless WLAN
+adapters (e.g., the EEPROM and firmware). Furthermore, if you use any
+patches, utilities, or code with the Intel wireless LAN adapters that
+have been manipulated by an unauthorized party (i.e., patches,
+utilities, or code (including open source code modifications) which have
+not been validated by Intel), (i) you will be solely responsible for
+ensuring the regulatory compliance of the products, (ii) Intel will bear
+no liability, under any theory of liability for any issues associated
+with the modified products, including without limitation, claims under
+the warranty and/or issues arising from regulatory non-compliance, and
+(iii) Intel will not provide or be required to assist in providing
+support to any third parties for such modified products.
+
+Note: Many regulatory agencies consider Wireless LAN adapters to be
+modules, and accordingly, condition system-level regulatory approval
+upon receipt and review of test data documenting that the antennas and
+system configuration do not cause the EMC and radio operation to be
+non-compliant.
+
+The drivers available for download from SourceForge are provided as a
+part of a development project. Conformance to local regulatory
+requirements is the responsibility of the individual developer. As
+such, if you are interested in deploying or shipping a driver as part of
+solution intended to be used for purposes other than development, please
+obtain a tested driver from Intel Customer Support at:
+
+http://www.intel.com/support/wireless/sb/CS-006408.htm
+
+1. Introduction
+-----------------------------------------------
+
+This document provides a brief overview of the features supported by the
+IPW2100 driver project. The main project website, where the latest
+development version of the driver can be found, is:
+
+ http://ipw2100.sourceforge.net
+
+There you can find the not only the latest releases, but also information about
+potential fixes and patches, as well as links to the development mailing list
+for the driver project.
+
+
+2. Release git-1.1.5 Current Supported Features
+-----------------------------------------------
+- Managed (BSS) and Ad-Hoc (IBSS)
+- WEP (shared key and open)
+- Wireless Tools support
+- 802.1x (tested with XSupplicant 1.0.1)
+
+Enabled (but not supported) features:
+- Monitor/RFMon mode
+- WPA/WPA2
+
+The distinction between officially supported and enabled is a reflection
+on the amount of validation and interoperability testing that has been
+performed on a given feature.
+
+
+3. Command Line Parameters
+-----------------------------------------------
+
+If the driver is built as a module, the following optional parameters are used
+by entering them on the command line with the modprobe command using this
+syntax:
+
+ modprobe ipw2100 [<option>=<VAL1><,VAL2>...]
+
+For example, to disable the radio on driver loading, enter:
+
+ modprobe ipw2100 disable=1
+
+The ipw2100 driver supports the following module parameters:
+
+Name Value Example:
+debug 0x0-0xffffffff debug=1024
+mode 0,1,2 mode=1 /* AdHoc */
+channel int channel=3 /* Only valid in AdHoc or Monitor */
+associate boolean associate=0 /* Do NOT auto associate */
+disable boolean disable=1 /* Do not power the HW */
+
+
+4. Sysfs Helper Files
+---------------------------
+-----------------------------------------------
+
+There are several ways to control the behavior of the driver. Many of the
+general capabilities are exposed through the Wireless Tools (iwconfig). There
+are a few capabilities that are exposed through entries in the Linux Sysfs.
+
+
+----- Driver Level ------
+For the driver level files, look in /sys/bus/pci/drivers/ipw2100/
+
+ debug_level
+
+ This controls the same global as the 'debug' module parameter. For
+ information on the various debugging levels available, run the 'dvals'
+ script found in the driver source directory.
+
+ NOTE: 'debug_level' is only enabled if CONFIG_IPW2100_DEBUG is turn
+ on.
+
+----- Device Level ------
+For the device level files look in
+
+ /sys/bus/pci/drivers/ipw2100/{PCI-ID}/
+
+For example:
+ /sys/bus/pci/drivers/ipw2100/0000:02:01.0
+
+For the device level files, see /sys/bus/pci/drivers/ipw2100:
+
+ rf_kill
+ read -
+ 0 = RF kill not enabled (radio on)
+ 1 = SW based RF kill active (radio off)
+ 2 = HW based RF kill active (radio off)
+ 3 = Both HW and SW RF kill active (radio off)
+ write -
+ 0 = If SW based RF kill active, turn the radio back on
+ 1 = If radio is on, activate SW based RF kill
+
+ NOTE: If you enable the SW based RF kill and then toggle the HW
+ based RF kill from ON -> OFF -> ON, the radio will NOT come back on
+
+
+5. Radio Kill Switch
+-----------------------------------------------
+Most laptops provide the ability for the user to physically disable the radio.
+Some vendors have implemented this as a physical switch that requires no
+software to turn the radio off and on. On other laptops, however, the switch
+is controlled through a button being pressed and a software driver then making
+calls to turn the radio off and on. This is referred to as a "software based
+RF kill switch"
+
+See the Sysfs helper file 'rf_kill' for determining the state of the RF switch
+on your system.
+
+
+6. Dynamic Firmware
+-----------------------------------------------
+As the firmware is licensed under a restricted use license, it can not be
+included within the kernel sources. To enable the IPW2100 you will need a
+firmware image to load into the wireless NIC's processors.
+
+You can obtain these images from <http://ipw2100.sf.net/firmware.php>.
+
+See INSTALL for instructions on installing the firmware.
+
+
+7. Power Management
+-----------------------------------------------
+The IPW2100 supports the configuration of the Power Save Protocol
+through a private wireless extension interface. The IPW2100 supports
+the following different modes:
+
+ off No power management. Radio is always on.
+ on Automatic power management
+ 1-5 Different levels of power management. The higher the
+ number the greater the power savings, but with an impact to
+ packet latencies.
+
+Power management works by powering down the radio after a certain
+interval of time has passed where no packets are passed through the
+radio. Once powered down, the radio remains in that state for a given
+period of time. For higher power savings, the interval between last
+packet processed to sleep is shorter and the sleep period is longer.
+
+When the radio is asleep, the access point sending data to the station
+must buffer packets at the AP until the station wakes up and requests
+any buffered packets. If you have an AP that does not correctly support
+the PSP protocol you may experience packet loss or very poor performance
+while power management is enabled. If this is the case, you will need
+to try and find a firmware update for your AP, or disable power
+management (via `iwconfig eth1 power off`)
+
+To configure the power level on the IPW2100 you use a combination of
+iwconfig and iwpriv. iwconfig is used to turn power management on, off,
+and set it to auto.
+
+ iwconfig eth1 power off Disables radio power down
+ iwconfig eth1 power on Enables radio power management to
+ last set level (defaults to AUTO)
+ iwpriv eth1 set_power 0 Sets power level to AUTO and enables
+ power management if not previously
+ enabled.
+ iwpriv eth1 set_power 1-5 Set the power level as specified,
+ enabling power management if not
+ previously enabled.
+
+You can view the current power level setting via:
+
+ iwpriv eth1 get_power
+
+It will return the current period or timeout that is configured as a string
+in the form of xxxx/yyyy (z) where xxxx is the timeout interval (amount of
+time after packet processing), yyyy is the period to sleep (amount of time to
+wait before powering the radio and querying the access point for buffered
+packets), and z is the 'power level'. If power management is turned off the
+xxxx/yyyy will be replaced with 'off' -- the level reported will be the active
+level if `iwconfig eth1 power on` is invoked.
+
+
+8. Support
+-----------------------------------------------
+
+For general development information and support,
+go to:
+
+ http://ipw2100.sf.net/
+
+The ipw2100 1.1.0 driver and firmware can be downloaded from:
+
+ http://support.intel.com
+
+For installation support on the ipw2100 1.1.0 driver on Linux kernels
+2.6.8 or greater, email support is available from:
+
+ http://supportmail.intel.com
+
+9. License
+-----------------------------------------------
+
+ Copyright(c) 2003 - 2006 Intel Corporation. All rights reserved.
+
+ This program is free software; you can redistribute it and/or modify it
+ under the terms of the GNU General Public License (version 2) as
+ published by the Free Software Foundation.
+
+ This program is distributed in the hope that it will be useful, but WITHOUT
+ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ more details.
+
+ You should have received a copy of the GNU General Public License along with
+ this program; if not, write to the Free Software Foundation, Inc., 59
+ Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+
+ The full GNU General Public License is included in this distribution in the
+ file called LICENSE.
+
+ License Contact Information:
+ James P. Ketrenos <ipw2100-admin@linux.intel.com>
+ Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
diff --git a/Documentation/networking/README.ipw2200 b/Documentation/networking/README.ipw2200
new file mode 100644
index 000000000..b7658bed4
--- /dev/null
+++ b/Documentation/networking/README.ipw2200
@@ -0,0 +1,472 @@
+
+Intel(R) PRO/Wireless 2915ABG Driver for Linux in support of:
+
+Intel(R) PRO/Wireless 2200BG Network Connection
+Intel(R) PRO/Wireless 2915ABG Network Connection
+
+Note: The Intel(R) PRO/Wireless 2915ABG Driver for Linux and Intel(R)
+PRO/Wireless 2200BG Driver for Linux is a unified driver that works on
+both hardware adapters listed above. In this document the Intel(R)
+PRO/Wireless 2915ABG Driver for Linux will be used to reference the
+unified driver.
+
+Copyright (C) 2004-2006, Intel Corporation
+
+README.ipw2200
+
+Version: 1.1.2
+Date : March 30, 2006
+
+
+Index
+-----------------------------------------------
+0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+1. Introduction
+1.1. Overview of features
+1.2. Module parameters
+1.3. Wireless Extension Private Methods
+1.4. Sysfs Helper Files
+1.5. Supported channels
+2. Ad-Hoc Networking
+3. Interacting with Wireless Tools
+3.1. iwconfig mode
+3.2. iwconfig sens
+4. About the Version Numbers
+5. Firmware installation
+6. Support
+7. License
+
+
+0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER
+-----------------------------------------------
+
+Important Notice FOR ALL USERS OR DISTRIBUTORS!!!!
+
+Intel wireless LAN adapters are engineered, manufactured, tested, and
+quality checked to ensure that they meet all necessary local and
+governmental regulatory agency requirements for the regions that they
+are designated and/or marked to ship into. Since wireless LANs are
+generally unlicensed devices that share spectrum with radars,
+satellites, and other licensed and unlicensed devices, it is sometimes
+necessary to dynamically detect, avoid, and limit usage to avoid
+interference with these devices. In many instances Intel is required to
+provide test data to prove regional and local compliance to regional and
+governmental regulations before certification or approval to use the
+product is granted. Intel's wireless LAN's EEPROM, firmware, and
+software driver are designed to carefully control parameters that affect
+radio operation and to ensure electromagnetic compliance (EMC). These
+parameters include, without limitation, RF power, spectrum usage,
+channel scanning, and human exposure.
+
+For these reasons Intel cannot permit any manipulation by third parties
+of the software provided in binary format with the wireless WLAN
+adapters (e.g., the EEPROM and firmware). Furthermore, if you use any
+patches, utilities, or code with the Intel wireless LAN adapters that
+have been manipulated by an unauthorized party (i.e., patches,
+utilities, or code (including open source code modifications) which have
+not been validated by Intel), (i) you will be solely responsible for
+ensuring the regulatory compliance of the products, (ii) Intel will bear
+no liability, under any theory of liability for any issues associated
+with the modified products, including without limitation, claims under
+the warranty and/or issues arising from regulatory non-compliance, and
+(iii) Intel will not provide or be required to assist in providing
+support to any third parties for such modified products.
+
+Note: Many regulatory agencies consider Wireless LAN adapters to be
+modules, and accordingly, condition system-level regulatory approval
+upon receipt and review of test data documenting that the antennas and
+system configuration do not cause the EMC and radio operation to be
+non-compliant.
+
+The drivers available for download from SourceForge are provided as a
+part of a development project. Conformance to local regulatory
+requirements is the responsibility of the individual developer. As
+such, if you are interested in deploying or shipping a driver as part of
+solution intended to be used for purposes other than development, please
+obtain a tested driver from Intel Customer Support at:
+
+http://support.intel.com
+
+
+1. Introduction
+-----------------------------------------------
+The following sections attempt to provide a brief introduction to using
+the Intel(R) PRO/Wireless 2915ABG Driver for Linux.
+
+This document is not meant to be a comprehensive manual on
+understanding or using wireless technologies, but should be sufficient
+to get you moving without wires on Linux.
+
+For information on building and installing the driver, see the INSTALL
+file.
+
+
+1.1. Overview of Features
+-----------------------------------------------
+The current release (1.1.2) supports the following features:
+
++ BSS mode (Infrastructure, Managed)
++ IBSS mode (Ad-Hoc)
++ WEP (OPEN and SHARED KEY mode)
++ 802.1x EAP via wpa_supplicant and xsupplicant
++ Wireless Extension support
++ Full B and G rate support (2200 and 2915)
++ Full A rate support (2915 only)
++ Transmit power control
++ S state support (ACPI suspend/resume)
+
+The following features are currently enabled, but not officially
+supported:
+
++ WPA
++ long/short preamble support
++ Monitor mode (aka RFMon)
+
+The distinction between officially supported and enabled is a reflection
+on the amount of validation and interoperability testing that has been
+performed on a given feature.
+
+
+
+1.2. Command Line Parameters
+-----------------------------------------------
+
+Like many modules used in the Linux kernel, the Intel(R) PRO/Wireless
+2915ABG Driver for Linux allows configuration options to be provided
+as module parameters. The most common way to specify a module parameter
+is via the command line.
+
+The general form is:
+
+% modprobe ipw2200 parameter=value
+
+Where the supported parameter are:
+
+ associate
+ Set to 0 to disable the auto scan-and-associate functionality of the
+ driver. If disabled, the driver will not attempt to scan
+ for and associate to a network until it has been configured with
+ one or more properties for the target network, for example configuring
+ the network SSID. Default is 0 (do not auto-associate)
+
+ Example: % modprobe ipw2200 associate=0
+
+ auto_create
+ Set to 0 to disable the auto creation of an Ad-Hoc network
+ matching the channel and network name parameters provided.
+ Default is 1.
+
+ channel
+ channel number for association. The normal method for setting
+ the channel would be to use the standard wireless tools
+ (i.e. `iwconfig eth1 channel 10`), but it is useful sometimes
+ to set this while debugging. Channel 0 means 'ANY'
+
+ debug
+ If using a debug build, this is used to control the amount of debug
+ info is logged. See the 'dvals' and 'load' script for more info on
+ how to use this (the dvals and load scripts are provided as part
+ of the ipw2200 development snapshot releases available from the
+ SourceForge project at http://ipw2200.sf.net)
+
+ led
+ Can be used to turn on experimental LED code.
+ 0 = Off, 1 = On. Default is 1.
+
+ mode
+ Can be used to set the default mode of the adapter.
+ 0 = Managed, 1 = Ad-Hoc, 2 = Monitor
+
+
+1.3. Wireless Extension Private Methods
+-----------------------------------------------
+
+As an interface designed to handle generic hardware, there are certain
+capabilities not exposed through the normal Wireless Tool interface. As
+such, a provision is provided for a driver to declare custom, or
+private, methods. The Intel(R) PRO/Wireless 2915ABG Driver for Linux
+defines several of these to configure various settings.
+
+The general form of using the private wireless methods is:
+
+ % iwpriv $IFNAME method parameters
+
+Where $IFNAME is the interface name the device is registered with
+(typically eth1, customized via one of the various network interface
+name managers, such as ifrename)
+
+The supported private methods are:
+
+ get_mode
+ Can be used to report out which IEEE mode the driver is
+ configured to support. Example:
+
+ % iwpriv eth1 get_mode
+ eth1 get_mode:802.11bg (6)
+
+ set_mode
+ Can be used to configure which IEEE mode the driver will
+ support.
+
+ Usage:
+ % iwpriv eth1 set_mode {mode}
+ Where {mode} is a number in the range 1-7:
+ 1 802.11a (2915 only)
+ 2 802.11b
+ 3 802.11ab (2915 only)
+ 4 802.11g
+ 5 802.11ag (2915 only)
+ 6 802.11bg
+ 7 802.11abg (2915 only)
+
+ get_preamble
+ Can be used to report configuration of preamble length.
+
+ set_preamble
+ Can be used to set the configuration of preamble length:
+
+ Usage:
+ % iwpriv eth1 set_preamble {mode}
+ Where {mode} is one of:
+ 1 Long preamble only
+ 0 Auto (long or short based on connection)
+
+
+1.4. Sysfs Helper Files:
+-----------------------------------------------
+
+The Linux kernel provides a pseudo file system that can be used to
+access various components of the operating system. The Intel(R)
+PRO/Wireless 2915ABG Driver for Linux exposes several configuration
+parameters through this mechanism.
+
+An entry in the sysfs can support reading and/or writing. You can
+typically query the contents of a sysfs entry through the use of cat,
+and can set the contents via echo. For example:
+
+% cat /sys/bus/pci/drivers/ipw2200/debug_level
+
+Will report the current debug level of the driver's logging subsystem
+(only available if CONFIG_IPW2200_DEBUG was configured when the driver
+was built).
+
+You can set the debug level via:
+
+% echo $VALUE > /sys/bus/pci/drivers/ipw2200/debug_level
+
+Where $VALUE would be a number in the case of this sysfs entry. The
+input to sysfs files does not have to be a number. For example, the
+firmware loader used by hotplug utilizes sysfs entries for transferring
+the firmware image from user space into the driver.
+
+The Intel(R) PRO/Wireless 2915ABG Driver for Linux exposes sysfs entries
+at two levels -- driver level, which apply to all instances of the driver
+(in the event that there are more than one device installed) and device
+level, which applies only to the single specific instance.
+
+
+1.4.1 Driver Level Sysfs Helper Files
+-----------------------------------------------
+
+For the driver level files, look in /sys/bus/pci/drivers/ipw2200/
+
+ debug_level
+
+ This controls the same global as the 'debug' module parameter
+
+
+
+1.4.2 Device Level Sysfs Helper Files
+-----------------------------------------------
+
+For the device level files, look in
+
+ /sys/bus/pci/drivers/ipw2200/{PCI-ID}/
+
+For example:
+ /sys/bus/pci/drivers/ipw2200/0000:02:01.0
+
+For the device level files, see /sys/bus/pci/drivers/ipw2200:
+
+ rf_kill
+ read -
+ 0 = RF kill not enabled (radio on)
+ 1 = SW based RF kill active (radio off)
+ 2 = HW based RF kill active (radio off)
+ 3 = Both HW and SW RF kill active (radio off)
+ write -
+ 0 = If SW based RF kill active, turn the radio back on
+ 1 = If radio is on, activate SW based RF kill
+
+ NOTE: If you enable the SW based RF kill and then toggle the HW
+ based RF kill from ON -> OFF -> ON, the radio will NOT come back on
+
+ ucode
+ read-only access to the ucode version number
+
+ led
+ read -
+ 0 = LED code disabled
+ 1 = LED code enabled
+ write -
+ 0 = Disable LED code
+ 1 = Enable LED code
+
+ NOTE: The LED code has been reported to hang some systems when
+ running ifconfig and is therefore disabled by default.
+
+
+1.5. Supported channels
+-----------------------------------------------
+
+Upon loading the Intel(R) PRO/Wireless 2915ABG Driver for Linux, a
+message stating the detected geography code and the number of 802.11
+channels supported by the card will be displayed in the log.
+
+The geography code corresponds to a regulatory domain as shown in the
+table below.
+
+ Supported channels
+Code Geography 802.11bg 802.11a
+
+--- Restricted 11 0
+ZZF Custom US/Canada 11 8
+ZZD Rest of World 13 0
+ZZA Custom USA & Europe & High 11 13
+ZZB Custom NA & Europe 11 13
+ZZC Custom Japan 11 4
+ZZM Custom 11 0
+ZZE Europe 13 19
+ZZJ Custom Japan 14 4
+ZZR Rest of World 14 0
+ZZH High Band 13 4
+ZZG Custom Europe 13 4
+ZZK Europe 13 24
+ZZL Europe 11 13
+
+
+2. Ad-Hoc Networking
+-----------------------------------------------
+
+When using a device in an Ad-Hoc network, it is useful to understand the
+sequence and requirements for the driver to be able to create, join, or
+merge networks.
+
+The following attempts to provide enough information so that you can
+have a consistent experience while using the driver as a member of an
+Ad-Hoc network.
+
+2.1. Joining an Ad-Hoc Network
+-----------------------------------------------
+
+The easiest way to get onto an Ad-Hoc network is to join one that
+already exists.
+
+2.2. Creating an Ad-Hoc Network
+-----------------------------------------------
+
+An Ad-Hoc networks is created using the syntax of the Wireless tool.
+
+For Example:
+iwconfig eth1 mode ad-hoc essid testing channel 2
+
+2.3. Merging Ad-Hoc Networks
+-----------------------------------------------
+
+
+3. Interaction with Wireless Tools
+-----------------------------------------------
+
+3.1 iwconfig mode
+-----------------------------------------------
+
+When configuring the mode of the adapter, all run-time configured parameters
+are reset to the value used when the module was loaded. This includes
+channels, rates, ESSID, etc.
+
+3.2 iwconfig sens
+-----------------------------------------------
+
+The 'iwconfig ethX sens XX' command will not set the signal sensitivity
+threshold, as described in iwconfig documentation, but rather the number
+of consecutive missed beacons that will trigger handover, i.e. roaming
+to another access point. At the same time, it will set the disassociation
+threshold to 3 times the given value.
+
+
+4. About the Version Numbers
+-----------------------------------------------
+
+Due to the nature of open source development projects, there are
+frequently changes being incorporated that have not gone through
+a complete validation process. These changes are incorporated into
+development snapshot releases.
+
+Releases are numbered with a three level scheme:
+
+ major.minor.development
+
+Any version where the 'development' portion is 0 (for example
+1.0.0, 1.1.0, etc.) indicates a stable version that will be made
+available for kernel inclusion.
+
+Any version where the 'development' portion is not a 0 (for
+example 1.0.1, 1.1.5, etc.) indicates a development version that is
+being made available for testing and cutting edge users. The stability
+and functionality of the development releases are not know. We make
+efforts to try and keep all snapshots reasonably stable, but due to the
+frequency of their release, and the desire to get those releases
+available as quickly as possible, unknown anomalies should be expected.
+
+The major version number will be incremented when significant changes
+are made to the driver. Currently, there are no major changes planned.
+
+5. Firmware installation
+----------------------------------------------
+
+The driver requires a firmware image, download it and extract the
+files under /lib/firmware (or wherever your hotplug's firmware.agent
+will look for firmware files)
+
+The firmware can be downloaded from the following URL:
+
+ http://ipw2200.sf.net/
+
+
+6. Support
+-----------------------------------------------
+
+For direct support of the 1.0.0 version, you can contact
+http://supportmail.intel.com, or you can use the open source project
+support.
+
+For general information and support, go to:
+
+ http://ipw2200.sf.net/
+
+
+7. License
+-----------------------------------------------
+
+ Copyright(c) 2003 - 2006 Intel Corporation. All rights reserved.
+
+ This program is free software; you can redistribute it and/or modify it
+ under the terms of the GNU General Public License version 2 as
+ published by the Free Software Foundation.
+
+ This program is distributed in the hope that it will be useful, but WITHOUT
+ ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+ more details.
+
+ You should have received a copy of the GNU General Public License along with
+ this program; if not, write to the Free Software Foundation, Inc., 59
+ Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+
+ The full GNU General Public License is included in this distribution in the
+ file called LICENSE.
+
+ Contact Information:
+ James P. Ketrenos <ipw2100-admin@linux.intel.com>
+ Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497
+
diff --git a/Documentation/networking/README.sb1000 b/Documentation/networking/README.sb1000
new file mode 100644
index 000000000..f92c2aac5
--- /dev/null
+++ b/Documentation/networking/README.sb1000
@@ -0,0 +1,207 @@
+sb1000 is a module network device driver for the General Instrument (also known
+as NextLevel) SURFboard1000 internal cable modem board. This is an ISA card
+which is used by a number of cable TV companies to provide cable modem access.
+It's a one-way downstream-only cable modem, meaning that your upstream net link
+is provided by your regular phone modem.
+
+This driver was written by Franco Venturi <fventuri@mediaone.net>. He deserves
+a great deal of thanks for this wonderful piece of code!
+
+-----------------------------------------------------------------------------
+
+Support for this device is now a part of the standard Linux kernel. The
+driver source code file is drivers/net/sb1000.c. In addition to this
+you will need:
+
+1.) The "cmconfig" program. This is a utility which supplements "ifconfig"
+to configure the cable modem and network interface (usually called "cm0");
+and
+
+2.) Several PPP scripts which live in /etc/ppp to make connecting via your
+cable modem easy.
+
+ These utilities can be obtained from:
+
+ http://www.jacksonville.net/~fventuri/
+
+ in Franco's original source code distribution .tar.gz file. Support for
+ the sb1000 driver can be found at:
+
+ http://web.archive.org/web/*/http://home.adelphia.net/~siglercm/sb1000.html
+ http://web.archive.org/web/*/http://linuxpower.cx/~cable/
+
+ along with these utilities.
+
+3.) The standard isapnp tools. These are necessary to configure your SB1000
+card at boot time (or afterwards by hand) since it's a PnP card.
+
+ If you don't have these installed as a standard part of your Linux
+ distribution, you can find them at:
+
+ http://www.roestock.demon.co.uk/isapnptools/
+
+ or check your Linux distribution binary CD or their web site. For help with
+ isapnp, pnpdump, or /etc/isapnp.conf, go to:
+
+ http://www.roestock.demon.co.uk/isapnptools/isapnpfaq.html
+
+-----------------------------------------------------------------------------
+
+To make the SB1000 card work, follow these steps:
+
+1.) Run `make config', or `make menuconfig', or `make xconfig', whichever
+you prefer, in the top kernel tree directory to set up your kernel
+configuration. Make sure to say "Y" to "Prompt for development drivers"
+and to say "M" to the sb1000 driver. Also say "Y" or "M" to all the standard
+networking questions to get TCP/IP and PPP networking support.
+
+2.) *BEFORE* you build the kernel, edit drivers/net/sb1000.c. Make sure
+to redefine the value of READ_DATA_PORT to match the I/O address used
+by isapnp to access your PnP cards. This is the value of READPORT in
+/etc/isapnp.conf or given by the output of pnpdump.
+
+3.) Build and install the kernel and modules as usual.
+
+4.) Boot your new kernel following the usual procedures.
+
+5.) Set up to configure the new SB1000 PnP card by capturing the output
+of "pnpdump" to a file and editing this file to set the correct I/O ports,
+IRQ, and DMA settings for all your PnP cards. Make sure none of the settings
+conflict with one another. Then test this configuration by running the
+"isapnp" command with your new config file as the input. Check for
+errors and fix as necessary. (As an aside, I use I/O ports 0x110 and
+0x310 and IRQ 11 for my SB1000 card and these work well for me. YMMV.)
+Then save the finished config file as /etc/isapnp.conf for proper configuration
+on subsequent reboots.
+
+6.) Download the original file sb1000-1.1.2.tar.gz from Franco's site or one of
+the others referenced above. As root, unpack it into a temporary directory and
+do a `make cmconfig' and then `install -c cmconfig /usr/local/sbin'. Don't do
+`make install' because it expects to find all the utilities built and ready for
+installation, not just cmconfig.
+
+7.) As root, copy all the files under the ppp/ subdirectory in Franco's
+tar file into /etc/ppp, being careful not to overwrite any files that are
+already in there. Then modify ppp@gi-on to set the correct login name,
+phone number, and frequency for the cable modem. Also edit pap-secrets
+to specify your login name and password and any site-specific information
+you need.
+
+8.) Be sure to modify /etc/ppp/firewall to use ipchains instead of
+the older ipfwadm commands from the 2.0.x kernels. There's a neat utility to
+convert ipfwadm commands to ipchains commands:
+
+ http://users.dhp.com/~whisper/ipfwadm2ipchains/
+
+You may also wish to modify the firewall script to implement a different
+firewalling scheme.
+
+9.) Start the PPP connection via the script /etc/ppp/ppp@gi-on. You must be
+root to do this. It's better to use a utility like sudo to execute
+frequently used commands like this with root permissions if possible. If you
+connect successfully the cable modem interface will come up and you'll see a
+driver message like this at the console:
+
+ cm0: sb1000 at (0x110,0x310), csn 1, S/N 0x2a0d16d8, IRQ 11.
+ sb1000.c:v1.1.2 6/01/98 (fventuri@mediaone.net)
+
+The "ifconfig" command should show two new interfaces, ppp0 and cm0.
+The command "cmconfig cm0" will give you information about the cable modem
+interface.
+
+10.) Try pinging a site via `ping -c 5 www.yahoo.com', for example. You should
+see packets received.
+
+11.) If you can't get site names (like www.yahoo.com) to resolve into
+IP addresses (like 204.71.200.67), be sure your /etc/resolv.conf file
+has no syntax errors and has the right nameserver IP addresses in it.
+If this doesn't help, try something like `ping -c 5 204.71.200.67' to
+see if the networking is running but the DNS resolution is where the
+problem lies.
+
+12.) If you still have problems, go to the support web sites mentioned above
+and read the information and documentation there.
+
+-----------------------------------------------------------------------------
+
+Common problems:
+
+1.) Packets go out on the ppp0 interface but don't come back on the cm0
+interface. It looks like I'm connected but I can't even ping any
+numerical IP addresses. (This happens predominantly on Debian systems due
+to a default boot-time configuration script.)
+
+Solution -- As root `echo 0 > /proc/sys/net/ipv4/conf/cm0/rp_filter' so it
+can share the same IP address as the ppp0 interface. Note that this
+command should probably be added to the /etc/ppp/cablemodem script
+*right*between* the "/sbin/ifconfig" and "/sbin/cmconfig" commands.
+You may need to do this to /proc/sys/net/ipv4/conf/ppp0/rp_filter as well.
+If you do this to /proc/sys/net/ipv4/conf/default/rp_filter on each reboot
+(in rc.local or some such) then any interfaces can share the same IP
+addresses.
+
+2.) I get "unresolved symbol" error messages on executing `insmod sb1000.o'.
+
+Solution -- You probably have a non-matching kernel source tree and
+/usr/include/linux and /usr/include/asm header files. Make sure you
+install the correct versions of the header files in these two directories.
+Then rebuild and reinstall the kernel.
+
+3.) When isapnp runs it reports an error, and my SB1000 card isn't working.
+
+Solution -- There's a problem with later versions of isapnp using the "(CHECK)"
+option in the lines that allocate the two I/O addresses for the SB1000 card.
+This first popped up on RH 6.0. Delete "(CHECK)" for the SB1000 I/O addresses.
+Make sure they don't conflict with any other pieces of hardware first! Then
+rerun isapnp and go from there.
+
+4.) I can't execute the /etc/ppp/ppp@gi-on file.
+
+Solution -- As root do `chmod ug+x /etc/ppp/ppp@gi-on'.
+
+5.) The firewall script isn't working (with 2.2.x and higher kernels).
+
+Solution -- Use the ipfwadm2ipchains script referenced above to convert the
+/etc/ppp/firewall script from the deprecated ipfwadm commands to ipchains.
+
+6.) I'm getting *tons* of firewall deny messages in the /var/kern.log,
+/var/messages, and/or /var/syslog files, and they're filling up my /var
+partition!!!
+
+Solution -- First, tell your ISP that you're receiving DoS (Denial of Service)
+and/or portscanning (UDP connection attempts) attacks! Look over the deny
+messages to figure out what the attack is and where it's coming from. Next,
+edit /etc/ppp/cablemodem and make sure the ",nobroadcast" option is turned on
+to the "cmconfig" command (uncomment that line). If you're not receiving these
+denied packets on your broadcast interface (IP address xxx.yyy.zzz.255
+typically), then someone is attacking your machine in particular. Be careful
+out there....
+
+7.) Everything seems to work fine but my computer locks up after a while
+(and typically during a lengthy download through the cable modem)!
+
+Solution -- You may need to add a short delay in the driver to 'slow down' the
+SURFboard because your PC might not be able to keep up with the transfer rate
+of the SB1000. To do this, it's probably best to download Franco's
+sb1000-1.1.2.tar.gz archive and build and install sb1000.o manually. You'll
+want to edit the 'Makefile' and look for the 'SB1000_DELAY'
+define. Uncomment those 'CFLAGS' lines (and comment out the default ones)
+and try setting the delay to something like 60 microseconds with:
+'-DSB1000_DELAY=60'. Then do `make' and as root `make install' and try
+it out. If it still doesn't work or you like playing with the driver, you may
+try other numbers. Remember though that the higher the delay, the slower the
+driver (which slows down the rest of the PC too when it is actively
+used). Thanks to Ed Daiga for this tip!
+
+-----------------------------------------------------------------------------
+
+Credits: This README came from Franco Venturi's original README file which is
+still supplied with his driver .tar.gz archive. I and all other sb1000 users
+owe Franco a tremendous "Thank you!" Additional thanks goes to Carl Patten
+and Ralph Bonnell who are now managing the Linux SB1000 web site, and to
+the SB1000 users who reported and helped debug the common problems listed
+above.
+
+
+ Clemmitt Sigler
+ csigler@vt.edu
diff --git a/Documentation/networking/alias.txt b/Documentation/networking/alias.txt
new file mode 100644
index 000000000..85046f53f
--- /dev/null
+++ b/Documentation/networking/alias.txt
@@ -0,0 +1,40 @@
+
+IP-Aliasing:
+============
+
+IP-aliases are an obsolete way to manage multiple IP-addresses/masks
+per interface. Newer tools such as iproute2 support multiple
+address/prefixes per interface, but aliases are still supported
+for backwards compatibility.
+
+An alias is formed by adding a colon and a string when running ifconfig.
+This string is usually numeric, but this is not a must.
+
+o Alias creation.
+ Alias creation is done by 'magic' interface naming: eg. to create a
+ 200.1.1.1 alias for eth0 ...
+
+ # ifconfig eth0:0 200.1.1.1 etc,etc....
+ ~~ -> request alias #0 creation (if not yet exists) for eth0
+
+ The corresponding route is also set up by this command.
+ Please note: The route always points to the base interface.
+
+
+o Alias deletion.
+ The alias is removed by shutting the alias down:
+
+ # ifconfig eth0:0 down
+ ~~~~~~~~~~ -> will delete alias
+
+
+o Alias (re-)configuring
+
+ Aliases are not real devices, but programs should be able to configure and
+ refer to them as usual (ifconfig, route, etc).
+
+
+o Relationship with main device
+
+ If the base device is shut down the added aliases will be deleted
+ too.
diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt
new file mode 100644
index 000000000..3f24df8c6
--- /dev/null
+++ b/Documentation/networking/altera_tse.txt
@@ -0,0 +1,263 @@
+ Altera Triple-Speed Ethernet MAC driver
+
+Copyright (C) 2008-2014 Altera Corporation
+
+This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers
+using the SGDMA and MSGDMA soft DMA IP components. The driver uses the
+platform bus to obtain component resources. The designs used to test this
+driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board,
+and tested with ARM and NIOS processor hosts seperately. The anticipated use
+cases are simple communications between an embedded system and an external peer
+for status and simple configuration of the embedded system.
+
+For more information visit www.altera.com and www.rocketboards.org. Support
+forums for the driver may be found on www.rocketboards.org, and a design used
+to test this driver may be found there as well. Support is also available from
+the maintainer of this driver, found in MAINTAINERS.
+
+The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP
+components that can be assembled and built into an FPGA using the Altera
+Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that
+this driver was tested against. The sopc2dts tool is used to create the
+device tree for the driver, and may be found at rocketboards.org.
+
+The driver probe function examines the device tree and determines if the
+Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The
+probe function then installs the appropriate set of DMA routines to
+initialize, setup transmits, receives, and interrupt handling primitives for
+the respective configurations.
+
+The SGDMA component is to be deprecated in the near future (over the next 1-2
+years as of this writing in early 2014) in favor of the MSGDMA component.
+SGDMA support is included for existing designs and reference in case a
+developer wishes to support their own soft DMA logic and driver support. Any
+new designs should not use the SGDMA.
+
+The SGDMA supports only a single transmit or receive operation at a time, and
+therefore will not perform as well compared to the MSGDMA soft IP. Please
+visit www.altera.com for known, documented SGDMA errata.
+
+Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time.
+Scatter-gather DMA will be added to a future maintenance update to this
+driver.
+
+Jumbo frames are not supported at this time.
+
+The driver limits PHY operations to 10/100Mbps, and has not yet been fully
+tested for 1Gbps. This support will be added in a future maintenance update.
+
+1) Kernel Configuration
+The kernel configuration option is ALTERA_TSE:
+ Device Drivers ---> Network device support ---> Ethernet driver support --->
+ Altera Triple-Speed Ethernet MAC support (ALTERA_TSE)
+
+2) Driver parameters list:
+ debug: message level (0: no output, 16: all);
+ dma_rx_num: Number of descriptors in the RX list (default is 64);
+ dma_tx_num: Number of descriptors in the TX list (default is 64).
+
+3) Command line options
+Driver parameters can be also passed in command line by using:
+ altera_tse=dma_rx_num:128,dma_tx_num:512
+
+4) Driver information and notes
+
+4.1) Transmit process
+When the driver's transmit routine is called by the kernel, it sets up a
+transmit descriptor by calling the underlying DMA transmit routine (SGDMA or
+MSGDMA), and initites a transmit operation. Once the transmit is complete, an
+interrupt is driven by the transmit DMA logic. The driver handles the transmit
+completion in the context of the interrupt handling chain by recycling
+resource required to send and track the requested transmit operation.
+
+4.2) Receive process
+The driver will post receive buffers to the receive DMA logic during driver
+intialization. Receive buffers may or may not be queued depending upon the
+underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able
+to queue receive buffers to the SGDMA receive logic). When a packet is
+received, the DMA logic generates an interrupt. The driver handles a receive
+interrupt by obtaining the DMA receive logic status, reaping receive
+completions until no more receive completions are available.
+
+4.3) Interrupt Mitigation
+The driver is able to mitigate the number of its DMA interrupts
+using NAPI for receive operations. Interrupt mitigation is not yet supported
+for transmit operations, but will be added in a future maintenance release.
+
+4.4) Ethtool support
+Ethtool is supported. Driver statistics and internal errors can be taken using:
+ethtool -S ethX command. It is possible to dump registers etc.
+
+4.5) PHY Support
+The driver is compatible with PAL to work with PHY and GPHY devices.
+
+4.7) List of source files:
+ o Kconfig
+ o Makefile
+ o altera_tse_main.c: main network device driver
+ o altera_tse_ethtool.c: ethtool support
+ o altera_tse.h: private driver structure and common definitions
+ o altera_msgdma.h: MSGDMA implementation function definitions
+ o altera_sgdma.h: SGDMA implementation function definitions
+ o altera_msgdma.c: MSGDMA implementation
+ o altera_sgdma.c: SGDMA implementation
+ o altera_sgdmahw.h: SGDMA register and descriptor definitions
+ o altera_msgdmahw.h: MSGDMA register and descriptor definitions
+ o altera_utils.c: Driver utility functions
+ o altera_utils.h: Driver utility function definitions
+
+5) Debug Information
+
+The driver exports debug information such as internal statistics,
+debug information, MAC and DMA registers etc.
+
+A user may use the ethtool support to get statistics:
+e.g. using: ethtool -S ethX (that shows the statistics counters)
+or sees the MAC registers: e.g. using: ethtool -d ethX
+
+The developer can also use the "debug" module parameter to get
+further debug information.
+
+6) Statistics Support
+
+The controller and driver support a mix of IEEE standard defined statistics,
+RFC defined statistics, and driver or Altera defined statistics. The four
+specifications containing the standard definitions for these statistics are
+as follows:
+
+ o IEEE 802.3-2012 - IEEE Standard for Ethernet.
+ o RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt.
+ o RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt.
+ o Altera Triple Speed Ethernet User Guide, found at http://www.altera.com
+
+The statistics supported by the TSE and the device driver are as follows:
+
+"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.2. This statistics is the count of frames that are successfully
+transmitted.
+
+"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.5. This statistic is the count of frames that are successfully
+received. This count does not include any error packets such as CRC errors,
+length errors, or alignment errors.
+
+"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE
+802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are
+an integral number of bytes in length and do not pass the CRC test as the frame
+is received.
+
+"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012,
+Section 5.2.2.1.7. This statistic is the count of frames that are not an
+integral number of bytes in length and do not pass the CRC test as the frame is
+received.
+
+"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.8. This statistic is the count of data and pad bytes
+successfully transmitted from the interface.
+
+"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.14. This statistic is the count of data and pad bytes
+successfully received by the controller.
+
+"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE
+802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames
+transmitted from the network controller.
+
+"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE
+802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames
+received by the network controller.
+
+"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is
+a count of the number of packets received containing errors that prevented the
+packet from being delivered to a higher level protocol.
+
+"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic
+is a count of the number of packets that could not be transmitted due to errors.
+
+"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were not addressed
+to the broadcast address or a multicast group.
+
+"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were addressed to
+a multicast address group.
+
+"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were addressed to
+the broadcast address.
+
+"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This
+statistic is the number of outbound packets not transmitted even though an
+error was not detected. An example of a reason this might occur is to free up
+internal buffer space.
+
+"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were not addressed to
+a multicast group or broadcast address.
+
+"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were addressed to a
+multicast group.
+
+"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were addressed to a
+broadcast address.
+
+"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819.
+This statistic counts the number of packets dropped due to lack of internal
+controller resources.
+
+"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819.
+This statistic counts the total number of bytes received by the controller,
+including error and discarded packets.
+
+"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819.
+This statistic counts the total number of packets received by the controller,
+including error, discarded, unicast, multicast, and broadcast packets.
+
+"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819.
+This statistic counts the number of correctly formed packets received less
+than 64 bytes long.
+
+"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819.
+This statistic counts the number of correctly formed packets greater than 1518
+bytes long.
+
+"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819.
+This statistic counts the total number of packets received that were 64 octets
+in length.
+
+"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC
+2819. This statistic counts the total number of packets received that were
+between 65 and 127 octets in length inclusive.
+
+"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 128 and 255 octets in length inclusive.
+
+"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 256 and 511 octets in length inclusive.
+
+"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 512 and 1023 octets in length inclusive.
+
+"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define
+in RFC 2819. This statistic is the total number of packets received that were
+between 1024 and 1518 octets in length inclusive.
+
+"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the
+Altera TSE. This statistics counts the number of received good and errored
+frames between the length of 1519 and the maximum frame length configured
+in the frm_length register. See the Altera TSE User Guide for More details.
+
+"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This
+statistic is the total number of packets received that were longer than 1518
+octets, and had either a bad CRC with an integral number of octets (CRC Error)
+or a bad CRC with a non-integral number of octets (Alignment Error).
+
+"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This
+statistic is the total number of packets received that were less than 64 octets
+in length and had either a bad CRC with an integral number of octets (CRC
+error) or a bad CRC with a non-integral number of octets (Alignment Error).
diff --git a/Documentation/networking/arcnet-hardware.txt b/Documentation/networking/arcnet-hardware.txt
new file mode 100644
index 000000000..731de4115
--- /dev/null
+++ b/Documentation/networking/arcnet-hardware.txt
@@ -0,0 +1,3133 @@
+
+-----------------------------------------------------------------------------
+1) This file is a supplement to arcnet.txt. Please read that for general
+ driver configuration help.
+-----------------------------------------------------------------------------
+2) This file is no longer Linux-specific. It should probably be moved out of
+ the kernel sources. Ideas?
+-----------------------------------------------------------------------------
+
+Because so many people (myself included) seem to have obtained ARCnet cards
+without manuals, this file contains a quick introduction to ARCnet hardware,
+some cabling tips, and a listing of all jumper settings I can find. Please
+e-mail apenwarr@worldvisions.ca with any settings for your particular card,
+or any other information you have!
+
+
+INTRODUCTION TO ARCNET
+----------------------
+
+ARCnet is a network type which works in a way similar to popular Ethernet
+networks but which is also different in some very important ways.
+
+First of all, you can get ARCnet cards in at least two speeds: 2.5 Mbps
+(slower than Ethernet) and 100 Mbps (faster than normal Ethernet). In fact,
+there are others as well, but these are less common. The different hardware
+types, as far as I'm aware, are not compatible and so you cannot wire a
+100 Mbps card to a 2.5 Mbps card, and so on. From what I hear, my driver does
+work with 100 Mbps cards, but I haven't been able to verify this myself,
+since I only have the 2.5 Mbps variety. It is probably not going to saturate
+your 100 Mbps card. Stop complaining. :)
+
+You also cannot connect an ARCnet card to any kind of Ethernet card and
+expect it to work.
+
+There are two "types" of ARCnet - STAR topology and BUS topology. This
+refers to how the cards are meant to be wired together. According to most
+available documentation, you can only connect STAR cards to STAR cards and
+BUS cards to BUS cards. That makes sense, right? Well, it's not quite
+true; see below under "Cabling."
+
+Once you get past these little stumbling blocks, ARCnet is actually quite a
+well-designed standard. It uses something called "modified token passing"
+which makes it completely incompatible with so-called "Token Ring" cards,
+but which makes transfers much more reliable than Ethernet does. In fact,
+ARCnet will guarantee that a packet arrives safely at the destination, and
+even if it can't possibly be delivered properly (ie. because of a cable
+break, or because the destination computer does not exist) it will at least
+tell the sender about it.
+
+Because of the carefully defined action of the "token", it will always make
+a pass around the "ring" within a maximum length of time. This makes it
+useful for realtime networks.
+
+In addition, all known ARCnet cards have an (almost) identical programming
+interface. This means that with one ARCnet driver you can support any
+card, whereas with Ethernet each manufacturer uses what is sometimes a
+completely different programming interface, leading to a lot of different,
+sometimes very similar, Ethernet drivers. Of course, always using the same
+programming interface also means that when high-performance hardware
+facilities like PCI bus mastering DMA appear, it's hard to take advantage of
+them. Let's not go into that.
+
+One thing that makes ARCnet cards difficult to program for, however, is the
+limit on their packet sizes; standard ARCnet can only send packets that are
+up to 508 bytes in length. This is smaller than the Internet "bare minimum"
+of 576 bytes, let alone the Ethernet MTU of 1500. To compensate, an extra
+level of encapsulation is defined by RFC1201, which I call "packet
+splitting," that allows "virtual packets" to grow as large as 64K each,
+although they are generally kept down to the Ethernet-style 1500 bytes.
+
+For more information on the advantages and disadvantages (mostly the
+advantages) of ARCnet networks, you might try the "ARCnet Trade Association"
+WWW page:
+ http://www.arcnet.com
+
+
+CABLING ARCNET NETWORKS
+-----------------------
+
+This section was rewritten by
+ Vojtech Pavlik <vojtech@suse.cz>
+using information from several people, including:
+ Avery Pennraun <apenwarr@worldvisions.ca>
+ Stephen A. Wood <saw@hallc1.cebaf.gov>
+ John Paul Morrison <jmorriso@bogomips.ee.ubc.ca>
+ Joachim Koenig <jojo@repas.de>
+and Avery touched it up a bit, at Vojtech's request.
+
+ARCnet (the classic 2.5 Mbps version) can be connected by two different
+types of cabling: coax and twisted pair. The other ARCnet-type networks
+(100 Mbps TCNS and 320 kbps - 32 Mbps ARCnet Plus) use different types of
+cabling (Type1, Fiber, C1, C4, C5).
+
+For a coax network, you "should" use 93 Ohm RG-62 cable. But other cables
+also work fine, because ARCnet is a very stable network. I personally use 75
+Ohm TV antenna cable.
+
+Cards for coax cabling are shipped in two different variants: for BUS and
+STAR network topologies. They are mostly the same. The only difference
+lies in the hybrid chip installed. BUS cards use high impedance output,
+while STAR use low impedance. Low impedance card (STAR) is electrically
+equal to a high impedance one with a terminator installed.
+
+Usually, the ARCnet networks are built up from STAR cards and hubs. There
+are two types of hubs - active and passive. Passive hubs are small boxes
+with four BNC connectors containing four 47 Ohm resistors:
+
+ | | wires
+ R + junction
+-R-+-R- R 47 Ohm resistors
+ R
+ |
+
+The shielding is connected together. Active hubs are much more complicated;
+they are powered and contain electronics to amplify the signal and send it
+to other segments of the net. They usually have eight connectors. Active
+hubs come in two variants - dumb and smart. The dumb variant just
+amplifies, but the smart one decodes to digital and encodes back all packets
+coming through. This is much better if you have several hubs in the net,
+since many dumb active hubs may worsen the signal quality.
+
+And now to the cabling. What you can connect together:
+
+1. A card to a card. This is the simplest way of creating a 2-computer
+ network.
+
+2. A card to a passive hub. Remember that all unused connectors on the hub
+ must be properly terminated with 93 Ohm (or something else if you don't
+ have the right ones) terminators.
+ (Avery's note: oops, I didn't know that. Mine (TV cable) works
+ anyway, though.)
+
+3. A card to an active hub. Here is no need to terminate the unused
+ connectors except some kind of aesthetic feeling. But, there may not be
+ more than eleven active hubs between any two computers. That of course
+ doesn't limit the number of active hubs on the network.
+
+4. An active hub to another.
+
+5. An active hub to passive hub.
+
+Remember that you cannot connect two passive hubs together. The power loss
+implied by such a connection is too high for the net to operate reliably.
+
+An example of a typical ARCnet network:
+
+ R S - STAR type card
+ S------H--------A-------S R - Terminator
+ | | H - Hub
+ | | A - Active hub
+ | S----H----S
+ S |
+ |
+ S
+
+The BUS topology is very similar to the one used by Ethernet. The only
+difference is in cable and terminators: they should be 93 Ohm. Ethernet
+uses 50 Ohm impedance. You use T connectors to put the computers on a single
+line of cable, the bus. You have to put terminators at both ends of the
+cable. A typical BUS ARCnet network looks like:
+
+ RT----T------T------T------T------TR
+ B B B B B B
+
+ B - BUS type card
+ R - Terminator
+ T - T connector
+
+But that is not all! The two types can be connected together. According to
+the official documentation the only way of connecting them is using an active
+hub:
+
+ A------T------T------TR
+ | B B B
+ S---H---S
+ |
+ S
+
+The official docs also state that you can use STAR cards at the ends of
+BUS network in place of a BUS card and a terminator:
+
+ S------T------T------S
+ B B
+
+But, according to my own experiments, you can simply hang a BUS type card
+anywhere in middle of a cable in a STAR topology network. And more - you
+can use the bus card in place of any star card if you use a terminator. Then
+you can build very complicated networks fulfilling all your needs! An
+example:
+
+ S
+ |
+ RT------T-------T------H------S
+ B B B |
+ | R
+ S------A------T-------T-------A-------H------TR
+ | B B | | B
+ | S BT |
+ | | | S----A-----S
+ S------H---A----S | |
+ | | S------T----H---S |
+ S S B R S
+
+A basically different cabling scheme is used with Twisted Pair cabling. Each
+of the TP cards has two RJ (phone-cord style) connectors. The cards are
+then daisy-chained together using a cable connecting every two neighboring
+cards. The ends are terminated with RJ 93 Ohm terminators which plug into
+the empty connectors of cards on the ends of the chain. An example:
+
+ ___________ ___________
+ _R_|_ _|_|_ _|_R_
+ | | | | | |
+ |Card | |Card | |Card |
+ |_____| |_____| |_____|
+
+
+There are also hubs for the TP topology. There is nothing difficult
+involved in using them; you just connect a TP chain to a hub on any end or
+even at both. This way you can create almost any network configuration.
+The maximum of 11 hubs between any two computers on the net applies here as
+well. An example:
+
+ RP-------P--------P--------H-----P------P-----PR
+ |
+ RP-----H--------P--------H-----P------PR
+ | |
+ PR PR
+
+ R - RJ Terminator
+ P - TP Card
+ H - TP Hub
+
+Like any network, ARCnet has a limited cable length. These are the maximum
+cable lengths between two active ends (an active end being an active hub or
+a STAR card).
+
+ RG-62 93 Ohm up to 650 m
+ RG-59/U 75 Ohm up to 457 m
+ RG-11/U 75 Ohm up to 533 m
+ IBM Type 1 150 Ohm up to 200 m
+ IBM Type 3 100 Ohm up to 100 m
+
+The maximum length of all cables connected to a passive hub is limited to 65
+meters for RG-62 cabling; less for others. You can see that using passive
+hubs in a large network is a bad idea. The maximum length of a single "BUS
+Trunk" is about 300 meters for RG-62. The maximum distance between the two
+most distant points of the net is limited to 3000 meters. The maximum length
+of a TP cable between two cards/hubs is 650 meters.
+
+
+SETTING THE JUMPERS
+-------------------
+
+All ARCnet cards should have a total of four or five different settings:
+
+ - the I/O address: this is the "port" your ARCnet card is on. Probed
+ values in the Linux ARCnet driver are only from 0x200 through 0x3F0. (If
+ your card has additional ones, which is possible, please tell me.) This
+ should not be the same as any other device on your system. According to
+ a doc I got from Novell, MS Windows prefers values of 0x300 or more,
+ eating net connections on my system (at least) otherwise. My guess is
+ this may be because, if your card is at 0x2E0, probing for a serial port
+ at 0x2E8 will reset the card and probably mess things up royally.
+ - Avery's favourite: 0x300.
+
+ - the IRQ: on 8-bit cards, it might be 2 (9), 3, 4, 5, or 7.
+ on 16-bit cards, it might be 2 (9), 3, 4, 5, 7, or 10-15.
+
+ Make sure this is different from any other card on your system. Note
+ that IRQ2 is the same as IRQ9, as far as Linux is concerned. You can
+ "cat /proc/interrupts" for a somewhat complete list of which ones are in
+ use at any given time. Here is a list of common usages from Vojtech
+ Pavlik <vojtech@suse.cz>:
+ ("Not on bus" means there is no way for a card to generate this
+ interrupt)
+ IRQ 0 - Timer 0 (Not on bus)
+ IRQ 1 - Keyboard (Not on bus)
+ IRQ 2 - IRQ Controller 2 (Not on bus, nor does interrupt the CPU)
+ IRQ 3 - COM2
+ IRQ 4 - COM1
+ IRQ 5 - FREE (LPT2 if you have it; sometimes COM3; maybe PLIP)
+ IRQ 6 - Floppy disk controller
+ IRQ 7 - FREE (LPT1 if you don't use the polling driver; PLIP)
+ IRQ 8 - Realtime Clock Interrupt (Not on bus)
+ IRQ 9 - FREE (VGA vertical sync interrupt if enabled)
+ IRQ 10 - FREE
+ IRQ 11 - FREE
+ IRQ 12 - FREE
+ IRQ 13 - Numeric Coprocessor (Not on bus)
+ IRQ 14 - Fixed Disk Controller
+ IRQ 15 - FREE (Fixed Disk Controller 2 if you have it)
+
+ Note: IRQ 9 is used on some video cards for the "vertical retrace"
+ interrupt. This interrupt would have been handy for things like
+ video games, as it occurs exactly once per screen refresh, but
+ unfortunately IBM cancelled this feature starting with the original
+ VGA and thus many VGA/SVGA cards do not support it. For this
+ reason, no modern software uses this interrupt and it can almost
+ always be safely disabled, if your video card supports it at all.
+
+ If your card for some reason CANNOT disable this IRQ (usually there
+ is a jumper), one solution would be to clip the printed circuit
+ contact on the board: it's the fourth contact from the left on the
+ back side. I take no responsibility if you try this.
+
+ - Avery's favourite: IRQ2 (actually IRQ9). Watch that VGA, though.
+
+ - the memory address: Unlike most cards, ARCnets use "shared memory" for
+ copying buffers around. Make SURE it doesn't conflict with any other
+ used memory in your system!
+ A0000 - VGA graphics memory (ok if you don't have VGA)
+ B0000 - Monochrome text mode
+ C0000 \ One of these is your VGA BIOS - usually C0000.
+ E0000 /
+ F0000 - System BIOS
+
+ Anything less than 0xA0000 is, well, a BAD idea since it isn't above
+ 640k.
+ - Avery's favourite: 0xD0000
+
+ - the station address: Every ARCnet card has its own "unique" network
+ address from 0 to 255. Unlike Ethernet, you can set this address
+ yourself with a jumper or switch (or on some cards, with special
+ software). Since it's only 8 bits, you can only have 254 ARCnet cards
+ on a network. DON'T use 0 or 255, since these are reserved (although
+ neat stuff will probably happen if you DO use them). By the way, if you
+ haven't already guessed, don't set this the same as any other ARCnet on
+ your network!
+ - Avery's favourite: 3 and 4. Not that it matters.
+
+ - There may be ETS1 and ETS2 settings. These may or may not make a
+ difference on your card (many manuals call them "reserved"), but are
+ used to change the delays used when powering up a computer on the
+ network. This is only necessary when wiring VERY long range ARCnet
+ networks, on the order of 4km or so; in any case, the only real
+ requirement here is that all cards on the network with ETS1 and ETS2
+ jumpers have them in the same position. Chris Hindy <chrish@io.org>
+ sent in a chart with actual values for this:
+ ET1 ET2 Response Time Reconfiguration Time
+ --- --- ------------- --------------------
+ open open 74.7us 840us
+ open closed 283.4us 1680us
+ closed open 561.8us 1680us
+ closed closed 1118.6us 1680us
+
+ Make sure you set ETS1 and ETS2 to the SAME VALUE for all cards on your
+ network.
+
+Also, on many cards (not mine, though) there are red and green LED's.
+Vojtech Pavlik <vojtech@suse.cz> tells me this is what they mean:
+ GREEN RED Status
+ ----- --- ------
+ OFF OFF Power off
+ OFF Short flashes Cabling problems (broken cable or not
+ terminated)
+ OFF (short) ON Card init
+ ON ON Normal state - everything OK, nothing
+ happens
+ ON Long flashes Data transfer
+ ON OFF Never happens (maybe when wrong ID)
+
+
+The following is all the specific information people have sent me about
+their own particular ARCnet cards. It is officially a mess, and contains
+huge amounts of duplicated information. I have no time to fix it. If you
+want to, PLEASE DO! Just send me a 'diff -u' of all your changes.
+
+The model # is listed right above specifics for that card, so you should be
+able to use your text viewer's "search" function to find the entry you want.
+If you don't KNOW what kind of card you have, try looking through the
+various diagrams to see if you can tell.
+
+If your model isn't listed and/or has different settings, PLEASE PLEASE
+tell me. I had to figure mine out without the manual, and it WASN'T FUN!
+
+Even if your ARCnet model isn't listed, but has the same jumpers as another
+model that is, please e-mail me to say so.
+
+Cards Listed in this file (in this order, mostly):
+
+ Manufacturer Model # Bits
+ ------------ ------- ----
+ SMC PC100 8
+ SMC PC110 8
+ SMC PC120 8
+ SMC PC130 8
+ SMC PC270E 8
+ SMC PC500 16
+ SMC PC500Longboard 16
+ SMC PC550Longboard 16
+ SMC PC600 16
+ SMC PC710 8
+ SMC? LCS-8830(-T) 8/16
+ Puredata PDI507 8
+ CNet Tech CN120-Series 8
+ CNet Tech CN160-Series 16
+ Lantech? UM9065L chipset 8
+ Acer 5210-003 8
+ Datapoint? LAN-ARC-8 8
+ Topware TA-ARC/10 8
+ Thomas-Conrad 500-6242-0097 REV A 8
+ Waterloo? (C)1985 Waterloo Micro. 8
+ No Name -- 8/16
+ No Name Taiwan R.O.C? 8
+ No Name Model 9058 8
+ Tiara Tiara Lancard? 8
+
+
+** SMC = Standard Microsystems Corp.
+** CNet Tech = CNet Technology, Inc.
+
+
+Unclassified Stuff
+------------------
+ - Please send any other information you can find.
+
+ - And some other stuff (more info is welcome!):
+ From: root@ultraworld.xs4all.nl (Timo Hilbrink)
+ To: apenwarr@foxnet.net (Avery Pennarun)
+ Date: Wed, 26 Oct 1994 02:10:32 +0000 (GMT)
+ Reply-To: timoh@xs4all.nl
+
+ [...parts deleted...]
+
+ About the jumpers: On my PC130 there is one more jumper, located near the
+ cable-connector and it's for changing to star or bus topology;
+ closed: star - open: bus
+ On the PC500 are some more jumper-pins, one block labeled with RX,PDN,TXI
+ and another with ALE,LA17,LA18,LA19 these are undocumented..
+
+ [...more parts deleted...]
+
+ --- CUT ---
+
+
+** Standard Microsystems Corp (SMC) **
+PC100, PC110, PC120, PC130 (8-bit cards)
+PC500, PC600 (16-bit cards)
+---------------------------------
+ - mainly from Avery Pennarun <apenwarr@worldvisions.ca>. Values depicted
+ are from Avery's setup.
+ - special thanks to Timo Hilbrink <timoh@xs4all.nl> for noting that PC120,
+ 130, 500, and 600 all have the same switches as Avery's PC100.
+ PC500/600 have several extra, undocumented pins though. (?)
+ - PC110 settings were verified by Stephen A. Wood <saw@cebaf.gov>
+ - Also, the JP- and S-numbers probably don't match your card exactly. Try
+ to find jumpers/switches with the same number of settings - it's
+ probably more reliable.
+
+
+ JP5 [|] : : : :
+(IRQ Setting) IRQ2 IRQ3 IRQ4 IRQ5 IRQ7
+ Put exactly one jumper on exactly one set of pins.
+
+
+ 1 2 3 4 5 6 7 8 9 10
+ S1 /----------------------------------\
+(I/O and Memory | 1 1 * 0 0 0 0 * 1 1 0 1 |
+ addresses) \----------------------------------/
+ |--| |--------| |--------|
+ (a) (b) (m)
+
+ WARNING. It's very important when setting these which way
+ you're holding the card, and which way you think is '1'!
+
+ If you suspect that your settings are not being made
+ correctly, try reversing the direction or inverting the
+ switch positions.
+
+ a: The first digit of the I/O address.
+ Setting Value
+ ------- -----
+ 00 0
+ 01 1
+ 10 2
+ 11 3
+
+ b: The second digit of the I/O address.
+ Setting Value
+ ------- -----
+ 0000 0
+ 0001 1
+ 0010 2
+ ... ...
+ 1110 E
+ 1111 F
+
+ The I/O address is in the form ab0. For example, if
+ a is 0x2 and b is 0xE, the address will be 0x2E0.
+
+ DO NOT SET THIS LESS THAN 0x200!!!!!
+
+
+ m: The first digit of the memory address.
+ Setting Value
+ ------- -----
+ 0000 0
+ 0001 1
+ 0010 2
+ ... ...
+ 1110 E
+ 1111 F
+
+ The memory address is in the form m0000. For example, if
+ m is D, the address will be 0xD0000.
+
+ DO NOT SET THIS TO C0000, F0000, OR LESS THAN A0000!
+
+ 1 2 3 4 5 6 7 8
+ S2 /--------------------------\
+(Station Address) | 1 1 0 0 0 0 0 0 |
+ \--------------------------/
+
+ Setting Value
+ ------- -----
+ 00000000 00
+ 10000000 01
+ 01000000 02
+ ...
+ 01111111 FE
+ 11111111 FF
+
+ Note that this is binary with the digits reversed!
+
+ DO NOT SET THIS TO 0 OR 255 (0xFF)!
+
+
+*****************************************************************************
+
+** Standard Microsystems Corp (SMC) **
+PC130E/PC270E (8-bit cards)
+---------------------------
+ - from Juergen Seifert <seifert@htwm.de>
+
+
+STANDARD MICROSYSTEMS CORPORATION (SMC) ARCNET(R)-PC130E/PC270E
+===============================================================
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the following Original SMC Manual
+
+ "Configuration Guide for
+ ARCNET(R)-PC130E/PC270
+ Network Controller Boards
+ Pub. # 900.044A
+ June, 1989"
+
+ARCNET is a registered trademark of the Datapoint Corporation
+SMC is a registered trademark of the Standard Microsystems Corporation
+
+The PC130E is an enhanced version of the PC130 board, is equipped with a
+standard BNC female connector for connection to RG-62/U coax cable.
+Since this board is designed both for point-to-point connection in star
+networks and for connection to bus networks, it is downwardly compatible
+with all the other standard boards designed for coax networks (that is,
+the PC120, PC110 and PC100 star topology boards and the PC220, PC210 and
+PC200 bus topology boards).
+
+The PC270E is an enhanced version of the PC260 board, is equipped with two
+modular RJ11-type jacks for connection to twisted pair wiring.
+It can be used in a star or a daisy-chained network.
+
+
+ 8 7 6 5 4 3 2 1
+ ________________________________________________________________
+ | | S1 | |
+ | |_________________| |
+ | Offs|Base |I/O Addr |
+ | RAM Addr | ___|
+ | ___ ___ CR3 |___|
+ | | \/ | CR4 |___|
+ | | PROM | ___|
+ | | | N | | 8
+ | | SOCKET | o | | 7
+ | |________| d | | 6
+ | ___________________ e | | 5
+ | | | A | S | 4
+ | |oo| EXT2 | | d | 2 | 3
+ | |oo| EXT1 | SMC | d | | 2
+ | |oo| ROM | 90C63 | r |___| 1
+ | |oo| IRQ7 | | |o| _____|
+ | |oo| IRQ5 | | |o| | J1 |
+ | |oo| IRQ4 | | STAR |_____|
+ | |oo| IRQ3 | | | J2 |
+ | |oo| IRQ2 |___________________| |_____|
+ |___ ______________|
+ | |
+ |_____________________________________________|
+
+Legend:
+
+SMC 90C63 ARCNET Controller / Transceiver /Logic
+S1 1-3: I/O Base Address Select
+ 4-6: Memory Base Address Select
+ 7-8: RAM Offset Select
+S2 1-8: Node ID Select
+EXT Extended Timeout Select
+ROM ROM Enable Select
+STAR Selected - Star Topology (PC130E only)
+ Deselected - Bus Topology (PC130E only)
+CR3/CR4 Diagnostic LEDs
+J1 BNC RG62/U Connector (PC130E only)
+J1 6-position Telephone Jack (PC270E only)
+J2 6-position Telephone Jack (PC270E only)
+
+Setting one of the switches to Off/Open means "1", On/Closed means "0".
+
+
+Setting the Node ID
+-------------------
+
+The eight switches in group S2 are used to set the node ID.
+These switches work in a way similar to the PC100-series cards; see that
+entry for more information.
+
+
+Setting the I/O Base Address
+----------------------------
+
+The first three switches in switch group S1 are used to select one
+of eight possible I/O Base addresses using the following table
+
+
+ Switch | Hex I/O
+ 1 2 3 | Address
+ -------|--------
+ 0 0 0 | 260
+ 0 0 1 | 290
+ 0 1 0 | 2E0 (Manufacturer's default)
+ 0 1 1 | 2F0
+ 1 0 0 | 300
+ 1 0 1 | 350
+ 1 1 0 | 380
+ 1 1 1 | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+--------------------------------------------
+
+The memory buffer requires 2K of a 16K block of RAM. The base of this
+16K block can be located in any of eight positions.
+Switches 4-6 of switch group S1 select the Base of the 16K block.
+Within that 16K address space, the buffer may be assigned any one of four
+positions, determined by the offset, switches 7 and 8 of group S1.
+
+ Switch | Hex RAM | Hex ROM
+ 4 5 6 7 8 | Address | Address *)
+ -----------|---------|-----------
+ 0 0 0 0 0 | C0000 | C2000
+ 0 0 0 0 1 | C0800 | C2000
+ 0 0 0 1 0 | C1000 | C2000
+ 0 0 0 1 1 | C1800 | C2000
+ | |
+ 0 0 1 0 0 | C4000 | C6000
+ 0 0 1 0 1 | C4800 | C6000
+ 0 0 1 1 0 | C5000 | C6000
+ 0 0 1 1 1 | C5800 | C6000
+ | |
+ 0 1 0 0 0 | CC000 | CE000
+ 0 1 0 0 1 | CC800 | CE000
+ 0 1 0 1 0 | CD000 | CE000
+ 0 1 0 1 1 | CD800 | CE000
+ | |
+ 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default)
+ 0 1 1 0 1 | D0800 | D2000
+ 0 1 1 1 0 | D1000 | D2000
+ 0 1 1 1 1 | D1800 | D2000
+ | |
+ 1 0 0 0 0 | D4000 | D6000
+ 1 0 0 0 1 | D4800 | D6000
+ 1 0 0 1 0 | D5000 | D6000
+ 1 0 0 1 1 | D5800 | D6000
+ | |
+ 1 0 1 0 0 | D8000 | DA000
+ 1 0 1 0 1 | D8800 | DA000
+ 1 0 1 1 0 | D9000 | DA000
+ 1 0 1 1 1 | D9800 | DA000
+ | |
+ 1 1 0 0 0 | DC000 | DE000
+ 1 1 0 0 1 | DC800 | DE000
+ 1 1 0 1 0 | DD000 | DE000
+ 1 1 0 1 1 | DD800 | DE000
+ | |
+ 1 1 1 0 0 | E0000 | E2000
+ 1 1 1 0 1 | E0800 | E2000
+ 1 1 1 1 0 | E1000 | E2000
+ 1 1 1 1 1 | E1800 | E2000
+
+*) To enable the 8K Boot PROM install the jumper ROM.
+ The default is jumper ROM not installed.
+
+
+Setting the Timeouts and Interrupt
+----------------------------------
+
+The jumpers labeled EXT1 and EXT2 are used to determine the timeout
+parameters. These two jumpers are normally left open.
+
+To select a hardware interrupt level set one (only one!) of the jumpers
+IRQ2, IRQ3, IRQ4, IRQ5, IRQ7. The Manufacturer's default is IRQ2.
+
+
+Configuring the PC130E for Star or Bus Topology
+-----------------------------------------------
+
+The single jumper labeled STAR is used to configure the PC130E board for
+star or bus topology.
+When the jumper is installed, the board may be used in a star network, when
+it is removed, the board can be used in a bus topology.
+
+
+Diagnostic LEDs
+---------------
+
+Two diagnostic LEDs are visible on the rear bracket of the board.
+The green LED monitors the network activity: the red one shows the
+board activity:
+
+ Green | Status Red | Status
+ -------|------------------- ---------|-------------------
+ on | normal activity flash/on | data transfer
+ blink | reconfiguration off | no data transfer;
+ off | defective board or | incorrect memory or
+ | node ID is zero | I/O address
+
+
+*****************************************************************************
+
+** Standard Microsystems Corp (SMC) **
+PC500/PC550 Longboard (16-bit cards)
+-------------------------------------
+ - from Juergen Seifert <seifert@htwm.de>
+
+
+STANDARD MICROSYSTEMS CORPORATION (SMC) ARCNET-PC500/PC550 Long Board
+=====================================================================
+
+Note: There is another Version of the PC500 called Short Version, which
+ is different in hard- and software! The most important differences
+ are:
+ - The long board has no Shared memory.
+ - On the long board the selection of the interrupt is done by binary
+ coded switch, on the short board directly by jumper.
+
+[Avery's note: pay special attention to that: the long board HAS NO SHARED
+MEMORY. This means the current Linux-ARCnet driver can't use these cards.
+I have obtained a PC500Longboard and will be doing some experiments on it in
+the future, but don't hold your breath. Thanks again to Juergen Seifert for
+his advice about this!]
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the following Original SMC Manual
+
+ "Configuration Guide for
+ SMC ARCNET-PC500/PC550
+ Series Network Controller Boards
+ Pub. # 900.033 Rev. A
+ November, 1989"
+
+ARCNET is a registered trademark of the Datapoint Corporation
+SMC is a registered trademark of the Standard Microsystems Corporation
+
+The PC500 is equipped with a standard BNC female connector for connection
+to RG-62/U coax cable.
+The board is designed both for point-to-point connection in star networks
+and for connection to bus networks.
+
+The PC550 is equipped with two modular RJ11-type jacks for connection
+to twisted pair wiring.
+It can be used in a star or a daisy-chained (BUS) network.
+
+ 1
+ 0 9 8 7 6 5 4 3 2 1 6 5 4 3 2 1
+ ____________________________________________________________________
+ < | SW1 | | SW2 | |
+ > |_____________________| |_____________| |
+ < IRQ |I/O Addr |
+ > ___|
+ < CR4 |___|
+ > CR3 |___|
+ < ___|
+ > N | | 8
+ < o | | 7
+ > d | S | 6
+ < e | W | 5
+ > A | 3 | 4
+ < d | | 3
+ > d | | 2
+ < r |___| 1
+ > |o| _____|
+ < |o| | J1 |
+ > 3 1 JP6 |_____|
+ < |o|o| JP2 | J2 |
+ > |o|o| |_____|
+ < 4 2__ ______________|
+ > | | |
+ <____| |_____________________________________________|
+
+Legend:
+
+SW1 1-6: I/O Base Address Select
+ 7-10: Interrupt Select
+SW2 1-6: Reserved for Future Use
+SW3 1-8: Node ID Select
+JP2 1-4: Extended Timeout Select
+JP6 Selected - Star Topology (PC500 only)
+ Deselected - Bus Topology (PC500 only)
+CR3 Green Monitors Network Activity
+CR4 Red Monitors Board Activity
+J1 BNC RG62/U Connector (PC500 only)
+J1 6-position Telephone Jack (PC550 only)
+J2 6-position Telephone Jack (PC550 only)
+
+Setting one of the switches to Off/Open means "1", On/Closed means "0".
+
+
+Setting the Node ID
+-------------------
+
+The eight switches in group SW3 are used to set the node ID. Each node
+attached to the network must have an unique node ID which must be
+different from 0.
+Switch 1 serves as the least significant bit (LSB).
+
+The node ID is the sum of the values of all switches set to "1"
+These values are:
+
+ Switch | Value
+ -------|-------
+ 1 | 1
+ 2 | 2
+ 3 | 4
+ 4 | 8
+ 5 | 16
+ 6 | 32
+ 7 | 64
+ 8 | 128
+
+Some Examples:
+
+ Switch | Hex | Decimal
+ 8 7 6 5 4 3 2 1 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+----------------------------
+
+The first six switches in switch group SW1 are used to select one
+of 32 possible I/O Base addresses using the following table
+
+ Switch | Hex I/O
+ 6 5 4 3 2 1 | Address
+ -------------|--------
+ 0 1 0 0 0 0 | 200
+ 0 1 0 0 0 1 | 210
+ 0 1 0 0 1 0 | 220
+ 0 1 0 0 1 1 | 230
+ 0 1 0 1 0 0 | 240
+ 0 1 0 1 0 1 | 250
+ 0 1 0 1 1 0 | 260
+ 0 1 0 1 1 1 | 270
+ 0 1 1 0 0 0 | 280
+ 0 1 1 0 0 1 | 290
+ 0 1 1 0 1 0 | 2A0
+ 0 1 1 0 1 1 | 2B0
+ 0 1 1 1 0 0 | 2C0
+ 0 1 1 1 0 1 | 2D0
+ 0 1 1 1 1 0 | 2E0 (Manufacturer's default)
+ 0 1 1 1 1 1 | 2F0
+ 1 1 0 0 0 0 | 300
+ 1 1 0 0 0 1 | 310
+ 1 1 0 0 1 0 | 320
+ 1 1 0 0 1 1 | 330
+ 1 1 0 1 0 0 | 340
+ 1 1 0 1 0 1 | 350
+ 1 1 0 1 1 0 | 360
+ 1 1 0 1 1 1 | 370
+ 1 1 1 0 0 0 | 380
+ 1 1 1 0 0 1 | 390
+ 1 1 1 0 1 0 | 3A0
+ 1 1 1 0 1 1 | 3B0
+ 1 1 1 1 0 0 | 3C0
+ 1 1 1 1 0 1 | 3D0
+ 1 1 1 1 1 0 | 3E0
+ 1 1 1 1 1 1 | 3F0
+
+
+Setting the Interrupt
+---------------------
+
+Switches seven through ten of switch group SW1 are used to select the
+interrupt level. The interrupt level is binary coded, so selections
+from 0 to 15 would be possible, but only the following eight values will
+be supported: 3, 4, 5, 7, 9, 10, 11, 12.
+
+ Switch | IRQ
+ 10 9 8 7 |
+ ---------|--------
+ 0 0 1 1 | 3
+ 0 1 0 0 | 4
+ 0 1 0 1 | 5
+ 0 1 1 1 | 7
+ 1 0 0 1 | 9 (=2) (default)
+ 1 0 1 0 | 10
+ 1 0 1 1 | 11
+ 1 1 0 0 | 12
+
+
+Setting the Timeouts
+--------------------
+
+The two jumpers JP2 (1-4) are used to determine the timeout parameters.
+These two jumpers are normally left open.
+Refer to the COM9026 Data Sheet for alternate configurations.
+
+
+Configuring the PC500 for Star or Bus Topology
+----------------------------------------------
+
+The single jumper labeled JP6 is used to configure the PC500 board for
+star or bus topology.
+When the jumper is installed, the board may be used in a star network, when
+it is removed, the board can be used in a bus topology.
+
+
+Diagnostic LEDs
+---------------
+
+Two diagnostic LEDs are visible on the rear bracket of the board.
+The green LED monitors the network activity: the red one shows the
+board activity:
+
+ Green | Status Red | Status
+ -------|------------------- ---------|-------------------
+ on | normal activity flash/on | data transfer
+ blink | reconfiguration off | no data transfer;
+ off | defective board or | incorrect memory or
+ | node ID is zero | I/O address
+
+
+*****************************************************************************
+
+** SMC **
+PC710 (8-bit card)
+------------------
+ - from J.S. van Oosten <jvoosten@compiler.tdcnet.nl>
+
+Note: this data is gathered by experimenting and looking at info of other
+cards. However, I'm sure I got 99% of the settings right.
+
+The SMC710 card resembles the PC270 card, but is much more basic (i.e. no
+LEDs, RJ11 jacks, etc.) and 8 bit. Here's a little drawing:
+
+ _______________________________________
+ | +---------+ +---------+ |____
+ | | S2 | | S1 | |
+ | +---------+ +---------+ |
+ | |
+ | +===+ __ |
+ | | R | | | X-tal ###___
+ | | O | |__| ####__'|
+ | | M | || ###
+ | +===+ |
+ | |
+ | .. JP1 +----------+ |
+ | .. | big chip | |
+ | .. | 90C63 | |
+ | .. | | |
+ | .. +----------+ |
+ ------- -----------
+ |||||||||||||||||||||
+
+The row of jumpers at JP1 actually consists of 8 jumpers, (sometimes
+labelled) the same as on the PC270, from top to bottom: EXT2, EXT1, ROM,
+IRQ7, IRQ5, IRQ4, IRQ3, IRQ2 (gee, wonder what they would do? :-) )
+
+S1 and S2 perform the same function as on the PC270, only their numbers
+are swapped (S1 is the nodeaddress, S2 sets IO- and RAM-address).
+
+I know it works when connected to a PC110 type ARCnet board.
+
+
+*****************************************************************************
+
+** Possibly SMC **
+LCS-8830(-T) (8 and 16-bit cards)
+---------------------------------
+ - from Mathias Katzer <mkatzer@HRZ.Uni-Bielefeld.DE>
+ - Marek Michalkiewicz <marekm@i17linuxb.ists.pwr.wroc.pl> says the
+ LCS-8830 is slightly different from LCS-8830-T. These are 8 bit, BUS
+ only (the JP0 jumper is hardwired), and BNC only.
+
+This is a LCS-8830-T made by SMC, I think ('SMC' only appears on one PLCC,
+nowhere else, not even on the few Xeroxed sheets from the manual).
+
+SMC ARCnet Board Type LCS-8830-T
+
+ ------------------------------------
+ | |
+ | JP3 88 8 JP2 |
+ | ##### | \ |
+ | ##### ET1 ET2 ###|
+ | 8 ###|
+ | U3 SW 1 JP0 ###| Phone Jacks
+ | -- ###|
+ | | | |
+ | | | SW2 |
+ | | | |
+ | | | ##### |
+ | -- ##### #### BNC Connector
+ | ####
+ | 888888 JP1 |
+ | 234567 |
+ -- -------
+ |||||||||||||||||||||||||||
+ --------------------------
+
+
+SW1: DIP-Switches for Station Address
+SW2: DIP-Switches for Memory Base and I/O Base addresses
+
+JP0: If closed, internal termination on (default open)
+JP1: IRQ Jumpers
+JP2: Boot-ROM enabled if closed
+JP3: Jumpers for response timeout
+
+U3: Boot-ROM Socket
+
+
+ET1 ET2 Response Time Idle Time Reconfiguration Time
+
+ 78 86 840
+ X 285 316 1680
+ X 563 624 1680
+ X X 1130 1237 1680
+
+(X means closed jumper)
+
+(DIP-Switch downwards means "0")
+
+The station address is binary-coded with SW1.
+
+The I/O base address is coded with DIP-Switches 6,7 and 8 of SW2:
+
+Switches Base
+678 Address
+000 260-26f
+100 290-29f
+010 2e0-2ef
+110 2f0-2ff
+001 300-30f
+101 350-35f
+011 380-38f
+111 3e0-3ef
+
+
+DIP Switches 1-5 of SW2 encode the RAM and ROM Address Range:
+
+Switches RAM ROM
+12345 Address Range Address Range
+00000 C:0000-C:07ff C:2000-C:3fff
+10000 C:0800-C:0fff
+01000 C:1000-C:17ff
+11000 C:1800-C:1fff
+00100 C:4000-C:47ff C:6000-C:7fff
+10100 C:4800-C:4fff
+01100 C:5000-C:57ff
+11100 C:5800-C:5fff
+00010 C:C000-C:C7ff C:E000-C:ffff
+10010 C:C800-C:Cfff
+01010 C:D000-C:D7ff
+11010 C:D800-C:Dfff
+00110 D:0000-D:07ff D:2000-D:3fff
+10110 D:0800-D:0fff
+01110 D:1000-D:17ff
+11110 D:1800-D:1fff
+00001 D:4000-D:47ff D:6000-D:7fff
+10001 D:4800-D:4fff
+01001 D:5000-D:57ff
+11001 D:5800-D:5fff
+00101 D:8000-D:87ff D:A000-D:bfff
+10101 D:8800-D:8fff
+01101 D:9000-D:97ff
+11101 D:9800-D:9fff
+00011 D:C000-D:c7ff D:E000-D:ffff
+10011 D:C800-D:cfff
+01011 D:D000-D:d7ff
+11011 D:D800-D:dfff
+00111 E:0000-E:07ff E:2000-E:3fff
+10111 E:0800-E:0fff
+01111 E:1000-E:17ff
+11111 E:1800-E:1fff
+
+
+*****************************************************************************
+
+** PureData Corp **
+PDI507 (8-bit card)
+--------------------
+ - from Mark Rejhon <mdrejhon@magi.com> (slight modifications by Avery)
+ - Avery's note: I think PDI508 cards (but definitely NOT PDI508Plus cards)
+ are mostly the same as this. PDI508Plus cards appear to be mainly
+ software-configured.
+
+Jumpers:
+ There is a jumper array at the bottom of the card, near the edge
+ connector. This array is labelled J1. They control the IRQs and
+ something else. Put only one jumper on the IRQ pins.
+
+ ETS1, ETS2 are for timing on very long distance networks. See the
+ more general information near the top of this file.
+
+ There is a J2 jumper on two pins. A jumper should be put on them,
+ since it was already there when I got the card. I don't know what
+ this jumper is for though.
+
+ There is a two-jumper array for J3. I don't know what it is for,
+ but there were already two jumpers on it when I got the card. It's
+ a six pin grid in a two-by-three fashion. The jumpers were
+ configured as follows:
+
+ .-------.
+ o | o o |
+ :-------: ------> Accessible end of card with connectors
+ o | o o | in this direction ------->
+ `-------'
+
+Carl de Billy <CARL@carainfo.com> explains J3 and J4:
+
+ J3 Diagram:
+
+ .-------.
+ o | o o |
+ :-------: TWIST Technology
+ o | o o |
+ `-------'
+ .-------.
+ | o o | o
+ :-------: COAX Technology
+ | o o | o
+ `-------'
+
+ - If using coax cable in a bus topology the J4 jumper must be removed;
+ place it on one pin.
+
+ - If using bus topology with twisted pair wiring move the J3
+ jumpers so they connect the middle pin and the pins closest to the RJ11
+ Connectors. Also the J4 jumper must be removed; place it on one pin of
+ J4 jumper for storage.
+
+ - If using star topology with twisted pair wiring move the J3
+ jumpers so they connect the middle pin and the pins closest to the RJ11
+ connectors.
+
+
+DIP Switches:
+
+ The DIP switches accessible on the accessible end of the card while
+ it is installed, is used to set the ARCnet address. There are 8
+ switches. Use an address from 1 to 254.
+
+ Switch No.
+ 12345678 ARCnet address
+ -----------------------------------------
+ 00000000 FF (Don't use this!)
+ 00000001 FE
+ 00000010 FD
+ ....
+ 11111101 2
+ 11111110 1
+ 11111111 0 (Don't use this!)
+
+ There is another array of eight DIP switches at the top of the
+ card. There are five labelled MS0-MS4 which seem to control the
+ memory address, and another three labelled IO0-IO2 which seem to
+ control the base I/O address of the card.
+
+ This was difficult to test by trial and error, and the I/O addresses
+ are in a weird order. This was tested by setting the DIP switches,
+ rebooting the computer, and attempting to load ARCETHER at various
+ addresses (mostly between 0x200 and 0x400). The address that caused
+ the red transmit LED to blink, is the one that I thought works.
+
+ Also, the address 0x3D0 seem to have a special meaning, since the
+ ARCETHER packet driver loaded fine, but without the red LED
+ blinking. I don't know what 0x3D0 is for though. I recommend using
+ an address of 0x300 since Windows may not like addresses below
+ 0x300.
+
+ IO Switch No.
+ 210 I/O address
+ -------------------------------
+ 111 0x260
+ 110 0x290
+ 101 0x2E0
+ 100 0x2F0
+ 011 0x300
+ 010 0x350
+ 001 0x380
+ 000 0x3E0
+
+ The memory switches set a reserved address space of 0x1000 bytes
+ (0x100 segment units, or 4k). For example if I set an address of
+ 0xD000, it will use up addresses 0xD000 to 0xD100.
+
+ The memory switches were tested by booting using QEMM386 stealth,
+ and using LOADHI to see what address automatically became excluded
+ from the upper memory regions, and then attempting to load ARCETHER
+ using these addresses.
+
+ I recommend using an ARCnet memory address of 0xD000, and putting
+ the EMS page frame at 0xC000 while using QEMM stealth mode. That
+ way, you get contiguous high memory from 0xD100 almost all the way
+ the end of the megabyte.
+
+ Memory Switch 0 (MS0) didn't seem to work properly when set to OFF
+ on my card. It could be malfunctioning on my card. Experiment with
+ it ON first, and if it doesn't work, set it to OFF. (It may be a
+ modifier for the 0x200 bit?)
+
+ MS Switch No.
+ 43210 Memory address
+ --------------------------------
+ 00001 0xE100 (guessed - was not detected by QEMM)
+ 00011 0xE000 (guessed - was not detected by QEMM)
+ 00101 0xDD00
+ 00111 0xDC00
+ 01001 0xD900
+ 01011 0xD800
+ 01101 0xD500
+ 01111 0xD400
+ 10001 0xD100
+ 10011 0xD000
+ 10101 0xCD00
+ 10111 0xCC00
+ 11001 0xC900 (guessed - crashes tested system)
+ 11011 0xC800 (guessed - crashes tested system)
+ 11101 0xC500 (guessed - crashes tested system)
+ 11111 0xC400 (guessed - crashes tested system)
+
+
+*****************************************************************************
+
+** CNet Technology Inc. **
+120 Series (8-bit cards)
+------------------------
+ - from Juergen Seifert <seifert@htwm.de>
+
+
+CNET TECHNOLOGY INC. (CNet) ARCNET 120A SERIES
+==============================================
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the following Original CNet Manual
+
+ "ARCNET
+ USER'S MANUAL
+ for
+ CN120A
+ CN120AB
+ CN120TP
+ CN120ST
+ CN120SBT
+ P/N:12-01-0007
+ Revision 3.00"
+
+ARCNET is a registered trademark of the Datapoint Corporation
+
+P/N 120A ARCNET 8 bit XT/AT Star
+P/N 120AB ARCNET 8 bit XT/AT Bus
+P/N 120TP ARCNET 8 bit XT/AT Twisted Pair
+P/N 120ST ARCNET 8 bit XT/AT Star, Twisted Pair
+P/N 120SBT ARCNET 8 bit XT/AT Star, Bus, Twisted Pair
+
+ __________________________________________________________________
+ | |
+ | ___|
+ | LED |___|
+ | ___|
+ | N | | ID7
+ | o | | ID6
+ | d | S | ID5
+ | e | W | ID4
+ | ___________________ A | 2 | ID3
+ | | | d | | ID2
+ | | | 1 2 3 4 5 6 7 8 d | | ID1
+ | | | _________________ r |___| ID0
+ | | 90C65 || SW1 | ____|
+ | JP 8 7 | ||_________________| | |
+ | |o|o| JP1 | | | J2 |
+ | |o|o| |oo| | | JP 1 1 1 | |
+ | ______________ | | 0 1 2 |____|
+ | | PROM | |___________________| |o|o|o| _____|
+ | > SOCKET | JP 6 5 4 3 2 |o|o|o| | J1 |
+ | |______________| |o|o|o|o|o| |o|o|o| |_____|
+ |_____ |o|o|o|o|o| ______________|
+ | |
+ |_____________________________________________|
+
+Legend:
+
+90C65 ARCNET Probe
+S1 1-5: Base Memory Address Select
+ 6-8: Base I/O Address Select
+S2 1-8: Node ID Select (ID0-ID7)
+JP1 ROM Enable Select
+JP2 IRQ2
+JP3 IRQ3
+JP4 IRQ4
+JP5 IRQ5
+JP6 IRQ7
+JP7/JP8 ET1, ET2 Timeout Parameters
+JP10/JP11 Coax / Twisted Pair Select (CN120ST/SBT only)
+JP12 Terminator Select (CN120AB/ST/SBT only)
+J1 BNC RG62/U Connector (all except CN120TP)
+J2 Two 6-position Telephone Jack (CN120TP/ST/SBT only)
+
+Setting one of the switches to Off means "1", On means "0".
+
+
+Setting the Node ID
+-------------------
+
+The eight switches in SW2 are used to set the node ID. Each node attached
+to the network must have an unique node ID which must be different from 0.
+Switch 1 (ID0) serves as the least significant bit (LSB).
+
+The node ID is the sum of the values of all switches set to "1"
+These values are:
+
+ Switch | Label | Value
+ -------|-------|-------
+ 1 | ID0 | 1
+ 2 | ID1 | 2
+ 3 | ID2 | 4
+ 4 | ID3 | 8
+ 5 | ID4 | 16
+ 6 | ID5 | 32
+ 7 | ID6 | 64
+ 8 | ID7 | 128
+
+Some Examples:
+
+ Switch | Hex | Decimal
+ 8 7 6 5 4 3 2 1 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+----------------------------
+
+The last three switches in switch block SW1 are used to select one
+of eight possible I/O Base addresses using the following table
+
+
+ Switch | Hex I/O
+ 6 7 8 | Address
+ ------------|--------
+ ON ON ON | 260
+ OFF ON ON | 290
+ ON OFF ON | 2E0 (Manufacturer's default)
+ OFF OFF ON | 2F0
+ ON ON OFF | 300
+ OFF ON OFF | 350
+ ON OFF OFF | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+--------------------------------------------
+
+The memory buffer (RAM) requires 2K. The base of this buffer can be
+located in any of eight positions. The address of the Boot Prom is
+memory base + 8K or memory base + 0x2000.
+Switches 1-5 of switch block SW1 select the Memory Base address.
+
+ Switch | Hex RAM | Hex ROM
+ 1 2 3 4 5 | Address | Address *)
+ --------------------|---------|-----------
+ ON ON ON ON ON | C0000 | C2000
+ ON ON OFF ON ON | C4000 | C6000
+ ON ON ON OFF ON | CC000 | CE000
+ ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default)
+ ON ON ON ON OFF | D4000 | D6000
+ ON ON OFF ON OFF | D8000 | DA000
+ ON ON ON OFF OFF | DC000 | DE000
+ ON ON OFF OFF OFF | E0000 | E2000
+
+*) To enable the Boot ROM install the jumper JP1
+
+Note: Since the switches 1 and 2 are always set to ON it may be possible
+ that they can be used to add an offset of 2K, 4K or 6K to the base
+ address, but this feature is not documented in the manual and I
+ haven't tested it yet.
+
+
+Setting the Interrupt Line
+--------------------------
+
+To select a hardware interrupt level install one (only one!) of the jumpers
+JP2, JP3, JP4, JP5, JP6. JP2 is the default.
+
+ Jumper | IRQ
+ -------|-----
+ 2 | 2
+ 3 | 3
+ 4 | 4
+ 5 | 5
+ 6 | 7
+
+
+Setting the Internal Terminator on CN120AB/TP/SBT
+--------------------------------------------------
+
+The jumper JP12 is used to enable the internal terminator.
+
+ -----
+ 0 | 0 |
+ ----- ON | | ON
+ | 0 | | 0 |
+ | | OFF ----- OFF
+ | 0 | 0
+ -----
+ Terminator Terminator
+ disabled enabled
+
+
+Selecting the Connector Type on CN120ST/SBT
+-------------------------------------------
+
+ JP10 JP11 JP10 JP11
+ ----- -----
+ 0 0 | 0 | | 0 |
+ ----- ----- | | | |
+ | 0 | | 0 | | 0 | | 0 |
+ | | | | ----- -----
+ | 0 | | 0 | 0 0
+ ----- -----
+ Coaxial Cable Twisted Pair Cable
+ (Default)
+
+
+Setting the Timeout Parameters
+------------------------------
+
+The jumpers labeled EXT1 and EXT2 are used to determine the timeout
+parameters. These two jumpers are normally left open.
+
+
+
+*****************************************************************************
+
+** CNet Technology Inc. **
+160 Series (16-bit cards)
+-------------------------
+ - from Juergen Seifert <seifert@htwm.de>
+
+CNET TECHNOLOGY INC. (CNet) ARCNET 160A SERIES
+==============================================
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the following Original CNet Manual
+
+ "ARCNET
+ USER'S MANUAL
+ for
+ CN160A
+ CN160AB
+ CN160TP
+ P/N:12-01-0006
+ Revision 3.00"
+
+ARCNET is a registered trademark of the Datapoint Corporation
+
+P/N 160A ARCNET 16 bit XT/AT Star
+P/N 160AB ARCNET 16 bit XT/AT Bus
+P/N 160TP ARCNET 16 bit XT/AT Twisted Pair
+
+ ___________________________________________________________________
+ < _________________________ ___|
+ > |oo| JP2 | | LED |___|
+ < |oo| JP1 | 9026 | LED |___|
+ > |_________________________| ___|
+ < N | | ID7
+ > 1 o | | ID6
+ < 1 2 3 4 5 6 7 8 9 0 d | S | ID5
+ > _______________ _____________________ e | W | ID4
+ < | PROM | | SW1 | A | 2 | ID3
+ > > SOCKET | |_____________________| d | | ID2
+ < |_______________| | IO-Base | MEM | d | | ID1
+ > r |___| ID0
+ < ____|
+ > | |
+ < | J1 |
+ > | |
+ < |____|
+ > 1 1 1 1 |
+ < 3 4 5 6 7 JP 8 9 0 1 2 3 |
+ > |o|o|o|o|o| |o|o|o|o|o|o| |
+ < |o|o|o|o|o| __ |o|o|o|o|o|o| ___________|
+ > | | |
+ <____________| |_______________________________________|
+
+Legend:
+
+9026 ARCNET Probe
+SW1 1-6: Base I/O Address Select
+ 7-10: Base Memory Address Select
+SW2 1-8: Node ID Select (ID0-ID7)
+JP1/JP2 ET1, ET2 Timeout Parameters
+JP3-JP13 Interrupt Select
+J1 BNC RG62/U Connector (CN160A/AB only)
+J1 Two 6-position Telephone Jack (CN160TP only)
+LED
+
+Setting one of the switches to Off means "1", On means "0".
+
+
+Setting the Node ID
+-------------------
+
+The eight switches in SW2 are used to set the node ID. Each node attached
+to the network must have an unique node ID which must be different from 0.
+Switch 1 (ID0) serves as the least significant bit (LSB).
+
+The node ID is the sum of the values of all switches set to "1"
+These values are:
+
+ Switch | Label | Value
+ -------|-------|-------
+ 1 | ID0 | 1
+ 2 | ID1 | 2
+ 3 | ID2 | 4
+ 4 | ID3 | 8
+ 5 | ID4 | 16
+ 6 | ID5 | 32
+ 7 | ID6 | 64
+ 8 | ID7 | 128
+
+Some Examples:
+
+ Switch | Hex | Decimal
+ 8 7 6 5 4 3 2 1 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+----------------------------
+
+The first six switches in switch block SW1 are used to select the I/O Base
+address using the following table:
+
+ Switch | Hex I/O
+ 1 2 3 4 5 6 | Address
+ ------------------------|--------
+ OFF ON ON OFF OFF ON | 260
+ OFF ON OFF ON ON OFF | 290
+ OFF ON OFF OFF OFF ON | 2E0 (Manufacturer's default)
+ OFF ON OFF OFF OFF OFF | 2F0
+ OFF OFF ON ON ON ON | 300
+ OFF OFF ON OFF ON OFF | 350
+ OFF OFF OFF ON ON ON | 380
+ OFF OFF OFF OFF OFF ON | 3E0
+
+Note: Other IO-Base addresses seem to be selectable, but only the above
+ combinations are documented.
+
+
+Setting the Base Memory (RAM) buffer Address
+--------------------------------------------
+
+The switches 7-10 of switch block SW1 are used to select the Memory
+Base address of the RAM (2K) and the PROM.
+
+ Switch | Hex RAM | Hex ROM
+ 7 8 9 10 | Address | Address
+ ----------------|---------|-----------
+ OFF OFF ON ON | C0000 | C8000
+ OFF OFF ON OFF | D0000 | D8000 (Default)
+ OFF OFF OFF ON | E0000 | E8000
+
+Note: Other MEM-Base addresses seem to be selectable, but only the above
+ combinations are documented.
+
+
+Setting the Interrupt Line
+--------------------------
+
+To select a hardware interrupt level install one (only one!) of the jumpers
+JP3 through JP13 using the following table:
+
+ Jumper | IRQ
+ -------|-----------------
+ 3 | 14
+ 4 | 15
+ 5 | 12
+ 6 | 11
+ 7 | 10
+ 8 | 3
+ 9 | 4
+ 10 | 5
+ 11 | 6
+ 12 | 7
+ 13 | 2 (=9) Default!
+
+Note: - Do not use JP11=IRQ6, it may conflict with your Floppy Disk
+ Controller
+ - Use JP3=IRQ14 only, if you don't have an IDE-, MFM-, or RLL-
+ Hard Disk, it may conflict with their controllers
+
+
+Setting the Timeout Parameters
+------------------------------
+
+The jumpers labeled JP1 and JP2 are used to determine the timeout
+parameters. These two jumpers are normally left open.
+
+
+*****************************************************************************
+
+** Lantech **
+8-bit card, unknown model
+-------------------------
+ - from Vlad Lungu <vlungu@ugal.ro> - his e-mail address seemed broken at
+ the time I tried to reach him. Sorry Vlad, if you didn't get my reply.
+
+ ________________________________________________________________
+ | 1 8 |
+ | ___________ __|
+ | | SW1 | LED |__|
+ | |__________| |
+ | ___|
+ | _____________________ |S | 8
+ | | | |W |
+ | | | |2 |
+ | | | |__| 1
+ | | UM9065L | |o| JP4 ____|____
+ | | | |o| | CN |
+ | | | |________|
+ | | | |
+ | |___________________| |
+ | |
+ | |
+ | _____________ |
+ | | | |
+ | | PROM | |ooooo| JP6 |
+ | |____________| |ooooo| |
+ |_____________ _ _|
+ |____________________________________________| |__|
+
+
+UM9065L : ARCnet Controller
+
+SW 1 : Shared Memory Address and I/O Base
+
+ ON=0
+
+ 12345|Memory Address
+ -----|--------------
+ 00001| D4000
+ 00010| CC000
+ 00110| D0000
+ 01110| D1000
+ 01101| D9000
+ 10010| CC800
+ 10011| DC800
+ 11110| D1800
+
+It seems that the bits are considered in reverse order. Also, you must
+observe that some of those addresses are unusual and I didn't probe them; I
+used a memory dump in DOS to identify them. For the 00000 configuration and
+some others that I didn't write here the card seems to conflict with the
+video card (an S3 GENDAC). I leave the full decoding of those addresses to
+you.
+
+ 678| I/O Address
+ ---|------------
+ 000| 260
+ 001| failed probe
+ 010| 2E0
+ 011| 380
+ 100| 290
+ 101| 350
+ 110| failed probe
+ 111| 3E0
+
+SW 2 : Node ID (binary coded)
+
+JP 4 : Boot PROM enable CLOSE - enabled
+ OPEN - disabled
+
+JP 6 : IRQ set (ONLY ONE jumper on 1-5 for IRQ 2-6)
+
+
+*****************************************************************************
+
+** Acer **
+8-bit card, Model 5210-003
+--------------------------
+ - from Vojtech Pavlik <vojtech@suse.cz> using portions of the existing
+ arcnet-hardware file.
+
+This is a 90C26 based card. Its configuration seems similar to the SMC
+PC100, but has some additional jumpers I don't know the meaning of.
+
+ __
+ | |
+ ___________|__|_________________________
+ | | | |
+ | | BNC | |
+ | |______| ___|
+ | _____________________ |___
+ | | | |
+ | | Hybrid IC | |
+ | | | o|o J1 |
+ | |_____________________| 8|8 |
+ | 8|8 J5 |
+ | o|o |
+ | 8|8 |
+ |__ 8|8 |
+ (|__| LED o|o |
+ | 8|8 |
+ | 8|8 J15 |
+ | |
+ | _____ |
+ | | | _____ |
+ | | | | | ___|
+ | | | | | |
+ | _____ | ROM | | UFS | |
+ | | | | | | | |
+ | | | ___ | | | | |
+ | | | | | |__.__| |__.__| |
+ | | NCR | |XTL| _____ _____ |
+ | | | |___| | | | | |
+ | |90C26| | | | | |
+ | | | | RAM | | UFS | |
+ | | | J17 o|o | | | | |
+ | | | J16 o|o | | | | |
+ | |__.__| |__.__| |__.__| |
+ | ___ |
+ | | |8 |
+ | |SW2| |
+ | | | |
+ | |___|1 |
+ | ___ |
+ | | |10 J18 o|o |
+ | | | o|o |
+ | |SW1| o|o |
+ | | | J21 o|o |
+ | |___|1 |
+ | |
+ |____________________________________|
+
+
+Legend:
+
+90C26 ARCNET Chip
+XTL 20 MHz Crystal
+SW1 1-6 Base I/O Address Select
+ 7-10 Memory Address Select
+SW2 1-8 Node ID Select (ID0-ID7)
+J1-J5 IRQ Select
+J6-J21 Unknown (Probably extra timeouts & ROM enable ...)
+LED1 Activity LED
+BNC Coax connector (STAR ARCnet)
+RAM 2k of SRAM
+ROM Boot ROM socket
+UFS Unidentified Flying Sockets
+
+
+Setting the Node ID
+-------------------
+
+The eight switches in SW2 are used to set the node ID. Each node attached
+to the network must have an unique node ID which must not be 0.
+Switch 1 (ID0) serves as the least significant bit (LSB).
+
+Setting one of the switches to OFF means "1", ON means "0".
+
+The node ID is the sum of the values of all switches set to "1"
+These values are:
+
+ Switch | Value
+ -------|-------
+ 1 | 1
+ 2 | 2
+ 3 | 4
+ 4 | 8
+ 5 | 16
+ 6 | 32
+ 7 | 64
+ 8 | 128
+
+Don't set this to 0 or 255; these values are reserved.
+
+
+Setting the I/O Base Address
+----------------------------
+
+The switches 1 to 6 of switch block SW1 are used to select one
+of 32 possible I/O Base addresses using the following tables
+
+ | Hex
+ Switch | Value
+ -------|-------
+ 1 | 200
+ 2 | 100
+ 3 | 80
+ 4 | 40
+ 5 | 20
+ 6 | 10
+
+The I/O address is sum of all switches set to "1". Remember that
+the I/O address space bellow 0x200 is RESERVED for mainboard, so
+switch 1 should be ALWAYS SET TO OFF.
+
+
+Setting the Base Memory (RAM) buffer Address
+--------------------------------------------
+
+The memory buffer (RAM) requires 2K. The base of this buffer can be
+located in any of sixteen positions. However, the addresses below
+A0000 are likely to cause system hang because there's main RAM.
+
+Jumpers 7-10 of switch block SW1 select the Memory Base address.
+
+ Switch | Hex RAM
+ 7 8 9 10 | Address
+ ----------------|---------
+ OFF OFF OFF OFF | F0000 (conflicts with main BIOS)
+ OFF OFF OFF ON | E0000
+ OFF OFF ON OFF | D0000
+ OFF OFF ON ON | C0000 (conflicts with video BIOS)
+ OFF ON OFF OFF | B0000 (conflicts with mono video)
+ OFF ON OFF ON | A0000 (conflicts with graphics)
+
+
+Setting the Interrupt Line
+--------------------------
+
+Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means
+shorted, OFF means open.
+
+ Jumper | IRQ
+ 1 2 3 4 5 |
+ ----------------------------
+ ON OFF OFF OFF OFF | 7
+ OFF ON OFF OFF OFF | 5
+ OFF OFF ON OFF OFF | 4
+ OFF OFF OFF ON OFF | 3
+ OFF OFF OFF OFF ON | 2
+
+
+Unknown jumpers & sockets
+-------------------------
+
+I know nothing about these. I just guess that J16&J17 are timeout
+jumpers and maybe one of J18-J21 selects ROM. Also J6-J10 and
+J11-J15 are connecting IRQ2-7 to some pins on the UFSs. I can't
+guess the purpose.
+
+
+*****************************************************************************
+
+** Datapoint? **
+LAN-ARC-8, an 8-bit card
+------------------------
+ - from Vojtech Pavlik <vojtech@suse.cz>
+
+This is another SMC 90C65-based ARCnet card. I couldn't identify the
+manufacturer, but it might be DataPoint, because the card has the
+original arcNet logo in its upper right corner.
+
+ _______________________________________________________
+ | _________ |
+ | | SW2 | ON arcNet |
+ | |_________| OFF ___|
+ | _____________ 1 ______ 8 | | 8
+ | | | SW1 | XTAL | ____________ | S |
+ | > RAM (2k) | |______|| | | W |
+ | |_____________| | H | | 3 |
+ | _________|_____ y | |___| 1
+ | _________ | | |b | |
+ | |_________| | | |r | |
+ | | SMC | |i | |
+ | | 90C65| |d | |
+ | _________ | | | | |
+ | | SW1 | ON | | |I | |
+ | |_________| OFF |_________|_____/C | _____|
+ | 1 8 | | | |___
+ | ______________ | | | BNC |___|
+ | | | |____________| |_____|
+ | > EPROM SOCKET | _____________ |
+ | |______________| |_____________| |
+ | ______________|
+ | |
+ |________________________________________|
+
+Legend:
+
+90C65 ARCNET Chip
+SW1 1-5: Base Memory Address Select
+ 6-8: Base I/O Address Select
+SW2 1-8: Node ID Select
+SW3 1-5: IRQ Select
+ 6-7: Extra Timeout
+ 8 : ROM Enable
+BNC Coax connector
+XTAL 20 MHz Crystal
+
+
+Setting the Node ID
+-------------------
+
+The eight switches in SW3 are used to set the node ID. Each node attached
+to the network must have an unique node ID which must not be 0.
+Switch 1 serves as the least significant bit (LSB).
+
+Setting one of the switches to Off means "1", On means "0".
+
+The node ID is the sum of the values of all switches set to "1"
+These values are:
+
+ Switch | Value
+ -------|-------
+ 1 | 1
+ 2 | 2
+ 3 | 4
+ 4 | 8
+ 5 | 16
+ 6 | 32
+ 7 | 64
+ 8 | 128
+
+
+Setting the I/O Base Address
+----------------------------
+
+The last three switches in switch block SW1 are used to select one
+of eight possible I/O Base addresses using the following table
+
+
+ Switch | Hex I/O
+ 6 7 8 | Address
+ ------------|--------
+ ON ON ON | 260
+ OFF ON ON | 290
+ ON OFF ON | 2E0 (Manufacturer's default)
+ OFF OFF ON | 2F0
+ ON ON OFF | 300
+ OFF ON OFF | 350
+ ON OFF OFF | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+--------------------------------------------
+
+The memory buffer (RAM) requires 2K. The base of this buffer can be
+located in any of eight positions. The address of the Boot Prom is
+memory base + 0x2000.
+Jumpers 3-5 of switch block SW1 select the Memory Base address.
+
+ Switch | Hex RAM | Hex ROM
+ 1 2 3 4 5 | Address | Address *)
+ --------------------|---------|-----------
+ ON ON ON ON ON | C0000 | C2000
+ ON ON OFF ON ON | C4000 | C6000
+ ON ON ON OFF ON | CC000 | CE000
+ ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default)
+ ON ON ON ON OFF | D4000 | D6000
+ ON ON OFF ON OFF | D8000 | DA000
+ ON ON ON OFF OFF | DC000 | DE000
+ ON ON OFF OFF OFF | E0000 | E2000
+
+*) To enable the Boot ROM set the switch 8 of switch block SW3 to position ON.
+
+The switches 1 and 2 probably add 0x0800 and 0x1000 to RAM base address.
+
+
+Setting the Interrupt Line
+--------------------------
+
+Switches 1-5 of the switch block SW3 control the IRQ level.
+
+ Jumper | IRQ
+ 1 2 3 4 5 |
+ ----------------------------
+ ON OFF OFF OFF OFF | 3
+ OFF ON OFF OFF OFF | 4
+ OFF OFF ON OFF OFF | 5
+ OFF OFF OFF ON OFF | 7
+ OFF OFF OFF OFF ON | 2
+
+
+Setting the Timeout Parameters
+------------------------------
+
+The switches 6-7 of the switch block SW3 are used to determine the timeout
+parameters. These two switches are normally left in the OFF position.
+
+
+*****************************************************************************
+
+** Topware **
+8-bit card, TA-ARC/10
+-------------------------
+ - from Vojtech Pavlik <vojtech@suse.cz>
+
+This is another very similar 90C65 card. Most of the switches and jumpers
+are the same as on other clones.
+
+ _____________________________________________________________________
+| ___________ | | ______ |
+| |SW2 NODE ID| | | | XTAL | |
+| |___________| | Hybrid IC | |______| |
+| ___________ | | __|
+| |SW1 MEM+I/O| |_________________________| LED1|__|)
+| |___________| 1 2 |
+| J3 |o|o| TIMEOUT ______|
+| ______________ |o|o| | |
+| | | ___________________ | RJ |
+| > EPROM SOCKET | | \ |------|
+|J2 |______________| | | | |
+||o| | | |______|
+||o| ROM ENABLE | SMC | _________ |
+| _____________ | 90C65 | |_________| _____|
+| | | | | | |___
+| > RAM (2k) | | | | BNC |___|
+| |_____________| | | |_____|
+| |____________________| |
+| ________ IRQ 2 3 4 5 7 ___________ |
+||________| |o|o|o|o|o| |___________| |
+|________ J1|o|o|o|o|o| ______________|
+ | |
+ |_____________________________________________|
+
+Legend:
+
+90C65 ARCNET Chip
+XTAL 20 MHz Crystal
+SW1 1-5 Base Memory Address Select
+ 6-8 Base I/O Address Select
+SW2 1-8 Node ID Select (ID0-ID7)
+J1 IRQ Select
+J2 ROM Enable
+J3 Extra Timeout
+LED1 Activity LED
+BNC Coax connector (BUS ARCnet)
+RJ Twisted Pair Connector (daisy chain)
+
+
+Setting the Node ID
+-------------------
+
+The eight switches in SW2 are used to set the node ID. Each node attached to
+the network must have an unique node ID which must not be 0. Switch 1 (ID0)
+serves as the least significant bit (LSB).
+
+Setting one of the switches to Off means "1", On means "0".
+
+The node ID is the sum of the values of all switches set to "1"
+These values are:
+
+ Switch | Label | Value
+ -------|-------|-------
+ 1 | ID0 | 1
+ 2 | ID1 | 2
+ 3 | ID2 | 4
+ 4 | ID3 | 8
+ 5 | ID4 | 16
+ 6 | ID5 | 32
+ 7 | ID6 | 64
+ 8 | ID7 | 128
+
+Setting the I/O Base Address
+----------------------------
+
+The last three switches in switch block SW1 are used to select one
+of eight possible I/O Base addresses using the following table:
+
+
+ Switch | Hex I/O
+ 6 7 8 | Address
+ ------------|--------
+ ON ON ON | 260 (Manufacturer's default)
+ OFF ON ON | 290
+ ON OFF ON | 2E0
+ OFF OFF ON | 2F0
+ ON ON OFF | 300
+ OFF ON OFF | 350
+ ON OFF OFF | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+--------------------------------------------
+
+The memory buffer (RAM) requires 2K. The base of this buffer can be
+located in any of eight positions. The address of the Boot Prom is
+memory base + 0x2000.
+Jumpers 3-5 of switch block SW1 select the Memory Base address.
+
+ Switch | Hex RAM | Hex ROM
+ 1 2 3 4 5 | Address | Address *)
+ --------------------|---------|-----------
+ ON ON ON ON ON | C0000 | C2000
+ ON ON OFF ON ON | C4000 | C6000 (Manufacturer's default)
+ ON ON ON OFF ON | CC000 | CE000
+ ON ON OFF OFF ON | D0000 | D2000
+ ON ON ON ON OFF | D4000 | D6000
+ ON ON OFF ON OFF | D8000 | DA000
+ ON ON ON OFF OFF | DC000 | DE000
+ ON ON OFF OFF OFF | E0000 | E2000
+
+*) To enable the Boot ROM short the jumper J2.
+
+The jumpers 1 and 2 probably add 0x0800 and 0x1000 to RAM address.
+
+
+Setting the Interrupt Line
+--------------------------
+
+Jumpers 1-5 of the jumper block J1 control the IRQ level. ON means
+shorted, OFF means open.
+
+ Jumper | IRQ
+ 1 2 3 4 5 |
+ ----------------------------
+ ON OFF OFF OFF OFF | 2
+ OFF ON OFF OFF OFF | 3
+ OFF OFF ON OFF OFF | 4
+ OFF OFF OFF ON OFF | 5
+ OFF OFF OFF OFF ON | 7
+
+
+Setting the Timeout Parameters
+------------------------------
+
+The jumpers J3 are used to set the timeout parameters. These two
+jumpers are normally left open.
+
+
+*****************************************************************************
+
+** Thomas-Conrad **
+Model #500-6242-0097 REV A (8-bit card)
+---------------------------------------
+ - from Lars Karlsson <100617.3473@compuserve.com>
+
+ ________________________________________________________
+ | ________ ________ |_____
+ | |........| |........| |
+ | |________| |________| ___|
+ | SW 3 SW 1 | |
+ | Base I/O Base Addr. Station | |
+ | address | |
+ | ______ switch | |
+ | | | | |
+ | | | |___|
+ | | | ______ |___._
+ | |______| |______| ____| BNC
+ | Jumper- _____| Connector
+ | Main chip block _ __| '
+ | | | | RJ Connector
+ | |_| | with 110 Ohm
+ | |__ Terminator
+ | ___________ __|
+ | |...........| | RJ-jack
+ | |...........| _____ | (unused)
+ | |___________| |_____| |__
+ | Boot PROM socket IRQ-jumpers |_ Diagnostic
+ |________ __ _| LED (red)
+ | | | | | | | | | | | | | | | | | | | | | |
+ | | | | | | | | | | | | | | | | | | | | |________|
+ |
+ |
+
+And here are the settings for some of the switches and jumpers on the cards.
+
+
+ I/O
+
+ 1 2 3 4 5 6 7 8
+
+2E0----- 0 0 0 1 0 0 0 1
+2F0----- 0 0 0 1 0 0 0 0
+300----- 0 0 0 0 1 1 1 1
+350----- 0 0 0 0 1 1 1 0
+
+"0" in the above example means switch is off "1" means that it is on.
+
+
+ ShMem address.
+
+ 1 2 3 4 5 6 7 8
+
+CX00--0 0 1 1 | | |
+DX00--0 0 1 0 |
+X000--------- 1 1 |
+X400--------- 1 0 |
+X800--------- 0 1 |
+XC00--------- 0 0
+ENHANCED----------- 1
+COMPATIBLE--------- 0
+
+
+ IRQ
+
+
+ 3 4 5 7 2
+ . . . . .
+ . . . . .
+
+
+There is a DIP-switch with 8 switches, used to set the shared memory address
+to be used. The first 6 switches set the address, the 7th doesn't have any
+function, and the 8th switch is used to select "compatible" or "enhanced".
+When I got my two cards, one of them had this switch set to "enhanced". That
+card didn't work at all, it wasn't even recognized by the driver. The other
+card had this switch set to "compatible" and it behaved absolutely normally. I
+guess that the switch on one of the cards, must have been changed accidentally
+when the card was taken out of its former host. The question remains
+unanswered, what is the purpose of the "enhanced" position?
+
+[Avery's note: "enhanced" probably either disables shared memory (use IO
+ports instead) or disables IO ports (use memory addresses instead). This
+varies by the type of card involved. I fail to see how either of these
+enhance anything. Send me more detailed information about this mode, or
+just use "compatible" mode instead.]
+
+
+*****************************************************************************
+
+** Waterloo Microsystems Inc. ?? **
+8-bit card (C) 1985
+-------------------
+ - from Robert Michael Best <rmb117@cs.usask.ca>
+
+[Avery's note: these don't work with my driver for some reason. These cards
+SEEM to have settings similar to the PDI508Plus, which is
+software-configured and doesn't work with my driver either. The "Waterloo
+chip" is a boot PROM, probably designed specifically for the University of
+Waterloo. If you have any further information about this card, please
+e-mail me.]
+
+The probe has not been able to detect the card on any of the J2 settings,
+and I tried them again with the "Waterloo" chip removed.
+
+ _____________________________________________________________________
+| \/ \/ ___ __ __ |
+| C4 C4 |^| | M || ^ ||^| |
+| -- -- |_| | 5 || || | C3 |
+| \/ \/ C10 |___|| ||_| |
+| C4 C4 _ _ | | ?? |
+| -- -- | \/ || | |
+| | || | |
+| | || C1 | |
+| | || | \/ _____|
+| | C6 || | C9 | |___
+| | || | -- | BNC |___|
+| | || | >C7| |_____|
+| | || | |
+| __ __ |____||_____| 1 2 3 6 |
+|| ^ | >C4| |o|o|o|o|o|o| J2 >C4| |
+|| | |o|o|o|o|o|o| |
+|| C2 | >C4| >C4| |
+|| | >C8| |
+|| | 2 3 4 5 6 7 IRQ >C4| |
+||_____| |o|o|o|o|o|o| J3 |
+|_______ |o|o|o|o|o|o| _______________|
+ | |
+ |_____________________________________________|
+
+C1 -- "COM9026
+ SMC 8638"
+ In a chip socket.
+
+C2 -- "@Copyright
+ Waterloo Microsystems Inc.
+ 1985"
+ In a chip Socket with info printed on a label covering a round window
+ showing the circuit inside. (The window indicates it is an EPROM chip.)
+
+C3 -- "COM9032
+ SMC 8643"
+ In a chip socket.
+
+C4 -- "74LS"
+ 9 total no sockets.
+
+M5 -- "50006-136
+ 20.000000 MHZ
+ MTQ-T1-S3
+ 0 M-TRON 86-40"
+ Metallic case with 4 pins, no socket.
+
+C6 -- "MOSTEK@TC8643
+ MK6116N-20
+ MALAYSIA"
+ No socket.
+
+C7 -- No stamp or label but in a 20 pin chip socket.
+
+C8 -- "PAL10L8CN
+ 8623"
+ In a 20 pin socket.
+
+C9 -- "PAl16R4A-2CN
+ 8641"
+ In a 20 pin socket.
+
+C10 -- "M8640
+ NMC
+ 9306N"
+ In an 8 pin socket.
+
+?? -- Some components on a smaller board and attached with 20 pins all
+ along the side closest to the BNC connector. The are coated in a dark
+ resin.
+
+On the board there are two jumper banks labeled J2 and J3. The
+manufacturer didn't put a J1 on the board. The two boards I have both
+came with a jumper box for each bank.
+
+J2 -- Numbered 1 2 3 4 5 6.
+ 4 and 5 are not stamped due to solder points.
+
+J3 -- IRQ 2 3 4 5 6 7
+
+The board itself has a maple leaf stamped just above the irq jumpers
+and "-2 46-86" beside C2. Between C1 and C6 "ASS 'Y 300163" and "@1986
+CORMAN CUSTOM ELECTRONICS CORP." stamped just below the BNC connector.
+Below that "MADE IN CANADA"
+
+
+*****************************************************************************
+
+** No Name **
+8-bit cards, 16-bit cards
+-------------------------
+ - from Juergen Seifert <seifert@htwm.de>
+
+NONAME 8-BIT ARCNET
+===================
+
+I have named this ARCnet card "NONAME", since there is no name of any
+manufacturer on the Installation manual nor on the shipping box. The only
+hint to the existence of a manufacturer at all is written in copper,
+it is "Made in Taiwan"
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the Original
+ "ARCnet Installation Manual"
+
+
+ ________________________________________________________________
+ | |STAR| BUS| T/P| |
+ | |____|____|____| |
+ | _____________________ |
+ | | | |
+ | | | |
+ | | | |
+ | | SMC | |
+ | | | |
+ | | COM90C65 | |
+ | | | |
+ | | | |
+ | |__________-__________| |
+ | _____|
+ | _______________ | CN |
+ | | PROM | |_____|
+ | > SOCKET | |
+ | |_______________| 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 |
+ | _______________ _______________ |
+ | |o|o|o|o|o|o|o|o| | SW1 || SW2 ||
+ | |o|o|o|o|o|o|o|o| |_______________||_______________||
+ |___ 2 3 4 5 7 E E R Node ID IOB__|__MEM____|
+ | \ IRQ / T T O |
+ |__________________1_2_M______________________|
+
+Legend:
+
+COM90C65: ARCnet Probe
+S1 1-8: Node ID Select
+S2 1-3: I/O Base Address Select
+ 4-6: Memory Base Address Select
+ 7-8: RAM Offset Select
+ET1, ET2 Extended Timeout Select
+ROM ROM Enable Select
+CN RG62 Coax Connector
+STAR| BUS | T/P Three fields for placing a sign (colored circle)
+ indicating the topology of the card
+
+Setting one of the switches to Off means "1", On means "0".
+
+
+Setting the Node ID
+-------------------
+
+The eight switches in group SW1 are used to set the node ID.
+Each node attached to the network must have an unique node ID which
+must be different from 0.
+Switch 8 serves as the least significant bit (LSB).
+
+The node ID is the sum of the values of all switches set to "1"
+These values are:
+
+ Switch | Value
+ -------|-------
+ 8 | 1
+ 7 | 2
+ 6 | 4
+ 5 | 8
+ 4 | 16
+ 3 | 32
+ 2 | 64
+ 1 | 128
+
+Some Examples:
+
+ Switch | Hex | Decimal
+ 1 2 3 4 5 6 7 8 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+----------------------------
+
+The first three switches in switch group SW2 are used to select one
+of eight possible I/O Base addresses using the following table
+
+ Switch | Hex I/O
+ 1 2 3 | Address
+ ------------|--------
+ ON ON ON | 260
+ ON ON OFF | 290
+ ON OFF ON | 2E0 (Manufacturer's default)
+ ON OFF OFF | 2F0
+ OFF ON ON | 300
+ OFF ON OFF | 350
+ OFF OFF ON | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+--------------------------------------------
+
+The memory buffer requires 2K of a 16K block of RAM. The base of this
+16K block can be located in any of eight positions.
+Switches 4-6 of switch group SW2 select the Base of the 16K block.
+Within that 16K address space, the buffer may be assigned any one of four
+positions, determined by the offset, switches 7 and 8 of group SW2.
+
+ Switch | Hex RAM | Hex ROM
+ 4 5 6 7 8 | Address | Address *)
+ -----------|---------|-----------
+ 0 0 0 0 0 | C0000 | C2000
+ 0 0 0 0 1 | C0800 | C2000
+ 0 0 0 1 0 | C1000 | C2000
+ 0 0 0 1 1 | C1800 | C2000
+ | |
+ 0 0 1 0 0 | C4000 | C6000
+ 0 0 1 0 1 | C4800 | C6000
+ 0 0 1 1 0 | C5000 | C6000
+ 0 0 1 1 1 | C5800 | C6000
+ | |
+ 0 1 0 0 0 | CC000 | CE000
+ 0 1 0 0 1 | CC800 | CE000
+ 0 1 0 1 0 | CD000 | CE000
+ 0 1 0 1 1 | CD800 | CE000
+ | |
+ 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default)
+ 0 1 1 0 1 | D0800 | D2000
+ 0 1 1 1 0 | D1000 | D2000
+ 0 1 1 1 1 | D1800 | D2000
+ | |
+ 1 0 0 0 0 | D4000 | D6000
+ 1 0 0 0 1 | D4800 | D6000
+ 1 0 0 1 0 | D5000 | D6000
+ 1 0 0 1 1 | D5800 | D6000
+ | |
+ 1 0 1 0 0 | D8000 | DA000
+ 1 0 1 0 1 | D8800 | DA000
+ 1 0 1 1 0 | D9000 | DA000
+ 1 0 1 1 1 | D9800 | DA000
+ | |
+ 1 1 0 0 0 | DC000 | DE000
+ 1 1 0 0 1 | DC800 | DE000
+ 1 1 0 1 0 | DD000 | DE000
+ 1 1 0 1 1 | DD800 | DE000
+ | |
+ 1 1 1 0 0 | E0000 | E2000
+ 1 1 1 0 1 | E0800 | E2000
+ 1 1 1 1 0 | E1000 | E2000
+ 1 1 1 1 1 | E1800 | E2000
+
+*) To enable the 8K Boot PROM install the jumper ROM.
+ The default is jumper ROM not installed.
+
+
+Setting Interrupt Request Lines (IRQ)
+-------------------------------------
+
+To select a hardware interrupt level set one (only one!) of the jumpers
+IRQ2, IRQ3, IRQ4, IRQ5 or IRQ7. The manufacturer's default is IRQ2.
+
+
+Setting the Timeouts
+--------------------
+
+The two jumpers labeled ET1 and ET2 are used to determine the timeout
+parameters (response and reconfiguration time). Every node in a network
+must be set to the same timeout values.
+
+ ET1 ET2 | Response Time (us) | Reconfiguration Time (ms)
+ --------|--------------------|--------------------------
+ Off Off | 78 | 840 (Default)
+ Off On | 285 | 1680
+ On Off | 563 | 1680
+ On On | 1130 | 1680
+
+On means jumper installed, Off means jumper not installed
+
+
+NONAME 16-BIT ARCNET
+====================
+
+The manual of my 8-Bit NONAME ARCnet Card contains another description
+of a 16-Bit Coax / Twisted Pair Card. This description is incomplete,
+because there are missing two pages in the manual booklet. (The table
+of contents reports pages ... 2-9, 2-11, 2-12, 3-1, ... but inside
+the booklet there is a different way of counting ... 2-9, 2-10, A-1,
+(empty page), 3-1, ..., 3-18, A-1 (again), A-2)
+Also the picture of the board layout is not as good as the picture of
+8-Bit card, because there isn't any letter like "SW1" written to the
+picture.
+Should somebody have such a board, please feel free to complete this
+description or to send a mail to me!
+
+This description has been written by Juergen Seifert <seifert@htwm.de>
+using information from the Original
+ "ARCnet Installation Manual"
+
+
+ ___________________________________________________________________
+ < _________________ _________________ |
+ > | SW? || SW? | |
+ < |_________________||_________________| |
+ > ____________________ |
+ < | | |
+ > | | |
+ < | | |
+ > | | |
+ < | | |
+ > | | |
+ < | | |
+ > |____________________| |
+ < ____|
+ > ____________________ | |
+ < | | | J1 |
+ > | < | |
+ < |____________________| ? ? ? ? ? ? |____|
+ > |o|o|o|o|o|o| |
+ < |o|o|o|o|o|o| |
+ > |
+ < __ ___________|
+ > | | |
+ <____________| |_______________________________________|
+
+
+Setting one of the switches to Off means "1", On means "0".
+
+
+Setting the Node ID
+-------------------
+
+The eight switches in group SW2 are used to set the node ID.
+Each node attached to the network must have an unique node ID which
+must be different from 0.
+Switch 8 serves as the least significant bit (LSB).
+
+The node ID is the sum of the values of all switches set to "1"
+These values are:
+
+ Switch | Value
+ -------|-------
+ 8 | 1
+ 7 | 2
+ 6 | 4
+ 5 | 8
+ 4 | 16
+ 3 | 32
+ 2 | 64
+ 1 | 128
+
+Some Examples:
+
+ Switch | Hex | Decimal
+ 1 2 3 4 5 6 7 8 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+----------------------------
+
+The first three switches in switch group SW1 are used to select one
+of eight possible I/O Base addresses using the following table
+
+ Switch | Hex I/O
+ 3 2 1 | Address
+ ------------|--------
+ ON ON ON | 260
+ ON ON OFF | 290
+ ON OFF ON | 2E0 (Manufacturer's default)
+ ON OFF OFF | 2F0
+ OFF ON ON | 300
+ OFF ON OFF | 350
+ OFF OFF ON | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+--------------------------------------------
+
+The memory buffer requires 2K of a 16K block of RAM. The base of this
+16K block can be located in any of eight positions.
+Switches 6-8 of switch group SW1 select the Base of the 16K block.
+Within that 16K address space, the buffer may be assigned any one of four
+positions, determined by the offset, switches 4 and 5 of group SW1.
+
+ Switch | Hex RAM | Hex ROM
+ 8 7 6 5 4 | Address | Address
+ -----------|---------|-----------
+ 0 0 0 0 0 | C0000 | C2000
+ 0 0 0 0 1 | C0800 | C2000
+ 0 0 0 1 0 | C1000 | C2000
+ 0 0 0 1 1 | C1800 | C2000
+ | |
+ 0 0 1 0 0 | C4000 | C6000
+ 0 0 1 0 1 | C4800 | C6000
+ 0 0 1 1 0 | C5000 | C6000
+ 0 0 1 1 1 | C5800 | C6000
+ | |
+ 0 1 0 0 0 | CC000 | CE000
+ 0 1 0 0 1 | CC800 | CE000
+ 0 1 0 1 0 | CD000 | CE000
+ 0 1 0 1 1 | CD800 | CE000
+ | |
+ 0 1 1 0 0 | D0000 | D2000 (Manufacturer's default)
+ 0 1 1 0 1 | D0800 | D2000
+ 0 1 1 1 0 | D1000 | D2000
+ 0 1 1 1 1 | D1800 | D2000
+ | |
+ 1 0 0 0 0 | D4000 | D6000
+ 1 0 0 0 1 | D4800 | D6000
+ 1 0 0 1 0 | D5000 | D6000
+ 1 0 0 1 1 | D5800 | D6000
+ | |
+ 1 0 1 0 0 | D8000 | DA000
+ 1 0 1 0 1 | D8800 | DA000
+ 1 0 1 1 0 | D9000 | DA000
+ 1 0 1 1 1 | D9800 | DA000
+ | |
+ 1 1 0 0 0 | DC000 | DE000
+ 1 1 0 0 1 | DC800 | DE000
+ 1 1 0 1 0 | DD000 | DE000
+ 1 1 0 1 1 | DD800 | DE000
+ | |
+ 1 1 1 0 0 | E0000 | E2000
+ 1 1 1 0 1 | E0800 | E2000
+ 1 1 1 1 0 | E1000 | E2000
+ 1 1 1 1 1 | E1800 | E2000
+
+
+Setting Interrupt Request Lines (IRQ)
+-------------------------------------
+
+??????????????????????????????????????
+
+
+Setting the Timeouts
+--------------------
+
+??????????????????????????????????????
+
+
+*****************************************************************************
+
+** No Name **
+8-bit cards ("Made in Taiwan R.O.C.")
+-----------
+ - from Vojtech Pavlik <vojtech@suse.cz>
+
+I have named this ARCnet card "NONAME", since I got only the card with
+no manual at all and the only text identifying the manufacturer is
+"MADE IN TAIWAN R.O.C" printed on the card.
+
+ ____________________________________________________________
+ | 1 2 3 4 5 6 7 8 |
+ | |o|o| JP1 o|o|o|o|o|o|o|o| ON |
+ | + o|o|o|o|o|o|o|o| ___|
+ | _____________ o|o|o|o|o|o|o|o| OFF _____ | | ID7
+ | | | SW1 | | | | ID6
+ | > RAM (2k) | ____________________ | H | | S | ID5
+ | |_____________| | || y | | W | ID4
+ | | || b | | 2 | ID3
+ | | || r | | | ID2
+ | | || i | | | ID1
+ | | 90C65 || d | |___| ID0
+ | SW3 | || | |
+ | |o|o|o|o|o|o|o|o| ON | || I | |
+ | |o|o|o|o|o|o|o|o| | || C | |
+ | |o|o|o|o|o|o|o|o| OFF |____________________|| | _____|
+ | 1 2 3 4 5 6 7 8 | | | |___
+ | ______________ | | | BNC |___|
+ | | | |_____| |_____|
+ | > EPROM SOCKET | |
+ | |______________| |
+ | ______________|
+ | |
+ |_____________________________________________|
+
+Legend:
+
+90C65 ARCNET Chip
+SW1 1-5: Base Memory Address Select
+ 6-8: Base I/O Address Select
+SW2 1-8: Node ID Select (ID0-ID7)
+SW3 1-5: IRQ Select
+ 6-7: Extra Timeout
+ 8 : ROM Enable
+JP1 Led connector
+BNC Coax connector
+
+Although the jumpers SW1 and SW3 are marked SW, not JP, they are jumpers, not
+switches.
+
+Setting the jumpers to ON means connecting the upper two pins, off the bottom
+two - or - in case of IRQ setting, connecting none of them at all.
+
+Setting the Node ID
+-------------------
+
+The eight switches in SW2 are used to set the node ID. Each node attached
+to the network must have an unique node ID which must not be 0.
+Switch 1 (ID0) serves as the least significant bit (LSB).
+
+Setting one of the switches to Off means "1", On means "0".
+
+The node ID is the sum of the values of all switches set to "1"
+These values are:
+
+ Switch | Label | Value
+ -------|-------|-------
+ 1 | ID0 | 1
+ 2 | ID1 | 2
+ 3 | ID2 | 4
+ 4 | ID3 | 8
+ 5 | ID4 | 16
+ 6 | ID5 | 32
+ 7 | ID6 | 64
+ 8 | ID7 | 128
+
+Some Examples:
+
+ Switch | Hex | Decimal
+ 8 7 6 5 4 3 2 1 | Node ID | Node ID
+ ----------------|---------|---------
+ 0 0 0 0 0 0 0 0 | not allowed
+ 0 0 0 0 0 0 0 1 | 1 | 1
+ 0 0 0 0 0 0 1 0 | 2 | 2
+ 0 0 0 0 0 0 1 1 | 3 | 3
+ . . . | |
+ 0 1 0 1 0 1 0 1 | 55 | 85
+ . . . | |
+ 1 0 1 0 1 0 1 0 | AA | 170
+ . . . | |
+ 1 1 1 1 1 1 0 1 | FD | 253
+ 1 1 1 1 1 1 1 0 | FE | 254
+ 1 1 1 1 1 1 1 1 | FF | 255
+
+
+Setting the I/O Base Address
+----------------------------
+
+The last three switches in switch block SW1 are used to select one
+of eight possible I/O Base addresses using the following table
+
+
+ Switch | Hex I/O
+ 6 7 8 | Address
+ ------------|--------
+ ON ON ON | 260
+ OFF ON ON | 290
+ ON OFF ON | 2E0 (Manufacturer's default)
+ OFF OFF ON | 2F0
+ ON ON OFF | 300
+ OFF ON OFF | 350
+ ON OFF OFF | 380
+ OFF OFF OFF | 3E0
+
+
+Setting the Base Memory (RAM) buffer Address
+--------------------------------------------
+
+The memory buffer (RAM) requires 2K. The base of this buffer can be
+located in any of eight positions. The address of the Boot Prom is
+memory base + 0x2000.
+Jumpers 3-5 of jumper block SW1 select the Memory Base address.
+
+ Switch | Hex RAM | Hex ROM
+ 1 2 3 4 5 | Address | Address *)
+ --------------------|---------|-----------
+ ON ON ON ON ON | C0000 | C2000
+ ON ON OFF ON ON | C4000 | C6000
+ ON ON ON OFF ON | CC000 | CE000
+ ON ON OFF OFF ON | D0000 | D2000 (Manufacturer's default)
+ ON ON ON ON OFF | D4000 | D6000
+ ON ON OFF ON OFF | D8000 | DA000
+ ON ON ON OFF OFF | DC000 | DE000
+ ON ON OFF OFF OFF | E0000 | E2000
+
+*) To enable the Boot ROM set the jumper 8 of jumper block SW3 to position ON.
+
+The jumpers 1 and 2 probably add 0x0800, 0x1000 and 0x1800 to RAM adders.
+
+Setting the Interrupt Line
+--------------------------
+
+Jumpers 1-5 of the jumper block SW3 control the IRQ level.
+
+ Jumper | IRQ
+ 1 2 3 4 5 |
+ ----------------------------
+ ON OFF OFF OFF OFF | 2
+ OFF ON OFF OFF OFF | 3
+ OFF OFF ON OFF OFF | 4
+ OFF OFF OFF ON OFF | 5
+ OFF OFF OFF OFF ON | 7
+
+
+Setting the Timeout Parameters
+------------------------------
+
+The jumpers 6-7 of the jumper block SW3 are used to determine the timeout
+parameters. These two jumpers are normally left in the OFF position.
+
+
+*****************************************************************************
+
+** No Name **
+(Generic Model 9058)
+--------------------
+ - from Andrew J. Kroll <ag784@freenet.buffalo.edu>
+ - Sorry this sat in my to-do box for so long, Andrew! (yikes - over a
+ year!)
+ _____
+ | <
+ | .---'
+ ________________________________________________________________ | |
+ | | SW2 | | |
+ | ___________ |_____________| | |
+ | | | 1 2 3 4 5 6 ___| |
+ | > 6116 RAM | _________ 8 | | |
+ | |___________| |20MHzXtal| 7 | | |
+ | |_________| __________ 6 | S | |
+ | 74LS373 | |- 5 | W | |
+ | _________ | E |- 4 | | |
+ | >_______| ______________|..... P |- 3 | 3 | |
+ | | | : O |- 2 | | |
+ | | | : X |- 1 |___| |
+ | ________________ | | : Y |- | |
+ | | SW1 | | SL90C65 | : |- | |
+ | |________________| | | : B |- | |
+ | 1 2 3 4 5 6 7 8 | | : O |- | |
+ | |_________o____|..../ A |- _______| |
+ | ____________________ | R |- | |------,
+ | | | | D |- | BNC | # |
+ | > 2764 PROM SOCKET | |__________|- |_______|------'
+ | |____________________| _________ | |
+ | >________| <- 74LS245 | |
+ | | |
+ |___ ______________| |
+ |H H H H H H H H H H H H H H H H H H H H H H H| | |
+ |U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U_U| | |
+ \|
+Legend:
+
+SL90C65 ARCNET Controller / Transceiver /Logic
+SW1 1-5: IRQ Select
+ 6: ET1
+ 7: ET2
+ 8: ROM ENABLE
+SW2 1-3: Memory Buffer/PROM Address
+ 3-6: I/O Address Map
+SW3 1-8: Node ID Select
+BNC BNC RG62/U Connection
+ *I* have had success using RG59B/U with *NO* terminators!
+ What gives?!
+
+SW1: Timeouts, Interrupt and ROM
+---------------------------------
+
+To select a hardware interrupt level set one (only one!) of the dip switches
+up (on) SW1...(switches 1-5)
+IRQ3, IRQ4, IRQ5, IRQ7, IRQ2. The Manufacturer's default is IRQ2.
+
+The switches on SW1 labeled EXT1 (switch 6) and EXT2 (switch 7)
+are used to determine the timeout parameters. These two dip switches
+are normally left off (down).
+
+ To enable the 8K Boot PROM position SW1 switch 8 on (UP) labeled ROM.
+ The default is jumper ROM not installed.
+
+
+Setting the I/O Base Address
+----------------------------
+
+The last three switches in switch group SW2 are used to select one
+of eight possible I/O Base addresses using the following table
+
+
+ Switch | Hex I/O
+ 4 5 6 | Address
+ -------|--------
+ 0 0 0 | 260
+ 0 0 1 | 290
+ 0 1 0 | 2E0 (Manufacturer's default)
+ 0 1 1 | 2F0
+ 1 0 0 | 300
+ 1 0 1 | 350
+ 1 1 0 | 380
+ 1 1 1 | 3E0
+
+
+Setting the Base Memory Address (RAM & ROM)
+-------------------------------------------
+
+The memory buffer requires 2K of a 16K block of RAM. The base of this
+16K block can be located in any of eight positions.
+Switches 1-3 of switch group SW2 select the Base of the 16K block.
+(0 = DOWN, 1 = UP)
+I could, however, only verify two settings...
+
+ Switch| Hex RAM | Hex ROM
+ 1 2 3 | Address | Address
+ ------|---------|-----------
+ 0 0 0 | E0000 | E2000
+ 0 0 1 | D0000 | D2000 (Manufacturer's default)
+ 0 1 0 | ????? | ?????
+ 0 1 1 | ????? | ?????
+ 1 0 0 | ????? | ?????
+ 1 0 1 | ????? | ?????
+ 1 1 0 | ????? | ?????
+ 1 1 1 | ????? | ?????
+
+
+Setting the Node ID
+-------------------
+
+The eight switches in group SW3 are used to set the node ID.
+Each node attached to the network must have an unique node ID which
+must be different from 0.
+Switch 1 serves as the least significant bit (LSB).
+switches in the DOWN position are OFF (0) and in the UP position are ON (1)
+
+The node ID is the sum of the values of all switches set to "1"
+These values are:
+ Switch | Value
+ -------|-------
+ 1 | 1
+ 2 | 2
+ 3 | 4
+ 4 | 8
+ 5 | 16
+ 6 | 32
+ 7 | 64
+ 8 | 128
+
+Some Examples:
+
+ Switch# | Hex | Decimal
+8 7 6 5 4 3 2 1 | Node ID | Node ID
+----------------|---------|---------
+0 0 0 0 0 0 0 0 | not allowed <-.
+0 0 0 0 0 0 0 1 | 1 | 1 |
+0 0 0 0 0 0 1 0 | 2 | 2 |
+0 0 0 0 0 0 1 1 | 3 | 3 |
+ . . . | | |
+0 1 0 1 0 1 0 1 | 55 | 85 |
+ . . . | | + Don't use 0 or 255!
+1 0 1 0 1 0 1 0 | AA | 170 |
+ . . . | | |
+1 1 1 1 1 1 0 1 | FD | 253 |
+1 1 1 1 1 1 1 0 | FE | 254 |
+1 1 1 1 1 1 1 1 | FF | 255 <-'
+
+
+*****************************************************************************
+
+** Tiara **
+(model unknown)
+-------------------------
+ - from Christoph Lameter <christoph@lameter.com>
+
+
+Here is information about my card as far as I could figure it out:
+----------------------------------------------- tiara
+Tiara LanCard of Tiara Computer Systems.
+
++----------------------------------------------+
+! ! Transmitter Unit ! !
+! +------------------+ -------
+! MEM Coax Connector
+! ROM 7654321 <- I/O -------
+! : : +--------+ !
+! : : ! 90C66LJ! +++
+! : : ! ! !D Switch to set
+! : : ! ! !I the Nodenumber
+! : : +--------+ !P
+! !++
+! 234567 <- IRQ !
++------------!!!!!!!!!!!!!!!!!!!!!!!!--------+
+ !!!!!!!!!!!!!!!!!!!!!!!!
+
+0 = Jumper Installed
+1 = Open
+
+Top Jumper line Bit 7 = ROM Enable 654=Memory location 321=I/O
+
+Settings for Memory Location (Top Jumper Line)
+456 Address selected
+000 C0000
+001 C4000
+010 CC000
+011 D0000
+100 D4000
+101 D8000
+110 DC000
+111 E0000
+
+Settings for I/O Address (Top Jumper Line)
+123 Port
+000 260
+001 290
+010 2E0
+011 2F0
+100 300
+101 350
+110 380
+111 3E0
+
+Settings for IRQ Selection (Lower Jumper Line)
+234567
+011111 IRQ 2
+101111 IRQ 3
+110111 IRQ 4
+111011 IRQ 5
+111110 IRQ 7
+
+*****************************************************************************
+
+
+Other Cards
+-----------
+
+I have no information on other models of ARCnet cards at the moment. Please
+send any and all info to:
+ apenwarr@worldvisions.ca
+
+Thanks.
diff --git a/Documentation/networking/arcnet.txt b/Documentation/networking/arcnet.txt
new file mode 100644
index 000000000..aff97f47c
--- /dev/null
+++ b/Documentation/networking/arcnet.txt
@@ -0,0 +1,556 @@
+----------------------------------------------------------------------------
+NOTE: See also arcnet-hardware.txt in this directory for jumper-setting
+and cabling information if you're like many of us and didn't happen to get a
+manual with your ARCnet card.
+----------------------------------------------------------------------------
+
+Since no one seems to listen to me otherwise, perhaps a poem will get your
+attention:
+ This driver's getting fat and beefy,
+ But my cat is still named Fifi.
+
+Hmm, I think I'm allowed to call that a poem, even though it's only two
+lines. Hey, I'm in Computer Science, not English. Give me a break.
+
+The point is: I REALLY REALLY REALLY REALLY REALLY want to hear from you if
+you test this and get it working. Or if you don't. Or anything.
+
+ARCnet 0.32 ALPHA first made it into the Linux kernel 1.1.80 - this was
+nice, but after that even FEWER people started writing to me because they
+didn't even have to install the patch. <sigh>
+
+Come on, be a sport! Send me a success report!
+
+(hey, that was even better than my original poem... this is getting bad!)
+
+
+--------
+WARNING:
+--------
+
+If you don't e-mail me about your success/failure soon, I may be forced to
+start SINGING. And we don't want that, do we?
+
+(You know, it might be argued that I'm pushing this point a little too much.
+If you think so, why not flame me in a quick little e-mail? Please also
+include the type of card(s) you're using, software, size of network, and
+whether it's working or not.)
+
+My e-mail address is: apenwarr@worldvisions.ca
+
+
+---------------------------------------------------------------------------
+
+
+These are the ARCnet drivers for Linux.
+
+
+This new release (2.91) has been put together by David Woodhouse
+<dwmw2@infradead.org>, in an attempt to tidy up the driver after adding support
+for yet another chipset. Now the generic support has been separated from the
+individual chipset drivers, and the source files aren't quite so packed with
+#ifdefs! I've changed this file a bit, but kept it in the first person from
+Avery, because I didn't want to completely rewrite it.
+
+The previous release resulted from many months of on-and-off effort from me
+(Avery Pennarun), many bug reports/fixes and suggestions from others, and in
+particular a lot of input and coding from Tomasz Motylewski. Starting with
+ARCnet 2.10 ALPHA, Tomasz's all-new-and-improved RFC1051 support has been
+included and seems to be working fine!
+
+
+Where do I discuss these drivers?
+---------------------------------
+
+Tomasz has been so kind as to set up a new and improved mailing list.
+Subscribe by sending a message with the BODY "subscribe linux-arcnet YOUR
+REAL NAME" to listserv@tichy.ch.uj.edu.pl. Then, to submit messages to the
+list, mail to linux-arcnet@tichy.ch.uj.edu.pl.
+
+There are archives of the mailing list at:
+ http://epistolary.org/mailman/listinfo.cgi/arcnet
+
+The people on linux-net@vger.kernel.org (now defunct, replaced by
+netdev@vger.kernel.org) have also been known to be very helpful, especially
+when we're talking about ALPHA Linux kernels that may or may not work right
+in the first place.
+
+
+Other Drivers and Info
+----------------------
+
+You can try my ARCNET page on the World Wide Web at:
+ http://www.qis.net/~jschmitz/arcnet/
+
+Also, SMC (one of the companies that makes ARCnet cards) has a WWW site you
+might be interested in, which includes several drivers for various cards
+including ARCnet. Try:
+ http://www.smc.com/
+
+Performance Technologies makes various network software that supports
+ARCnet:
+ http://www.perftech.com/ or ftp to ftp.perftech.com.
+
+Novell makes a networking stack for DOS which includes ARCnet drivers. Try
+FTPing to ftp.novell.com.
+
+You can get the Crynwr packet driver collection (including arcether.com, the
+one you'll want to use with ARCnet cards) from
+oak.oakland.edu:/simtel/msdos/pktdrvr. It won't work perfectly on a 386+
+without patches, though, and also doesn't like several cards. Fixed
+versions are available on my WWW page, or via e-mail if you don't have WWW
+access.
+
+
+Installing the Driver
+---------------------
+
+All you will need to do in order to install the driver is:
+ make config
+ (be sure to choose ARCnet in the network devices
+ and at least one chipset driver.)
+ make clean
+ make zImage
+
+If you obtained this ARCnet package as an upgrade to the ARCnet driver in
+your current kernel, you will need to first copy arcnet.c over the one in
+the linux/drivers/net directory.
+
+You will know the driver is installed properly if you get some ARCnet
+messages when you reboot into the new Linux kernel.
+
+There are four chipset options:
+
+ 1. Standard ARCnet COM90xx chipset.
+
+This is the normal ARCnet card, which you've probably got. This is the only
+chipset driver which will autoprobe if not told where the card is.
+It following options on the command line:
+ com90xx=[<io>[,<irq>[,<shmem>]]][,<name>] | <name>
+
+If you load the chipset support as a module, the options are:
+ io=<io> irq=<irq> shmem=<shmem> device=<name>
+
+To disable the autoprobe, just specify "com90xx=" on the kernel command line.
+To specify the name alone, but allow autoprobe, just put "com90xx=<name>"
+
+ 2. ARCnet COM20020 chipset.
+
+This is the new chipset from SMC with support for promiscuous mode (packet
+sniffing), extra diagnostic information, etc. Unfortunately, there is no
+sensible method of autoprobing for these cards. You must specify the I/O
+address on the kernel command line.
+The command line options are:
+ com20020=<io>[,<irq>[,<node_ID>[,backplane[,CKP[,timeout]]]]][,name]
+
+If you load the chipset support as a module, the options are:
+ io=<io> irq=<irq> node=<node_ID> backplane=<backplane> clock=<CKP>
+ timeout=<timeout> device=<name>
+
+The COM20020 chipset allows you to set the node ID in software, overriding the
+default which is still set in DIP switches on the card. If you don't have the
+COM20020 data sheets, and you don't know what the other three options refer
+to, then they won't interest you - forget them.
+
+ 3. ARCnet COM90xx chipset in IO-mapped mode.
+
+This will also work with the normal ARCnet cards, but doesn't use the shared
+memory. It performs less well than the above driver, but is provided in case
+you have a card which doesn't support shared memory, or (strangely) in case
+you have so many ARCnet cards in your machine that you run out of shmem slots.
+If you don't give the IO address on the kernel command line, then the driver
+will not find the card.
+The command line options are:
+ com90io=<io>[,<irq>][,<name>]
+
+If you load the chipset support as a module, the options are:
+ io=<io> irq=<irq> device=<name>
+
+ 4. ARCnet RIM I cards.
+
+These are COM90xx chips which are _completely_ memory mapped. The support for
+these is not tested. If you have one, please mail the author with a success
+report. All options must be specified, except the device name.
+Command line options:
+ arcrimi=<shmem>,<irq>,<node_ID>[,<name>]
+
+If you load the chipset support as a module, the options are:
+ shmem=<shmem> irq=<irq> node=<node_ID> device=<name>
+
+
+Loadable Module Support
+-----------------------
+
+Configure and rebuild Linux. When asked, answer 'm' to "Generic ARCnet
+support" and to support for your ARCnet chipset if you want to use the
+loadable module. You can also say 'y' to "Generic ARCnet support" and 'm'
+to the chipset support if you wish.
+
+ make config
+ make clean
+ make zImage
+ make modules
+
+If you're using a loadable module, you need to use insmod to load it, and
+you can specify various characteristics of your card on the command
+line. (In recent versions of the driver, autoprobing is much more reliable
+and works as a module, so most of this is now unnecessary.)
+
+For example:
+ cd /usr/src/linux/modules
+ insmod arcnet.o
+ insmod com90xx.o
+ insmod com20020.o io=0x2e0 device=eth1
+
+
+Using the Driver
+----------------
+
+If you build your kernel with ARCnet COM90xx support included, it should
+probe for your card automatically when you boot. If you use a different
+chipset driver complied into the kernel, you must give the necessary options
+on the kernel command line, as detailed above.
+
+Go read the NET-2-HOWTO and ETHERNET-HOWTO for Linux; they should be
+available where you picked up this driver. Think of your ARCnet as a
+souped-up (or down, as the case may be) Ethernet card.
+
+By the way, be sure to change all references from "eth0" to "arc0" in the
+HOWTOs. Remember that ARCnet isn't a "true" Ethernet, and the device name
+is DIFFERENT.
+
+
+Multiple Cards in One Computer
+------------------------------
+
+Linux has pretty good support for this now, but since I've been busy, the
+ARCnet driver has somewhat suffered in this respect. COM90xx support, if
+compiled into the kernel, will (try to) autodetect all the installed cards.
+
+If you have other cards, with support compiled into the kernel, then you can
+just repeat the options on the kernel command line, e.g.:
+LILO: linux com20020=0x2e0 com20020=0x380 com90io=0x260
+
+If you have the chipset support built as a loadable module, then you need to
+do something like this:
+ insmod -o arc0 com90xx
+ insmod -o arc1 com20020 io=0x2e0
+ insmod -o arc2 com90xx
+The ARCnet drivers will now sort out their names automatically.
+
+
+How do I get it to work with...?
+--------------------------------
+
+NFS: Should be fine linux->linux, just pretend you're using Ethernet cards.
+ oak.oakland.edu:/simtel/msdos/nfs has some nice DOS clients. There
+ is also a DOS-based NFS server called SOSS. It doesn't multitask
+ quite the way Linux does (actually, it doesn't multitask AT ALL) but
+ you never know what you might need.
+
+ With AmiTCP (and possibly others), you may need to set the following
+ options in your Amiga nfstab: MD 1024 MR 1024 MW 1024
+ (Thanks to Christian Gottschling <ferksy@indigo.tng.oche.de>
+ for this.)
+
+ Probably these refer to maximum NFS data/read/write block sizes. I
+ don't know why the defaults on the Amiga didn't work; write to me if
+ you know more.
+
+DOS: If you're using the freeware arcether.com, you might want to install
+ the driver patch from my web page. It helps with PC/TCP, and also
+ can get arcether to load if it timed out too quickly during
+ initialization. In fact, if you use it on a 386+ you REALLY need
+ the patch, really.
+
+Windows: See DOS :) Trumpet Winsock works fine with either the Novell or
+ Arcether client, assuming you remember to load winpkt of course.
+
+LAN Manager and Windows for Workgroups: These programs use protocols that
+ are incompatible with the Internet standard. They try to pretend
+ the cards are Ethernet, and confuse everyone else on the network.
+
+ However, v2.00 and higher of the Linux ARCnet driver supports this
+ protocol via the 'arc0e' device. See the section on "Multiprotocol
+ Support" for more information.
+
+ Using the freeware Samba server and clients for Linux, you can now
+ interface quite nicely with TCP/IP-based WfWg or Lan Manager
+ networks.
+
+Windows 95: Tools are included with Win95 that let you use either the LANMAN
+ style network drivers (NDIS) or Novell drivers (ODI) to handle your
+ ARCnet packets. If you use ODI, you'll need to use the 'arc0'
+ device with Linux. If you use NDIS, then try the 'arc0e' device.
+ See the "Multiprotocol Support" section below if you need arc0e,
+ you're completely insane, and/or you need to build some kind of
+ hybrid network that uses both encapsulation types.
+
+OS/2: I've been told it works under Warp Connect with an ARCnet driver from
+ SMC. You need to use the 'arc0e' interface for this. If you get
+ the SMC driver to work with the TCP/IP stuff included in the
+ "normal" Warp Bonus Pack, let me know.
+
+ ftp.microsoft.com also has a freeware "Lan Manager for OS/2" client
+ which should use the same protocol as WfWg does. I had no luck
+ installing it under Warp, however. Please mail me with any results.
+
+NetBSD/AmiTCP: These use an old version of the Internet standard ARCnet
+ protocol (RFC1051) which is compatible with the Linux driver v2.10
+ ALPHA and above using the arc0s device. (See "Multiprotocol ARCnet"
+ below.) ** Newer versions of NetBSD apparently support RFC1201.
+
+
+Using Multiprotocol ARCnet
+--------------------------
+
+The ARCnet driver v2.10 ALPHA supports three protocols, each on its own
+"virtual network device":
+
+ arc0 - RFC1201 protocol, the official Internet standard which just
+ happens to be 100% compatible with Novell's TRXNET driver.
+ Version 1.00 of the ARCnet driver supported _only_ this
+ protocol. arc0 is the fastest of the three protocols (for
+ whatever reason), and allows larger packets to be used
+ because it supports RFC1201 "packet splitting" operations.
+ Unless you have a specific need to use a different protocol,
+ I strongly suggest that you stick with this one.
+
+ arc0e - "Ethernet-Encapsulation" which sends packets over ARCnet
+ that are actually a lot like Ethernet packets, including the
+ 6-byte hardware addresses. This protocol is compatible with
+ Microsoft's NDIS ARCnet driver, like the one in WfWg and
+ LANMAN. Because the MTU of 493 is actually smaller than the
+ one "required" by TCP/IP (576), there is a chance that some
+ network operations will not function properly. The Linux
+ TCP/IP layer can compensate in most cases, however, by
+ automatically fragmenting the TCP/IP packets to make them
+ fit. arc0e also works slightly more slowly than arc0, for
+ reasons yet to be determined. (Probably it's the smaller
+ MTU that does it.)
+
+ arc0s - The "[s]imple" RFC1051 protocol is the "previous" Internet
+ standard that is completely incompatible with the new
+ standard. Some software today, however, continues to
+ support the old standard (and only the old standard)
+ including NetBSD and AmiTCP. RFC1051 also does not support
+ RFC1201's packet splitting, and the MTU of 507 is still
+ smaller than the Internet "requirement," so it's quite
+ possible that you may run into problems. It's also slower
+ than RFC1201 by about 25%, for the same reason as arc0e.
+
+ The arc0s support was contributed by Tomasz Motylewski
+ and modified somewhat by me. Bugs are probably my fault.
+
+You can choose not to compile arc0e and arc0s into the driver if you want -
+this will save you a bit of memory and avoid confusion when eg. trying to
+use the "NFS-root" stuff in recent Linux kernels.
+
+The arc0e and arc0s devices are created automatically when you first
+ifconfig the arc0 device. To actually use them, though, you need to also
+ifconfig the other virtual devices you need. There are a number of ways you
+can set up your network then:
+
+
+1. Single Protocol.
+
+ This is the simplest way to configure your network: use just one of the
+ two available protocols. As mentioned above, it's a good idea to use
+ only arc0 unless you have a good reason (like some other software, ie.
+ WfWg, that only works with arc0e).
+
+ If you need only arc0, then the following commands should get you going:
+ ifconfig arc0 MY.IP.ADD.RESS
+ route add MY.IP.ADD.RESS arc0
+ route add -net SUB.NET.ADD.RESS arc0
+ [add other local routes here]
+
+ If you need arc0e (and only arc0e), it's a little different:
+ ifconfig arc0 MY.IP.ADD.RESS
+ ifconfig arc0e MY.IP.ADD.RESS
+ route add MY.IP.ADD.RESS arc0e
+ route add -net SUB.NET.ADD.RESS arc0e
+
+ arc0s works much the same way as arc0e.
+
+
+2. More than one protocol on the same wire.
+
+ Now things start getting confusing. To even try it, you may need to be
+ partly crazy. Here's what *I* did. :) Note that I don't include arc0s in
+ my home network; I don't have any NetBSD or AmiTCP computers, so I only
+ use arc0s during limited testing.
+
+ I have three computers on my home network; two Linux boxes (which prefer
+ RFC1201 protocol, for reasons listed above), and one XT that can't run
+ Linux but runs the free Microsoft LANMAN Client instead.
+
+ Worse, one of the Linux computers (freedom) also has a modem and acts as
+ a router to my Internet provider. The other Linux box (insight) also has
+ its own IP address and needs to use freedom as its default gateway. The
+ XT (patience), however, does not have its own Internet IP address and so
+ I assigned it one on a "private subnet" (as defined by RFC1597).
+
+ To start with, take a simple network with just insight and freedom.
+ Insight needs to:
+ - talk to freedom via RFC1201 (arc0) protocol, because I like it
+ more and it's faster.
+ - use freedom as its Internet gateway.
+
+ That's pretty easy to do. Set up insight like this:
+ ifconfig arc0 insight
+ route add insight arc0
+ route add freedom arc0 /* I would use the subnet here (like I said
+ to to in "single protocol" above),
+ but the rest of the subnet
+ unfortunately lies across the PPP
+ link on freedom, which confuses
+ things. */
+ route add default gw freedom
+
+ And freedom gets configured like so:
+ ifconfig arc0 freedom
+ route add freedom arc0
+ route add insight arc0
+ /* and default gateway is configured by pppd */
+
+ Great, now insight talks to freedom directly on arc0, and sends packets
+ to the Internet through freedom. If you didn't know how to do the above,
+ you should probably stop reading this section now because it only gets
+ worse.
+
+ Now, how do I add patience into the network? It will be using LANMAN
+ Client, which means I need the arc0e device. It needs to be able to talk
+ to both insight and freedom, and also use freedom as a gateway to the
+ Internet. (Recall that patience has a "private IP address" which won't
+ work on the Internet; that's okay, I configured Linux IP masquerading on
+ freedom for this subnet).
+
+ So patience (necessarily; I don't have another IP number from my
+ provider) has an IP address on a different subnet than freedom and
+ insight, but needs to use freedom as an Internet gateway. Worse, most
+ DOS networking programs, including LANMAN, have braindead networking
+ schemes that rely completely on the netmask and a 'default gateway' to
+ determine how to route packets. This means that to get to freedom or
+ insight, patience WILL send through its default gateway, regardless of
+ the fact that both freedom and insight (courtesy of the arc0e device)
+ could understand a direct transmission.
+
+ I compensate by giving freedom an extra IP address - aliased 'gatekeeper'
+ - that is on my private subnet, the same subnet that patience is on. I
+ then define gatekeeper to be the default gateway for patience.
+
+ To configure freedom (in addition to the commands above):
+ ifconfig arc0e gatekeeper
+ route add gatekeeper arc0e
+ route add patience arc0e
+
+ This way, freedom will send all packets for patience through arc0e,
+ giving its IP address as gatekeeper (on the private subnet). When it
+ talks to insight or the Internet, it will use its "freedom" Internet IP
+ address.
+
+ You will notice that we haven't configured the arc0e device on insight.
+ This would work, but is not really necessary, and would require me to
+ assign insight another special IP number from my private subnet. Since
+ both insight and patience are using freedom as their default gateway, the
+ two can already talk to each other.
+
+ It's quite fortunate that I set things up like this the first time (cough
+ cough) because it's really handy when I boot insight into DOS. There, it
+ runs the Novell ODI protocol stack, which only works with RFC1201 ARCnet.
+ In this mode it would be impossible for insight to communicate directly
+ with patience, since the Novell stack is incompatible with Microsoft's
+ Ethernet-Encap. Without changing any settings on freedom or patience, I
+ simply set freedom as the default gateway for insight (now in DOS,
+ remember) and all the forwarding happens "automagically" between the two
+ hosts that would normally not be able to communicate at all.
+
+ For those who like diagrams, I have created two "virtual subnets" on the
+ same physical ARCnet wire. You can picture it like this:
+
+
+ [RFC1201 NETWORK] [ETHER-ENCAP NETWORK]
+ (registered Internet subnet) (RFC1597 private subnet)
+
+ (IP Masquerade)
+ /---------------\ * /---------------\
+ | | * | |
+ | +-Freedom-*-Gatekeeper-+ |
+ | | | * | |
+ \-------+-------/ | * \-------+-------/
+ | | |
+ Insight | Patience
+ (Internet)
+
+
+
+It works: what now?
+-------------------
+
+Send mail describing your setup, preferably including driver version, kernel
+version, ARCnet card model, CPU type, number of systems on your network, and
+list of software in use to me at the following address:
+ apenwarr@worldvisions.ca
+
+I do send (sometimes automated) replies to all messages I receive. My email
+can be weird (and also usually gets forwarded all over the place along the
+way to me), so if you don't get a reply within a reasonable time, please
+resend.
+
+
+It doesn't work: what now?
+--------------------------
+
+Do the same as above, but also include the output of the ifconfig and route
+commands, as well as any pertinent log entries (ie. anything that starts
+with "arcnet:" and has shown up since the last reboot) in your mail.
+
+If you want to try fixing it yourself (I strongly recommend that you mail me
+about the problem first, since it might already have been solved) you may
+want to try some of the debug levels available. For heavy testing on
+D_DURING or more, it would be a REALLY good idea to kill your klogd daemon
+first! D_DURING displays 4-5 lines for each packet sent or received. D_TX,
+D_RX, and D_SKB actually DISPLAY each packet as it is sent or received,
+which is obviously quite big.
+
+Starting with v2.40 ALPHA, the autoprobe routines have changed
+significantly. In particular, they won't tell you why the card was not
+found unless you turn on the D_INIT_REASONS debugging flag.
+
+Once the driver is running, you can run the arcdump shell script (available
+from me or in the full ARCnet package, if you have it) as root to list the
+contents of the arcnet buffers at any time. To make any sense at all out of
+this, you should grab the pertinent RFCs. (some are listed near the top of
+arcnet.c). arcdump assumes your card is at 0xD0000. If it isn't, edit the
+script.
+
+Buffers 0 and 1 are used for receiving, and Buffers 2 and 3 are for sending.
+Ping-pong buffers are implemented both ways.
+
+If your debug level includes D_DURING and you did NOT define SLOW_XMIT_COPY,
+the buffers are cleared to a constant value of 0x42 every time the card is
+reset (which should only happen when you do an ifconfig up, or when Linux
+decides that the driver is broken). During a transmit, unused parts of the
+buffer will be cleared to 0x42 as well. This is to make it easier to figure
+out which bytes are being used by a packet.
+
+You can change the debug level without recompiling the kernel by typing:
+ ifconfig arc0 down metric 1xxx
+ /etc/rc.d/rc.inet1
+where "xxx" is the debug level you want. For example, "metric 1015" would put
+you at debug level 15. Debug level 7 is currently the default.
+
+Note that the debug level is (starting with v1.90 ALPHA) a binary
+combination of different debug flags; so debug level 7 is really 1+2+4 or
+D_NORMAL+D_EXTRA+D_INIT. To include D_DURING, you would add 16 to this,
+resulting in debug level 23.
+
+If you don't understand that, you probably don't want to know anyway.
+E-mail me about your problem.
+
+
+I want to send money: what now?
+-------------------------------
+
+Go take a nap or something. You'll feel better in the morning.
diff --git a/Documentation/networking/atm.txt b/Documentation/networking/atm.txt
new file mode 100644
index 000000000..82921cee7
--- /dev/null
+++ b/Documentation/networking/atm.txt
@@ -0,0 +1,8 @@
+In order to use anything but the most primitive functions of ATM,
+several user-mode programs are required to assist the kernel. These
+programs and related material can be found via the ATM on Linux Web
+page at http://linux-atm.sourceforge.net/
+
+If you encounter problems with ATM, please report them on the ATM
+on Linux mailing list. Subscription information, archives, etc.,
+can be found on http://linux-atm.sourceforge.net/
diff --git a/Documentation/networking/ax25.txt b/Documentation/networking/ax25.txt
new file mode 100644
index 000000000..8257dbf9b
--- /dev/null
+++ b/Documentation/networking/ax25.txt
@@ -0,0 +1,10 @@
+To use the amateur radio protocols within Linux you will need to get a
+suitable copy of the AX.25 Utilities. More detailed information about
+AX.25, NET/ROM and ROSE, associated programs and and utilities can be
+found on http://www.linux-ax25.org.
+
+There is an active mailing list for discussing Linux amateur radio matters
+called linux-hams@vger.kernel.org. To subscribe to it, send a message to
+majordomo@vger.kernel.org with the words "subscribe linux-hams" in the body
+of the message, the subject field is ignored. You don't need to be
+subscribed to post but of course that means you might miss an answer.
diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt
new file mode 100644
index 000000000..58e49042f
--- /dev/null
+++ b/Documentation/networking/batman-adv.txt
@@ -0,0 +1,202 @@
+BATMAN-ADV
+----------
+
+Batman advanced is a new approach to wireless networking which
+does no longer operate on the IP basis. Unlike the batman daemon,
+which exchanges information using UDP packets and sets routing
+tables, batman-advanced operates on ISO/OSI Layer 2 only and uses
+and routes (or better: bridges) Ethernet Frames. It emulates a
+virtual network switch of all nodes participating. Therefore all
+nodes appear to be link local, thus all higher operating proto-
+cols won't be affected by any changes within the network. You can
+run almost any protocol above batman advanced, prominent examples
+are: IPv4, IPv6, DHCP, IPX.
+
+Batman advanced was implemented as a Linux kernel driver to re-
+duce the overhead to a minimum. It does not depend on any (other)
+network driver, and can be used on wifi as well as ethernet lan,
+vpn, etc ... (anything with ethernet-style layer 2).
+
+
+CONFIGURATION
+-------------
+
+Load the batman-adv module into your kernel:
+
+# insmod batman-adv.ko
+
+The module is now waiting for activation. You must add some in-
+terfaces on which batman can operate. After loading the module
+batman advanced will scan your systems interfaces to search for
+compatible interfaces. Once found, it will create subfolders in
+the /sys directories of each supported interface, e.g.
+
+# ls /sys/class/net/eth0/batman_adv/
+# iface_status mesh_iface
+
+If an interface does not have the "batman_adv" subfolder it prob-
+ably is not supported. Not supported interfaces are: loopback,
+non-ethernet and batman's own interfaces.
+
+Note: After the module was loaded it will continuously watch for
+new interfaces to verify the compatibility. There is no need to
+reload the module if you plug your USB wifi adapter into your ma-
+chine after batman advanced was initially loaded.
+
+To activate a given interface simply write "bat0" into its
+"mesh_iface" file inside the batman_adv subfolder:
+
+# echo bat0 > /sys/class/net/eth0/batman_adv/mesh_iface
+
+Repeat this step for all interfaces you wish to add. Now batman
+starts using/broadcasting on this/these interface(s).
+
+By reading the "iface_status" file you can check its status:
+
+# cat /sys/class/net/eth0/batman_adv/iface_status
+# active
+
+To deactivate an interface you have to write "none" into its
+"mesh_iface" file:
+
+# echo none > /sys/class/net/eth0/batman_adv/mesh_iface
+
+
+All mesh wide settings can be found in batman's own interface
+folder:
+
+# ls /sys/class/net/bat0/mesh/
+#aggregated_ogms distributed_arp_table gw_sel_class orig_interval
+#ap_isolation fragmentation hop_penalty routing_algo
+#bonding gw_bandwidth isolation_mark vlan0
+#bridge_loop_avoidance gw_mode log_level
+
+There is a special folder for debugging information:
+
+# ls /sys/kernel/debug/batman_adv/bat0/
+# bla_backbone_table log transtable_global
+# bla_claim_table originators transtable_local
+# gateways socket
+
+Some of the files contain all sort of status information regard-
+ing the mesh network. For example, you can view the table of
+originators (mesh participants) with:
+
+# cat /sys/kernel/debug/batman_adv/bat0/originators
+
+Other files allow to change batman's behaviour to better fit your
+requirements. For instance, you can check the current originator
+interval (value in milliseconds which determines how often batman
+sends its broadcast packets):
+
+# cat /sys/class/net/bat0/mesh/orig_interval
+# 1000
+
+and also change its value:
+
+# echo 3000 > /sys/class/net/bat0/mesh/orig_interval
+
+In very mobile scenarios, you might want to adjust the originator
+interval to a lower value. This will make the mesh more respon-
+sive to topology changes, but will also increase the overhead.
+
+
+USAGE
+-----
+
+To make use of your newly created mesh, batman advanced provides
+a new interface "bat0" which you should use from this point on.
+All interfaces added to batman advanced are not relevant any
+longer because batman handles them for you. Basically, one "hands
+over" the data by using the batman interface and batman will make
+sure it reaches its destination.
+
+The "bat0" interface can be used like any other regular inter-
+face. It needs an IP address which can be either statically con-
+figured or dynamically (by using DHCP or similar services):
+
+# NodeA: ifconfig bat0 192.168.0.1
+# NodeB: ifconfig bat0 192.168.0.2
+# NodeB: ping 192.168.0.1
+
+Note: In order to avoid problems remove all IP addresses previ-
+ously assigned to interfaces now used by batman advanced, e.g.
+
+# ifconfig eth0 0.0.0.0
+
+
+LOGGING/DEBUGGING
+-----------------
+
+All error messages, warnings and information messages are sent to
+the kernel log. Depending on your operating system distribution
+this can be read in one of a number of ways. Try using the com-
+mands: dmesg, logread, or looking in the files /var/log/kern.log
+or /var/log/syslog. All batman-adv messages are prefixed with
+"batman-adv:" So to see just these messages try
+
+# dmesg | grep batman-adv
+
+When investigating problems with your mesh network it is some-
+times necessary to see more detail debug messages. This must be
+enabled when compiling the batman-adv module. When building bat-
+man-adv as part of kernel, use "make menuconfig" and enable the
+option "B.A.T.M.A.N. debugging".
+
+Those additional debug messages can be accessed using a special
+file in debugfs
+
+# cat /sys/kernel/debug/batman_adv/bat0/log
+
+The additional debug output is by default disabled. It can be en-
+abled during run time. Following log_levels are defined:
+
+0 - All debug output disabled
+1 - Enable messages related to routing / flooding / broadcasting
+2 - Enable messages related to route added / changed / deleted
+4 - Enable messages related to translation table operations
+8 - Enable messages related to bridge loop avoidance
+16 - Enable messaged related to DAT, ARP snooping and parsing
+31 - Enable all messages
+
+The debug output can be changed at runtime using the file
+/sys/class/net/bat0/mesh/log_level. e.g.
+
+# echo 6 > /sys/class/net/bat0/mesh/log_level
+
+will enable debug messages for when routes change.
+
+Counters for different types of packets entering and leaving the
+batman-adv module are available through ethtool:
+
+# ethtool --statistics bat0
+
+
+BATCTL
+------
+
+As batman advanced operates on layer 2 all hosts participating in
+the virtual switch are completely transparent for all protocols
+above layer 2. Therefore the common diagnosis tools do not work
+as expected. To overcome these problems batctl was created. At
+the moment the batctl contains ping, traceroute, tcpdump and
+interfaces to the kernel module settings.
+
+For more information, please see the manpage (man batctl).
+
+batctl is available on http://www.open-mesh.org/
+
+
+CONTACT
+-------
+
+Please send us comments, experiences, questions, anything :)
+
+IRC: #batman on irc.freenode.org
+Mailing-list: b.a.t.m.a.n@open-mesh.org (optional subscription
+ at https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n)
+
+You can also contact the Authors:
+
+Marek Lindner <mareklindner@neomailbox.ch>
+Simon Wunderlich <sw@simonwunderlich.de>
diff --git a/Documentation/networking/baycom.txt b/Documentation/networking/baycom.txt
new file mode 100644
index 000000000..688f18fd4
--- /dev/null
+++ b/Documentation/networking/baycom.txt
@@ -0,0 +1,158 @@
+ LINUX DRIVERS FOR BAYCOM MODEMS
+
+ Thomas M. Sailer, HB9JNX/AE4WA, <sailer@ife.ee.ethz.ch>
+
+!!NEW!! (04/98) The drivers for the baycom modems have been split into
+separate drivers as they did not share any code, and the driver
+and device names have changed.
+
+This document describes the Linux Kernel Drivers for simple Baycom style
+amateur radio modems.
+
+The following drivers are available:
+
+baycom_ser_fdx:
+ This driver supports the SER12 modems either full or half duplex.
+ Its baud rate may be changed via the `baud' module parameter,
+ therefore it supports just about every bit bang modem on a
+ serial port. Its devices are called bcsf0 through bcsf3.
+ This is the recommended driver for SER12 type modems,
+ however if you have a broken UART clone that does not have working
+ delta status bits, you may try baycom_ser_hdx.
+
+baycom_ser_hdx:
+ This is an alternative driver for SER12 type modems.
+ It only supports half duplex, and only 1200 baud. Its devices
+ are called bcsh0 through bcsh3. Use this driver only if baycom_ser_fdx
+ does not work with your UART.
+
+baycom_par:
+ This driver supports the par96 and picpar modems.
+ Its devices are called bcp0 through bcp3.
+
+baycom_epp:
+ This driver supports the EPP modem.
+ Its devices are called bce0 through bce3.
+ This driver is work-in-progress.
+
+The following modems are supported:
+
+ser12: This is a very simple 1200 baud AFSK modem. The modem consists only
+ of a modulator/demodulator chip, usually a TI TCM3105. The computer
+ is responsible for regenerating the receiver bit clock, as well as
+ for handling the HDLC protocol. The modem connects to a serial port,
+ hence the name. Since the serial port is not used as an async serial
+ port, the kernel driver for serial ports cannot be used, and this
+ driver only supports standard serial hardware (8250, 16450, 16550)
+
+par96: This is a modem for 9600 baud FSK compatible to the G3RUH standard.
+ The modem does all the filtering and regenerates the receiver clock.
+ Data is transferred from and to the PC via a shift register.
+ The shift register is filled with 16 bits and an interrupt is signalled.
+ The PC then empties the shift register in a burst. This modem connects
+ to the parallel port, hence the name. The modem leaves the
+ implementation of the HDLC protocol and the scrambler polynomial to
+ the PC.
+
+picpar: This is a redesign of the par96 modem by Henning Rech, DF9IC. The modem
+ is protocol compatible to par96, but uses only three low power ICs
+ and can therefore be fed from the parallel port and does not require
+ an additional power supply. Furthermore, it incorporates a carrier
+ detect circuitry.
+
+EPP: This is a high-speed modem adaptor that connects to an enhanced parallel port.
+ Its target audience is users working over a high speed hub (76.8kbit/s).
+
+eppfpga: This is a redesign of the EPP adaptor.
+
+
+
+All of the above modems only support half duplex communications. However,
+the driver supports the KISS (see below) fullduplex command. It then simply
+starts to send as soon as there's a packet to transmit and does not care
+about DCD, i.e. it starts to send even if there's someone else on the channel.
+This command is required by some implementations of the DAMA channel
+access protocol.
+
+
+The Interface of the drivers
+
+Unlike previous drivers, these drivers are no longer character devices,
+but they are now true kernel network interfaces. Installation is therefore
+simple. Once installed, four interfaces named bc{sf,sh,p,e}[0-3] are available.
+sethdlc from the ax25 utilities may be used to set driver states etc.
+Users of userland AX.25 stacks may use the net2kiss utility (also available
+in the ax25 utilities package) to convert packets of a network interface
+to a KISS stream on a pseudo tty. There's also a patch available from
+me for WAMPES which allows attaching a kernel network interface directly.
+
+
+Configuring the driver
+
+Every time a driver is inserted into the kernel, it has to know which
+modems it should access at which ports. This can be done with the setbaycom
+utility. If you are only using one modem, you can also configure the
+driver from the insmod command line (or by means of an option line in
+/etc/modprobe.d/*.conf).
+
+Examples:
+ modprobe baycom_ser_fdx mode="ser12*" iobase=0x3f8 irq=4
+ sethdlc -i bcsf0 -p mode "ser12*" io 0x3f8 irq 4
+
+Both lines configure the first port to drive a ser12 modem at the first
+serial port (COM1 under DOS). The * in the mode parameter instructs the driver to use
+the software DCD algorithm (see below).
+
+ insmod baycom_par mode="picpar" iobase=0x378
+ sethdlc -i bcp0 -p mode "picpar" io 0x378
+
+Both lines configure the first port to drive a picpar modem at the
+first parallel port (LPT1 under DOS). (Note: picpar implies
+hardware DCD, par96 implies software DCD).
+
+The channel access parameters can be set with sethdlc -a or kissparms.
+Note that both utilities interpret the values slightly differently.
+
+
+Hardware DCD versus Software DCD
+
+To avoid collisions on the air, the driver must know when the channel is
+busy. This is the task of the DCD circuitry/software. The driver may either
+utilise a software DCD algorithm (options=1) or use a DCD signal from
+the hardware (options=0).
+
+ser12: if software DCD is utilised, the radio's squelch should always be
+ open. It is highly recommended to use the software DCD algorithm,
+ as it is much faster than most hardware squelch circuitry. The
+ disadvantage is a slightly higher load on the system.
+
+par96: the software DCD algorithm for this type of modem is rather poor.
+ The modem simply does not provide enough information to implement
+ a reasonable DCD algorithm in software. Therefore, if your radio
+ feeds the DCD input of the PAR96 modem, the use of the hardware
+ DCD circuitry is recommended.
+
+picpar: the picpar modem features a builtin DCD hardware, which is highly
+ recommended.
+
+
+
+Compatibility with the rest of the Linux kernel
+
+The serial driver and the baycom serial drivers compete
+for the same hardware resources. Of course only one driver can access a given
+interface at a time. The serial driver grabs all interfaces it can find at
+startup time. Therefore the baycom drivers subsequently won't be able to
+access a serial port. You might therefore find it necessary to release
+a port owned by the serial driver with 'setserial /dev/ttyS# uart none', where
+# is the number of the interface. The baycom drivers do not reserve any
+ports at startup, unless one is specified on the 'insmod' command line. Another
+method to solve the problem is to compile all drivers as modules and
+leave it to kmod to load the correct driver depending on the application.
+
+The parallel port drivers (baycom_par, baycom_epp) now use the parport subsystem
+to arbitrate the ports between different client drivers.
+
+vy 73s de
+Tom Sailer, sailer@ife.ee.ethz.ch
+hb9jnx @ hb9w.ampr.org
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
new file mode 100644
index 000000000..83bf4986b
--- /dev/null
+++ b/Documentation/networking/bonding.txt
@@ -0,0 +1,2743 @@
+
+ Linux Ethernet Bonding Driver HOWTO
+
+ Latest update: 27 April 2011
+
+Initial release : Thomas Davis <tadavis at lbl.gov>
+Corrections, HA extensions : 2000/10/03-15 :
+ - Willy Tarreau <willy at meta-x.org>
+ - Constantine Gavrilov <const-g at xpert.com>
+ - Chad N. Tindel <ctindel at ieee dot org>
+ - Janice Girouard <girouard at us dot ibm dot com>
+ - Jay Vosburgh <fubar at us dot ibm dot com>
+
+Reorganized and updated Feb 2005 by Jay Vosburgh
+Added Sysfs information: 2006/04/24
+ - Mitch Williams <mitch.a.williams at intel.com>
+
+Introduction
+============
+
+ The Linux bonding driver provides a method for aggregating
+multiple network interfaces into a single logical "bonded" interface.
+The behavior of the bonded interfaces depends upon the mode; generally
+speaking, modes provide either hot standby or load balancing services.
+Additionally, link integrity monitoring may be performed.
+
+ The bonding driver originally came from Donald Becker's
+beowulf patches for kernel 2.0. It has changed quite a bit since, and
+the original tools from extreme-linux and beowulf sites will not work
+with this version of the driver.
+
+ For new versions of the driver, updated userspace tools, and
+who to ask for help, please follow the links at the end of this file.
+
+Table of Contents
+=================
+
+1. Bonding Driver Installation
+
+2. Bonding Driver Options
+
+3. Configuring Bonding Devices
+3.1 Configuration with Sysconfig Support
+3.1.1 Using DHCP with Sysconfig
+3.1.2 Configuring Multiple Bonds with Sysconfig
+3.2 Configuration with Initscripts Support
+3.2.1 Using DHCP with Initscripts
+3.2.2 Configuring Multiple Bonds with Initscripts
+3.3 Configuring Bonding Manually with Ifenslave
+3.3.1 Configuring Multiple Bonds Manually
+3.4 Configuring Bonding Manually via Sysfs
+3.5 Configuration with Interfaces Support
+3.6 Overriding Configuration for Special Cases
+
+4. Querying Bonding Configuration
+4.1 Bonding Configuration
+4.2 Network Configuration
+
+5. Switch Configuration
+
+6. 802.1q VLAN Support
+
+7. Link Monitoring
+7.1 ARP Monitor Operation
+7.2 Configuring Multiple ARP Targets
+7.3 MII Monitor Operation
+
+8. Potential Trouble Sources
+8.1 Adventures in Routing
+8.2 Ethernet Device Renaming
+8.3 Painfully Slow Or No Failed Link Detection By Miimon
+
+9. SNMP agents
+
+10. Promiscuous mode
+
+11. Configuring Bonding for High Availability
+11.1 High Availability in a Single Switch Topology
+11.2 High Availability in a Multiple Switch Topology
+11.2.1 HA Bonding Mode Selection for Multiple Switch Topology
+11.2.2 HA Link Monitoring for Multiple Switch Topology
+
+12. Configuring Bonding for Maximum Throughput
+12.1 Maximum Throughput in a Single Switch Topology
+12.1.1 MT Bonding Mode Selection for Single Switch Topology
+12.1.2 MT Link Monitoring for Single Switch Topology
+12.2 Maximum Throughput in a Multiple Switch Topology
+12.2.1 MT Bonding Mode Selection for Multiple Switch Topology
+12.2.2 MT Link Monitoring for Multiple Switch Topology
+
+13. Switch Behavior Issues
+13.1 Link Establishment and Failover Delays
+13.2 Duplicated Incoming Packets
+
+14. Hardware Specific Considerations
+14.1 IBM BladeCenter
+
+15. Frequently Asked Questions
+
+16. Resources and Links
+
+
+1. Bonding Driver Installation
+==============================
+
+ Most popular distro kernels ship with the bonding driver
+already available as a module. If your distro does not, or you
+have need to compile bonding from source (e.g., configuring and
+installing a mainline kernel from kernel.org), you'll need to perform
+the following steps:
+
+1.1 Configure and build the kernel with bonding
+-----------------------------------------------
+
+ The current version of the bonding driver is available in the
+drivers/net/bonding subdirectory of the most recent kernel source
+(which is available on http://kernel.org). Most users "rolling their
+own" will want to use the most recent kernel from kernel.org.
+
+ Configure kernel with "make menuconfig" (or "make xconfig" or
+"make config"), then select "Bonding driver support" in the "Network
+device support" section. It is recommended that you configure the
+driver as module since it is currently the only way to pass parameters
+to the driver or configure more than one bonding device.
+
+ Build and install the new kernel and modules.
+
+1.2 Bonding Control Utility
+-------------------------------------
+
+ It is recommended to configure bonding via iproute2 (netlink)
+or sysfs, the old ifenslave control utility is obsolete.
+
+2. Bonding Driver Options
+=========================
+
+ Options for the bonding driver are supplied as parameters to the
+bonding module at load time, or are specified via sysfs.
+
+ Module options may be given as command line arguments to the
+insmod or modprobe command, but are usually specified in either the
+/etc/modrobe.d/*.conf configuration files, or in a distro-specific
+configuration file (some of which are detailed in the next section).
+
+ Details on bonding support for sysfs is provided in the
+"Configuring Bonding Manually via Sysfs" section, below.
+
+ The available bonding driver parameters are listed below. If a
+parameter is not specified the default value is used. When initially
+configuring a bond, it is recommended "tail -f /var/log/messages" be
+run in a separate window to watch for bonding driver error messages.
+
+ It is critical that either the miimon or arp_interval and
+arp_ip_target parameters be specified, otherwise serious network
+degradation will occur during link failures. Very few devices do not
+support at least miimon, so there is really no reason not to use it.
+
+ Options with textual values will accept either the text name
+or, for backwards compatibility, the option value. E.g.,
+"mode=802.3ad" and "mode=4" set the same mode.
+
+ The parameters are as follows:
+
+active_slave
+
+ Specifies the new active slave for modes that support it
+ (active-backup, balance-alb and balance-tlb). Possible values
+ are the name of any currently enslaved interface, or an empty
+ string. If a name is given, the slave and its link must be up in order
+ to be selected as the new active slave. If an empty string is
+ specified, the current active slave is cleared, and a new active
+ slave is selected automatically.
+
+ Note that this is only available through the sysfs interface. No module
+ parameter by this name exists.
+
+ The normal value of this option is the name of the currently
+ active slave, or the empty string if there is no active slave or
+ the current mode does not use an active slave.
+
+ad_select
+
+ Specifies the 802.3ad aggregation selection logic to use. The
+ possible values and their effects are:
+
+ stable or 0
+
+ The active aggregator is chosen by largest aggregate
+ bandwidth.
+
+ Reselection of the active aggregator occurs only when all
+ slaves of the active aggregator are down or the active
+ aggregator has no slaves.
+
+ This is the default value.
+
+ bandwidth or 1
+
+ The active aggregator is chosen by largest aggregate
+ bandwidth. Reselection occurs if:
+
+ - A slave is added to or removed from the bond
+
+ - Any slave's link state changes
+
+ - Any slave's 802.3ad association state changes
+
+ - The bond's administrative state changes to up
+
+ count or 2
+
+ The active aggregator is chosen by the largest number of
+ ports (slaves). Reselection occurs as described under the
+ "bandwidth" setting, above.
+
+ The bandwidth and count selection policies permit failover of
+ 802.3ad aggregations when partial failure of the active aggregator
+ occurs. This keeps the aggregator with the highest availability
+ (either in bandwidth or in number of ports) active at all times.
+
+ This option was added in bonding version 3.4.0.
+
+all_slaves_active
+
+ Specifies that duplicate frames (received on inactive ports) should be
+ dropped (0) or delivered (1).
+
+ Normally, bonding will drop duplicate frames (received on inactive
+ ports), which is desirable for most users. But there are some times
+ it is nice to allow duplicate frames to be delivered.
+
+ The default value is 0 (drop duplicate frames received on inactive
+ ports).
+
+arp_interval
+
+ Specifies the ARP link monitoring frequency in milliseconds.
+
+ The ARP monitor works by periodically checking the slave
+ devices to determine whether they have sent or received
+ traffic recently (the precise criteria depends upon the
+ bonding mode, and the state of the slave). Regular traffic is
+ generated via ARP probes issued for the addresses specified by
+ the arp_ip_target option.
+
+ This behavior can be modified by the arp_validate option,
+ below.
+
+ If ARP monitoring is used in an etherchannel compatible mode
+ (modes 0 and 2), the switch should be configured in a mode
+ that evenly distributes packets across all links. If the
+ switch is configured to distribute the packets in an XOR
+ fashion, all replies from the ARP targets will be received on
+ the same link which could cause the other team members to
+ fail. ARP monitoring should not be used in conjunction with
+ miimon. A value of 0 disables ARP monitoring. The default
+ value is 0.
+
+arp_ip_target
+
+ Specifies the IP addresses to use as ARP monitoring peers when
+ arp_interval is > 0. These are the targets of the ARP request
+ sent to determine the health of the link to the targets.
+ Specify these values in ddd.ddd.ddd.ddd format. Multiple IP
+ addresses must be separated by a comma. At least one IP
+ address must be given for ARP monitoring to function. The
+ maximum number of targets that can be specified is 16. The
+ default value is no IP addresses.
+
+arp_validate
+
+ Specifies whether or not ARP probes and replies should be
+ validated in any mode that supports arp monitoring, or whether
+ non-ARP traffic should be filtered (disregarded) for link
+ monitoring purposes.
+
+ Possible values are:
+
+ none or 0
+
+ No validation or filtering is performed.
+
+ active or 1
+
+ Validation is performed only for the active slave.
+
+ backup or 2
+
+ Validation is performed only for backup slaves.
+
+ all or 3
+
+ Validation is performed for all slaves.
+
+ filter or 4
+
+ Filtering is applied to all slaves. No validation is
+ performed.
+
+ filter_active or 5
+
+ Filtering is applied to all slaves, validation is performed
+ only for the active slave.
+
+ filter_backup or 6
+
+ Filtering is applied to all slaves, validation is performed
+ only for backup slaves.
+
+ Validation:
+
+ Enabling validation causes the ARP monitor to examine the incoming
+ ARP requests and replies, and only consider a slave to be up if it
+ is receiving the appropriate ARP traffic.
+
+ For an active slave, the validation checks ARP replies to confirm
+ that they were generated by an arp_ip_target. Since backup slaves
+ do not typically receive these replies, the validation performed
+ for backup slaves is on the broadcast ARP request sent out via the
+ active slave. It is possible that some switch or network
+ configurations may result in situations wherein the backup slaves
+ do not receive the ARP requests; in such a situation, validation
+ of backup slaves must be disabled.
+
+ The validation of ARP requests on backup slaves is mainly helping
+ bonding to decide which slaves are more likely to work in case of
+ the active slave failure, it doesn't really guarantee that the
+ backup slave will work if it's selected as the next active slave.
+
+ Validation is useful in network configurations in which multiple
+ bonding hosts are concurrently issuing ARPs to one or more targets
+ beyond a common switch. Should the link between the switch and
+ target fail (but not the switch itself), the probe traffic
+ generated by the multiple bonding instances will fool the standard
+ ARP monitor into considering the links as still up. Use of
+ validation can resolve this, as the ARP monitor will only consider
+ ARP requests and replies associated with its own instance of
+ bonding.
+
+ Filtering:
+
+ Enabling filtering causes the ARP monitor to only use incoming ARP
+ packets for link availability purposes. Arriving packets that are
+ not ARPs are delivered normally, but do not count when determining
+ if a slave is available.
+
+ Filtering operates by only considering the reception of ARP
+ packets (any ARP packet, regardless of source or destination) when
+ determining if a slave has received traffic for link availability
+ purposes.
+
+ Filtering is useful in network configurations in which significant
+ levels of third party broadcast traffic would fool the standard
+ ARP monitor into considering the links as still up. Use of
+ filtering can resolve this, as only ARP traffic is considered for
+ link availability purposes.
+
+ This option was added in bonding version 3.1.0.
+
+arp_all_targets
+
+ Specifies the quantity of arp_ip_targets that must be reachable
+ in order for the ARP monitor to consider a slave as being up.
+ This option affects only active-backup mode for slaves with
+ arp_validation enabled.
+
+ Possible values are:
+
+ any or 0
+
+ consider the slave up only when any of the arp_ip_targets
+ is reachable
+
+ all or 1
+
+ consider the slave up only when all of the arp_ip_targets
+ are reachable
+
+downdelay
+
+ Specifies the time, in milliseconds, to wait before disabling
+ a slave after a link failure has been detected. This option
+ is only valid for the miimon link monitor. The downdelay
+ value should be a multiple of the miimon value; if not, it
+ will be rounded down to the nearest multiple. The default
+ value is 0.
+
+fail_over_mac
+
+ Specifies whether active-backup mode should set all slaves to
+ the same MAC address at enslavement (the traditional
+ behavior), or, when enabled, perform special handling of the
+ bond's MAC address in accordance with the selected policy.
+
+ Possible values are:
+
+ none or 0
+
+ This setting disables fail_over_mac, and causes
+ bonding to set all slaves of an active-backup bond to
+ the same MAC address at enslavement time. This is the
+ default.
+
+ active or 1
+
+ The "active" fail_over_mac policy indicates that the
+ MAC address of the bond should always be the MAC
+ address of the currently active slave. The MAC
+ address of the slaves is not changed; instead, the MAC
+ address of the bond changes during a failover.
+
+ This policy is useful for devices that cannot ever
+ alter their MAC address, or for devices that refuse
+ incoming broadcasts with their own source MAC (which
+ interferes with the ARP monitor).
+
+ The down side of this policy is that every device on
+ the network must be updated via gratuitous ARP,
+ vs. just updating a switch or set of switches (which
+ often takes place for any traffic, not just ARP
+ traffic, if the switch snoops incoming traffic to
+ update its tables) for the traditional method. If the
+ gratuitous ARP is lost, communication may be
+ disrupted.
+
+ When this policy is used in conjunction with the mii
+ monitor, devices which assert link up prior to being
+ able to actually transmit and receive are particularly
+ susceptible to loss of the gratuitous ARP, and an
+ appropriate updelay setting may be required.
+
+ follow or 2
+
+ The "follow" fail_over_mac policy causes the MAC
+ address of the bond to be selected normally (normally
+ the MAC address of the first slave added to the bond).
+ However, the second and subsequent slaves are not set
+ to this MAC address while they are in a backup role; a
+ slave is programmed with the bond's MAC address at
+ failover time (and the formerly active slave receives
+ the newly active slave's MAC address).
+
+ This policy is useful for multiport devices that
+ either become confused or incur a performance penalty
+ when multiple ports are programmed with the same MAC
+ address.
+
+
+ The default policy is none, unless the first slave cannot
+ change its MAC address, in which case the active policy is
+ selected by default.
+
+ This option may be modified via sysfs only when no slaves are
+ present in the bond.
+
+ This option was added in bonding version 3.2.0. The "follow"
+ policy was added in bonding version 3.3.0.
+
+lacp_rate
+
+ Option specifying the rate in which we'll ask our link partner
+ to transmit LACPDU packets in 802.3ad mode. Possible values
+ are:
+
+ slow or 0
+ Request partner to transmit LACPDUs every 30 seconds
+
+ fast or 1
+ Request partner to transmit LACPDUs every 1 second
+
+ The default is slow.
+
+max_bonds
+
+ Specifies the number of bonding devices to create for this
+ instance of the bonding driver. E.g., if max_bonds is 3, and
+ the bonding driver is not already loaded, then bond0, bond1
+ and bond2 will be created. The default value is 1. Specifying
+ a value of 0 will load bonding, but will not create any devices.
+
+miimon
+
+ Specifies the MII link monitoring frequency in milliseconds.
+ This determines how often the link state of each slave is
+ inspected for link failures. A value of zero disables MII
+ link monitoring. A value of 100 is a good starting point.
+ The use_carrier option, below, affects how the link state is
+ determined. See the High Availability section for additional
+ information. The default value is 0.
+
+min_links
+
+ Specifies the minimum number of links that must be active before
+ asserting carrier. It is similar to the Cisco EtherChannel min-links
+ feature. This allows setting the minimum number of member ports that
+ must be up (link-up state) before marking the bond device as up
+ (carrier on). This is useful for situations where higher level services
+ such as clustering want to ensure a minimum number of low bandwidth
+ links are active before switchover. This option only affect 802.3ad
+ mode.
+
+ The default value is 0. This will cause carrier to be asserted (for
+ 802.3ad mode) whenever there is an active aggregator, regardless of the
+ number of available links in that aggregator. Note that, because an
+ aggregator cannot be active without at least one available link,
+ setting this option to 0 or to 1 has the exact same effect.
+
+mode
+
+ Specifies one of the bonding policies. The default is
+ balance-rr (round robin). Possible values are:
+
+ balance-rr or 0
+
+ Round-robin policy: Transmit packets in sequential
+ order from the first available slave through the
+ last. This mode provides load balancing and fault
+ tolerance.
+
+ active-backup or 1
+
+ Active-backup policy: Only one slave in the bond is
+ active. A different slave becomes active if, and only
+ if, the active slave fails. The bond's MAC address is
+ externally visible on only one port (network adapter)
+ to avoid confusing the switch.
+
+ In bonding version 2.6.2 or later, when a failover
+ occurs in active-backup mode, bonding will issue one
+ or more gratuitous ARPs on the newly active slave.
+ One gratuitous ARP is issued for the bonding master
+ interface and each VLAN interfaces configured above
+ it, provided that the interface has at least one IP
+ address configured. Gratuitous ARPs issued for VLAN
+ interfaces are tagged with the appropriate VLAN id.
+
+ This mode provides fault tolerance. The primary
+ option, documented below, affects the behavior of this
+ mode.
+
+ balance-xor or 2
+
+ XOR policy: Transmit based on the selected transmit
+ hash policy. The default policy is a simple [(source
+ MAC address XOR'd with destination MAC address XOR
+ packet type ID) modulo slave count]. Alternate transmit
+ policies may be selected via the xmit_hash_policy option,
+ described below.
+
+ This mode provides load balancing and fault tolerance.
+
+ broadcast or 3
+
+ Broadcast policy: transmits everything on all slave
+ interfaces. This mode provides fault tolerance.
+
+ 802.3ad or 4
+
+ IEEE 802.3ad Dynamic link aggregation. Creates
+ aggregation groups that share the same speed and
+ duplex settings. Utilizes all slaves in the active
+ aggregator according to the 802.3ad specification.
+
+ Slave selection for outgoing traffic is done according
+ to the transmit hash policy, which may be changed from
+ the default simple XOR policy via the xmit_hash_policy
+ option, documented below. Note that not all transmit
+ policies may be 802.3ad compliant, particularly in
+ regards to the packet mis-ordering requirements of
+ section 43.2.4 of the 802.3ad standard. Differing
+ peer implementations will have varying tolerances for
+ noncompliance.
+
+ Prerequisites:
+
+ 1. Ethtool support in the base drivers for retrieving
+ the speed and duplex of each slave.
+
+ 2. A switch that supports IEEE 802.3ad Dynamic link
+ aggregation.
+
+ Most switches will require some type of configuration
+ to enable 802.3ad mode.
+
+ balance-tlb or 5
+
+ Adaptive transmit load balancing: channel bonding that
+ does not require any special switch support.
+
+ In tlb_dynamic_lb=1 mode; the outgoing traffic is
+ distributed according to the current load (computed
+ relative to the speed) on each slave.
+
+ In tlb_dynamic_lb=0 mode; the load balancing based on
+ current load is disabled and the load is distributed
+ only using the hash distribution.
+
+ Incoming traffic is received by the current slave.
+ If the receiving slave fails, another slave takes over
+ the MAC address of the failed receiving slave.
+
+ Prerequisite:
+
+ Ethtool support in the base drivers for retrieving the
+ speed of each slave.
+
+ balance-alb or 6
+
+ Adaptive load balancing: includes balance-tlb plus
+ receive load balancing (rlb) for IPV4 traffic, and
+ does not require any special switch support. The
+ receive load balancing is achieved by ARP negotiation.
+ The bonding driver intercepts the ARP Replies sent by
+ the local system on their way out and overwrites the
+ source hardware address with the unique hardware
+ address of one of the slaves in the bond such that
+ different peers use different hardware addresses for
+ the server.
+
+ Receive traffic from connections created by the server
+ is also balanced. When the local system sends an ARP
+ Request the bonding driver copies and saves the peer's
+ IP information from the ARP packet. When the ARP
+ Reply arrives from the peer, its hardware address is
+ retrieved and the bonding driver initiates an ARP
+ reply to this peer assigning it to one of the slaves
+ in the bond. A problematic outcome of using ARP
+ negotiation for balancing is that each time that an
+ ARP request is broadcast it uses the hardware address
+ of the bond. Hence, peers learn the hardware address
+ of the bond and the balancing of receive traffic
+ collapses to the current slave. This is handled by
+ sending updates (ARP Replies) to all the peers with
+ their individually assigned hardware address such that
+ the traffic is redistributed. Receive traffic is also
+ redistributed when a new slave is added to the bond
+ and when an inactive slave is re-activated. The
+ receive load is distributed sequentially (round robin)
+ among the group of highest speed slaves in the bond.
+
+ When a link is reconnected or a new slave joins the
+ bond the receive traffic is redistributed among all
+ active slaves in the bond by initiating ARP Replies
+ with the selected MAC address to each of the
+ clients. The updelay parameter (detailed below) must
+ be set to a value equal or greater than the switch's
+ forwarding delay so that the ARP Replies sent to the
+ peers will not be blocked by the switch.
+
+ Prerequisites:
+
+ 1. Ethtool support in the base drivers for retrieving
+ the speed of each slave.
+
+ 2. Base driver support for setting the hardware
+ address of a device while it is open. This is
+ required so that there will always be one slave in the
+ team using the bond hardware address (the
+ curr_active_slave) while having a unique hardware
+ address for each slave in the bond. If the
+ curr_active_slave fails its hardware address is
+ swapped with the new curr_active_slave that was
+ chosen.
+
+num_grat_arp
+num_unsol_na
+
+ Specify the number of peer notifications (gratuitous ARPs and
+ unsolicited IPv6 Neighbor Advertisements) to be issued after a
+ failover event. As soon as the link is up on the new slave
+ (possibly immediately) a peer notification is sent on the
+ bonding device and each VLAN sub-device. This is repeated at
+ each link monitor interval (arp_interval or miimon, whichever
+ is active) if the number is greater than 1.
+
+ The valid range is 0 - 255; the default value is 1. These options
+ affect only the active-backup mode. These options were added for
+ bonding versions 3.3.0 and 3.4.0 respectively.
+
+ From Linux 3.0 and bonding version 3.7.1, these notifications
+ are generated by the ipv4 and ipv6 code and the numbers of
+ repetitions cannot be set independently.
+
+packets_per_slave
+
+ Specify the number of packets to transmit through a slave before
+ moving to the next one. When set to 0 then a slave is chosen at
+ random.
+
+ The valid range is 0 - 65535; the default value is 1. This option
+ has effect only in balance-rr mode.
+
+primary
+
+ A string (eth0, eth2, etc) specifying which slave is the
+ primary device. The specified device will always be the
+ active slave while it is available. Only when the primary is
+ off-line will alternate devices be used. This is useful when
+ one slave is preferred over another, e.g., when one slave has
+ higher throughput than another.
+
+ The primary option is only valid for active-backup(1),
+ balance-tlb (5) and balance-alb (6) mode.
+
+primary_reselect
+
+ Specifies the reselection policy for the primary slave. This
+ affects how the primary slave is chosen to become the active slave
+ when failure of the active slave or recovery of the primary slave
+ occurs. This option is designed to prevent flip-flopping between
+ the primary slave and other slaves. Possible values are:
+
+ always or 0 (default)
+
+ The primary slave becomes the active slave whenever it
+ comes back up.
+
+ better or 1
+
+ The primary slave becomes the active slave when it comes
+ back up, if the speed and duplex of the primary slave is
+ better than the speed and duplex of the current active
+ slave.
+
+ failure or 2
+
+ The primary slave becomes the active slave only if the
+ current active slave fails and the primary slave is up.
+
+ The primary_reselect setting is ignored in two cases:
+
+ If no slaves are active, the first slave to recover is
+ made the active slave.
+
+ When initially enslaved, the primary slave is always made
+ the active slave.
+
+ Changing the primary_reselect policy via sysfs will cause an
+ immediate selection of the best active slave according to the new
+ policy. This may or may not result in a change of the active
+ slave, depending upon the circumstances.
+
+ This option was added for bonding version 3.6.0.
+
+tlb_dynamic_lb
+
+ Specifies if dynamic shuffling of flows is enabled in tlb
+ mode. The value has no effect on any other modes.
+
+ The default behavior of tlb mode is to shuffle active flows across
+ slaves based on the load in that interval. This gives nice lb
+ characteristics but can cause packet reordering. If re-ordering is
+ a concern use this variable to disable flow shuffling and rely on
+ load balancing provided solely by the hash distribution.
+ xmit-hash-policy can be used to select the appropriate hashing for
+ the setup.
+
+ The sysfs entry can be used to change the setting per bond device
+ and the initial value is derived from the module parameter. The
+ sysfs entry is allowed to be changed only if the bond device is
+ down.
+
+ The default value is "1" that enables flow shuffling while value "0"
+ disables it. This option was added in bonding driver 3.7.1
+
+
+updelay
+
+ Specifies the time, in milliseconds, to wait before enabling a
+ slave after a link recovery has been detected. This option is
+ only valid for the miimon link monitor. The updelay value
+ should be a multiple of the miimon value; if not, it will be
+ rounded down to the nearest multiple. The default value is 0.
+
+use_carrier
+
+ Specifies whether or not miimon should use MII or ETHTOOL
+ ioctls vs. netif_carrier_ok() to determine the link
+ status. The MII or ETHTOOL ioctls are less efficient and
+ utilize a deprecated calling sequence within the kernel. The
+ netif_carrier_ok() relies on the device driver to maintain its
+ state with netif_carrier_on/off; at this writing, most, but
+ not all, device drivers support this facility.
+
+ If bonding insists that the link is up when it should not be,
+ it may be that your network device driver does not support
+ netif_carrier_on/off. The default state for netif_carrier is
+ "carrier on," so if a driver does not support netif_carrier,
+ it will appear as if the link is always up. In this case,
+ setting use_carrier to 0 will cause bonding to revert to the
+ MII / ETHTOOL ioctl method to determine the link state.
+
+ A value of 1 enables the use of netif_carrier_ok(), a value of
+ 0 will use the deprecated MII / ETHTOOL ioctls. The default
+ value is 1.
+
+xmit_hash_policy
+
+ Selects the transmit hash policy to use for slave selection in
+ balance-xor, 802.3ad, and tlb modes. Possible values are:
+
+ layer2
+
+ Uses XOR of hardware MAC addresses and packet type ID
+ field to generate the hash. The formula is
+
+ hash = source MAC XOR destination MAC XOR packet type ID
+ slave number = hash modulo slave count
+
+ This algorithm will place all traffic to a particular
+ network peer on the same slave.
+
+ This algorithm is 802.3ad compliant.
+
+ layer2+3
+
+ This policy uses a combination of layer2 and layer3
+ protocol information to generate the hash.
+
+ Uses XOR of hardware MAC addresses and IP addresses to
+ generate the hash. The formula is
+
+ hash = source MAC XOR destination MAC XOR packet type ID
+ hash = hash XOR source IP XOR destination IP
+ hash = hash XOR (hash RSHIFT 16)
+ hash = hash XOR (hash RSHIFT 8)
+ And then hash is reduced modulo slave count.
+
+ If the protocol is IPv6 then the source and destination
+ addresses are first hashed using ipv6_addr_hash.
+
+ This algorithm will place all traffic to a particular
+ network peer on the same slave. For non-IP traffic,
+ the formula is the same as for the layer2 transmit
+ hash policy.
+
+ This policy is intended to provide a more balanced
+ distribution of traffic than layer2 alone, especially
+ in environments where a layer3 gateway device is
+ required to reach most destinations.
+
+ This algorithm is 802.3ad compliant.
+
+ layer3+4
+
+ This policy uses upper layer protocol information,
+ when available, to generate the hash. This allows for
+ traffic to a particular network peer to span multiple
+ slaves, although a single connection will not span
+ multiple slaves.
+
+ The formula for unfragmented TCP and UDP packets is
+
+ hash = source port, destination port (as in the header)
+ hash = hash XOR source IP XOR destination IP
+ hash = hash XOR (hash RSHIFT 16)
+ hash = hash XOR (hash RSHIFT 8)
+ And then hash is reduced modulo slave count.
+
+ If the protocol is IPv6 then the source and destination
+ addresses are first hashed using ipv6_addr_hash.
+
+ For fragmented TCP or UDP packets and all other IPv4 and
+ IPv6 protocol traffic, the source and destination port
+ information is omitted. For non-IP traffic, the
+ formula is the same as for the layer2 transmit hash
+ policy.
+
+ This algorithm is not fully 802.3ad compliant. A
+ single TCP or UDP conversation containing both
+ fragmented and unfragmented packets will see packets
+ striped across two interfaces. This may result in out
+ of order delivery. Most traffic types will not meet
+ this criteria, as TCP rarely fragments traffic, and
+ most UDP traffic is not involved in extended
+ conversations. Other implementations of 802.3ad may
+ or may not tolerate this noncompliance.
+
+ encap2+3
+
+ This policy uses the same formula as layer2+3 but it
+ relies on skb_flow_dissect to obtain the header fields
+ which might result in the use of inner headers if an
+ encapsulation protocol is used. For example this will
+ improve the performance for tunnel users because the
+ packets will be distributed according to the encapsulated
+ flows.
+
+ encap3+4
+
+ This policy uses the same formula as layer3+4 but it
+ relies on skb_flow_dissect to obtain the header fields
+ which might result in the use of inner headers if an
+ encapsulation protocol is used. For example this will
+ improve the performance for tunnel users because the
+ packets will be distributed according to the encapsulated
+ flows.
+
+ The default value is layer2. This option was added in bonding
+ version 2.6.3. In earlier versions of bonding, this parameter
+ does not exist, and the layer2 policy is the only policy. The
+ layer2+3 value was added for bonding version 3.2.2.
+
+resend_igmp
+
+ Specifies the number of IGMP membership reports to be issued after
+ a failover event. One membership report is issued immediately after
+ the failover, subsequent packets are sent in each 200ms interval.
+
+ The valid range is 0 - 255; the default value is 1. A value of 0
+ prevents the IGMP membership report from being issued in response
+ to the failover event.
+
+ This option is useful for bonding modes balance-rr (0), active-backup
+ (1), balance-tlb (5) and balance-alb (6), in which a failover can
+ switch the IGMP traffic from one slave to another. Therefore a fresh
+ IGMP report must be issued to cause the switch to forward the incoming
+ IGMP traffic over the newly selected slave.
+
+ This option was added for bonding version 3.7.0.
+
+lp_interval
+
+ Specifies the number of seconds between instances where the bonding
+ driver sends learning packets to each slaves peer switch.
+
+ The valid range is 1 - 0x7fffffff; the default value is 1. This Option
+ has effect only in balance-tlb and balance-alb modes.
+
+3. Configuring Bonding Devices
+==============================
+
+ You can configure bonding using either your distro's network
+initialization scripts, or manually using either iproute2 or the
+sysfs interface. Distros generally use one of three packages for the
+network initialization scripts: initscripts, sysconfig or interfaces.
+Recent versions of these packages have support for bonding, while older
+versions do not.
+
+ We will first describe the options for configuring bonding for
+distros using versions of initscripts, sysconfig and interfaces with full
+or partial support for bonding, then provide information on enabling
+bonding without support from the network initialization scripts (i.e.,
+older versions of initscripts or sysconfig).
+
+ If you're unsure whether your distro uses sysconfig,
+initscripts or interfaces, or don't know if it's new enough, have no fear.
+Determining this is fairly straightforward.
+
+ First, look for a file called interfaces in /etc/network directory.
+If this file is present in your system, then your system use interfaces. See
+Configuration with Interfaces Support.
+
+ Else, issue the command:
+
+$ rpm -qf /sbin/ifup
+
+ It will respond with a line of text starting with either
+"initscripts" or "sysconfig," followed by some numbers. This is the
+package that provides your network initialization scripts.
+
+ Next, to determine if your installation supports bonding,
+issue the command:
+
+$ grep ifenslave /sbin/ifup
+
+ If this returns any matches, then your initscripts or
+sysconfig has support for bonding.
+
+3.1 Configuration with Sysconfig Support
+----------------------------------------
+
+ This section applies to distros using a version of sysconfig
+with bonding support, for example, SuSE Linux Enterprise Server 9.
+
+ SuSE SLES 9's networking configuration system does support
+bonding, however, at this writing, the YaST system configuration
+front end does not provide any means to work with bonding devices.
+Bonding devices can be managed by hand, however, as follows.
+
+ First, if they have not already been configured, configure the
+slave devices. On SLES 9, this is most easily done by running the
+yast2 sysconfig configuration utility. The goal is for to create an
+ifcfg-id file for each slave device. The simplest way to accomplish
+this is to configure the devices for DHCP (this is only to get the
+file ifcfg-id file created; see below for some issues with DHCP). The
+name of the configuration file for each device will be of the form:
+
+ifcfg-id-xx:xx:xx:xx:xx:xx
+
+ Where the "xx" portion will be replaced with the digits from
+the device's permanent MAC address.
+
+ Once the set of ifcfg-id-xx:xx:xx:xx:xx:xx files has been
+created, it is necessary to edit the configuration files for the slave
+devices (the MAC addresses correspond to those of the slave devices).
+Before editing, the file will contain multiple lines, and will look
+something like this:
+
+BOOTPROTO='dhcp'
+STARTMODE='on'
+USERCTL='no'
+UNIQUE='XNzu.WeZGOGF+4wE'
+_nm_name='bus-pci-0001:61:01.0'
+
+ Change the BOOTPROTO and STARTMODE lines to the following:
+
+BOOTPROTO='none'
+STARTMODE='off'
+
+ Do not alter the UNIQUE or _nm_name lines. Remove any other
+lines (USERCTL, etc).
+
+ Once the ifcfg-id-xx:xx:xx:xx:xx:xx files have been modified,
+it's time to create the configuration file for the bonding device
+itself. This file is named ifcfg-bondX, where X is the number of the
+bonding device to create, starting at 0. The first such file is
+ifcfg-bond0, the second is ifcfg-bond1, and so on. The sysconfig
+network configuration system will correctly start multiple instances
+of bonding.
+
+ The contents of the ifcfg-bondX file is as follows:
+
+BOOTPROTO="static"
+BROADCAST="10.0.2.255"
+IPADDR="10.0.2.10"
+NETMASK="255.255.0.0"
+NETWORK="10.0.2.0"
+REMOTE_IPADDR=""
+STARTMODE="onboot"
+BONDING_MASTER="yes"
+BONDING_MODULE_OPTS="mode=active-backup miimon=100"
+BONDING_SLAVE0="eth0"
+BONDING_SLAVE1="bus-pci-0000:06:08.1"
+
+ Replace the sample BROADCAST, IPADDR, NETMASK and NETWORK
+values with the appropriate values for your network.
+
+ The STARTMODE specifies when the device is brought online.
+The possible values are:
+
+ onboot: The device is started at boot time. If you're not
+ sure, this is probably what you want.
+
+ manual: The device is started only when ifup is called
+ manually. Bonding devices may be configured this
+ way if you do not wish them to start automatically
+ at boot for some reason.
+
+ hotplug: The device is started by a hotplug event. This is not
+ a valid choice for a bonding device.
+
+ off or ignore: The device configuration is ignored.
+
+ The line BONDING_MASTER='yes' indicates that the device is a
+bonding master device. The only useful value is "yes."
+
+ The contents of BONDING_MODULE_OPTS are supplied to the
+instance of the bonding module for this device. Specify the options
+for the bonding mode, link monitoring, and so on here. Do not include
+the max_bonds bonding parameter; this will confuse the configuration
+system if you have multiple bonding devices.
+
+ Finally, supply one BONDING_SLAVEn="slave device" for each
+slave. where "n" is an increasing value, one for each slave. The
+"slave device" is either an interface name, e.g., "eth0", or a device
+specifier for the network device. The interface name is easier to
+find, but the ethN names are subject to change at boot time if, e.g.,
+a device early in the sequence has failed. The device specifiers
+(bus-pci-0000:06:08.1 in the example above) specify the physical
+network device, and will not change unless the device's bus location
+changes (for example, it is moved from one PCI slot to another). The
+example above uses one of each type for demonstration purposes; most
+configurations will choose one or the other for all slave devices.
+
+ When all configuration files have been modified or created,
+networking must be restarted for the configuration changes to take
+effect. This can be accomplished via the following:
+
+# /etc/init.d/network restart
+
+ Note that the network control script (/sbin/ifdown) will
+remove the bonding module as part of the network shutdown processing,
+so it is not necessary to remove the module by hand if, e.g., the
+module parameters have changed.
+
+ Also, at this writing, YaST/YaST2 will not manage bonding
+devices (they do not show bonding interfaces on its list of network
+devices). It is necessary to edit the configuration file by hand to
+change the bonding configuration.
+
+ Additional general options and details of the ifcfg file
+format can be found in an example ifcfg template file:
+
+/etc/sysconfig/network/ifcfg.template
+
+ Note that the template does not document the various BONDING_
+settings described above, but does describe many of the other options.
+
+3.1.1 Using DHCP with Sysconfig
+-------------------------------
+
+ Under sysconfig, configuring a device with BOOTPROTO='dhcp'
+will cause it to query DHCP for its IP address information. At this
+writing, this does not function for bonding devices; the scripts
+attempt to obtain the device address from DHCP prior to adding any of
+the slave devices. Without active slaves, the DHCP requests are not
+sent to the network.
+
+3.1.2 Configuring Multiple Bonds with Sysconfig
+-----------------------------------------------
+
+ The sysconfig network initialization system is capable of
+handling multiple bonding devices. All that is necessary is for each
+bonding instance to have an appropriately configured ifcfg-bondX file
+(as described above). Do not specify the "max_bonds" parameter to any
+instance of bonding, as this will confuse sysconfig. If you require
+multiple bonding devices with identical parameters, create multiple
+ifcfg-bondX files.
+
+ Because the sysconfig scripts supply the bonding module
+options in the ifcfg-bondX file, it is not necessary to add them to
+the system /etc/modules.d/*.conf configuration files.
+
+3.2 Configuration with Initscripts Support
+------------------------------------------
+
+ This section applies to distros using a recent version of
+initscripts with bonding support, for example, Red Hat Enterprise Linux
+version 3 or later, Fedora, etc. On these systems, the network
+initialization scripts have knowledge of bonding, and can be configured to
+control bonding devices. Note that older versions of the initscripts
+package have lower levels of support for bonding; this will be noted where
+applicable.
+
+ These distros will not automatically load the network adapter
+driver unless the ethX device is configured with an IP address.
+Because of this constraint, users must manually configure a
+network-script file for all physical adapters that will be members of
+a bondX link. Network script files are located in the directory:
+
+/etc/sysconfig/network-scripts
+
+ The file name must be prefixed with "ifcfg-eth" and suffixed
+with the adapter's physical adapter number. For example, the script
+for eth0 would be named /etc/sysconfig/network-scripts/ifcfg-eth0.
+Place the following text in the file:
+
+DEVICE=eth0
+USERCTL=no
+ONBOOT=yes
+MASTER=bond0
+SLAVE=yes
+BOOTPROTO=none
+
+ The DEVICE= line will be different for every ethX device and
+must correspond with the name of the file, i.e., ifcfg-eth1 must have
+a device line of DEVICE=eth1. The setting of the MASTER= line will
+also depend on the final bonding interface name chosen for your bond.
+As with other network devices, these typically start at 0, and go up
+one for each device, i.e., the first bonding instance is bond0, the
+second is bond1, and so on.
+
+ Next, create a bond network script. The file name for this
+script will be /etc/sysconfig/network-scripts/ifcfg-bondX where X is
+the number of the bond. For bond0 the file is named "ifcfg-bond0",
+for bond1 it is named "ifcfg-bond1", and so on. Within that file,
+place the following text:
+
+DEVICE=bond0
+IPADDR=192.168.1.1
+NETMASK=255.255.255.0
+NETWORK=192.168.1.0
+BROADCAST=192.168.1.255
+ONBOOT=yes
+BOOTPROTO=none
+USERCTL=no
+
+ Be sure to change the networking specific lines (IPADDR,
+NETMASK, NETWORK and BROADCAST) to match your network configuration.
+
+ For later versions of initscripts, such as that found with Fedora
+7 (or later) and Red Hat Enterprise Linux version 5 (or later), it is possible,
+and, indeed, preferable, to specify the bonding options in the ifcfg-bond0
+file, e.g. a line of the format:
+
+BONDING_OPTS="mode=active-backup arp_interval=60 arp_ip_target=192.168.1.254"
+
+ will configure the bond with the specified options. The options
+specified in BONDING_OPTS are identical to the bonding module parameters
+except for the arp_ip_target field when using versions of initscripts older
+than and 8.57 (Fedora 8) and 8.45.19 (Red Hat Enterprise Linux 5.2). When
+using older versions each target should be included as a separate option and
+should be preceded by a '+' to indicate it should be added to the list of
+queried targets, e.g.,
+
+ arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2
+
+ is the proper syntax to specify multiple targets. When specifying
+options via BONDING_OPTS, it is not necessary to edit /etc/modprobe.d/*.conf.
+
+ For even older versions of initscripts that do not support
+BONDING_OPTS, it is necessary to edit /etc/modprobe.d/*.conf, depending upon
+your distro) to load the bonding module with your desired options when the
+bond0 interface is brought up. The following lines in /etc/modprobe.d/*.conf
+will load the bonding module, and select its options:
+
+alias bond0 bonding
+options bond0 mode=balance-alb miimon=100
+
+ Replace the sample parameters with the appropriate set of
+options for your configuration.
+
+ Finally run "/etc/rc.d/init.d/network restart" as root. This
+will restart the networking subsystem and your bond link should be now
+up and running.
+
+3.2.1 Using DHCP with Initscripts
+---------------------------------
+
+ Recent versions of initscripts (the versions supplied with Fedora
+Core 3 and Red Hat Enterprise Linux 4, or later versions, are reported to
+work) have support for assigning IP information to bonding devices via
+DHCP.
+
+ To configure bonding for DHCP, configure it as described
+above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp"
+and add a line consisting of "TYPE=Bonding". Note that the TYPE value
+is case sensitive.
+
+3.2.2 Configuring Multiple Bonds with Initscripts
+-------------------------------------------------
+
+ Initscripts packages that are included with Fedora 7 and Red Hat
+Enterprise Linux 5 support multiple bonding interfaces by simply
+specifying the appropriate BONDING_OPTS= in ifcfg-bondX where X is the
+number of the bond. This support requires sysfs support in the kernel,
+and a bonding driver of version 3.0.0 or later. Other configurations may
+not support this method for specifying multiple bonding interfaces; for
+those instances, see the "Configuring Multiple Bonds Manually" section,
+below.
+
+3.3 Configuring Bonding Manually with iproute2
+-----------------------------------------------
+
+ This section applies to distros whose network initialization
+scripts (the sysconfig or initscripts package) do not have specific
+knowledge of bonding. One such distro is SuSE Linux Enterprise Server
+version 8.
+
+ The general method for these systems is to place the bonding
+module parameters into a config file in /etc/modprobe.d/ (as
+appropriate for the installed distro), then add modprobe and/or
+`ip link` commands to the system's global init script. The name of
+the global init script differs; for sysconfig, it is
+/etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local.
+
+ For example, if you wanted to make a simple bond of two e100
+devices (presumed to be eth0 and eth1), and have it persist across
+reboots, edit the appropriate file (/etc/init.d/boot.local or
+/etc/rc.d/rc.local), and add the following:
+
+modprobe bonding mode=balance-alb miimon=100
+modprobe e100
+ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
+ip link set eth0 master bond0
+ip link set eth1 master bond0
+
+ Replace the example bonding module parameters and bond0
+network configuration (IP address, netmask, etc) with the appropriate
+values for your configuration.
+
+ Unfortunately, this method will not provide support for the
+ifup and ifdown scripts on the bond devices. To reload the bonding
+configuration, it is necessary to run the initialization script, e.g.,
+
+# /etc/init.d/boot.local
+
+ or
+
+# /etc/rc.d/rc.local
+
+ It may be desirable in such a case to create a separate script
+which only initializes the bonding configuration, then call that
+separate script from within boot.local. This allows for bonding to be
+enabled without re-running the entire global init script.
+
+ To shut down the bonding devices, it is necessary to first
+mark the bonding device itself as being down, then remove the
+appropriate device driver modules. For our example above, you can do
+the following:
+
+# ifconfig bond0 down
+# rmmod bonding
+# rmmod e100
+
+ Again, for convenience, it may be desirable to create a script
+with these commands.
+
+
+3.3.1 Configuring Multiple Bonds Manually
+-----------------------------------------
+
+ This section contains information on configuring multiple
+bonding devices with differing options for those systems whose network
+initialization scripts lack support for configuring multiple bonds.
+
+ If you require multiple bonding devices, but all with the same
+options, you may wish to use the "max_bonds" module parameter,
+documented above.
+
+ To create multiple bonding devices with differing options, it is
+preferable to use bonding parameters exported by sysfs, documented in the
+section below.
+
+ For versions of bonding without sysfs support, the only means to
+provide multiple instances of bonding with differing options is to load
+the bonding driver multiple times. Note that current versions of the
+sysconfig network initialization scripts handle this automatically; if
+your distro uses these scripts, no special action is needed. See the
+section Configuring Bonding Devices, above, if you're not sure about your
+network initialization scripts.
+
+ To load multiple instances of the module, it is necessary to
+specify a different name for each instance (the module loading system
+requires that every loaded module, even multiple instances of the same
+module, have a unique name). This is accomplished by supplying multiple
+sets of bonding options in /etc/modprobe.d/*.conf, for example:
+
+alias bond0 bonding
+options bond0 -o bond0 mode=balance-rr miimon=100
+
+alias bond1 bonding
+options bond1 -o bond1 mode=balance-alb miimon=50
+
+ will load the bonding module two times. The first instance is
+named "bond0" and creates the bond0 device in balance-rr mode with an
+miimon of 100. The second instance is named "bond1" and creates the
+bond1 device in balance-alb mode with an miimon of 50.
+
+ In some circumstances (typically with older distributions),
+the above does not work, and the second bonding instance never sees
+its options. In that case, the second options line can be substituted
+as follows:
+
+install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \
+ mode=balance-alb miimon=50
+
+ This may be repeated any number of times, specifying a new and
+unique name in place of bond1 for each subsequent instance.
+
+ It has been observed that some Red Hat supplied kernels are unable
+to rename modules at load time (the "-o bond1" part). Attempts to pass
+that option to modprobe will produce an "Operation not permitted" error.
+This has been reported on some Fedora Core kernels, and has been seen on
+RHEL 4 as well. On kernels exhibiting this problem, it will be impossible
+to configure multiple bonds with differing parameters (as they are older
+kernels, and also lack sysfs support).
+
+3.4 Configuring Bonding Manually via Sysfs
+------------------------------------------
+
+ Starting with version 3.0.0, Channel Bonding may be configured
+via the sysfs interface. This interface allows dynamic configuration
+of all bonds in the system without unloading the module. It also
+allows for adding and removing bonds at runtime. Ifenslave is no
+longer required, though it is still supported.
+
+ Use of the sysfs interface allows you to use multiple bonds
+with different configurations without having to reload the module.
+It also allows you to use multiple, differently configured bonds when
+bonding is compiled into the kernel.
+
+ You must have the sysfs filesystem mounted to configure
+bonding this way. The examples in this document assume that you
+are using the standard mount point for sysfs, e.g. /sys. If your
+sysfs filesystem is mounted elsewhere, you will need to adjust the
+example paths accordingly.
+
+Creating and Destroying Bonds
+-----------------------------
+To add a new bond foo:
+# echo +foo > /sys/class/net/bonding_masters
+
+To remove an existing bond bar:
+# echo -bar > /sys/class/net/bonding_masters
+
+To show all existing bonds:
+# cat /sys/class/net/bonding_masters
+
+NOTE: due to 4K size limitation of sysfs files, this list may be
+truncated if you have more than a few hundred bonds. This is unlikely
+to occur under normal operating conditions.
+
+Adding and Removing Slaves
+--------------------------
+ Interfaces may be enslaved to a bond using the file
+/sys/class/net/<bond>/bonding/slaves. The semantics for this file
+are the same as for the bonding_masters file.
+
+To enslave interface eth0 to bond bond0:
+# ifconfig bond0 up
+# echo +eth0 > /sys/class/net/bond0/bonding/slaves
+
+To free slave eth0 from bond bond0:
+# echo -eth0 > /sys/class/net/bond0/bonding/slaves
+
+ When an interface is enslaved to a bond, symlinks between the
+two are created in the sysfs filesystem. In this case, you would get
+/sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and
+/sys/class/net/eth0/master pointing to /sys/class/net/bond0.
+
+ This means that you can tell quickly whether or not an
+interface is enslaved by looking for the master symlink. Thus:
+# echo -eth0 > /sys/class/net/eth0/master/bonding/slaves
+will free eth0 from whatever bond it is enslaved to, regardless of
+the name of the bond interface.
+
+Changing a Bond's Configuration
+-------------------------------
+ Each bond may be configured individually by manipulating the
+files located in /sys/class/net/<bond name>/bonding
+
+ The names of these files correspond directly with the command-
+line parameters described elsewhere in this file, and, with the
+exception of arp_ip_target, they accept the same values. To see the
+current setting, simply cat the appropriate file.
+
+ A few examples will be given here; for specific usage
+guidelines for each parameter, see the appropriate section in this
+document.
+
+To configure bond0 for balance-alb mode:
+# ifconfig bond0 down
+# echo 6 > /sys/class/net/bond0/bonding/mode
+ - or -
+# echo balance-alb > /sys/class/net/bond0/bonding/mode
+ NOTE: The bond interface must be down before the mode can be
+changed.
+
+To enable MII monitoring on bond0 with a 1 second interval:
+# echo 1000 > /sys/class/net/bond0/bonding/miimon
+ NOTE: If ARP monitoring is enabled, it will disabled when MII
+monitoring is enabled, and vice-versa.
+
+To add ARP targets:
+# echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
+# echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target
+ NOTE: up to 16 target addresses may be specified.
+
+To remove an ARP target:
+# echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
+
+To configure the interval between learning packet transmits:
+# echo 12 > /sys/class/net/bond0/bonding/lp_interval
+ NOTE: the lp_inteval is the number of seconds between instances where
+the bonding driver sends learning packets to each slaves peer switch. The
+default interval is 1 second.
+
+Example Configuration
+---------------------
+ We begin with the same example that is shown in section 3.3,
+executed with sysfs, and without using ifenslave.
+
+ To make a simple bond of two e100 devices (presumed to be eth0
+and eth1), and have it persist across reboots, edit the appropriate
+file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the
+following:
+
+modprobe bonding
+modprobe e100
+echo balance-alb > /sys/class/net/bond0/bonding/mode
+ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
+echo 100 > /sys/class/net/bond0/bonding/miimon
+echo +eth0 > /sys/class/net/bond0/bonding/slaves
+echo +eth1 > /sys/class/net/bond0/bonding/slaves
+
+ To add a second bond, with two e1000 interfaces in
+active-backup mode, using ARP monitoring, add the following lines to
+your init script:
+
+modprobe e1000
+echo +bond1 > /sys/class/net/bonding_masters
+echo active-backup > /sys/class/net/bond1/bonding/mode
+ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up
+echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target
+echo 2000 > /sys/class/net/bond1/bonding/arp_interval
+echo +eth2 > /sys/class/net/bond1/bonding/slaves
+echo +eth3 > /sys/class/net/bond1/bonding/slaves
+
+3.5 Configuration with Interfaces Support
+-----------------------------------------
+
+ This section applies to distros which use /etc/network/interfaces file
+to describe network interface configuration, most notably Debian and it's
+derivatives.
+
+ The ifup and ifdown commands on Debian don't support bonding out of
+the box. The ifenslave-2.6 package should be installed to provide bonding
+support. Once installed, this package will provide bond-* options to be used
+into /etc/network/interfaces.
+
+ Note that ifenslave-2.6 package will load the bonding module and use
+the ifenslave command when appropriate.
+
+Example Configurations
+----------------------
+
+In /etc/network/interfaces, the following stanza will configure bond0, in
+active-backup mode, with eth0 and eth1 as slaves.
+
+auto bond0
+iface bond0 inet dhcp
+ bond-slaves eth0 eth1
+ bond-mode active-backup
+ bond-miimon 100
+ bond-primary eth0 eth1
+
+If the above configuration doesn't work, you might have a system using
+upstart for system startup. This is most notably true for recent
+Ubuntu versions. The following stanza in /etc/network/interfaces will
+produce the same result on those systems.
+
+auto bond0
+iface bond0 inet dhcp
+ bond-slaves none
+ bond-mode active-backup
+ bond-miimon 100
+
+auto eth0
+iface eth0 inet manual
+ bond-master bond0
+ bond-primary eth0 eth1
+
+auto eth1
+iface eth1 inet manual
+ bond-master bond0
+ bond-primary eth0 eth1
+
+For a full list of bond-* supported options in /etc/network/interfaces and some
+more advanced examples tailored to you particular distros, see the files in
+/usr/share/doc/ifenslave-2.6.
+
+3.6 Overriding Configuration for Special Cases
+----------------------------------------------
+
+When using the bonding driver, the physical port which transmits a frame is
+typically selected by the bonding driver, and is not relevant to the user or
+system administrator. The output port is simply selected using the policies of
+the selected bonding mode. On occasion however, it is helpful to direct certain
+classes of traffic to certain physical interfaces on output to implement
+slightly more complex policies. For example, to reach a web server over a
+bonded interface in which eth0 connects to a private network, while eth1
+connects via a public network, it may be desirous to bias the bond to send said
+traffic over eth0 first, using eth1 only as a fall back, while all other traffic
+can safely be sent over either interface. Such configurations may be achieved
+using the traffic control utilities inherent in linux.
+
+By default the bonding driver is multiqueue aware and 16 queues are created
+when the driver initializes (see Documentation/networking/multiqueue.txt
+for details). If more or less queues are desired the module parameter
+tx_queues can be used to change this value. There is no sysfs parameter
+available as the allocation is done at module init time.
+
+The output of the file /proc/net/bonding/bondX has changed so the output Queue
+ID is now printed for each slave:
+
+Bonding Mode: fault-tolerance (active-backup)
+Primary Slave: None
+Currently Active Slave: eth0
+MII Status: up
+MII Polling Interval (ms): 0
+Up Delay (ms): 0
+Down Delay (ms): 0
+
+Slave Interface: eth0
+MII Status: up
+Link Failure Count: 0
+Permanent HW addr: 00:1a:a0:12:8f:cb
+Slave queue ID: 0
+
+Slave Interface: eth1
+MII Status: up
+Link Failure Count: 0
+Permanent HW addr: 00:1a:a0:12:8f:cc
+Slave queue ID: 2
+
+The queue_id for a slave can be set using the command:
+
+# echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id
+
+Any interface that needs a queue_id set should set it with multiple calls
+like the one above until proper priorities are set for all interfaces. On
+distributions that allow configuration via initscripts, multiple 'queue_id'
+arguments can be added to BONDING_OPTS to set all needed slave queues.
+
+These queue id's can be used in conjunction with the tc utility to configure
+a multiqueue qdisc and filters to bias certain traffic to transmit on certain
+slave devices. For instance, say we wanted, in the above configuration to
+force all traffic bound to 192.168.1.100 to use eth1 in the bond as its output
+device. The following commands would accomplish this:
+
+# tc qdisc add dev bond0 handle 1 root multiq
+
+# tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip dst \
+ 192.168.1.100 action skbedit queue_mapping 2
+
+These commands tell the kernel to attach a multiqueue queue discipline to the
+bond0 interface and filter traffic enqueued to it, such that packets with a dst
+ip of 192.168.1.100 have their output queue mapping value overwritten to 2.
+This value is then passed into the driver, causing the normal output path
+selection policy to be overridden, selecting instead qid 2, which maps to eth1.
+
+Note that qid values begin at 1. Qid 0 is reserved to initiate to the driver
+that normal output policy selection should take place. One benefit to simply
+leaving the qid for a slave to 0 is the multiqueue awareness in the bonding
+driver that is now present. This awareness allows tc filters to be placed on
+slave devices as well as bond devices and the bonding driver will simply act as
+a pass-through for selecting output queues on the slave device rather than
+output port selection.
+
+This feature first appeared in bonding driver version 3.7.0 and support for
+output slave selection was limited to round-robin and active-backup modes.
+
+4 Querying Bonding Configuration
+=================================
+
+4.1 Bonding Configuration
+-------------------------
+
+ Each bonding device has a read-only file residing in the
+/proc/net/bonding directory. The file contents include information
+about the bonding configuration, options and state of each slave.
+
+ For example, the contents of /proc/net/bonding/bond0 after the
+driver is loaded with parameters of mode=0 and miimon=1000 is
+generally as follows:
+
+ Ethernet Channel Bonding Driver: 2.6.1 (October 29, 2004)
+ Bonding Mode: load balancing (round-robin)
+ Currently Active Slave: eth0
+ MII Status: up
+ MII Polling Interval (ms): 1000
+ Up Delay (ms): 0
+ Down Delay (ms): 0
+
+ Slave Interface: eth1
+ MII Status: up
+ Link Failure Count: 1
+
+ Slave Interface: eth0
+ MII Status: up
+ Link Failure Count: 1
+
+ The precise format and contents will change depending upon the
+bonding configuration, state, and version of the bonding driver.
+
+4.2 Network configuration
+-------------------------
+
+ The network configuration can be inspected using the ifconfig
+command. Bonding devices will have the MASTER flag set; Bonding slave
+devices will have the SLAVE flag set. The ifconfig output does not
+contain information on which slaves are associated with which masters.
+
+ In the example below, the bond0 interface is the master
+(MASTER) while eth0 and eth1 are slaves (SLAVE). Notice all slaves of
+bond0 have the same MAC address (HWaddr) as bond0 for all modes except
+TLB and ALB that require a unique MAC address for each slave.
+
+# /sbin/ifconfig
+bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
+ inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0
+ UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
+ RX packets:7224794 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:3286647 errors:1 dropped:0 overruns:1 carrier:0
+ collisions:0 txqueuelen:0
+
+eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
+ UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
+ RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0
+ collisions:0 txqueuelen:100
+ Interrupt:10 Base address:0x1080
+
+eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4
+ UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
+ RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0
+ collisions:0 txqueuelen:100
+ Interrupt:9 Base address:0x1400
+
+5. Switch Configuration
+=======================
+
+ For this section, "switch" refers to whatever system the
+bonded devices are directly connected to (i.e., where the other end of
+the cable plugs into). This may be an actual dedicated switch device,
+or it may be another regular system (e.g., another computer running
+Linux),
+
+ The active-backup, balance-tlb and balance-alb modes do not
+require any specific configuration of the switch.
+
+ The 802.3ad mode requires that the switch have the appropriate
+ports configured as an 802.3ad aggregation. The precise method used
+to configure this varies from switch to switch, but, for example, a
+Cisco 3550 series switch requires that the appropriate ports first be
+grouped together in a single etherchannel instance, then that
+etherchannel is set to mode "lacp" to enable 802.3ad (instead of
+standard EtherChannel).
+
+ The balance-rr, balance-xor and broadcast modes generally
+require that the switch have the appropriate ports grouped together.
+The nomenclature for such a group differs between switches, it may be
+called an "etherchannel" (as in the Cisco example, above), a "trunk
+group" or some other similar variation. For these modes, each switch
+will also have its own configuration options for the switch's transmit
+policy to the bond. Typical choices include XOR of either the MAC or
+IP addresses. The transmit policy of the two peers does not need to
+match. For these three modes, the bonding mode really selects a
+transmit policy for an EtherChannel group; all three will interoperate
+with another EtherChannel group.
+
+
+6. 802.1q VLAN Support
+======================
+
+ It is possible to configure VLAN devices over a bond interface
+using the 8021q driver. However, only packets coming from the 8021q
+driver and passing through bonding will be tagged by default. Self
+generated packets, for example, bonding's learning packets or ARP
+packets generated by either ALB mode or the ARP monitor mechanism, are
+tagged internally by bonding itself. As a result, bonding must
+"learn" the VLAN IDs configured above it, and use those IDs to tag
+self generated packets.
+
+ For reasons of simplicity, and to support the use of adapters
+that can do VLAN hardware acceleration offloading, the bonding
+interface declares itself as fully hardware offloading capable, it gets
+the add_vid/kill_vid notifications to gather the necessary
+information, and it propagates those actions to the slaves. In case
+of mixed adapter types, hardware accelerated tagged packets that
+should go through an adapter that is not offloading capable are
+"un-accelerated" by the bonding driver so the VLAN tag sits in the
+regular location.
+
+ VLAN interfaces *must* be added on top of a bonding interface
+only after enslaving at least one slave. The bonding interface has a
+hardware address of 00:00:00:00:00:00 until the first slave is added.
+If the VLAN interface is created prior to the first enslavement, it
+would pick up the all-zeroes hardware address. Once the first slave
+is attached to the bond, the bond device itself will pick up the
+slave's hardware address, which is then available for the VLAN device.
+
+ Also, be aware that a similar problem can occur if all slaves
+are released from a bond that still has one or more VLAN interfaces on
+top of it. When a new slave is added, the bonding interface will
+obtain its hardware address from the first slave, which might not
+match the hardware address of the VLAN interfaces (which was
+ultimately copied from an earlier slave).
+
+ There are two methods to insure that the VLAN device operates
+with the correct hardware address if all slaves are removed from a
+bond interface:
+
+ 1. Remove all VLAN interfaces then recreate them
+
+ 2. Set the bonding interface's hardware address so that it
+matches the hardware address of the VLAN interfaces.
+
+ Note that changing a VLAN interface's HW address would set the
+underlying device -- i.e. the bonding interface -- to promiscuous
+mode, which might not be what you want.
+
+
+7. Link Monitoring
+==================
+
+ The bonding driver at present supports two schemes for
+monitoring a slave device's link state: the ARP monitor and the MII
+monitor.
+
+ At the present time, due to implementation restrictions in the
+bonding driver itself, it is not possible to enable both ARP and MII
+monitoring simultaneously.
+
+7.1 ARP Monitor Operation
+-------------------------
+
+ The ARP monitor operates as its name suggests: it sends ARP
+queries to one or more designated peer systems on the network, and
+uses the response as an indication that the link is operating. This
+gives some assurance that traffic is actually flowing to and from one
+or more peers on the local network.
+
+ The ARP monitor relies on the device driver itself to verify
+that traffic is flowing. In particular, the driver must keep up to
+date the last receive time, dev->last_rx, and transmit start time,
+dev->trans_start. If these are not updated by the driver, then the
+ARP monitor will immediately fail any slaves using that driver, and
+those slaves will stay down. If networking monitoring (tcpdump, etc)
+shows the ARP requests and replies on the network, then it may be that
+your device driver is not updating last_rx and trans_start.
+
+7.2 Configuring Multiple ARP Targets
+------------------------------------
+
+ While ARP monitoring can be done with just one target, it can
+be useful in a High Availability setup to have several targets to
+monitor. In the case of just one target, the target itself may go
+down or have a problem making it unresponsive to ARP requests. Having
+an additional target (or several) increases the reliability of the ARP
+monitoring.
+
+ Multiple ARP targets must be separated by commas as follows:
+
+# example options for ARP monitoring with three targets
+alias bond0 bonding
+options bond0 arp_interval=60 arp_ip_target=192.168.0.1,192.168.0.3,192.168.0.9
+
+ For just a single target the options would resemble:
+
+# example options for ARP monitoring with one target
+alias bond0 bonding
+options bond0 arp_interval=60 arp_ip_target=192.168.0.100
+
+
+7.3 MII Monitor Operation
+-------------------------
+
+ The MII monitor monitors only the carrier state of the local
+network interface. It accomplishes this in one of three ways: by
+depending upon the device driver to maintain its carrier state, by
+querying the device's MII registers, or by making an ethtool query to
+the device.
+
+ If the use_carrier module parameter is 1 (the default value),
+then the MII monitor will rely on the driver for carrier state
+information (via the netif_carrier subsystem). As explained in the
+use_carrier parameter information, above, if the MII monitor fails to
+detect carrier loss on the device (e.g., when the cable is physically
+disconnected), it may be that the driver does not support
+netif_carrier.
+
+ If use_carrier is 0, then the MII monitor will first query the
+device's (via ioctl) MII registers and check the link state. If that
+request fails (not just that it returns carrier down), then the MII
+monitor will make an ethtool ETHOOL_GLINK request to attempt to obtain
+the same information. If both methods fail (i.e., the driver either
+does not support or had some error in processing both the MII register
+and ethtool requests), then the MII monitor will assume the link is
+up.
+
+8. Potential Sources of Trouble
+===============================
+
+8.1 Adventures in Routing
+-------------------------
+
+ When bonding is configured, it is important that the slave
+devices not have routes that supersede routes of the master (or,
+generally, not have routes at all). For example, suppose the bonding
+device bond0 has two slaves, eth0 and eth1, and the routing table is
+as follows:
+
+Kernel IP routing table
+Destination Gateway Genmask Flags MSS Window irtt Iface
+10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth0
+10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 eth1
+10.0.0.0 0.0.0.0 255.255.0.0 U 40 0 0 bond0
+127.0.0.0 0.0.0.0 255.0.0.0 U 40 0 0 lo
+
+ This routing configuration will likely still update the
+receive/transmit times in the driver (needed by the ARP monitor), but
+may bypass the bonding driver (because outgoing traffic to, in this
+case, another host on network 10 would use eth0 or eth1 before bond0).
+
+ The ARP monitor (and ARP itself) may become confused by this
+configuration, because ARP requests (generated by the ARP monitor)
+will be sent on one interface (bond0), but the corresponding reply
+will arrive on a different interface (eth0). This reply looks to ARP
+as an unsolicited ARP reply (because ARP matches replies on an
+interface basis), and is discarded. The MII monitor is not affected
+by the state of the routing table.
+
+ The solution here is simply to insure that slaves do not have
+routes of their own, and if for some reason they must, those routes do
+not supersede routes of their master. This should generally be the
+case, but unusual configurations or errant manual or automatic static
+route additions may cause trouble.
+
+8.2 Ethernet Device Renaming
+----------------------------
+
+ On systems with network configuration scripts that do not
+associate physical devices directly with network interface names (so
+that the same physical device always has the same "ethX" name), it may
+be necessary to add some special logic to config files in
+/etc/modprobe.d/.
+
+ For example, given a modules.conf containing the following:
+
+alias bond0 bonding
+options bond0 mode=some-mode miimon=50
+alias eth0 tg3
+alias eth1 tg3
+alias eth2 e1000
+alias eth3 e1000
+
+ If neither eth0 and eth1 are slaves to bond0, then when the
+bond0 interface comes up, the devices may end up reordered. This
+happens because bonding is loaded first, then its slave device's
+drivers are loaded next. Since no other drivers have been loaded,
+when the e1000 driver loads, it will receive eth0 and eth1 for its
+devices, but the bonding configuration tries to enslave eth2 and eth3
+(which may later be assigned to the tg3 devices).
+
+ Adding the following:
+
+add above bonding e1000 tg3
+
+ causes modprobe to load e1000 then tg3, in that order, when
+bonding is loaded. This command is fully documented in the
+modules.conf manual page.
+
+ On systems utilizing modprobe an equivalent problem can occur.
+In this case, the following can be added to config files in
+/etc/modprobe.d/ as:
+
+softdep bonding pre: tg3 e1000
+
+ This will load tg3 and e1000 modules before loading the bonding one.
+Full documentation on this can be found in the modprobe.d and modprobe
+manual pages.
+
+8.3. Painfully Slow Or No Failed Link Detection By Miimon
+---------------------------------------------------------
+
+ By default, bonding enables the use_carrier option, which
+instructs bonding to trust the driver to maintain carrier state.
+
+ As discussed in the options section, above, some drivers do
+not support the netif_carrier_on/_off link state tracking system.
+With use_carrier enabled, bonding will always see these links as up,
+regardless of their actual state.
+
+ Additionally, other drivers do support netif_carrier, but do
+not maintain it in real time, e.g., only polling the link state at
+some fixed interval. In this case, miimon will detect failures, but
+only after some long period of time has expired. If it appears that
+miimon is very slow in detecting link failures, try specifying
+use_carrier=0 to see if that improves the failure detection time. If
+it does, then it may be that the driver checks the carrier state at a
+fixed interval, but does not cache the MII register values (so the
+use_carrier=0 method of querying the registers directly works). If
+use_carrier=0 does not improve the failover, then the driver may cache
+the registers, or the problem may be elsewhere.
+
+ Also, remember that miimon only checks for the device's
+carrier state. It has no way to determine the state of devices on or
+beyond other ports of a switch, or if a switch is refusing to pass
+traffic while still maintaining carrier on.
+
+9. SNMP agents
+===============
+
+ If running SNMP agents, the bonding driver should be loaded
+before any network drivers participating in a bond. This requirement
+is due to the interface index (ipAdEntIfIndex) being associated to
+the first interface found with a given IP address. That is, there is
+only one ipAdEntIfIndex for each IP address. For example, if eth0 and
+eth1 are slaves of bond0 and the driver for eth0 is loaded before the
+bonding driver, the interface for the IP address will be associated
+with the eth0 interface. This configuration is shown below, the IP
+address 192.168.1.1 has an interface index of 2 which indexes to eth0
+in the ifDescr table (ifDescr.2).
+
+ interfaces.ifTable.ifEntry.ifDescr.1 = lo
+ interfaces.ifTable.ifEntry.ifDescr.2 = eth0
+ interfaces.ifTable.ifEntry.ifDescr.3 = eth1
+ interfaces.ifTable.ifEntry.ifDescr.4 = eth2
+ interfaces.ifTable.ifEntry.ifDescr.5 = eth3
+ interfaces.ifTable.ifEntry.ifDescr.6 = bond0
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 5
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 4
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
+
+ This problem is avoided by loading the bonding driver before
+any network drivers participating in a bond. Below is an example of
+loading the bonding driver first, the IP address 192.168.1.1 is
+correctly associated with ifDescr.2.
+
+ interfaces.ifTable.ifEntry.ifDescr.1 = lo
+ interfaces.ifTable.ifEntry.ifDescr.2 = bond0
+ interfaces.ifTable.ifEntry.ifDescr.3 = eth0
+ interfaces.ifTable.ifEntry.ifDescr.4 = eth1
+ interfaces.ifTable.ifEntry.ifDescr.5 = eth2
+ interfaces.ifTable.ifEntry.ifDescr.6 = eth3
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.10.10.10 = 6
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.192.168.1.1 = 2
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.10.74.20.94 = 5
+ ip.ipAddrTable.ipAddrEntry.ipAdEntIfIndex.127.0.0.1 = 1
+
+ While some distributions may not report the interface name in
+ifDescr, the association between the IP address and IfIndex remains
+and SNMP functions such as Interface_Scan_Next will report that
+association.
+
+10. Promiscuous mode
+====================
+
+ When running network monitoring tools, e.g., tcpdump, it is
+common to enable promiscuous mode on the device, so that all traffic
+is seen (instead of seeing only traffic destined for the local host).
+The bonding driver handles promiscuous mode changes to the bonding
+master device (e.g., bond0), and propagates the setting to the slave
+devices.
+
+ For the balance-rr, balance-xor, broadcast, and 802.3ad modes,
+the promiscuous mode setting is propagated to all slaves.
+
+ For the active-backup, balance-tlb and balance-alb modes, the
+promiscuous mode setting is propagated only to the active slave.
+
+ For balance-tlb mode, the active slave is the slave currently
+receiving inbound traffic.
+
+ For balance-alb mode, the active slave is the slave used as a
+"primary." This slave is used for mode-specific control traffic, for
+sending to peers that are unassigned or if the load is unbalanced.
+
+ For the active-backup, balance-tlb and balance-alb modes, when
+the active slave changes (e.g., due to a link failure), the
+promiscuous setting will be propagated to the new active slave.
+
+11. Configuring Bonding for High Availability
+=============================================
+
+ High Availability refers to configurations that provide
+maximum network availability by having redundant or backup devices,
+links or switches between the host and the rest of the world. The
+goal is to provide the maximum availability of network connectivity
+(i.e., the network always works), even though other configurations
+could provide higher throughput.
+
+11.1 High Availability in a Single Switch Topology
+--------------------------------------------------
+
+ If two hosts (or a host and a single switch) are directly
+connected via multiple physical links, then there is no availability
+penalty to optimizing for maximum bandwidth. In this case, there is
+only one switch (or peer), so if it fails, there is no alternative
+access to fail over to. Additionally, the bonding load balance modes
+support link monitoring of their members, so if individual links fail,
+the load will be rebalanced across the remaining devices.
+
+ See Section 12, "Configuring Bonding for Maximum Throughput"
+for information on configuring bonding with one peer device.
+
+11.2 High Availability in a Multiple Switch Topology
+----------------------------------------------------
+
+ With multiple switches, the configuration of bonding and the
+network changes dramatically. In multiple switch topologies, there is
+a trade off between network availability and usable bandwidth.
+
+ Below is a sample network, configured to maximize the
+availability of the network:
+
+ | |
+ |port3 port3|
+ +-----+----+ +-----+----+
+ | |port2 ISL port2| |
+ | switch A +--------------------------+ switch B |
+ | | | |
+ +-----+----+ +-----++---+
+ |port1 port1|
+ | +-------+ |
+ +-------------+ host1 +---------------+
+ eth0 +-------+ eth1
+
+ In this configuration, there is a link between the two
+switches (ISL, or inter switch link), and multiple ports connecting to
+the outside world ("port3" on each switch). There is no technical
+reason that this could not be extended to a third switch.
+
+11.2.1 HA Bonding Mode Selection for Multiple Switch Topology
+-------------------------------------------------------------
+
+ In a topology such as the example above, the active-backup and
+broadcast modes are the only useful bonding modes when optimizing for
+availability; the other modes require all links to terminate on the
+same peer for them to behave rationally.
+
+active-backup: This is generally the preferred mode, particularly if
+ the switches have an ISL and play together well. If the
+ network configuration is such that one switch is specifically
+ a backup switch (e.g., has lower capacity, higher cost, etc),
+ then the primary option can be used to insure that the
+ preferred link is always used when it is available.
+
+broadcast: This mode is really a special purpose mode, and is suitable
+ only for very specific needs. For example, if the two
+ switches are not connected (no ISL), and the networks beyond
+ them are totally independent. In this case, if it is
+ necessary for some specific one-way traffic to reach both
+ independent networks, then the broadcast mode may be suitable.
+
+11.2.2 HA Link Monitoring Selection for Multiple Switch Topology
+----------------------------------------------------------------
+
+ The choice of link monitoring ultimately depends upon your
+switch. If the switch can reliably fail ports in response to other
+failures, then either the MII or ARP monitors should work. For
+example, in the above example, if the "port3" link fails at the remote
+end, the MII monitor has no direct means to detect this. The ARP
+monitor could be configured with a target at the remote end of port3,
+thus detecting that failure without switch support.
+
+ In general, however, in a multiple switch topology, the ARP
+monitor can provide a higher level of reliability in detecting end to
+end connectivity failures (which may be caused by the failure of any
+individual component to pass traffic for any reason). Additionally,
+the ARP monitor should be configured with multiple targets (at least
+one for each switch in the network). This will insure that,
+regardless of which switch is active, the ARP monitor has a suitable
+target to query.
+
+ Note, also, that of late many switches now support a functionality
+generally referred to as "trunk failover." This is a feature of the
+switch that causes the link state of a particular switch port to be set
+down (or up) when the state of another switch port goes down (or up).
+Its purpose is to propagate link failures from logically "exterior" ports
+to the logically "interior" ports that bonding is able to monitor via
+miimon. Availability and configuration for trunk failover varies by
+switch, but this can be a viable alternative to the ARP monitor when using
+suitable switches.
+
+12. Configuring Bonding for Maximum Throughput
+==============================================
+
+12.1 Maximizing Throughput in a Single Switch Topology
+------------------------------------------------------
+
+ In a single switch configuration, the best method to maximize
+throughput depends upon the application and network environment. The
+various load balancing modes each have strengths and weaknesses in
+different environments, as detailed below.
+
+ For this discussion, we will break down the topologies into
+two categories. Depending upon the destination of most traffic, we
+categorize them into either "gatewayed" or "local" configurations.
+
+ In a gatewayed configuration, the "switch" is acting primarily
+as a router, and the majority of traffic passes through this router to
+other networks. An example would be the following:
+
+
+ +----------+ +----------+
+ | |eth0 port1| | to other networks
+ | Host A +---------------------+ router +------------------->
+ | +---------------------+ | Hosts B and C are out
+ | |eth1 port2| | here somewhere
+ +----------+ +----------+
+
+ The router may be a dedicated router device, or another host
+acting as a gateway. For our discussion, the important point is that
+the majority of traffic from Host A will pass through the router to
+some other network before reaching its final destination.
+
+ In a gatewayed network configuration, although Host A may
+communicate with many other systems, all of its traffic will be sent
+and received via one other peer on the local network, the router.
+
+ Note that the case of two systems connected directly via
+multiple physical links is, for purposes of configuring bonding, the
+same as a gatewayed configuration. In that case, it happens that all
+traffic is destined for the "gateway" itself, not some other network
+beyond the gateway.
+
+ In a local configuration, the "switch" is acting primarily as
+a switch, and the majority of traffic passes through this switch to
+reach other stations on the same network. An example would be the
+following:
+
+ +----------+ +----------+ +--------+
+ | |eth0 port1| +-------+ Host B |
+ | Host A +------------+ switch |port3 +--------+
+ | +------------+ | +--------+
+ | |eth1 port2| +------------------+ Host C |
+ +----------+ +----------+port4 +--------+
+
+
+ Again, the switch may be a dedicated switch device, or another
+host acting as a gateway. For our discussion, the important point is
+that the majority of traffic from Host A is destined for other hosts
+on the same local network (Hosts B and C in the above example).
+
+ In summary, in a gatewayed configuration, traffic to and from
+the bonded device will be to the same MAC level peer on the network
+(the gateway itself, i.e., the router), regardless of its final
+destination. In a local configuration, traffic flows directly to and
+from the final destinations, thus, each destination (Host B, Host C)
+will be addressed directly by their individual MAC addresses.
+
+ This distinction between a gatewayed and a local network
+configuration is important because many of the load balancing modes
+available use the MAC addresses of the local network source and
+destination to make load balancing decisions. The behavior of each
+mode is described below.
+
+
+12.1.1 MT Bonding Mode Selection for Single Switch Topology
+-----------------------------------------------------------
+
+ This configuration is the easiest to set up and to understand,
+although you will have to decide which bonding mode best suits your
+needs. The trade offs for each mode are detailed below:
+
+balance-rr: This mode is the only mode that will permit a single
+ TCP/IP connection to stripe traffic across multiple
+ interfaces. It is therefore the only mode that will allow a
+ single TCP/IP stream to utilize more than one interface's
+ worth of throughput. This comes at a cost, however: the
+ striping generally results in peer systems receiving packets out
+ of order, causing TCP/IP's congestion control system to kick
+ in, often by retransmitting segments.
+
+ It is possible to adjust TCP/IP's congestion limits by
+ altering the net.ipv4.tcp_reordering sysctl parameter. The
+ usual default value is 3. But keep in mind TCP stack is able
+ to automatically increase this when it detects reorders.
+
+ Note that the fraction of packets that will be delivered out of
+ order is highly variable, and is unlikely to be zero. The level
+ of reordering depends upon a variety of factors, including the
+ networking interfaces, the switch, and the topology of the
+ configuration. Speaking in general terms, higher speed network
+ cards produce more reordering (due to factors such as packet
+ coalescing), and a "many to many" topology will reorder at a
+ higher rate than a "many slow to one fast" configuration.
+
+ Many switches do not support any modes that stripe traffic
+ (instead choosing a port based upon IP or MAC level addresses);
+ for those devices, traffic for a particular connection flowing
+ through the switch to a balance-rr bond will not utilize greater
+ than one interface's worth of bandwidth.
+
+ If you are utilizing protocols other than TCP/IP, UDP for
+ example, and your application can tolerate out of order
+ delivery, then this mode can allow for single stream datagram
+ performance that scales near linearly as interfaces are added
+ to the bond.
+
+ This mode requires the switch to have the appropriate ports
+ configured for "etherchannel" or "trunking."
+
+active-backup: There is not much advantage in this network topology to
+ the active-backup mode, as the inactive backup devices are all
+ connected to the same peer as the primary. In this case, a
+ load balancing mode (with link monitoring) will provide the
+ same level of network availability, but with increased
+ available bandwidth. On the plus side, active-backup mode
+ does not require any configuration of the switch, so it may
+ have value if the hardware available does not support any of
+ the load balance modes.
+
+balance-xor: This mode will limit traffic such that packets destined
+ for specific peers will always be sent over the same
+ interface. Since the destination is determined by the MAC
+ addresses involved, this mode works best in a "local" network
+ configuration (as described above), with destinations all on
+ the same local network. This mode is likely to be suboptimal
+ if all your traffic is passed through a single router (i.e., a
+ "gatewayed" network configuration, as described above).
+
+ As with balance-rr, the switch ports need to be configured for
+ "etherchannel" or "trunking."
+
+broadcast: Like active-backup, there is not much advantage to this
+ mode in this type of network topology.
+
+802.3ad: This mode can be a good choice for this type of network
+ topology. The 802.3ad mode is an IEEE standard, so all peers
+ that implement 802.3ad should interoperate well. The 802.3ad
+ protocol includes automatic configuration of the aggregates,
+ so minimal manual configuration of the switch is needed
+ (typically only to designate that some set of devices is
+ available for 802.3ad). The 802.3ad standard also mandates
+ that frames be delivered in order (within certain limits), so
+ in general single connections will not see misordering of
+ packets. The 802.3ad mode does have some drawbacks: the
+ standard mandates that all devices in the aggregate operate at
+ the same speed and duplex. Also, as with all bonding load
+ balance modes other than balance-rr, no single connection will
+ be able to utilize more than a single interface's worth of
+ bandwidth.
+
+ Additionally, the linux bonding 802.3ad implementation
+ distributes traffic by peer (using an XOR of MAC addresses
+ and packet type ID), so in a "gatewayed" configuration, all
+ outgoing traffic will generally use the same device. Incoming
+ traffic may also end up on a single device, but that is
+ dependent upon the balancing policy of the peer's 8023.ad
+ implementation. In a "local" configuration, traffic will be
+ distributed across the devices in the bond.
+
+ Finally, the 802.3ad mode mandates the use of the MII monitor,
+ therefore, the ARP monitor is not available in this mode.
+
+balance-tlb: The balance-tlb mode balances outgoing traffic by peer.
+ Since the balancing is done according to MAC address, in a
+ "gatewayed" configuration (as described above), this mode will
+ send all traffic across a single device. However, in a
+ "local" network configuration, this mode balances multiple
+ local network peers across devices in a vaguely intelligent
+ manner (not a simple XOR as in balance-xor or 802.3ad mode),
+ so that mathematically unlucky MAC addresses (i.e., ones that
+ XOR to the same value) will not all "bunch up" on a single
+ interface.
+
+ Unlike 802.3ad, interfaces may be of differing speeds, and no
+ special switch configuration is required. On the down side,
+ in this mode all incoming traffic arrives over a single
+ interface, this mode requires certain ethtool support in the
+ network device driver of the slave interfaces, and the ARP
+ monitor is not available.
+
+balance-alb: This mode is everything that balance-tlb is, and more.
+ It has all of the features (and restrictions) of balance-tlb,
+ and will also balance incoming traffic from local network
+ peers (as described in the Bonding Module Options section,
+ above).
+
+ The only additional down side to this mode is that the network
+ device driver must support changing the hardware address while
+ the device is open.
+
+12.1.2 MT Link Monitoring for Single Switch Topology
+----------------------------------------------------
+
+ The choice of link monitoring may largely depend upon which
+mode you choose to use. The more advanced load balancing modes do not
+support the use of the ARP monitor, and are thus restricted to using
+the MII monitor (which does not provide as high a level of end to end
+assurance as the ARP monitor).
+
+12.2 Maximum Throughput in a Multiple Switch Topology
+-----------------------------------------------------
+
+ Multiple switches may be utilized to optimize for throughput
+when they are configured in parallel as part of an isolated network
+between two or more systems, for example:
+
+ +-----------+
+ | Host A |
+ +-+---+---+-+
+ | | |
+ +--------+ | +---------+
+ | | |
+ +------+---+ +-----+----+ +-----+----+
+ | Switch A | | Switch B | | Switch C |
+ +------+---+ +-----+----+ +-----+----+
+ | | |
+ +--------+ | +---------+
+ | | |
+ +-+---+---+-+
+ | Host B |
+ +-----------+
+
+ In this configuration, the switches are isolated from one
+another. One reason to employ a topology such as this is for an
+isolated network with many hosts (a cluster configured for high
+performance, for example), using multiple smaller switches can be more
+cost effective than a single larger switch, e.g., on a network with 24
+hosts, three 24 port switches can be significantly less expensive than
+a single 72 port switch.
+
+ If access beyond the network is required, an individual host
+can be equipped with an additional network device connected to an
+external network; this host then additionally acts as a gateway.
+
+12.2.1 MT Bonding Mode Selection for Multiple Switch Topology
+-------------------------------------------------------------
+
+ In actual practice, the bonding mode typically employed in
+configurations of this type is balance-rr. Historically, in this
+network configuration, the usual caveats about out of order packet
+delivery are mitigated by the use of network adapters that do not do
+any kind of packet coalescing (via the use of NAPI, or because the
+device itself does not generate interrupts until some number of
+packets has arrived). When employed in this fashion, the balance-rr
+mode allows individual connections between two hosts to effectively
+utilize greater than one interface's bandwidth.
+
+12.2.2 MT Link Monitoring for Multiple Switch Topology
+------------------------------------------------------
+
+ Again, in actual practice, the MII monitor is most often used
+in this configuration, as performance is given preference over
+availability. The ARP monitor will function in this topology, but its
+advantages over the MII monitor are mitigated by the volume of probes
+needed as the number of systems involved grows (remember that each
+host in the network is configured with bonding).
+
+13. Switch Behavior Issues
+==========================
+
+13.1 Link Establishment and Failover Delays
+-------------------------------------------
+
+ Some switches exhibit undesirable behavior with regard to the
+timing of link up and down reporting by the switch.
+
+ First, when a link comes up, some switches may indicate that
+the link is up (carrier available), but not pass traffic over the
+interface for some period of time. This delay is typically due to
+some type of autonegotiation or routing protocol, but may also occur
+during switch initialization (e.g., during recovery after a switch
+failure). If you find this to be a problem, specify an appropriate
+value to the updelay bonding module option to delay the use of the
+relevant interface(s).
+
+ Second, some switches may "bounce" the link state one or more
+times while a link is changing state. This occurs most commonly while
+the switch is initializing. Again, an appropriate updelay value may
+help.
+
+ Note that when a bonding interface has no active links, the
+driver will immediately reuse the first link that goes up, even if the
+updelay parameter has been specified (the updelay is ignored in this
+case). If there are slave interfaces waiting for the updelay timeout
+to expire, the interface that first went into that state will be
+immediately reused. This reduces down time of the network if the
+value of updelay has been overestimated, and since this occurs only in
+cases with no connectivity, there is no additional penalty for
+ignoring the updelay.
+
+ In addition to the concerns about switch timings, if your
+switches take a long time to go into backup mode, it may be desirable
+to not activate a backup interface immediately after a link goes down.
+Failover may be delayed via the downdelay bonding module option.
+
+13.2 Duplicated Incoming Packets
+--------------------------------
+
+ NOTE: Starting with version 3.0.2, the bonding driver has logic to
+suppress duplicate packets, which should largely eliminate this problem.
+The following description is kept for reference.
+
+ It is not uncommon to observe a short burst of duplicated
+traffic when the bonding device is first used, or after it has been
+idle for some period of time. This is most easily observed by issuing
+a "ping" to some other host on the network, and noticing that the
+output from ping flags duplicates (typically one per slave).
+
+ For example, on a bond in active-backup mode with five slaves
+all connected to one switch, the output may appear as follows:
+
+# ping -n 10.0.4.2
+PING 10.0.4.2 (10.0.4.2) from 10.0.3.10 : 56(84) bytes of data.
+64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.7 ms
+64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=13.8 ms (DUP!)
+64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.216 ms
+64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.267 ms
+64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.222 ms
+
+ This is not due to an error in the bonding driver, rather, it
+is a side effect of how many switches update their MAC forwarding
+tables. Initially, the switch does not associate the MAC address in
+the packet with a particular switch port, and so it may send the
+traffic to all ports until its MAC forwarding table is updated. Since
+the interfaces attached to the bond may occupy multiple ports on a
+single switch, when the switch (temporarily) floods the traffic to all
+ports, the bond device receives multiple copies of the same packet
+(one per slave device).
+
+ The duplicated packet behavior is switch dependent, some
+switches exhibit this, and some do not. On switches that display this
+behavior, it can be induced by clearing the MAC forwarding table (on
+most Cisco switches, the privileged command "clear mac address-table
+dynamic" will accomplish this).
+
+14. Hardware Specific Considerations
+====================================
+
+ This section contains additional information for configuring
+bonding on specific hardware platforms, or for interfacing bonding
+with particular switches or other devices.
+
+14.1 IBM BladeCenter
+--------------------
+
+ This applies to the JS20 and similar systems.
+
+ On the JS20 blades, the bonding driver supports only
+balance-rr, active-backup, balance-tlb and balance-alb modes. This is
+largely due to the network topology inside the BladeCenter, detailed
+below.
+
+JS20 network adapter information
+--------------------------------
+
+ All JS20s come with two Broadcom Gigabit Ethernet ports
+integrated on the planar (that's "motherboard" in IBM-speak). In the
+BladeCenter chassis, the eth0 port of all JS20 blades is hard wired to
+I/O Module #1; similarly, all eth1 ports are wired to I/O Module #2.
+An add-on Broadcom daughter card can be installed on a JS20 to provide
+two more Gigabit Ethernet ports. These ports, eth2 and eth3, are
+wired to I/O Modules 3 and 4, respectively.
+
+ Each I/O Module may contain either a switch or a passthrough
+module (which allows ports to be directly connected to an external
+switch). Some bonding modes require a specific BladeCenter internal
+network topology in order to function; these are detailed below.
+
+ Additional BladeCenter-specific networking information can be
+found in two IBM Redbooks (www.ibm.com/redbooks):
+
+"IBM eServer BladeCenter Networking Options"
+"IBM eServer BladeCenter Layer 2-7 Network Switching"
+
+BladeCenter networking configuration
+------------------------------------
+
+ Because a BladeCenter can be configured in a very large number
+of ways, this discussion will be confined to describing basic
+configurations.
+
+ Normally, Ethernet Switch Modules (ESMs) are used in I/O
+modules 1 and 2. In this configuration, the eth0 and eth1 ports of a
+JS20 will be connected to different internal switches (in the
+respective I/O modules).
+
+ A passthrough module (OPM or CPM, optical or copper,
+passthrough module) connects the I/O module directly to an external
+switch. By using PMs in I/O module #1 and #2, the eth0 and eth1
+interfaces of a JS20 can be redirected to the outside world and
+connected to a common external switch.
+
+ Depending upon the mix of ESMs and PMs, the network will
+appear to bonding as either a single switch topology (all PMs) or as a
+multiple switch topology (one or more ESMs, zero or more PMs). It is
+also possible to connect ESMs together, resulting in a configuration
+much like the example in "High Availability in a Multiple Switch
+Topology," above.
+
+Requirements for specific modes
+-------------------------------
+
+ The balance-rr mode requires the use of passthrough modules
+for devices in the bond, all connected to an common external switch.
+That switch must be configured for "etherchannel" or "trunking" on the
+appropriate ports, as is usual for balance-rr.
+
+ The balance-alb and balance-tlb modes will function with
+either switch modules or passthrough modules (or a mix). The only
+specific requirement for these modes is that all network interfaces
+must be able to reach all destinations for traffic sent over the
+bonding device (i.e., the network must converge at some point outside
+the BladeCenter).
+
+ The active-backup mode has no additional requirements.
+
+Link monitoring issues
+----------------------
+
+ When an Ethernet Switch Module is in place, only the ARP
+monitor will reliably detect link loss to an external switch. This is
+nothing unusual, but examination of the BladeCenter cabinet would
+suggest that the "external" network ports are the ethernet ports for
+the system, when it fact there is a switch between these "external"
+ports and the devices on the JS20 system itself. The MII monitor is
+only able to detect link failures between the ESM and the JS20 system.
+
+ When a passthrough module is in place, the MII monitor does
+detect failures to the "external" port, which is then directly
+connected to the JS20 system.
+
+Other concerns
+--------------
+
+ The Serial Over LAN (SoL) link is established over the primary
+ethernet (eth0) only, therefore, any loss of link to eth0 will result
+in losing your SoL connection. It will not fail over with other
+network traffic, as the SoL system is beyond the control of the
+bonding driver.
+
+ It may be desirable to disable spanning tree on the switch
+(either the internal Ethernet Switch Module, or an external switch) to
+avoid fail-over delay issues when using bonding.
+
+
+15. Frequently Asked Questions
+==============================
+
+1. Is it SMP safe?
+
+ Yes. The old 2.0.xx channel bonding patch was not SMP safe.
+The new driver was designed to be SMP safe from the start.
+
+2. What type of cards will work with it?
+
+ Any Ethernet type cards (you can even mix cards - a Intel
+EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes,
+devices need not be of the same speed.
+
+ Starting with version 3.2.1, bonding also supports Infiniband
+slaves in active-backup mode.
+
+3. How many bonding devices can I have?
+
+ There is no limit.
+
+4. How many slaves can a bonding device have?
+
+ This is limited only by the number of network interfaces Linux
+supports and/or the number of network cards you can place in your
+system.
+
+5. What happens when a slave link dies?
+
+ If link monitoring is enabled, then the failing device will be
+disabled. The active-backup mode will fail over to a backup link, and
+other modes will ignore the failed link. The link will continue to be
+monitored, and should it recover, it will rejoin the bond (in whatever
+manner is appropriate for the mode). See the sections on High
+Availability and the documentation for each mode for additional
+information.
+
+ Link monitoring can be enabled via either the miimon or
+arp_interval parameters (described in the module parameters section,
+above). In general, miimon monitors the carrier state as sensed by
+the underlying network device, and the arp monitor (arp_interval)
+monitors connectivity to another host on the local network.
+
+ If no link monitoring is configured, the bonding driver will
+be unable to detect link failures, and will assume that all links are
+always available. This will likely result in lost packets, and a
+resulting degradation of performance. The precise performance loss
+depends upon the bonding mode and network configuration.
+
+6. Can bonding be used for High Availability?
+
+ Yes. See the section on High Availability for details.
+
+7. Which switches/systems does it work with?
+
+ The full answer to this depends upon the desired mode.
+
+ In the basic balance modes (balance-rr and balance-xor), it
+works with any system that supports etherchannel (also called
+trunking). Most managed switches currently available have such
+support, and many unmanaged switches as well.
+
+ The advanced balance modes (balance-tlb and balance-alb) do
+not have special switch requirements, but do need device drivers that
+support specific features (described in the appropriate section under
+module parameters, above).
+
+ In 802.3ad mode, it works with systems that support IEEE
+802.3ad Dynamic Link Aggregation. Most managed and many unmanaged
+switches currently available support 802.3ad.
+
+ The active-backup mode should work with any Layer-II switch.
+
+8. Where does a bonding device get its MAC address from?
+
+ When using slave devices that have fixed MAC addresses, or when
+the fail_over_mac option is enabled, the bonding device's MAC address is
+the MAC address of the active slave.
+
+ For other configurations, if not explicitly configured (with
+ifconfig or ip link), the MAC address of the bonding device is taken from
+its first slave device. This MAC address is then passed to all following
+slaves and remains persistent (even if the first slave is removed) until
+the bonding device is brought down or reconfigured.
+
+ If you wish to change the MAC address, you can set it with
+ifconfig or ip link:
+
+# ifconfig bond0 hw ether 00:11:22:33:44:55
+
+# ip link set bond0 address 66:77:88:99:aa:bb
+
+ The MAC address can be also changed by bringing down/up the
+device and then changing its slaves (or their order):
+
+# ifconfig bond0 down ; modprobe -r bonding
+# ifconfig bond0 .... up
+# ifenslave bond0 eth...
+
+ This method will automatically take the address from the next
+slave that is added.
+
+ To restore your slaves' MAC addresses, you need to detach them
+from the bond (`ifenslave -d bond0 eth0'). The bonding driver will
+then restore the MAC addresses that the slaves had before they were
+enslaved.
+
+16. Resources and Links
+=======================
+
+ The latest version of the bonding driver can be found in the latest
+version of the linux kernel, found on http://kernel.org
+
+ The latest version of this document can be found in the latest kernel
+source (named Documentation/networking/bonding.txt).
+
+ Discussions regarding the usage of the bonding driver take place on the
+bonding-devel mailing list, hosted at sourceforge.net. If you have questions or
+problems, post them to the list. The list address is:
+
+bonding-devel@lists.sourceforge.net
+
+ The administrative interface (to subscribe or unsubscribe) can
+be found at:
+
+https://lists.sourceforge.net/lists/listinfo/bonding-devel
+
+ Discussions regarding the development of the bonding driver take place
+on the main Linux network mailing list, hosted at vger.kernel.org. The list
+address is:
+
+netdev@vger.kernel.org
+
+ The administrative interface (to subscribe or unsubscribe) can
+be found at:
+
+http://vger.kernel.org/vger-lists.html#netdev
+
+Donald Becker's Ethernet Drivers and diag programs may be found at :
+ - http://web.archive.org/web/*/http://www.scyld.com/network/
+
+You will also find a lot of information regarding Ethernet, NWay, MII,
+etc. at www.scyld.com.
+
+-- END --
diff --git a/Documentation/networking/bridge.txt b/Documentation/networking/bridge.txt
new file mode 100644
index 000000000..a27cb6214
--- /dev/null
+++ b/Documentation/networking/bridge.txt
@@ -0,0 +1,15 @@
+In order to use the Ethernet bridging functionality, you'll need the
+userspace tools.
+
+Documentation for Linux bridging is on:
+ http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge
+
+The bridge-utilities are maintained at:
+ git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git
+
+Additionally, the iproute2 utilities can be used to configure
+bridge devices.
+
+If you still have questions, don't hesitate to post to the mailing list
+(more info https://lists.linux-foundation.org/mailman/listinfo/bridge).
+
diff --git a/Documentation/networking/caif/Linux-CAIF.txt b/Documentation/networking/caif/Linux-CAIF.txt
new file mode 100644
index 000000000..0aa4bd381
--- /dev/null
+++ b/Documentation/networking/caif/Linux-CAIF.txt
@@ -0,0 +1,175 @@
+Linux CAIF
+===========
+copyright (C) ST-Ericsson AB 2010
+Author: Sjur Brendeland/ sjur.brandeland@stericsson.com
+License terms: GNU General Public License (GPL) version 2
+
+
+Introduction
+------------
+CAIF is a MUX protocol used by ST-Ericsson cellular modems for
+communication between Modem and host. The host processes can open virtual AT
+channels, initiate GPRS Data connections, Video channels and Utility Channels.
+The Utility Channels are general purpose pipes between modem and host.
+
+ST-Ericsson modems support a number of transports between modem
+and host. Currently, UART and Loopback are available for Linux.
+
+
+Architecture:
+------------
+The implementation of CAIF is divided into:
+* CAIF Socket Layer and GPRS IP Interface.
+* CAIF Core Protocol Implementation
+* CAIF Link Layer, implemented as NET devices.
+
+
+ RTNL
+ !
+ ! +------+ +------+
+ ! +------+! +------+!
+ ! ! IP !! !Socket!!
+ +-------> !interf!+ ! API !+ <- CAIF Client APIs
+ ! +------+ +------!
+ ! ! !
+ ! +-----------+
+ ! !
+ ! +------+ <- CAIF Core Protocol
+ ! ! CAIF !
+ ! ! Core !
+ ! +------+
+ ! +----------!---------+
+ ! ! ! !
+ ! +------+ +-----+ +------+
+ +--> ! HSI ! ! TTY ! ! USB ! <- Link Layer (Net Devices)
+ +------+ +-----+ +------+
+
+
+
+I M P L E M E N T A T I O N
+===========================
+
+
+CAIF Core Protocol Layer
+=========================================
+
+CAIF Core layer implements the CAIF protocol as defined by ST-Ericsson.
+It implements the CAIF protocol stack in a layered approach, where
+each layer described in the specification is implemented as a separate layer.
+The architecture is inspired by the design patterns "Protocol Layer" and
+"Protocol Packet".
+
+== CAIF structure ==
+The Core CAIF implementation contains:
+ - Simple implementation of CAIF.
+ - Layered architecture (a la Streams), each layer in the CAIF
+ specification is implemented in a separate c-file.
+ - Clients must call configuration function to add PHY layer.
+ - Clients must implement CAIF layer to consume/produce
+ CAIF payload with receive and transmit functions.
+ - Clients must call configuration function to add and connect the
+ Client layer.
+ - When receiving / transmitting CAIF Packets (cfpkt), ownership is passed
+ to the called function (except for framing layers' receive function)
+
+Layered Architecture
+--------------------
+The CAIF protocol can be divided into two parts: Support functions and Protocol
+Implementation. The support functions include:
+
+ - CFPKT CAIF Packet. Implementation of CAIF Protocol Packet. The
+ CAIF Packet has functions for creating, destroying and adding content
+ and for adding/extracting header and trailers to protocol packets.
+
+The CAIF Protocol implementation contains:
+
+ - CFCNFG CAIF Configuration layer. Configures the CAIF Protocol
+ Stack and provides a Client interface for adding Link-Layer and
+ Driver interfaces on top of the CAIF Stack.
+
+ - CFCTRL CAIF Control layer. Encodes and Decodes control messages
+ such as enumeration and channel setup. Also matches request and
+ response messages.
+
+ - CFSERVL General CAIF Service Layer functionality; handles flow
+ control and remote shutdown requests.
+
+ - CFVEI CAIF VEI layer. Handles CAIF AT Channels on VEI (Virtual
+ External Interface). This layer encodes/decodes VEI frames.
+
+ - CFDGML CAIF Datagram layer. Handles CAIF Datagram layer (IP
+ traffic), encodes/decodes Datagram frames.
+
+ - CFMUX CAIF Mux layer. Handles multiplexing between multiple
+ physical bearers and multiple channels such as VEI, Datagram, etc.
+ The MUX keeps track of the existing CAIF Channels and
+ Physical Instances and selects the appropriate instance based
+ on Channel-Id and Physical-ID.
+
+ - CFFRML CAIF Framing layer. Handles Framing i.e. Frame length
+ and frame checksum.
+
+ - CFSERL CAIF Serial layer. Handles concatenation/split of frames
+ into CAIF Frames with correct length.
+
+
+
+ +---------+
+ | Config |
+ | CFCNFG |
+ +---------+
+ !
+ +---------+ +---------+ +---------+
+ | AT | | Control | | Datagram|
+ | CFVEIL | | CFCTRL | | CFDGML |
+ +---------+ +---------+ +---------+
+ \_____________!______________/
+ !
+ +---------+
+ | MUX |
+ | |
+ +---------+
+ _____!_____
+ / \
+ +---------+ +---------+
+ | CFFRML | | CFFRML |
+ | Framing | | Framing |
+ +---------+ +---------+
+ ! !
+ +---------+ +---------+
+ | | | Serial |
+ | | | CFSERL |
+ +---------+ +---------+
+
+
+In this layered approach the following "rules" apply.
+ - All layers embed the same structure "struct cflayer"
+ - A layer does not depend on any other layer's private data.
+ - Layers are stacked by setting the pointers
+ layer->up , layer->dn
+ - In order to send data upwards, each layer should do
+ layer->up->receive(layer->up, packet);
+ - In order to send data downwards, each layer should do
+ layer->dn->transmit(layer->dn, packet);
+
+
+CAIF Socket and IP interface
+===========================
+
+The IP interface and CAIF socket API are implemented on top of the
+CAIF Core protocol. The IP Interface and CAIF socket have an instance of
+'struct cflayer', just like the CAIF Core protocol stack.
+Net device and Socket implement the 'receive()' function defined by
+'struct cflayer', just like the rest of the CAIF stack. In this way, transmit and
+receive of packets is handled as by the rest of the layers: the 'dn->transmit()'
+function is called in order to transmit data.
+
+Configuration of Link Layer
+---------------------------
+The Link Layer is implemented as Linux network devices (struct net_device).
+Payload handling and registration is done using standard Linux mechanisms.
+
+The CAIF Protocol relies on a loss-less link layer without implementing
+retransmission. This implies that packet drops must not happen.
+Therefore a flow-control mechanism is implemented where the physical
+interface can initiate flow stop for all CAIF Channels.
diff --git a/Documentation/networking/caif/README b/Documentation/networking/caif/README
new file mode 100644
index 000000000..757ccfaa1
--- /dev/null
+++ b/Documentation/networking/caif/README
@@ -0,0 +1,109 @@
+Copyright (C) ST-Ericsson AB 2010
+Author: Sjur Brendeland/ sjur.brandeland@stericsson.com
+License terms: GNU General Public License (GPL) version 2
+---------------------------------------------------------
+
+=== Start ===
+If you have compiled CAIF for modules do:
+
+$modprobe crc_ccitt
+$modprobe caif
+$modprobe caif_socket
+$modprobe chnl_net
+
+
+=== Preparing the setup with a STE modem ===
+
+If you are working on integration of CAIF you should make sure
+that the kernel is built with module support.
+
+There are some things that need to be tweaked to get the host TTY correctly
+set up to talk to the modem.
+Since the CAIF stack is running in the kernel and we want to use the existing
+TTY, we are installing our physical serial driver as a line discipline above
+the TTY device.
+
+To achieve this we need to install the N_CAIF ldisc from user space.
+The benefit is that we can hook up to any TTY.
+
+The use of Start-of-frame-extension (STX) must also be set as
+module parameter "ser_use_stx".
+
+Normally Frame Checksum is always used on UART, but this is also provided as a
+module parameter "ser_use_fcs".
+
+$ modprobe caif_serial ser_ttyname=/dev/ttyS0 ser_use_stx=yes
+$ ifconfig caif_ttyS0 up
+
+PLEASE NOTE: There is a limitation in Android shell.
+ It only accepts one argument to insmod/modprobe!
+
+=== Trouble shooting ===
+
+There are debugfs parameters provided for serial communication.
+/sys/kernel/debug/caif_serial/<tty-name>/
+
+* ser_state: Prints the bit-mask status where
+ - 0x02 means SENDING, this is a transient state.
+ - 0x10 means FLOW_OFF_SENT, i.e. the previous frame has not been sent
+ and is blocking further send operation. Flow OFF has been propagated
+ to all CAIF Channels using this TTY.
+
+* tty_status: Prints the bit-mask tty status information
+ - 0x01 - tty->warned is on.
+ - 0x02 - tty->low_latency is on.
+ - 0x04 - tty->packed is on.
+ - 0x08 - tty->flow_stopped is on.
+ - 0x10 - tty->hw_stopped is on.
+ - 0x20 - tty->stopped is on.
+
+* last_tx_msg: Binary blob Prints the last transmitted frame.
+ This can be printed with
+ $od --format=x1 /sys/kernel/debug/caif_serial/<tty>/last_rx_msg.
+ The first two tx messages sent look like this. Note: The initial
+ byte 02 is start of frame extension (STX) used for re-syncing
+ upon errors.
+
+ - Enumeration:
+ 0000000 02 05 00 00 03 01 d2 02
+ | | | | | |
+ STX(1) | | | |
+ Length(2)| | |
+ Control Channel(1)
+ Command:Enumeration(1)
+ Link-ID(1)
+ Checksum(2)
+ - Channel Setup:
+ 0000000 02 07 00 00 00 21 a1 00 48 df
+ | | | | | | | |
+ STX(1) | | | | | |
+ Length(2)| | | | |
+ Control Channel(1)
+ Command:Channel Setup(1)
+ Channel Type(1)
+ Priority and Link-ID(1)
+ Endpoint(1)
+ Checksum(2)
+
+* last_rx_msg: Prints the last transmitted frame.
+ The RX messages for LinkSetup look almost identical but they have the
+ bit 0x20 set in the command bit, and Channel Setup has added one byte
+ before Checksum containing Channel ID.
+ NOTE: Several CAIF Messages might be concatenated. The maximum debug
+ buffer size is 128 bytes.
+
+== Error Scenarios:
+- last_tx_msg contains channel setup message and last_rx_msg is empty ->
+ The host seems to be able to send over the UART, at least the CAIF ldisc get
+ notified that sending is completed.
+
+- last_tx_msg contains enumeration message and last_rx_msg is empty ->
+ The host is not able to send the message from UART, the tty has not been
+ able to complete the transmit operation.
+
+- if /sys/kernel/debug/caif_serial/<tty>/tty_status is non-zero there
+ might be problems transmitting over UART.
+ E.g. host and modem wiring is not correct you will typically see
+ tty_status = 0x10 (hw_stopped) and ser_state = 0x10 (FLOW_OFF_SENT).
+ You will probably see the enumeration message in last_tx_message
+ and empty last_rx_message.
diff --git a/Documentation/networking/caif/spi_porting.txt b/Documentation/networking/caif/spi_porting.txt
new file mode 100644
index 000000000..9efd0687d
--- /dev/null
+++ b/Documentation/networking/caif/spi_porting.txt
@@ -0,0 +1,208 @@
+- CAIF SPI porting -
+
+- CAIF SPI basics:
+
+Running CAIF over SPI needs some extra setup, owing to the nature of SPI.
+Two extra GPIOs have been added in order to negotiate the transfers
+ between the master and the slave. The minimum requirement for running
+CAIF over SPI is a SPI slave chip and two GPIOs (more details below).
+Please note that running as a slave implies that you need to keep up
+with the master clock. An overrun or underrun event is fatal.
+
+- CAIF SPI framework:
+
+To make porting as easy as possible, the CAIF SPI has been divided in
+two parts. The first part (called the interface part) deals with all
+generic functionality such as length framing, SPI frame negotiation
+and SPI frame delivery and transmission. The other part is the CAIF
+SPI slave device part, which is the module that you have to write if
+you want to run SPI CAIF on a new hardware. This part takes care of
+the physical hardware, both with regard to SPI and to GPIOs.
+
+- Implementing a CAIF SPI device:
+
+ - Functionality provided by the CAIF SPI slave device:
+
+ In order to implement a SPI device you will, as a minimum,
+ need to implement the following
+ functions:
+
+ int (*init_xfer) (struct cfspi_xfer * xfer, struct cfspi_dev *dev):
+
+ This function is called by the CAIF SPI interface to give
+ you a chance to set up your hardware to be ready to receive
+ a stream of data from the master. The xfer structure contains
+ both physical and logical addresses, as well as the total length
+ of the transfer in both directions.The dev parameter can be used
+ to map to different CAIF SPI slave devices.
+
+ void (*sig_xfer) (bool xfer, struct cfspi_dev *dev):
+
+ This function is called by the CAIF SPI interface when the output
+ (SPI_INT) GPIO needs to change state. The boolean value of the xfer
+ variable indicates whether the GPIO should be asserted (HIGH) or
+ deasserted (LOW). The dev parameter can be used to map to different CAIF
+ SPI slave devices.
+
+ - Functionality provided by the CAIF SPI interface:
+
+ void (*ss_cb) (bool assert, struct cfspi_ifc *ifc);
+
+ This function is called by the CAIF SPI slave device in order to
+ signal a change of state of the input GPIO (SS) to the interface.
+ Only active edges are mandatory to be reported.
+ This function can be called from IRQ context (recommended in order
+ not to introduce latency). The ifc parameter should be the pointer
+ returned from the platform probe function in the SPI device structure.
+
+ void (*xfer_done_cb) (struct cfspi_ifc *ifc);
+
+ This function is called by the CAIF SPI slave device in order to
+ report that a transfer is completed. This function should only be
+ called once both the transmission and the reception are completed.
+ This function can be called from IRQ context (recommended in order
+ not to introduce latency). The ifc parameter should be the pointer
+ returned from the platform probe function in the SPI device structure.
+
+ - Connecting the bits and pieces:
+
+ - Filling in the SPI slave device structure:
+
+ Connect the necessary callback functions.
+ Indicate clock speed (used to calculate toggle delays).
+ Chose a suitable name (helps debugging if you use several CAIF
+ SPI slave devices).
+ Assign your private data (can be used to map to your structure).
+
+ - Filling in the SPI slave platform device structure:
+ Add name of driver to connect to ("cfspi_sspi").
+ Assign the SPI slave device structure as platform data.
+
+- Padding:
+
+In order to optimize throughput, a number of SPI padding options are provided.
+Padding can be enabled independently for uplink and downlink transfers.
+Padding can be enabled for the head, the tail and for the total frame size.
+The padding needs to be correctly configured on both sides of the link.
+The padding can be changed via module parameters in cfspi_sspi.c or via
+the sysfs directory of the cfspi_sspi driver (before device registration).
+
+- CAIF SPI device template:
+
+/*
+ * Copyright (C) ST-Ericsson AB 2010
+ * Author: Daniel Martensson / Daniel.Martensson@stericsson.com
+ * License terms: GNU General Public License (GPL), version 2.
+ *
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/device.h>
+#include <linux/wait.h>
+#include <linux/interrupt.h>
+#include <linux/dma-mapping.h>
+#include <net/caif/caif_spi.h>
+
+MODULE_LICENSE("GPL");
+
+struct sspi_struct {
+ struct cfspi_dev sdev;
+ struct cfspi_xfer *xfer;
+};
+
+static struct sspi_struct slave;
+static struct platform_device slave_device;
+
+static irqreturn_t sspi_irq(int irq, void *arg)
+{
+ /* You only need to trigger on an edge to the active state of the
+ * SS signal. Once a edge is detected, the ss_cb() function should be
+ * called with the parameter assert set to true. It is OK
+ * (and even advised) to call the ss_cb() function in IRQ context in
+ * order not to add any delay. */
+
+ return IRQ_HANDLED;
+}
+
+static void sspi_complete(void *context)
+{
+ /* Normally the DMA or the SPI framework will call you back
+ * in something similar to this. The only thing you need to
+ * do is to call the xfer_done_cb() function, providing the pointer
+ * to the CAIF SPI interface. It is OK to call this function
+ * from IRQ context. */
+}
+
+static int sspi_init_xfer(struct cfspi_xfer *xfer, struct cfspi_dev *dev)
+{
+ /* Store transfer info. For a normal implementation you should
+ * set up your DMA here and make sure that you are ready to
+ * receive the data from the master SPI. */
+
+ struct sspi_struct *sspi = (struct sspi_struct *)dev->priv;
+
+ sspi->xfer = xfer;
+
+ return 0;
+}
+
+void sspi_sig_xfer(bool xfer, struct cfspi_dev *dev)
+{
+ /* If xfer is true then you should assert the SPI_INT to indicate to
+ * the master that you are ready to receive the data from the master
+ * SPI. If xfer is false then you should de-assert SPI_INT to indicate
+ * that the transfer is done.
+ */
+
+ struct sspi_struct *sspi = (struct sspi_struct *)dev->priv;
+}
+
+static void sspi_release(struct device *dev)
+{
+ /*
+ * Here you should release your SPI device resources.
+ */
+}
+
+static int __init sspi_init(void)
+{
+ /* Here you should initialize your SPI device by providing the
+ * necessary functions, clock speed, name and private data. Once
+ * done, you can register your device with the
+ * platform_device_register() function. This function will return
+ * with the CAIF SPI interface initialized. This is probably also
+ * the place where you should set up your GPIOs, interrupts and SPI
+ * resources. */
+
+ int res = 0;
+
+ /* Initialize slave device. */
+ slave.sdev.init_xfer = sspi_init_xfer;
+ slave.sdev.sig_xfer = sspi_sig_xfer;
+ slave.sdev.clk_mhz = 13;
+ slave.sdev.priv = &slave;
+ slave.sdev.name = "spi_sspi";
+ slave_device.dev.release = sspi_release;
+
+ /* Initialize platform device. */
+ slave_device.name = "cfspi_sspi";
+ slave_device.dev.platform_data = &slave.sdev;
+
+ /* Register platform device. */
+ res = platform_device_register(&slave_device);
+ if (res) {
+ printk(KERN_WARNING "sspi_init: failed to register dev.\n");
+ return -ENODEV;
+ }
+
+ return res;
+}
+
+static void __exit sspi_exit(void)
+{
+ platform_device_del(&slave_device);
+}
+
+module_init(sspi_init);
+module_exit(sspi_exit);
diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt
new file mode 100644
index 000000000..5abad1e92
--- /dev/null
+++ b/Documentation/networking/can.txt
@@ -0,0 +1,1216 @@
+============================================================================
+
+can.txt
+
+Readme file for the Controller Area Network Protocol Family (aka SocketCAN)
+
+This file contains
+
+ 1 Overview / What is SocketCAN
+
+ 2 Motivation / Why using the socket API
+
+ 3 SocketCAN concept
+ 3.1 receive lists
+ 3.2 local loopback of sent frames
+ 3.3 network problem notifications
+
+ 4 How to use SocketCAN
+ 4.1 RAW protocol sockets with can_filters (SOCK_RAW)
+ 4.1.1 RAW socket option CAN_RAW_FILTER
+ 4.1.2 RAW socket option CAN_RAW_ERR_FILTER
+ 4.1.3 RAW socket option CAN_RAW_LOOPBACK
+ 4.1.4 RAW socket option CAN_RAW_RECV_OWN_MSGS
+ 4.1.5 RAW socket option CAN_RAW_FD_FRAMES
+ 4.1.6 RAW socket option CAN_RAW_JOIN_FILTERS
+ 4.1.7 RAW socket returned message flags
+ 4.2 Broadcast Manager protocol sockets (SOCK_DGRAM)
+ 4.2.1 Broadcast Manager operations
+ 4.2.2 Broadcast Manager message flags
+ 4.2.3 Broadcast Manager transmission timers
+ 4.2.4 Broadcast Manager message sequence transmission
+ 4.2.5 Broadcast Manager receive filter timers
+ 4.2.6 Broadcast Manager multiplex message receive filter
+ 4.3 connected transport protocols (SOCK_SEQPACKET)
+ 4.4 unconnected transport protocols (SOCK_DGRAM)
+
+ 5 SocketCAN core module
+ 5.1 can.ko module params
+ 5.2 procfs content
+ 5.3 writing own CAN protocol modules
+
+ 6 CAN network drivers
+ 6.1 general settings
+ 6.2 local loopback of sent frames
+ 6.3 CAN controller hardware filters
+ 6.4 The virtual CAN driver (vcan)
+ 6.5 The CAN network device driver interface
+ 6.5.1 Netlink interface to set/get devices properties
+ 6.5.2 Setting the CAN bit-timing
+ 6.5.3 Starting and stopping the CAN network device
+ 6.6 CAN FD (flexible data rate) driver support
+ 6.7 supported CAN hardware
+
+ 7 SocketCAN resources
+
+ 8 Credits
+
+============================================================================
+
+1. Overview / What is SocketCAN
+--------------------------------
+
+The socketcan package is an implementation of CAN protocols
+(Controller Area Network) for Linux. CAN is a networking technology
+which has widespread use in automation, embedded devices, and
+automotive fields. While there have been other CAN implementations
+for Linux based on character devices, SocketCAN uses the Berkeley
+socket API, the Linux network stack and implements the CAN device
+drivers as network interfaces. The CAN socket API has been designed
+as similar as possible to the TCP/IP protocols to allow programmers,
+familiar with network programming, to easily learn how to use CAN
+sockets.
+
+2. Motivation / Why using the socket API
+----------------------------------------
+
+There have been CAN implementations for Linux before SocketCAN so the
+question arises, why we have started another project. Most existing
+implementations come as a device driver for some CAN hardware, they
+are based on character devices and provide comparatively little
+functionality. Usually, there is only a hardware-specific device
+driver which provides a character device interface to send and
+receive raw CAN frames, directly to/from the controller hardware.
+Queueing of frames and higher-level transport protocols like ISO-TP
+have to be implemented in user space applications. Also, most
+character-device implementations support only one single process to
+open the device at a time, similar to a serial interface. Exchanging
+the CAN controller requires employment of another device driver and
+often the need for adaption of large parts of the application to the
+new driver's API.
+
+SocketCAN was designed to overcome all of these limitations. A new
+protocol family has been implemented which provides a socket interface
+to user space applications and which builds upon the Linux network
+layer, enabling use all of the provided queueing functionality. A device
+driver for CAN controller hardware registers itself with the Linux
+network layer as a network device, so that CAN frames from the
+controller can be passed up to the network layer and on to the CAN
+protocol family module and also vice-versa. Also, the protocol family
+module provides an API for transport protocol modules to register, so
+that any number of transport protocols can be loaded or unloaded
+dynamically. In fact, the can core module alone does not provide any
+protocol and cannot be used without loading at least one additional
+protocol module. Multiple sockets can be opened at the same time,
+on different or the same protocol module and they can listen/send
+frames on different or the same CAN IDs. Several sockets listening on
+the same interface for frames with the same CAN ID are all passed the
+same received matching CAN frames. An application wishing to
+communicate using a specific transport protocol, e.g. ISO-TP, just
+selects that protocol when opening the socket, and then can read and
+write application data byte streams, without having to deal with
+CAN-IDs, frames, etc.
+
+Similar functionality visible from user-space could be provided by a
+character device, too, but this would lead to a technically inelegant
+solution for a couple of reasons:
+
+* Intricate usage. Instead of passing a protocol argument to
+ socket(2) and using bind(2) to select a CAN interface and CAN ID, an
+ application would have to do all these operations using ioctl(2)s.
+
+* Code duplication. A character device cannot make use of the Linux
+ network queueing code, so all that code would have to be duplicated
+ for CAN networking.
+
+* Abstraction. In most existing character-device implementations, the
+ hardware-specific device driver for a CAN controller directly
+ provides the character device for the application to work with.
+ This is at least very unusual in Unix systems for both, char and
+ block devices. For example you don't have a character device for a
+ certain UART of a serial interface, a certain sound chip in your
+ computer, a SCSI or IDE controller providing access to your hard
+ disk or tape streamer device. Instead, you have abstraction layers
+ which provide a unified character or block device interface to the
+ application on the one hand, and a interface for hardware-specific
+ device drivers on the other hand. These abstractions are provided
+ by subsystems like the tty layer, the audio subsystem or the SCSI
+ and IDE subsystems for the devices mentioned above.
+
+ The easiest way to implement a CAN device driver is as a character
+ device without such a (complete) abstraction layer, as is done by most
+ existing drivers. The right way, however, would be to add such a
+ layer with all the functionality like registering for certain CAN
+ IDs, supporting several open file descriptors and (de)multiplexing
+ CAN frames between them, (sophisticated) queueing of CAN frames, and
+ providing an API for device drivers to register with. However, then
+ it would be no more difficult, or may be even easier, to use the
+ networking framework provided by the Linux kernel, and this is what
+ SocketCAN does.
+
+ The use of the networking framework of the Linux kernel is just the
+ natural and most appropriate way to implement CAN for Linux.
+
+3. SocketCAN concept
+---------------------
+
+ As described in chapter 2 it is the main goal of SocketCAN to
+ provide a socket interface to user space applications which builds
+ upon the Linux network layer. In contrast to the commonly known
+ TCP/IP and ethernet networking, the CAN bus is a broadcast-only(!)
+ medium that has no MAC-layer addressing like ethernet. The CAN-identifier
+ (can_id) is used for arbitration on the CAN-bus. Therefore the CAN-IDs
+ have to be chosen uniquely on the bus. When designing a CAN-ECU
+ network the CAN-IDs are mapped to be sent by a specific ECU.
+ For this reason a CAN-ID can be treated best as a kind of source address.
+
+ 3.1 receive lists
+
+ The network transparent access of multiple applications leads to the
+ problem that different applications may be interested in the same
+ CAN-IDs from the same CAN network interface. The SocketCAN core
+ module - which implements the protocol family CAN - provides several
+ high efficient receive lists for this reason. If e.g. a user space
+ application opens a CAN RAW socket, the raw protocol module itself
+ requests the (range of) CAN-IDs from the SocketCAN core that are
+ requested by the user. The subscription and unsubscription of
+ CAN-IDs can be done for specific CAN interfaces or for all(!) known
+ CAN interfaces with the can_rx_(un)register() functions provided to
+ CAN protocol modules by the SocketCAN core (see chapter 5).
+ To optimize the CPU usage at runtime the receive lists are split up
+ into several specific lists per device that match the requested
+ filter complexity for a given use-case.
+
+ 3.2 local loopback of sent frames
+
+ As known from other networking concepts the data exchanging
+ applications may run on the same or different nodes without any
+ change (except for the according addressing information):
+
+ ___ ___ ___ _______ ___
+ | _ | | _ | | _ | | _ _ | | _ |
+ ||A|| ||B|| ||C|| ||A| |B|| ||C||
+ |___| |___| |___| |_______| |___|
+ | | | | |
+ -----------------(1)- CAN bus -(2)---------------
+
+ To ensure that application A receives the same information in the
+ example (2) as it would receive in example (1) there is need for
+ some kind of local loopback of the sent CAN frames on the appropriate
+ node.
+
+ The Linux network devices (by default) just can handle the
+ transmission and reception of media dependent frames. Due to the
+ arbitration on the CAN bus the transmission of a low prio CAN-ID
+ may be delayed by the reception of a high prio CAN frame. To
+ reflect the correct* traffic on the node the loopback of the sent
+ data has to be performed right after a successful transmission. If
+ the CAN network interface is not capable of performing the loopback for
+ some reason the SocketCAN core can do this task as a fallback solution.
+ See chapter 6.2 for details (recommended).
+
+ The loopback functionality is enabled by default to reflect standard
+ networking behaviour for CAN applications. Due to some requests from
+ the RT-SocketCAN group the loopback optionally may be disabled for each
+ separate socket. See sockopts from the CAN RAW sockets in chapter 4.1.
+
+ * = you really like to have this when you're running analyser tools
+ like 'candump' or 'cansniffer' on the (same) node.
+
+ 3.3 network problem notifications
+
+ The use of the CAN bus may lead to several problems on the physical
+ and media access control layer. Detecting and logging of these lower
+ layer problems is a vital requirement for CAN users to identify
+ hardware issues on the physical transceiver layer as well as
+ arbitration problems and error frames caused by the different
+ ECUs. The occurrence of detected errors are important for diagnosis
+ and have to be logged together with the exact timestamp. For this
+ reason the CAN interface driver can generate so called Error Message
+ Frames that can optionally be passed to the user application in the
+ same way as other CAN frames. Whenever an error on the physical layer
+ or the MAC layer is detected (e.g. by the CAN controller) the driver
+ creates an appropriate error message frame. Error messages frames can
+ be requested by the user application using the common CAN filter
+ mechanisms. Inside this filter definition the (interested) type of
+ errors may be selected. The reception of error messages is disabled
+ by default. The format of the CAN error message frame is briefly
+ described in the Linux header file "include/uapi/linux/can/error.h".
+
+4. How to use SocketCAN
+------------------------
+
+ Like TCP/IP, you first need to open a socket for communicating over a
+ CAN network. Since SocketCAN implements a new protocol family, you
+ need to pass PF_CAN as the first argument to the socket(2) system
+ call. Currently, there are two CAN protocols to choose from, the raw
+ socket protocol and the broadcast manager (BCM). So to open a socket,
+ you would write
+
+ s = socket(PF_CAN, SOCK_RAW, CAN_RAW);
+
+ and
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_BCM);
+
+ respectively. After the successful creation of the socket, you would
+ normally use the bind(2) system call to bind the socket to a CAN
+ interface (which is different from TCP/IP due to different addressing
+ - see chapter 3). After binding (CAN_RAW) or connecting (CAN_BCM)
+ the socket, you can read(2) and write(2) from/to the socket or use
+ send(2), sendto(2), sendmsg(2) and the recv* counterpart operations
+ on the socket as usual. There are also CAN specific socket options
+ described below.
+
+ The basic CAN frame structure and the sockaddr structure are defined
+ in include/linux/can.h:
+
+ struct can_frame {
+ canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */
+ __u8 can_dlc; /* frame payload length in byte (0 .. 8) */
+ __u8 data[8] __attribute__((aligned(8)));
+ };
+
+ The alignment of the (linear) payload data[] to a 64bit boundary
+ allows the user to define their own structs and unions to easily access
+ the CAN payload. There is no given byteorder on the CAN bus by
+ default. A read(2) system call on a CAN_RAW socket transfers a
+ struct can_frame to the user space.
+
+ The sockaddr_can structure has an interface index like the
+ PF_PACKET socket, that also binds to a specific interface:
+
+ struct sockaddr_can {
+ sa_family_t can_family;
+ int can_ifindex;
+ union {
+ /* transport protocol class address info (e.g. ISOTP) */
+ struct { canid_t rx_id, tx_id; } tp;
+
+ /* reserved for future CAN protocols address information */
+ } can_addr;
+ };
+
+ To determine the interface index an appropriate ioctl() has to
+ be used (example for CAN_RAW sockets without error checking):
+
+ int s;
+ struct sockaddr_can addr;
+ struct ifreq ifr;
+
+ s = socket(PF_CAN, SOCK_RAW, CAN_RAW);
+
+ strcpy(ifr.ifr_name, "can0" );
+ ioctl(s, SIOCGIFINDEX, &ifr);
+
+ addr.can_family = AF_CAN;
+ addr.can_ifindex = ifr.ifr_ifindex;
+
+ bind(s, (struct sockaddr *)&addr, sizeof(addr));
+
+ (..)
+
+ To bind a socket to all(!) CAN interfaces the interface index must
+ be 0 (zero). In this case the socket receives CAN frames from every
+ enabled CAN interface. To determine the originating CAN interface
+ the system call recvfrom(2) may be used instead of read(2). To send
+ on a socket that is bound to 'any' interface sendto(2) is needed to
+ specify the outgoing interface.
+
+ Reading CAN frames from a bound CAN_RAW socket (see above) consists
+ of reading a struct can_frame:
+
+ struct can_frame frame;
+
+ nbytes = read(s, &frame, sizeof(struct can_frame));
+
+ if (nbytes < 0) {
+ perror("can raw socket read");
+ return 1;
+ }
+
+ /* paranoid check ... */
+ if (nbytes < sizeof(struct can_frame)) {
+ fprintf(stderr, "read: incomplete CAN frame\n");
+ return 1;
+ }
+
+ /* do something with the received CAN frame */
+
+ Writing CAN frames can be done similarly, with the write(2) system call:
+
+ nbytes = write(s, &frame, sizeof(struct can_frame));
+
+ When the CAN interface is bound to 'any' existing CAN interface
+ (addr.can_ifindex = 0) it is recommended to use recvfrom(2) if the
+ information about the originating CAN interface is needed:
+
+ struct sockaddr_can addr;
+ struct ifreq ifr;
+ socklen_t len = sizeof(addr);
+ struct can_frame frame;
+
+ nbytes = recvfrom(s, &frame, sizeof(struct can_frame),
+ 0, (struct sockaddr*)&addr, &len);
+
+ /* get interface name of the received CAN frame */
+ ifr.ifr_ifindex = addr.can_ifindex;
+ ioctl(s, SIOCGIFNAME, &ifr);
+ printf("Received a CAN frame from interface %s", ifr.ifr_name);
+
+ To write CAN frames on sockets bound to 'any' CAN interface the
+ outgoing interface has to be defined certainly.
+
+ strcpy(ifr.ifr_name, "can0");
+ ioctl(s, SIOCGIFINDEX, &ifr);
+ addr.can_ifindex = ifr.ifr_ifindex;
+ addr.can_family = AF_CAN;
+
+ nbytes = sendto(s, &frame, sizeof(struct can_frame),
+ 0, (struct sockaddr*)&addr, sizeof(addr));
+
+ Remark about CAN FD (flexible data rate) support:
+
+ Generally the handling of CAN FD is very similar to the formerly described
+ examples. The new CAN FD capable CAN controllers support two different
+ bitrates for the arbitration phase and the payload phase of the CAN FD frame
+ and up to 64 bytes of payload. This extended payload length breaks all the
+ kernel interfaces (ABI) which heavily rely on the CAN frame with fixed eight
+ bytes of payload (struct can_frame) like the CAN_RAW socket. Therefore e.g.
+ the CAN_RAW socket supports a new socket option CAN_RAW_FD_FRAMES that
+ switches the socket into a mode that allows the handling of CAN FD frames
+ and (legacy) CAN frames simultaneously (see section 4.1.5).
+
+ The struct canfd_frame is defined in include/linux/can.h:
+
+ struct canfd_frame {
+ canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */
+ __u8 len; /* frame payload length in byte (0 .. 64) */
+ __u8 flags; /* additional flags for CAN FD */
+ __u8 __res0; /* reserved / padding */
+ __u8 __res1; /* reserved / padding */
+ __u8 data[64] __attribute__((aligned(8)));
+ };
+
+ The struct canfd_frame and the existing struct can_frame have the can_id,
+ the payload length and the payload data at the same offset inside their
+ structures. This allows to handle the different structures very similar.
+ When the content of a struct can_frame is copied into a struct canfd_frame
+ all structure elements can be used as-is - only the data[] becomes extended.
+
+ When introducing the struct canfd_frame it turned out that the data length
+ code (DLC) of the struct can_frame was used as a length information as the
+ length and the DLC has a 1:1 mapping in the range of 0 .. 8. To preserve
+ the easy handling of the length information the canfd_frame.len element
+ contains a plain length value from 0 .. 64. So both canfd_frame.len and
+ can_frame.can_dlc are equal and contain a length information and no DLC.
+ For details about the distinction of CAN and CAN FD capable devices and
+ the mapping to the bus-relevant data length code (DLC), see chapter 6.6.
+
+ The length of the two CAN(FD) frame structures define the maximum transfer
+ unit (MTU) of the CAN(FD) network interface and skbuff data length. Two
+ definitions are specified for CAN specific MTUs in include/linux/can.h :
+
+ #define CAN_MTU (sizeof(struct can_frame)) == 16 => 'legacy' CAN frame
+ #define CANFD_MTU (sizeof(struct canfd_frame)) == 72 => CAN FD frame
+
+ 4.1 RAW protocol sockets with can_filters (SOCK_RAW)
+
+ Using CAN_RAW sockets is extensively comparable to the commonly
+ known access to CAN character devices. To meet the new possibilities
+ provided by the multi user SocketCAN approach, some reasonable
+ defaults are set at RAW socket binding time:
+
+ - The filters are set to exactly one filter receiving everything
+ - The socket only receives valid data frames (=> no error message frames)
+ - The loopback of sent CAN frames is enabled (see chapter 3.2)
+ - The socket does not receive its own sent frames (in loopback mode)
+
+ These default settings may be changed before or after binding the socket.
+ To use the referenced definitions of the socket options for CAN_RAW
+ sockets, include <linux/can/raw.h>.
+
+ 4.1.1 RAW socket option CAN_RAW_FILTER
+
+ The reception of CAN frames using CAN_RAW sockets can be controlled
+ by defining 0 .. n filters with the CAN_RAW_FILTER socket option.
+
+ The CAN filter structure is defined in include/linux/can.h:
+
+ struct can_filter {
+ canid_t can_id;
+ canid_t can_mask;
+ };
+
+ A filter matches, when
+
+ <received_can_id> & mask == can_id & mask
+
+ which is analogous to known CAN controllers hardware filter semantics.
+ The filter can be inverted in this semantic, when the CAN_INV_FILTER
+ bit is set in can_id element of the can_filter structure. In
+ contrast to CAN controller hardware filters the user may set 0 .. n
+ receive filters for each open socket separately:
+
+ struct can_filter rfilter[2];
+
+ rfilter[0].can_id = 0x123;
+ rfilter[0].can_mask = CAN_SFF_MASK;
+ rfilter[1].can_id = 0x200;
+ rfilter[1].can_mask = 0x700;
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter));
+
+ To disable the reception of CAN frames on the selected CAN_RAW socket:
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, NULL, 0);
+
+ To set the filters to zero filters is quite obsolete as to not read
+ data causes the raw socket to discard the received CAN frames. But
+ having this 'send only' use-case we may remove the receive list in the
+ Kernel to save a little (really a very little!) CPU usage.
+
+ 4.1.1.1 CAN filter usage optimisation
+
+ The CAN filters are processed in per-device filter lists at CAN frame
+ reception time. To reduce the number of checks that need to be performed
+ while walking through the filter lists the CAN core provides an optimized
+ filter handling when the filter subscription focusses on a single CAN ID.
+
+ For the possible 2048 SFF CAN identifiers the identifier is used as an index
+ to access the corresponding subscription list without any further checks.
+ For the 2^29 possible EFF CAN identifiers a 10 bit XOR folding is used as
+ hash function to retrieve the EFF table index.
+
+ To benefit from the optimized filters for single CAN identifiers the
+ CAN_SFF_MASK or CAN_EFF_MASK have to be set into can_filter.mask together
+ with set CAN_EFF_FLAG and CAN_RTR_FLAG bits. A set CAN_EFF_FLAG bit in the
+ can_filter.mask makes clear that it matters whether a SFF or EFF CAN ID is
+ subscribed. E.g. in the example from above
+
+ rfilter[0].can_id = 0x123;
+ rfilter[0].can_mask = CAN_SFF_MASK;
+
+ both SFF frames with CAN ID 0x123 and EFF frames with 0xXXXXX123 can pass.
+
+ To filter for only 0x123 (SFF) and 0x12345678 (EFF) CAN identifiers the
+ filter has to be defined in this way to benefit from the optimized filters:
+
+ struct can_filter rfilter[2];
+
+ rfilter[0].can_id = 0x123;
+ rfilter[0].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_SFF_MASK);
+ rfilter[1].can_id = 0x12345678 | CAN_EFF_FLAG;
+ rfilter[1].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_EFF_MASK);
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter));
+
+ 4.1.2 RAW socket option CAN_RAW_ERR_FILTER
+
+ As described in chapter 3.4 the CAN interface driver can generate so
+ called Error Message Frames that can optionally be passed to the user
+ application in the same way as other CAN frames. The possible
+ errors are divided into different error classes that may be filtered
+ using the appropriate error mask. To register for every possible
+ error condition CAN_ERR_MASK can be used as value for the error mask.
+ The values for the error mask are defined in linux/can/error.h .
+
+ can_err_mask_t err_mask = ( CAN_ERR_TX_TIMEOUT | CAN_ERR_BUSOFF );
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_ERR_FILTER,
+ &err_mask, sizeof(err_mask));
+
+ 4.1.3 RAW socket option CAN_RAW_LOOPBACK
+
+ To meet multi user needs the local loopback is enabled by default
+ (see chapter 3.2 for details). But in some embedded use-cases
+ (e.g. when only one application uses the CAN bus) this loopback
+ functionality can be disabled (separately for each socket):
+
+ int loopback = 0; /* 0 = disabled, 1 = enabled (default) */
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_LOOPBACK, &loopback, sizeof(loopback));
+
+ 4.1.4 RAW socket option CAN_RAW_RECV_OWN_MSGS
+
+ When the local loopback is enabled, all the sent CAN frames are
+ looped back to the open CAN sockets that registered for the CAN
+ frames' CAN-ID on this given interface to meet the multi user
+ needs. The reception of the CAN frames on the same socket that was
+ sending the CAN frame is assumed to be unwanted and therefore
+ disabled by default. This default behaviour may be changed on
+ demand:
+
+ int recv_own_msgs = 1; /* 0 = disabled (default), 1 = enabled */
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS,
+ &recv_own_msgs, sizeof(recv_own_msgs));
+
+ 4.1.5 RAW socket option CAN_RAW_FD_FRAMES
+
+ CAN FD support in CAN_RAW sockets can be enabled with a new socket option
+ CAN_RAW_FD_FRAMES which is off by default. When the new socket option is
+ not supported by the CAN_RAW socket (e.g. on older kernels), switching the
+ CAN_RAW_FD_FRAMES option returns the error -ENOPROTOOPT.
+
+ Once CAN_RAW_FD_FRAMES is enabled the application can send both CAN frames
+ and CAN FD frames. OTOH the application has to handle CAN and CAN FD frames
+ when reading from the socket.
+
+ CAN_RAW_FD_FRAMES enabled: CAN_MTU and CANFD_MTU are allowed
+ CAN_RAW_FD_FRAMES disabled: only CAN_MTU is allowed (default)
+
+ Example:
+ [ remember: CANFD_MTU == sizeof(struct canfd_frame) ]
+
+ struct canfd_frame cfd;
+
+ nbytes = read(s, &cfd, CANFD_MTU);
+
+ if (nbytes == CANFD_MTU) {
+ printf("got CAN FD frame with length %d\n", cfd.len);
+ /* cfd.flags contains valid data */
+ } else if (nbytes == CAN_MTU) {
+ printf("got legacy CAN frame with length %d\n", cfd.len);
+ /* cfd.flags is undefined */
+ } else {
+ fprintf(stderr, "read: invalid CAN(FD) frame\n");
+ return 1;
+ }
+
+ /* the content can be handled independently from the received MTU size */
+
+ printf("can_id: %X data length: %d data: ", cfd.can_id, cfd.len);
+ for (i = 0; i < cfd.len; i++)
+ printf("%02X ", cfd.data[i]);
+
+ When reading with size CANFD_MTU only returns CAN_MTU bytes that have
+ been received from the socket a legacy CAN frame has been read into the
+ provided CAN FD structure. Note that the canfd_frame.flags data field is
+ not specified in the struct can_frame and therefore it is only valid in
+ CANFD_MTU sized CAN FD frames.
+
+ Implementation hint for new CAN applications:
+
+ To build a CAN FD aware application use struct canfd_frame as basic CAN
+ data structure for CAN_RAW based applications. When the application is
+ executed on an older Linux kernel and switching the CAN_RAW_FD_FRAMES
+ socket option returns an error: No problem. You'll get legacy CAN frames
+ or CAN FD frames and can process them the same way.
+
+ When sending to CAN devices make sure that the device is capable to handle
+ CAN FD frames by checking if the device maximum transfer unit is CANFD_MTU.
+ The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall.
+
+ 4.1.6 RAW socket option CAN_RAW_JOIN_FILTERS
+
+ The CAN_RAW socket can set multiple CAN identifier specific filters that
+ lead to multiple filters in the af_can.c filter processing. These filters
+ are indenpendent from each other which leads to logical OR'ed filters when
+ applied (see 4.1.1).
+
+ This socket option joines the given CAN filters in the way that only CAN
+ frames are passed to user space that matched *all* given CAN filters. The
+ semantic for the applied filters is therefore changed to a logical AND.
+
+ This is useful especially when the filterset is a combination of filters
+ where the CAN_INV_FILTER flag is set in order to notch single CAN IDs or
+ CAN ID ranges from the incoming traffic.
+
+ 4.1.7 RAW socket returned message flags
+
+ When using recvmsg() call, the msg->msg_flags may contain following flags:
+
+ MSG_DONTROUTE: set when the received frame was created on the local host.
+
+ MSG_CONFIRM: set when the frame was sent via the socket it is received on.
+ This flag can be interpreted as a 'transmission confirmation' when the
+ CAN driver supports the echo of frames on driver level, see 3.2 and 6.2.
+ In order to receive such messages, CAN_RAW_RECV_OWN_MSGS must be set.
+
+ 4.2 Broadcast Manager protocol sockets (SOCK_DGRAM)
+
+ The Broadcast Manager protocol provides a command based configuration
+ interface to filter and send (e.g. cyclic) CAN messages in kernel space.
+
+ Receive filters can be used to down sample frequent messages; detect events
+ such as message contents changes, packet length changes, and do time-out
+ monitoring of received messages.
+
+ Periodic transmission tasks of CAN frames or a sequence of CAN frames can be
+ created and modified at runtime; both the message content and the two
+ possible transmit intervals can be altered.
+
+ A BCM socket is not intended for sending individual CAN frames using the
+ struct can_frame as known from the CAN_RAW socket. Instead a special BCM
+ configuration message is defined. The basic BCM configuration message used
+ to communicate with the broadcast manager and the available operations are
+ defined in the linux/can/bcm.h include. The BCM message consists of a
+ message header with a command ('opcode') followed by zero or more CAN frames.
+ The broadcast manager sends responses to user space in the same form:
+
+ struct bcm_msg_head {
+ __u32 opcode; /* command */
+ __u32 flags; /* special flags */
+ __u32 count; /* run 'count' times with ival1 */
+ struct timeval ival1, ival2; /* count and subsequent interval */
+ canid_t can_id; /* unique can_id for task */
+ __u32 nframes; /* number of can_frames following */
+ struct can_frame frames[0];
+ };
+
+ The aligned payload 'frames' uses the same basic CAN frame structure defined
+ at the beginning of section 4 and in the include/linux/can.h include. All
+ messages to the broadcast manager from user space have this structure.
+
+ Note a CAN_BCM socket must be connected instead of bound after socket
+ creation (example without error checking):
+
+ int s;
+ struct sockaddr_can addr;
+ struct ifreq ifr;
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_BCM);
+
+ strcpy(ifr.ifr_name, "can0");
+ ioctl(s, SIOCGIFINDEX, &ifr);
+
+ addr.can_family = AF_CAN;
+ addr.can_ifindex = ifr.ifr_ifindex;
+
+ connect(s, (struct sockaddr *)&addr, sizeof(addr))
+
+ (..)
+
+ The broadcast manager socket is able to handle any number of in flight
+ transmissions or receive filters concurrently. The different RX/TX jobs are
+ distinguished by the unique can_id in each BCM message. However additional
+ CAN_BCM sockets are recommended to communicate on multiple CAN interfaces.
+ When the broadcast manager socket is bound to 'any' CAN interface (=> the
+ interface index is set to zero) the configured receive filters apply to any
+ CAN interface unless the sendto() syscall is used to overrule the 'any' CAN
+ interface index. When using recvfrom() instead of read() to retrieve BCM
+ socket messages the originating CAN interface is provided in can_ifindex.
+
+ 4.2.1 Broadcast Manager operations
+
+ The opcode defines the operation for the broadcast manager to carry out,
+ or details the broadcast managers response to several events, including
+ user requests.
+
+ Transmit Operations (user space to broadcast manager):
+
+ TX_SETUP: Create (cyclic) transmission task.
+
+ TX_DELETE: Remove (cyclic) transmission task, requires only can_id.
+
+ TX_READ: Read properties of (cyclic) transmission task for can_id.
+
+ TX_SEND: Send one CAN frame.
+
+ Transmit Responses (broadcast manager to user space):
+
+ TX_STATUS: Reply to TX_READ request (transmission task configuration).
+
+ TX_EXPIRED: Notification when counter finishes sending at initial interval
+ 'ival1'. Requires the TX_COUNTEVT flag to be set at TX_SETUP.
+
+ Receive Operations (user space to broadcast manager):
+
+ RX_SETUP: Create RX content filter subscription.
+
+ RX_DELETE: Remove RX content filter subscription, requires only can_id.
+
+ RX_READ: Read properties of RX content filter subscription for can_id.
+
+ Receive Responses (broadcast manager to user space):
+
+ RX_STATUS: Reply to RX_READ request (filter task configuration).
+
+ RX_TIMEOUT: Cyclic message is detected to be absent (timer ival1 expired).
+
+ RX_CHANGED: BCM message with updated CAN frame (detected content change).
+ Sent on first message received or on receipt of revised CAN messages.
+
+ 4.2.2 Broadcast Manager message flags
+
+ When sending a message to the broadcast manager the 'flags' element may
+ contain the following flag definitions which influence the behaviour:
+
+ SETTIMER: Set the values of ival1, ival2 and count
+
+ STARTTIMER: Start the timer with the actual values of ival1, ival2
+ and count. Starting the timer leads simultaneously to emit a CAN frame.
+
+ TX_COUNTEVT: Create the message TX_EXPIRED when count expires
+
+ TX_ANNOUNCE: A change of data by the process is emitted immediately.
+
+ TX_CP_CAN_ID: Copies the can_id from the message header to each
+ subsequent frame in frames. This is intended as usage simplification. For
+ TX tasks the unique can_id from the message header may differ from the
+ can_id(s) stored for transmission in the subsequent struct can_frame(s).
+
+ RX_FILTER_ID: Filter by can_id alone, no frames required (nframes=0).
+
+ RX_CHECK_DLC: A change of the DLC leads to an RX_CHANGED.
+
+ RX_NO_AUTOTIMER: Prevent automatically starting the timeout monitor.
+
+ RX_ANNOUNCE_RESUME: If passed at RX_SETUP and a receive timeout occurred, a
+ RX_CHANGED message will be generated when the (cyclic) receive restarts.
+
+ TX_RESET_MULTI_IDX: Reset the index for the multiple frame transmission.
+
+ RX_RTR_FRAME: Send reply for RTR-request (placed in op->frames[0]).
+
+ 4.2.3 Broadcast Manager transmission timers
+
+ Periodic transmission configurations may use up to two interval timers.
+ In this case the BCM sends a number of messages ('count') at an interval
+ 'ival1', then continuing to send at another given interval 'ival2'. When
+ only one timer is needed 'count' is set to zero and only 'ival2' is used.
+ When SET_TIMER and START_TIMER flag were set the timers are activated.
+ The timer values can be altered at runtime when only SET_TIMER is set.
+
+ 4.2.4 Broadcast Manager message sequence transmission
+
+ Up to 256 CAN frames can be transmitted in a sequence in the case of a cyclic
+ TX task configuration. The number of CAN frames is provided in the 'nframes'
+ element of the BCM message head. The defined number of CAN frames are added
+ as array to the TX_SETUP BCM configuration message.
+
+ /* create a struct to set up a sequence of four CAN frames */
+ struct {
+ struct bcm_msg_head msg_head;
+ struct can_frame frame[4];
+ } mytxmsg;
+
+ (..)
+ mytxmsg.nframes = 4;
+ (..)
+
+ write(s, &mytxmsg, sizeof(mytxmsg));
+
+ With every transmission the index in the array of CAN frames is increased
+ and set to zero at index overflow.
+
+ 4.2.5 Broadcast Manager receive filter timers
+
+ The timer values ival1 or ival2 may be set to non-zero values at RX_SETUP.
+ When the SET_TIMER flag is set the timers are enabled:
+
+ ival1: Send RX_TIMEOUT when a received message is not received again within
+ the given time. When START_TIMER is set at RX_SETUP the timeout detection
+ is activated directly - even without a former CAN frame reception.
+
+ ival2: Throttle the received message rate down to the value of ival2. This
+ is useful to reduce messages for the application when the signal inside the
+ CAN frame is stateless as state changes within the ival2 periode may get
+ lost.
+
+ 4.2.6 Broadcast Manager multiplex message receive filter
+
+ To filter for content changes in multiplex message sequences an array of more
+ than one CAN frames can be passed in a RX_SETUP configuration message. The
+ data bytes of the first CAN frame contain the mask of relevant bits that
+ have to match in the subsequent CAN frames with the received CAN frame.
+ If one of the subsequent CAN frames is matching the bits in that frame data
+ mark the relevant content to be compared with the previous received content.
+ Up to 257 CAN frames (multiplex filter bit mask CAN frame plus 256 CAN
+ filters) can be added as array to the TX_SETUP BCM configuration message.
+
+ /* usually used to clear CAN frame data[] - beware of endian problems! */
+ #define U64_DATA(p) (*(unsigned long long*)(p)->data)
+
+ struct {
+ struct bcm_msg_head msg_head;
+ struct can_frame frame[5];
+ } msg;
+
+ msg.msg_head.opcode = RX_SETUP;
+ msg.msg_head.can_id = 0x42;
+ msg.msg_head.flags = 0;
+ msg.msg_head.nframes = 5;
+ U64_DATA(&msg.frame[0]) = 0xFF00000000000000ULL; /* MUX mask */
+ U64_DATA(&msg.frame[1]) = 0x01000000000000FFULL; /* data mask (MUX 0x01) */
+ U64_DATA(&msg.frame[2]) = 0x0200FFFF000000FFULL; /* data mask (MUX 0x02) */
+ U64_DATA(&msg.frame[3]) = 0x330000FFFFFF0003ULL; /* data mask (MUX 0x33) */
+ U64_DATA(&msg.frame[4]) = 0x4F07FC0FF0000000ULL; /* data mask (MUX 0x4F) */
+
+ write(s, &msg, sizeof(msg));
+
+ 4.3 connected transport protocols (SOCK_SEQPACKET)
+ 4.4 unconnected transport protocols (SOCK_DGRAM)
+
+
+5. SocketCAN core module
+-------------------------
+
+ The SocketCAN core module implements the protocol family
+ PF_CAN. CAN protocol modules are loaded by the core module at
+ runtime. The core module provides an interface for CAN protocol
+ modules to subscribe needed CAN IDs (see chapter 3.1).
+
+ 5.1 can.ko module params
+
+ - stats_timer: To calculate the SocketCAN core statistics
+ (e.g. current/maximum frames per second) this 1 second timer is
+ invoked at can.ko module start time by default. This timer can be
+ disabled by using stattimer=0 on the module commandline.
+
+ - debug: (removed since SocketCAN SVN r546)
+
+ 5.2 procfs content
+
+ As described in chapter 3.1 the SocketCAN core uses several filter
+ lists to deliver received CAN frames to CAN protocol modules. These
+ receive lists, their filters and the count of filter matches can be
+ checked in the appropriate receive list. All entries contain the
+ device and a protocol module identifier:
+
+ foo@bar:~$ cat /proc/net/can/rcvlist_all
+
+ receive list 'rx_all':
+ (vcan3: no entry)
+ (vcan2: no entry)
+ (vcan1: no entry)
+ device can_id can_mask function userdata matches ident
+ vcan0 000 00000000 f88e6370 f6c6f400 0 raw
+ (any: no entry)
+
+ In this example an application requests any CAN traffic from vcan0.
+
+ rcvlist_all - list for unfiltered entries (no filter operations)
+ rcvlist_eff - list for single extended frame (EFF) entries
+ rcvlist_err - list for error message frames masks
+ rcvlist_fil - list for mask/value filters
+ rcvlist_inv - list for mask/value filters (inverse semantic)
+ rcvlist_sff - list for single standard frame (SFF) entries
+
+ Additional procfs files in /proc/net/can
+
+ stats - SocketCAN core statistics (rx/tx frames, match ratios, ...)
+ reset_stats - manual statistic reset
+ version - prints the SocketCAN core version and the ABI version
+
+ 5.3 writing own CAN protocol modules
+
+ To implement a new protocol in the protocol family PF_CAN a new
+ protocol has to be defined in include/linux/can.h .
+ The prototypes and definitions to use the SocketCAN core can be
+ accessed by including include/linux/can/core.h .
+ In addition to functions that register the CAN protocol and the
+ CAN device notifier chain there are functions to subscribe CAN
+ frames received by CAN interfaces and to send CAN frames:
+
+ can_rx_register - subscribe CAN frames from a specific interface
+ can_rx_unregister - unsubscribe CAN frames from a specific interface
+ can_send - transmit a CAN frame (optional with local loopback)
+
+ For details see the kerneldoc documentation in net/can/af_can.c or
+ the source code of net/can/raw.c or net/can/bcm.c .
+
+6. CAN network drivers
+----------------------
+
+ Writing a CAN network device driver is much easier than writing a
+ CAN character device driver. Similar to other known network device
+ drivers you mainly have to deal with:
+
+ - TX: Put the CAN frame from the socket buffer to the CAN controller.
+ - RX: Put the CAN frame from the CAN controller to the socket buffer.
+
+ See e.g. at Documentation/networking/netdevices.txt . The differences
+ for writing CAN network device driver are described below:
+
+ 6.1 general settings
+
+ dev->type = ARPHRD_CAN; /* the netdevice hardware type */
+ dev->flags = IFF_NOARP; /* CAN has no arp */
+
+ dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> legacy CAN interface */
+
+ or alternative, when the controller supports CAN with flexible data rate:
+ dev->mtu = CANFD_MTU; /* sizeof(struct canfd_frame) -> CAN FD interface */
+
+ The struct can_frame or struct canfd_frame is the payload of each socket
+ buffer (skbuff) in the protocol family PF_CAN.
+
+ 6.2 local loopback of sent frames
+
+ As described in chapter 3.2 the CAN network device driver should
+ support a local loopback functionality similar to the local echo
+ e.g. of tty devices. In this case the driver flag IFF_ECHO has to be
+ set to prevent the PF_CAN core from locally echoing sent frames
+ (aka loopback) as fallback solution:
+
+ dev->flags = (IFF_NOARP | IFF_ECHO);
+
+ 6.3 CAN controller hardware filters
+
+ To reduce the interrupt load on deep embedded systems some CAN
+ controllers support the filtering of CAN IDs or ranges of CAN IDs.
+ These hardware filter capabilities vary from controller to
+ controller and have to be identified as not feasible in a multi-user
+ networking approach. The use of the very controller specific
+ hardware filters could make sense in a very dedicated use-case, as a
+ filter on driver level would affect all users in the multi-user
+ system. The high efficient filter sets inside the PF_CAN core allow
+ to set different multiple filters for each socket separately.
+ Therefore the use of hardware filters goes to the category 'handmade
+ tuning on deep embedded systems'. The author is running a MPC603e
+ @133MHz with four SJA1000 CAN controllers from 2002 under heavy bus
+ load without any problems ...
+
+ 6.4 The virtual CAN driver (vcan)
+
+ Similar to the network loopback devices, vcan offers a virtual local
+ CAN interface. A full qualified address on CAN consists of
+
+ - a unique CAN Identifier (CAN ID)
+ - the CAN bus this CAN ID is transmitted on (e.g. can0)
+
+ so in common use cases more than one virtual CAN interface is needed.
+
+ The virtual CAN interfaces allow the transmission and reception of CAN
+ frames without real CAN controller hardware. Virtual CAN network
+ devices are usually named 'vcanX', like vcan0 vcan1 vcan2 ...
+ When compiled as a module the virtual CAN driver module is called vcan.ko
+
+ Since Linux Kernel version 2.6.24 the vcan driver supports the Kernel
+ netlink interface to create vcan network devices. The creation and
+ removal of vcan network devices can be managed with the ip(8) tool:
+
+ - Create a virtual CAN network interface:
+ $ ip link add type vcan
+
+ - Create a virtual CAN network interface with a specific name 'vcan42':
+ $ ip link add dev vcan42 type vcan
+
+ - Remove a (virtual CAN) network interface 'vcan42':
+ $ ip link del vcan42
+
+ 6.5 The CAN network device driver interface
+
+ The CAN network device driver interface provides a generic interface
+ to setup, configure and monitor CAN network devices. The user can then
+ configure the CAN device, like setting the bit-timing parameters, via
+ the netlink interface using the program "ip" from the "IPROUTE2"
+ utility suite. The following chapter describes briefly how to use it.
+ Furthermore, the interface uses a common data structure and exports a
+ set of common functions, which all real CAN network device drivers
+ should use. Please have a look to the SJA1000 or MSCAN driver to
+ understand how to use them. The name of the module is can-dev.ko.
+
+ 6.5.1 Netlink interface to set/get devices properties
+
+ The CAN device must be configured via netlink interface. The supported
+ netlink message types are defined and briefly described in
+ "include/linux/can/netlink.h". CAN link support for the program "ip"
+ of the IPROUTE2 utility suite is available and it can be used as shown
+ below:
+
+ - Setting CAN device properties:
+
+ $ ip link set can0 type can help
+ Usage: ip link set DEVICE type can
+ [ bitrate BITRATE [ sample-point SAMPLE-POINT] ] |
+ [ tq TQ prop-seg PROP_SEG phase-seg1 PHASE-SEG1
+ phase-seg2 PHASE-SEG2 [ sjw SJW ] ]
+
+ [ loopback { on | off } ]
+ [ listen-only { on | off } ]
+ [ triple-sampling { on | off } ]
+
+ [ restart-ms TIME-MS ]
+ [ restart ]
+
+ Where: BITRATE := { 1..1000000 }
+ SAMPLE-POINT := { 0.000..0.999 }
+ TQ := { NUMBER }
+ PROP-SEG := { 1..8 }
+ PHASE-SEG1 := { 1..8 }
+ PHASE-SEG2 := { 1..8 }
+ SJW := { 1..4 }
+ RESTART-MS := { 0 | NUMBER }
+
+ - Display CAN device details and statistics:
+
+ $ ip -details -statistics link show can0
+ 2: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP qlen 10
+ link/can
+ can <TRIPLE-SAMPLING> state ERROR-ACTIVE restart-ms 100
+ bitrate 125000 sample_point 0.875
+ tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
+ sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1
+ clock 8000000
+ re-started bus-errors arbit-lost error-warn error-pass bus-off
+ 41 17457 0 41 42 41
+ RX: bytes packets errors dropped overrun mcast
+ 140859 17608 17457 0 0 0
+ TX: bytes packets errors dropped carrier collsns
+ 861 112 0 41 0 0
+
+ More info to the above output:
+
+ "<TRIPLE-SAMPLING>"
+ Shows the list of selected CAN controller modes: LOOPBACK,
+ LISTEN-ONLY, or TRIPLE-SAMPLING.
+
+ "state ERROR-ACTIVE"
+ The current state of the CAN controller: "ERROR-ACTIVE",
+ "ERROR-WARNING", "ERROR-PASSIVE", "BUS-OFF" or "STOPPED"
+
+ "restart-ms 100"
+ Automatic restart delay time. If set to a non-zero value, a
+ restart of the CAN controller will be triggered automatically
+ in case of a bus-off condition after the specified delay time
+ in milliseconds. By default it's off.
+
+ "bitrate 125000 sample-point 0.875"
+ Shows the real bit-rate in bits/sec and the sample-point in the
+ range 0.000..0.999. If the calculation of bit-timing parameters
+ is enabled in the kernel (CONFIG_CAN_CALC_BITTIMING=y), the
+ bit-timing can be defined by setting the "bitrate" argument.
+ Optionally the "sample-point" can be specified. By default it's
+ 0.000 assuming CIA-recommended sample-points.
+
+ "tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1"
+ Shows the time quanta in ns, propagation segment, phase buffer
+ segment 1 and 2 and the synchronisation jump width in units of
+ tq. They allow to define the CAN bit-timing in a hardware
+ independent format as proposed by the Bosch CAN 2.0 spec (see
+ chapter 8 of http://www.semiconductors.bosch.de/pdf/can2spec.pdf).
+
+ "sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1
+ clock 8000000"
+ Shows the bit-timing constants of the CAN controller, here the
+ "sja1000". The minimum and maximum values of the time segment 1
+ and 2, the synchronisation jump width in units of tq, the
+ bitrate pre-scaler and the CAN system clock frequency in Hz.
+ These constants could be used for user-defined (non-standard)
+ bit-timing calculation algorithms in user-space.
+
+ "re-started bus-errors arbit-lost error-warn error-pass bus-off"
+ Shows the number of restarts, bus and arbitration lost errors,
+ and the state changes to the error-warning, error-passive and
+ bus-off state. RX overrun errors are listed in the "overrun"
+ field of the standard network statistics.
+
+ 6.5.2 Setting the CAN bit-timing
+
+ The CAN bit-timing parameters can always be defined in a hardware
+ independent format as proposed in the Bosch CAN 2.0 specification
+ specifying the arguments "tq", "prop_seg", "phase_seg1", "phase_seg2"
+ and "sjw":
+
+ $ ip link set canX type can tq 125 prop-seg 6 \
+ phase-seg1 7 phase-seg2 2 sjw 1
+
+ If the kernel option CONFIG_CAN_CALC_BITTIMING is enabled, CIA
+ recommended CAN bit-timing parameters will be calculated if the bit-
+ rate is specified with the argument "bitrate":
+
+ $ ip link set canX type can bitrate 125000
+
+ Note that this works fine for the most common CAN controllers with
+ standard bit-rates but may *fail* for exotic bit-rates or CAN system
+ clock frequencies. Disabling CONFIG_CAN_CALC_BITTIMING saves some
+ space and allows user-space tools to solely determine and set the
+ bit-timing parameters. The CAN controller specific bit-timing
+ constants can be used for that purpose. They are listed by the
+ following command:
+
+ $ ip -details link show can0
+ ...
+ sja1000: clock 8000000 tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1
+
+ 6.5.3 Starting and stopping the CAN network device
+
+ A CAN network device is started or stopped as usual with the command
+ "ifconfig canX up/down" or "ip link set canX up/down". Be aware that
+ you *must* define proper bit-timing parameters for real CAN devices
+ before you can start it to avoid error-prone default settings:
+
+ $ ip link set canX up type can bitrate 125000
+
+ A device may enter the "bus-off" state if too many errors occurred on
+ the CAN bus. Then no more messages are received or sent. An automatic
+ bus-off recovery can be enabled by setting the "restart-ms" to a
+ non-zero value, e.g.:
+
+ $ ip link set canX type can restart-ms 100
+
+ Alternatively, the application may realize the "bus-off" condition
+ by monitoring CAN error message frames and do a restart when
+ appropriate with the command:
+
+ $ ip link set canX type can restart
+
+ Note that a restart will also create a CAN error message frame (see
+ also chapter 3.4).
+
+ 6.6 CAN FD (flexible data rate) driver support
+
+ CAN FD capable CAN controllers support two different bitrates for the
+ arbitration phase and the payload phase of the CAN FD frame. Therefore a
+ second bit timing has to be specified in order to enable the CAN FD bitrate.
+
+ Additionally CAN FD capable CAN controllers support up to 64 bytes of
+ payload. The representation of this length in can_frame.can_dlc and
+ canfd_frame.len for userspace applications and inside the Linux network
+ layer is a plain value from 0 .. 64 instead of the CAN 'data length code'.
+ The data length code was a 1:1 mapping to the payload length in the legacy
+ CAN frames anyway. The payload length to the bus-relevant DLC mapping is
+ only performed inside the CAN drivers, preferably with the helper
+ functions can_dlc2len() and can_len2dlc().
+
+ The CAN netdevice driver capabilities can be distinguished by the network
+ devices maximum transfer unit (MTU):
+
+ MTU = 16 (CAN_MTU) => sizeof(struct can_frame) => 'legacy' CAN device
+ MTU = 72 (CANFD_MTU) => sizeof(struct canfd_frame) => CAN FD capable device
+
+ The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall.
+ N.B. CAN FD capable devices can also handle and send legacy CAN frames.
+
+ FIXME: Add details about the CAN FD controller configuration when available.
+
+ 6.7 Supported CAN hardware
+
+ Please check the "Kconfig" file in "drivers/net/can" to get an actual
+ list of the support CAN hardware. On the SocketCAN project website
+ (see chapter 7) there might be further drivers available, also for
+ older kernel versions.
+
+7. SocketCAN resources
+-----------------------
+
+ The Linux CAN / SocketCAN project ressources (project site / mailing list)
+ are referenced in the MAINTAINERS file in the Linux source tree.
+ Search for CAN NETWORK [LAYERS|DRIVERS].
+
+8. Credits
+----------
+
+ Oliver Hartkopp (PF_CAN core, filters, drivers, bcm, SJA1000 driver)
+ Urs Thuermann (PF_CAN core, kernel integration, socket interfaces, raw, vcan)
+ Jan Kizka (RT-SocketCAN core, Socket-API reconciliation)
+ Wolfgang Grandegger (RT-SocketCAN core & drivers, Raw Socket-API reviews,
+ CAN device driver interface, MSCAN driver)
+ Robert Schwebel (design reviews, PTXdist integration)
+ Marc Kleine-Budde (design reviews, Kernel 2.6 cleanups, drivers)
+ Benedikt Spranger (reviews)
+ Thomas Gleixner (LKML reviews, coding style, posting hints)
+ Andrey Volkov (kernel subtree structure, ioctls, MSCAN driver)
+ Matthias Brukner (first SJA1000 CAN netdevice implementation Q2/2003)
+ Klaus Hitschler (PEAK driver integration)
+ Uwe Koppe (CAN netdevices with PF_PACKET approach)
+ Michael Schulze (driver layer loopback requirement, RT CAN drivers review)
+ Pavel Pisa (Bit-timing calculation)
+ Sascha Hauer (SJA1000 platform driver)
+ Sebastian Haas (SJA1000 EMS PCI driver)
+ Markus Plessing (SJA1000 EMS PCI driver)
+ Per Dalen (SJA1000 Kvaser PCI driver)
+ Sam Ravnborg (reviews, coding style, kbuild help)
diff --git a/Documentation/networking/cdc_mbim.txt b/Documentation/networking/cdc_mbim.txt
new file mode 100644
index 000000000..a15ea602a
--- /dev/null
+++ b/Documentation/networking/cdc_mbim.txt
@@ -0,0 +1,339 @@
+ cdc_mbim - Driver for CDC MBIM Mobile Broadband modems
+ ========================================================
+
+The cdc_mbim driver supports USB devices conforming to the "Universal
+Serial Bus Communications Class Subclass Specification for Mobile
+Broadband Interface Model" [1], which is a further development of
+"Universal Serial Bus Communications Class Subclass Specifications for
+Network Control Model Devices" [2] optimized for Mobile Broadband
+devices, aka "3G/LTE modems".
+
+
+Command Line Parameters
+=======================
+
+The cdc_mbim driver has no parameters of its own. But the probing
+behaviour for NCM 1.0 backwards compatible MBIM functions (an
+"NCM/MBIM function" as defined in section 3.2 of [1]) is affected
+by a cdc_ncm driver parameter:
+
+prefer_mbim
+-----------
+Type: Boolean
+Valid Range: N/Y (0-1)
+Default Value: Y (MBIM is preferred)
+
+This parameter sets the system policy for NCM/MBIM functions. Such
+functions will be handled by either the cdc_ncm driver or the cdc_mbim
+driver depending on the prefer_mbim setting. Setting prefer_mbim=N
+makes the cdc_mbim driver ignore these functions and lets the cdc_ncm
+driver handle them instead.
+
+The parameter is writable, and can be changed at any time. A manual
+unbind/bind is required to make the change effective for NCM/MBIM
+functions bound to the "wrong" driver
+
+
+Basic usage
+===========
+
+MBIM functions are inactive when unmanaged. The cdc_mbim driver only
+provides an userspace interface to the MBIM control channel, and will
+not participate in the management of the function. This implies that a
+userspace MBIM management application always is required to enable a
+MBIM function.
+
+Such userspace applications includes, but are not limited to:
+ - mbimcli (included with the libmbim [3] library), and
+ - ModemManager [4]
+
+Establishing a MBIM IP session reequires at least these actions by the
+management application:
+ - open the control channel
+ - configure network connection settings
+ - connect to network
+ - configure IP interface
+
+Management application development
+----------------------------------
+The driver <-> userspace interfaces are described below. The MBIM
+control channel protocol is described in [1].
+
+
+MBIM control channel userspace ABI
+==================================
+
+/dev/cdc-wdmX character device
+------------------------------
+The driver creates a two-way pipe to the MBIM function control channel
+using the cdc-wdm driver as a subdriver. The userspace end of the
+control channel pipe is a /dev/cdc-wdmX character device.
+
+The cdc_mbim driver does not process or police messages on the control
+channel. The channel is fully delegated to the userspace management
+application. It is therefore up to this application to ensure that it
+complies with all the control channel requirements in [1].
+
+The cdc-wdmX device is created as a child of the MBIM control
+interface USB device. The character device associated with a specific
+MBIM function can be looked up using sysfs. For example:
+
+ bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc
+ cdc-wdm0
+
+ bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev
+ 180:0
+
+
+USB configuration descriptors
+-----------------------------
+The wMaxControlMessage field of the CDC MBIM functional descriptor
+limits the maximum control message size. The managament application is
+responsible for negotiating a control message size complying with the
+requirements in section 9.3.1 of [1], taking this descriptor field
+into consideration.
+
+The userspace application can access the CDC MBIM functional
+descriptor of a MBIM function using either of the two USB
+configuration descriptor kernel interfaces described in [6] or [7].
+
+See also the ioctl documentation below.
+
+
+Fragmentation
+-------------
+The userspace application is responsible for all control message
+fragmentation and defragmentaion, as described in section 9.5 of [1].
+
+
+/dev/cdc-wdmX write()
+---------------------
+The MBIM control messages from the management application *must not*
+exceed the negotiated control message size.
+
+
+/dev/cdc-wdmX read()
+--------------------
+The management application *must* accept control messages of up the
+negotiated control message size.
+
+
+/dev/cdc-wdmX ioctl()
+--------------------
+IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size
+This ioctl returns the wMaxControlMessage field of the CDC MBIM
+functional descriptor for MBIM devices. This is intended as a
+convenience, eliminating the need to parse the USB descriptors from
+userspace.
+
+ #include <stdio.h>
+ #include <fcntl.h>
+ #include <sys/ioctl.h>
+ #include <linux/types.h>
+ #include <linux/usb/cdc-wdm.h>
+ int main()
+ {
+ __u16 max;
+ int fd = open("/dev/cdc-wdm0", O_RDWR);
+ if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max))
+ printf("wMaxControlMessage is %d\n", max);
+ }
+
+
+Custom device services
+----------------------
+The MBIM specification allows vendors to freely define additional
+services. This is fully supported by the cdc_mbim driver.
+
+Support for new MBIM services, including vendor specified services, is
+implemented entirely in userspace, like the rest of the MBIM control
+protocol
+
+New services should be registered in the MBIM Registry [5].
+
+
+
+MBIM data channel userspace ABI
+===============================
+
+wwanY network device
+--------------------
+The cdc_mbim driver represents the MBIM data channel as a single
+network device of the "wwan" type. This network device is initially
+mapped to MBIM IP session 0.
+
+
+Multiplexed IP sessions (IPS)
+-----------------------------
+MBIM allows multiplexing up to 256 IP sessions over a single USB data
+channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN
+subdevices of the master wwanY device, mapping MBIM IP session Z to
+VLAN ID Z for all values of Z greater than 0.
+
+The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure
+described in section 10.5.1 of [1].
+
+The userspace management application is responsible for adding new
+VLAN links prior to establishing MBIM IP sessions where the SessionId
+is greater than 0. These links can be added by using the normal VLAN
+kernel interfaces, either ioctl or netlink.
+
+For example, adding a link for a MBIM IP session with SessionId 3:
+
+ ip link add link wwan0 name wwan0.3 type vlan id 3
+
+The driver will automatically map the "wwan0.3" network device to MBIM
+IP session 3.
+
+
+Device Service Streams (DSS)
+----------------------------
+MBIM also allows up to 256 non-IP data streams to be multiplexed over
+the same shared USB data channel. The cdc_mbim driver models these
+sessions as another set of 802.1q VLAN subdevices of the master wwanY
+device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values
+of A.
+
+The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO
+structure described in section 10.5.29 of [1].
+
+The DSS VLAN subdevices are used as a practical interface between the
+shared MBIM data channel and a MBIM DSS aware userspace application.
+It is not intended to be presented as-is to an end user. The
+assumption is that an userspace application initiating a DSS session
+also takes care of the necessary framing of the DSS data, presenting
+the stream to the end user in an appropriate way for the stream type.
+
+The network device ABI requires a dummy ethernet header for every DSS
+data frame being transported. The contents of this header is
+arbitrary, with the following exceptions:
+ - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped
+ - RX frames will have the protocol field set to ETH_P_802_3 (but will
+ not be properly formatted 802.3 frames)
+ - RX frames will have the destination address set to the hardware
+ address of the master device
+
+The DSS supporting userspace management application is responsible for
+adding the dummy ethernet header on TX and stripping it on RX.
+
+This is a simple example using tools commonly available, exporting
+DssSessionId 5 as a pty character device pointed to by a /dev/nmea
+symlink:
+
+ ip link add link wwan0 name wwan0.dss5 type vlan id 261
+ ip link set dev wwan0.dss5 up
+ socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea
+
+This is only an example, most suitable for testing out a DSS
+service. Userspace applications supporting specific MBIM DSS services
+are expected to use the tools and programming interfaces required by
+that service.
+
+Note that adding VLAN links for DSS sessions is entirely optional. A
+management application may instead choose to bind a packet socket
+directly to the master network device, using the received VLAN tags to
+map frames to the correct DSS session and adding 18 byte VLAN ethernet
+headers with the appropriate tag on TX. In this case using a socket
+filter is recommended, matching only the DSS VLAN subset. This avoid
+unnecessary copying of unrelated IP session data to userspace. For
+example:
+
+ static struct sock_filter dssfilter[] = {
+ /* use special negative offsets to get VLAN tag */
+ BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */
+
+ /* verify DSS VLAN range */
+ BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG),
+ BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */
+ BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */
+
+ /* verify ethertype */
+ BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1),
+
+ BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */
+ BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */
+ };
+
+
+
+Tagged IP session 0 VLAN
+------------------------
+As described above, MBIM IP session 0 is treated as special by the
+driver. It is initially mapped to untagged frames on the wwanY
+network device.
+
+This mapping implies a few restrictions on multiplexed IPS and DSS
+sessions, which may not always be practical:
+ - no IPS or DSS session can use a frame size greater than the MTU on
+ IP session 0
+ - no IPS or DSS session can be in the up state unless the network
+ device representing IP session 0 also is up
+
+These problems can be avoided by optionally making the driver map IP
+session 0 to a VLAN subdevice, similar to all other IP sessions. This
+behaviour is triggered by adding a VLAN link for the magic VLAN ID
+4094. The driver will then immediately start mapping MBIM IP session
+0 to this VLAN, and will drop untagged frames on the master wwanY
+device.
+
+Tip: It might be less confusing to the end user to name this VLAN
+subdevice after the MBIM SessionID instead of the VLAN ID. For
+example:
+
+ ip link add link wwan0 name wwan0.0 type vlan id 4094
+
+
+VLAN mapping
+------------
+
+Summarizing the cdc_mbim driver mapping described above, we have this
+relationship between VLAN tags on the wwanY network device and MBIM
+sessions on the shared USB data channel:
+
+ VLAN ID MBIM type MBIM SessionID Notes
+ ---------------------------------------------------------
+ untagged IPS 0 a)
+ 1 - 255 IPS 1 - 255 <VLANID>
+ 256 - 511 DSS 0 - 255 <VLANID - 256>
+ 512 - 4093 b)
+ 4094 IPS 0 c)
+
+ a) if no VLAN ID 4094 link exists, else dropped
+ b) unsupported VLAN range, unconditionally dropped
+ c) if a VLAN ID 4094 link exists, else dropped
+
+
+
+
+References
+==========
+
+[1] USB Implementers Forum, Inc. - "Universal Serial Bus
+ Communications Class Subclass Specification for Mobile Broadband
+ Interface Model", Revision 1.0 (Errata 1), May 1, 2013
+ - http://www.usb.org/developers/docs/devclass_docs/
+
+[2] USB Implementers Forum, Inc. - "Universal Serial Bus
+ Communications Class Subclass Specifications for Network Control
+ Model Devices", Revision 1.0 (Errata 1), November 24, 2010
+ - http://www.usb.org/developers/docs/devclass_docs/
+
+[3] libmbim - "a glib-based library for talking to WWAN modems and
+ devices which speak the Mobile Interface Broadband Model (MBIM)
+ protocol"
+ - http://www.freedesktop.org/wiki/Software/libmbim/
+
+[4] ModemManager - "a DBus-activated daemon which controls mobile
+ broadband (2G/3G/4G) devices and connections"
+ - http://www.freedesktop.org/wiki/Software/ModemManager/
+
+[5] "MBIM (Mobile Broadband Interface Model) Registry"
+ - http://compliance.usb.org/mbim/
+
+[6] "/proc/bus/usb filesystem output"
+ - Documentation/usb/proc_usb_info.txt
+
+[7] "/sys/bus/usb/devices/.../descriptors"
+ - Documentation/ABI/stable/sysfs-bus-usb
diff --git a/Documentation/networking/cops.txt b/Documentation/networking/cops.txt
new file mode 100644
index 000000000..3e344b448
--- /dev/null
+++ b/Documentation/networking/cops.txt
@@ -0,0 +1,63 @@
+Text File for the COPS LocalTalk Linux driver (cops.c).
+ By Jay Schulist <jschlst@samba.org>
+
+This driver has two modes and they are: Dayna mode and Tangent mode.
+Each mode corresponds with the type of card. It has been found
+that there are 2 main types of cards and all other cards are
+the same and just have different names or only have minor differences
+such as more IO ports. As this driver is tested it will
+become more clear exactly what cards are supported.
+
+Right now these cards are known to work with the COPS driver. The
+LT-200 cards work in a somewhat more limited capacity than the
+DL200 cards, which work very well and are in use by many people.
+
+TANGENT driver mode:
+ Tangent ATB-II, Novell NL-1000, Daystar Digital LT-200
+DAYNA driver mode:
+ Dayna DL2000/DaynaTalk PC (Half Length), COPS LT-95,
+ Farallon PhoneNET PC III, Farallon PhoneNET PC II
+Other cards possibly supported mode unknown though:
+ Dayna DL2000 (Full length)
+
+The COPS driver defaults to using Dayna mode. To change the driver's
+mode if you built a driver with dual support use board_type=1 or
+board_type=2 for Dayna or Tangent with insmod.
+
+** Operation/loading of the driver.
+Use modprobe like this: /sbin/modprobe cops.o (IO #) (IRQ #)
+If you do not specify any options the driver will try and use the IO = 0x240,
+IRQ = 5. As of right now I would only use IRQ 5 for the card, if autoprobing.
+
+To load multiple COPS driver Localtalk cards you can do one of the following.
+
+insmod cops io=0x240 irq=5
+insmod -o cops2 cops io=0x260 irq=3
+
+Or in lilo.conf put something like this:
+ append="ether=5,0x240,lt0 ether=3,0x260,lt1"
+
+Then bring up the interface with ifconfig. It will look something like this:
+lt0 Link encap:UNSPEC HWaddr 00-00-00-00-00-00-00-F7-00-00-00-00-00-00-00-00
+ inet addr:192.168.1.2 Bcast:192.168.1.255 Mask:255.255.255.0
+ UP BROADCAST RUNNING NOARP MULTICAST MTU:600 Metric:1
+ RX packets:0 errors:0 dropped:0 overruns:0 frame:0
+ TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 coll:0
+
+** Netatalk Configuration
+You will need to configure atalkd with something like the following to make
+it work with the cops.c driver.
+
+* For single LTalk card use.
+dummy -seed -phase 2 -net 2000 -addr 2000.10 -zone "1033"
+lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
+
+* For multiple cards, Ethernet and LocalTalk.
+eth0 -seed -phase 2 -net 3000 -addr 3000.20 -zone "1033"
+lt0 -seed -phase 1 -net 1000 -addr 1000.50 -zone "1033"
+
+* For multiple LocalTalk cards, and an Ethernet card.
+* Order seems to matter here, Ethernet last.
+lt0 -seed -phase 1 -net 1000 -addr 1000.10 -zone "LocalTalk1"
+lt1 -seed -phase 1 -net 2000 -addr 2000.20 -zone "LocalTalk2"
+eth0 -seed -phase 2 -net 3000 -addr 3000.30 -zone "EtherTalk"
diff --git a/Documentation/networking/cs89x0.txt b/Documentation/networking/cs89x0.txt
new file mode 100644
index 000000000..0e190180e
--- /dev/null
+++ b/Documentation/networking/cs89x0.txt
@@ -0,0 +1,624 @@
+
+NOTE
+----
+
+This document was contributed by Cirrus Logic for kernel 2.2.5. This version
+has been updated for 2.3.48 by Andrew Morton.
+
+Cirrus make a copy of this driver available at their website, as
+described below. In general, you should use the driver version which
+comes with your Linux distribution.
+
+
+
+CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS
+Linux Network Interface Driver ver. 2.00 <kernel 2.3.48>
+===============================================================================
+
+
+TABLE OF CONTENTS
+
+1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS
+ 1.1 Product Overview
+ 1.2 Driver Description
+ 1.2.1 Driver Name
+ 1.2.2 File in the Driver Package
+ 1.3 System Requirements
+ 1.4 Licensing Information
+
+2.0 ADAPTER INSTALLATION and CONFIGURATION
+ 2.1 CS8900-based Adapter Configuration
+ 2.2 CS8920-based Adapter Configuration
+
+3.0 LOADING THE DRIVER AS A MODULE
+
+4.0 COMPILING THE DRIVER
+ 4.1 Compiling the Driver as a Loadable Module
+ 4.2 Compiling the driver to support memory mode
+ 4.3 Compiling the driver to support Rx DMA
+
+5.0 TESTING AND TROUBLESHOOTING
+ 5.1 Known Defects and Limitations
+ 5.2 Testing the Adapter
+ 5.2.1 Diagnostic Self-Test
+ 5.2.2 Diagnostic Network Test
+ 5.3 Using the Adapter's LEDs
+ 5.4 Resolving I/O Conflicts
+
+6.0 TECHNICAL SUPPORT
+ 6.1 Contacting Cirrus Logic's Technical Support
+ 6.2 Information Required Before Contacting Technical Support
+ 6.3 Obtaining the Latest Driver Version
+ 6.4 Current maintainer
+ 6.5 Kernel boot parameters
+
+
+1.0 CIRRUS LOGIC LAN CS8900/CS8920 ETHERNET ADAPTERS
+===============================================================================
+
+
+1.1 PRODUCT OVERVIEW
+
+The CS8900-based ISA Ethernet Adapters from Cirrus Logic follow
+IEEE 802.3 standards and support half or full-duplex operation in ISA bus
+computers on 10 Mbps Ethernet networks. The adapters are designed for operation
+in 16-bit ISA or EISA bus expansion slots and are available in
+10BaseT-only or 3-media configurations (10BaseT, 10Base2, and AUI for 10Base-5
+or fiber networks).
+
+CS8920-based adapters are similar to the CS8900-based adapter with additional
+features for Plug and Play (PnP) support and Wakeup Frame recognition. As
+such, the configuration procedures differ somewhat between the two types of
+adapters. Refer to the "Adapter Configuration" section for details on
+configuring both types of adapters.
+
+
+1.2 DRIVER DESCRIPTION
+
+The CS8900/CS8920 Ethernet Adapter driver for Linux supports the Linux
+v2.3.48 or greater kernel. It can be compiled directly into the kernel
+or loaded at run-time as a device driver module.
+
+1.2.1 Driver Name: cs89x0
+
+1.2.2 Files in the Driver Archive:
+
+The files in the driver at Cirrus' website include:
+
+ readme.txt - this file
+ build - batch file to compile cs89x0.c.
+ cs89x0.c - driver C code
+ cs89x0.h - driver header file
+ cs89x0.o - pre-compiled module (for v2.2.5 kernel)
+ config/Config.in - sample file to include cs89x0 driver in the kernel.
+ config/Makefile - sample file to include cs89x0 driver in the kernel.
+ config/Space.c - sample file to include cs89x0 driver in the kernel.
+
+
+
+1.3 SYSTEM REQUIREMENTS
+
+The following hardware is required:
+
+ * Cirrus Logic LAN (CS8900/20-based) Ethernet ISA Adapter
+
+ * IBM or IBM-compatible PC with:
+ * An 80386 or higher processor
+ * 16 bytes of contiguous IO space available between 210h - 370h
+ * One available IRQ (5,10,11,or 12 for the CS8900, 3-7,9-15 for CS8920).
+
+ * Appropriate cable (and connector for AUI, 10BASE-2) for your network
+ topology.
+
+The following software is required:
+
+* LINUX kernel version 2.3.48 or higher
+
+ * CS8900/20 Setup Utility (DOS-based)
+
+ * LINUX kernel sources for your kernel (if compiling into kernel)
+
+ * GNU Toolkit (gcc and make) v2.6 or above (if compiling into kernel
+ or a module)
+
+
+
+1.4 LICENSING INFORMATION
+
+This program is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free Software
+Foundation, version 1.
+
+This program is distributed in the hope that it will be useful, but WITHOUT
+ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+more details.
+
+For a full copy of the GNU General Public License, write to the Free Software
+Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+
+
+2.0 ADAPTER INSTALLATION and CONFIGURATION
+===============================================================================
+
+Both the CS8900 and CS8920-based adapters can be configured using parameters
+stored in an on-board EEPROM. You must use the DOS-based CS8900/20 Setup
+Utility if you want to change the adapter's configuration in EEPROM.
+
+When loading the driver as a module, you can specify many of the adapter's
+configuration parameters on the command-line to override the EEPROM's settings
+or for interface configuration when an EEPROM is not used. (CS8920-based
+adapters must use an EEPROM.) See Section 3.0 LOADING THE DRIVER AS A MODULE.
+
+Since the CS8900/20 Setup Utility is a DOS-based application, you must install
+and configure the adapter in a DOS-based system using the CS8900/20 Setup
+Utility before installation in the target LINUX system. (Not required if
+installing a CS8900-based adapter and the default configuration is acceptable.)
+
+
+2.1 CS8900-BASED ADAPTER CONFIGURATION
+
+CS8900-based adapters shipped from Cirrus Logic have been configured
+with the following "default" settings:
+
+ Operation Mode: Memory Mode
+ IRQ: 10
+ Base I/O Address: 300
+ Memory Base Address: D0000
+ Optimization: DOS Client
+ Transmission Mode: Half-duplex
+ BootProm: None
+ Media Type: Autodetect (3-media cards) or
+ 10BASE-T (10BASE-T only adapter)
+
+You should only change the default configuration settings if conflicts with
+another adapter exists. To change the adapter's configuration, run the
+CS8900/20 Setup Utility.
+
+
+2.2 CS8920-BASED ADAPTER CONFIGURATION
+
+CS8920-based adapters are shipped from Cirrus Logic configured as Plug
+and Play (PnP) enabled. However, since the cs89x0 driver does NOT
+support PnP, you must install the CS8920 adapter in a DOS-based PC and
+run the CS8900/20 Setup Utility to disable PnP and configure the
+adapter before installation in the target Linux system. Failure to do
+this will leave the adapter inactive and the driver will be unable to
+communicate with the adapter.
+
+
+ ****************************************************************
+ * CS8920-BASED ADAPTERS: *
+ * *
+ * CS8920-BASED ADAPTERS ARE PLUG and PLAY ENABLED BY DEFAULT. *
+ * THE CS89X0 DRIVER DOES NOT SUPPORT PnP. THEREFORE, YOU MUST *
+ * RUN THE CS8900/20 SETUP UTILITY TO DISABLE PnP SUPPORT AND *
+ * TO ACTIVATE THE ADAPTER. *
+ ****************************************************************
+
+
+
+
+3.0 LOADING THE DRIVER AS A MODULE
+===============================================================================
+
+If the driver is compiled as a loadable module, you can load the driver module
+with the 'modprobe' command. Many of the adapter's configuration parameters can
+be specified as command-line arguments to the load command. This facility
+provides a means to override the EEPROM's settings or for interface
+configuration when an EEPROM is not used.
+
+Example:
+
+ insmod cs89x0.o io=0x200 irq=0xA media=aui
+
+This example loads the module and configures the adapter to use an IO port base
+address of 200h, interrupt 10, and use the AUI media connection. The following
+configuration options are available on the command line:
+
+* io=### - specify IO address (200h-360h)
+* irq=## - specify interrupt level
+* use_dma=1 - Enable DMA
+* dma=# - specify dma channel (Driver is compiled to support
+ Rx DMA only)
+* dmasize=# (16 or 64) - DMA size 16K or 64K. Default value is set to 16.
+* media=rj45 - specify media type
+ or media=bnc
+ or media=aui
+ or media=auto
+* duplex=full - specify forced half/full/autonegotiate duplex
+ or duplex=half
+ or duplex=auto
+* debug=# - debug level (only available if the driver was compiled
+ for debugging)
+
+NOTES:
+
+a) If an EEPROM is present, any specified command-line parameter
+ will override the corresponding configuration value stored in
+ EEPROM.
+
+b) The "io" parameter must be specified on the command-line.
+
+c) The driver's hardware probe routine is designed to avoid
+ writing to I/O space until it knows that there is a cs89x0
+ card at the written addresses. This could cause problems
+ with device probing. To avoid this behaviour, add one
+ to the `io=' module parameter. This doesn't actually change
+ the I/O address, but it is a flag to tell the driver
+ to partially initialise the hardware before trying to
+ identify the card. This could be dangerous if you are
+ not sure that there is a cs89x0 card at the provided address.
+
+ For example, to scan for an adapter located at IO base 0x300,
+ specify an IO address of 0x301.
+
+d) The "duplex=auto" parameter is only supported for the CS8920.
+
+e) The minimum command-line configuration required if an EEPROM is
+ not present is:
+
+ io
+ irq
+ media type (no autodetect)
+
+f) The following additional parameters are CS89XX defaults (values
+ used with no EEPROM or command-line argument).
+
+ * DMA Burst = enabled
+ * IOCHRDY Enabled = enabled
+ * UseSA = enabled
+ * CS8900 defaults to half-duplex if not specified on command-line
+ * CS8920 defaults to autoneg if not specified on command-line
+ * Use reset defaults for other config parameters
+ * dma_mode = 0
+
+g) You can use ifconfig to set the adapter's Ethernet address.
+
+h) Many Linux distributions use the 'modprobe' command to load
+ modules. This program uses the '/etc/conf.modules' file to
+ determine configuration information which is passed to a driver
+ module when it is loaded. All the configuration options which are
+ described above may be placed within /etc/conf.modules.
+
+ For example:
+
+ > cat /etc/conf.modules
+ ...
+ alias eth0 cs89x0
+ options cs89x0 io=0x0200 dma=5 use_dma=1
+ ...
+
+ In this example we are telling the module system that the
+ ethernet driver for this machine should use the cs89x0 driver. We
+ are asking 'modprobe' to pass the 'io', 'dma' and 'use_dma'
+ arguments to the driver when it is loaded.
+
+i) Cirrus recommend that the cs89x0 use the ISA DMA channels 5, 6 or
+ 7. You will probably find that other DMA channels will not work.
+
+j) The cs89x0 supports DMA for receiving only. DMA mode is
+ significantly more efficient. Flooding a 400 MHz Celeron machine
+ with large ping packets consumes 82% of its CPU capacity in non-DMA
+ mode. With DMA this is reduced to 45%.
+
+k) If your Linux kernel was compiled with inbuilt plug-and-play
+ support you will be able to find information about the cs89x0 card
+ with the command
+
+ cat /proc/isapnp
+
+l) If during DMA operation you find erratic behavior or network data
+ corruption you should use your PC's BIOS to slow the EISA bus clock.
+
+m) If the cs89x0 driver is compiled directly into the kernel
+ (non-modular) then its I/O address is automatically determined by
+ ISA bus probing. The IRQ number, media options, etc are determined
+ from the card's EEPROM.
+
+n) If the cs89x0 driver is compiled directly into the kernel, DMA
+ mode may be selected by providing the kernel with a boot option
+ 'cs89x0_dma=N' where 'N' is the desired DMA channel number (5, 6 or 7).
+
+ Kernel boot options may be provided on the LILO command line:
+
+ LILO boot: linux cs89x0_dma=5
+
+ or they may be placed in /etc/lilo.conf:
+
+ image=/boot/bzImage-2.3.48
+ append="cs89x0_dma=5"
+ label=linux
+ root=/dev/hda5
+ read-only
+
+ The DMA Rx buffer size is hardwired to 16 kbytes in this mode.
+ (64k mode is not available).
+
+
+4.0 COMPILING THE DRIVER
+===============================================================================
+
+The cs89x0 driver can be compiled directly into the kernel or compiled into
+a loadable device driver module.
+
+
+4.1 COMPILING THE DRIVER AS A LOADABLE MODULE
+
+To compile the driver into a loadable module, use the following command
+(single command line, without quotes):
+
+"gcc -D__KERNEL__ -I/usr/src/linux/include -I/usr/src/linux/net/inet -Wall
+-Wstrict-prototypes -O2 -fomit-frame-pointer -DMODULE -DCONFIG_MODVERSIONS
+-c cs89x0.c"
+
+4.2 COMPILING THE DRIVER TO SUPPORT MEMORY MODE
+
+Support for memory mode was not carried over into the 2.3 series kernels.
+
+4.3 COMPILING THE DRIVER TO SUPPORT Rx DMA
+
+The compile-time optionality for DMA was removed in the 2.3 kernel
+series. DMA support is now unconditionally part of the driver. It is
+enabled by the 'use_dma=1' module option.
+
+
+5.0 TESTING AND TROUBLESHOOTING
+===============================================================================
+
+5.1 KNOWN DEFECTS and LIMITATIONS
+
+Refer to the RELEASE.TXT file distributed as part of this archive for a list of
+known defects, driver limitations, and work arounds.
+
+
+5.2 TESTING THE ADAPTER
+
+Once the adapter has been installed and configured, the diagnostic option of
+the CS8900/20 Setup Utility can be used to test the functionality of the
+adapter and its network connection. Use the diagnostics 'Self Test' option to
+test the functionality of the adapter with the hardware configuration you have
+assigned. You can use the diagnostics 'Network Test' to test the ability of the
+adapter to communicate across the Ethernet with another PC equipped with a
+CS8900/20-based adapter card (it must also be running the CS8900/20 Setup
+Utility).
+
+ NOTE: The Setup Utility's diagnostics are designed to run in a
+ DOS-only operating system environment. DO NOT run the diagnostics
+ from a DOS or command prompt session under Windows 95, Windows NT,
+ OS/2, or other operating system.
+
+To run the diagnostics tests on the CS8900/20 adapter:
+
+ 1.) Boot DOS on the PC and start the CS8900/20 Setup Utility.
+
+ 2.) The adapter's current configuration is displayed. Hit the ENTER key to
+ get to the main menu.
+
+ 4.) Select 'Diagnostics' (ALT-G) from the main menu.
+ * Select 'Self-Test' to test the adapter's basic functionality.
+ * Select 'Network Test' to test the network connection and cabling.
+
+
+5.2.1 DIAGNOSTIC SELF-TEST
+
+The diagnostic self-test checks the adapter's basic functionality as well as
+its ability to communicate across the ISA bus based on the system resources
+assigned during hardware configuration. The following tests are performed:
+
+ * IO Register Read/Write Test
+ The IO Register Read/Write test insures that the CS8900/20 can be
+ accessed in IO mode, and that the IO base address is correct.
+
+ * Shared Memory Test
+ The Shared Memory test insures the CS8900/20 can be accessed in memory
+ mode and that the range of memory addresses assigned does not conflict
+ with other devices in the system.
+
+ * Interrupt Test
+ The Interrupt test insures there are no conflicts with the assigned IRQ
+ signal.
+
+ * EEPROM Test
+ The EEPROM test insures the EEPROM can be read.
+
+ * Chip RAM Test
+ The Chip RAM test insures the 4K of memory internal to the CS8900/20 is
+ working properly.
+
+ * Internal Loop-back Test
+ The Internal Loop Back test insures the adapter's transmitter and
+ receiver are operating properly. If this test fails, make sure the
+ adapter's cable is connected to the network (check for LED activity for
+ example).
+
+ * Boot PROM Test
+ The Boot PROM test insures the Boot PROM is present, and can be read.
+ Failure indicates the Boot PROM was not successfully read due to a
+ hardware problem or due to a conflicts on the Boot PROM address
+ assignment. (Test only applies if the adapter is configured to use the
+ Boot PROM option.)
+
+Failure of a test item indicates a possible system resource conflict with
+another device on the ISA bus. In this case, you should use the Manual Setup
+option to reconfigure the adapter by selecting a different value for the system
+resource that failed.
+
+
+5.2.2 DIAGNOSTIC NETWORK TEST
+
+The Diagnostic Network Test verifies a working network connection by
+transferring data between two CS8900/20 adapters installed in different PCs
+on the same network. (Note: the diagnostic network test should not be run
+between two nodes across a router.)
+
+This test requires that each of the two PCs have a CS8900/20-based adapter
+installed and have the CS8900/20 Setup Utility running. The first PC is
+configured as a Responder and the other PC is configured as an Initiator.
+Once the Initiator is started, it sends data frames to the Responder which
+returns the frames to the Initiator.
+
+The total number of frames received and transmitted are displayed on the
+Initiator's display, along with a count of the number of frames received and
+transmitted OK or in error. The test can be terminated anytime by the user at
+either PC.
+
+To setup the Diagnostic Network Test:
+
+ 1.) Select a PC with a CS8900/20-based adapter and a known working network
+ connection to act as the Responder. Run the CS8900/20 Setup Utility
+ and select 'Diagnostics -> Network Test -> Responder' from the main
+ menu. Hit ENTER to start the Responder.
+
+ 2.) Return to the PC with the CS8900/20-based adapter you want to test and
+ start the CS8900/20 Setup Utility.
+
+ 3.) From the main menu, Select 'Diagnostic -> Network Test -> Initiator'.
+ Hit ENTER to start the test.
+
+You may stop the test on the Initiator at any time while allowing the Responder
+to continue running. In this manner, you can move to additional PCs and test
+them by starting the Initiator on another PC without having to stop/start the
+Responder.
+
+
+
+5.3 USING THE ADAPTER'S LEDs
+
+The 2 and 3-media adapters have two LEDs visible on the back end of the board
+located near the 10Base-T connector.
+
+Link Integrity LED: A "steady" ON of the green LED indicates a valid 10Base-T
+connection. (Only applies to 10Base-T. The green LED has no significance for
+a 10Base-2 or AUI connection.)
+
+TX/RX LED: The yellow LED lights briefly each time the adapter transmits or
+receives data. (The yellow LED will appear to "flicker" on a typical network.)
+
+
+5.4 RESOLVING I/O CONFLICTS
+
+An IO conflict occurs when two or more adapter use the same ISA resource (IO
+address, memory address or IRQ). You can usually detect an IO conflict in one
+of four ways after installing and or configuring the CS8900/20-based adapter:
+
+ 1.) The system does not boot properly (or at all).
+
+ 2.) The driver cannot communicate with the adapter, reporting an "Adapter
+ not found" error message.
+
+ 3.) You cannot connect to the network or the driver will not load.
+
+ 4.) If you have configured the adapter to run in memory mode but the driver
+ reports it is using IO mode when loading, this is an indication of a
+ memory address conflict.
+
+If an IO conflict occurs, run the CS8900/20 Setup Utility and perform a
+diagnostic self-test. Normally, the ISA resource in conflict will fail the
+self-test. If so, reconfigure the adapter selecting another choice for the
+resource in conflict. Run the diagnostics again to check for further IO
+conflicts.
+
+In some cases, such as when the PC will not boot, it may be necessary to remove
+the adapter and reconfigure it by installing it in another PC to run the
+CS8900/20 Setup Utility. Once reinstalled in the target system, run the
+diagnostics self-test to ensure the new configuration is free of conflicts
+before loading the driver again.
+
+When manually configuring the adapter, keep in mind the typical ISA system
+resource usage as indicated in the tables below.
+
+I/O Address Device IRQ Device
+----------- -------- --- --------
+ 200-20F Game I/O adapter 3 COM2, Bus Mouse
+ 230-23F Bus Mouse 4 COM1
+ 270-27F LPT3: third parallel port 5 LPT2
+ 2F0-2FF COM2: second serial port 6 Floppy Disk controller
+ 320-32F Fixed disk controller 7 LPT1
+ 8 Real-time Clock
+ 9 EGA/VGA display adapter
+ 12 Mouse (PS/2)
+Memory Address Device 13 Math Coprocessor
+-------------- --------------------- 14 Hard Disk controller
+A000-BFFF EGA Graphics Adapter
+A000-C7FF VGA Graphics Adapter
+B000-BFFF Mono Graphics Adapter
+B800-BFFF Color Graphics Adapter
+E000-FFFF AT BIOS
+
+
+
+
+6.0 TECHNICAL SUPPORT
+===============================================================================
+
+6.1 CONTACTING CIRRUS LOGIC'S TECHNICAL SUPPORT
+
+Cirrus Logic's CS89XX Technical Application Support can be reached at:
+
+Telephone :(800) 888-5016 (from inside U.S. and Canada)
+ :(512) 442-7555 (from outside the U.S. and Canada)
+Fax :(512) 912-3871
+Email :ethernet@crystal.cirrus.com
+WWW :http://www.cirrus.com
+
+
+6.2 INFORMATION REQUIRED BEFORE CONTACTING TECHNICAL SUPPORT
+
+Before contacting Cirrus Logic for technical support, be prepared to provide as
+Much of the following information as possible.
+
+1.) Adapter type (CRD8900, CDB8900, CDB8920, etc.)
+
+2.) Adapter configuration
+
+ * IO Base, Memory Base, IO or memory mode enabled, IRQ, DMA channel
+ * Plug and Play enabled/disabled (CS8920-based adapters only)
+ * Configured for media auto-detect or specific media type (which type).
+
+3.) PC System's Configuration
+
+ * Plug and Play system (yes/no)
+ * BIOS (make and version)
+ * System make and model
+ * CPU (type and speed)
+ * System RAM
+ * SCSI Adapter
+
+4.) Software
+
+ * CS89XX driver and version
+ * Your network operating system and version
+ * Your system's OS version
+ * Version of all protocol support files
+
+5.) Any Error Message displayed.
+
+
+
+6.3 OBTAINING THE LATEST DRIVER VERSION
+
+You can obtain the latest CS89XX drivers and support software from Cirrus Logic's
+Web site. You can also contact Cirrus Logic's Technical Support (email:
+ethernet@crystal.cirrus.com) and request that you be registered for automatic
+software-update notification.
+
+Cirrus Logic maintains a web page at http://www.cirrus.com with the
+latest drivers and technical publications.
+
+
+6.4 Current maintainer
+
+In February 2000 the maintenance of this driver was assumed by Andrew
+Morton.
+
+6.5 Kernel module parameters
+
+For use in embedded environments with no cs89x0 EEPROM, the kernel boot
+parameter `cs89x0_media=' has been implemented. Usage is:
+
+ cs89x0_media=rj45 or
+ cs89x0_media=aui or
+ cs89x0_media=bnc
+
diff --git a/Documentation/networking/cxacru-cf.py b/Documentation/networking/cxacru-cf.py
new file mode 100644
index 000000000..b41d29839
--- /dev/null
+++ b/Documentation/networking/cxacru-cf.py
@@ -0,0 +1,48 @@
+#!/usr/bin/env python
+# Copyright 2009 Simon Arlott
+#
+# This program is free software; you can redistribute it and/or modify it
+# under the terms of the GNU General Public License as published by the Free
+# Software Foundation; either version 2 of the License, or (at your option)
+# any later version.
+#
+# This program is distributed in the hope that it will be useful, but WITHOUT
+# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+# FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
+# more details.
+#
+# You should have received a copy of the GNU General Public License along with
+# this program; if not, write to the Free Software Foundation, Inc., 59
+# Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+#
+# Usage: cxacru-cf.py < cxacru-cf.bin
+# Output: values string suitable for the sysfs adsl_config attribute
+#
+# Warning: cxacru-cf.bin with MD5 hash cdbac2689969d5ed5d4850f117702110
+# contains mis-aligned values which will stop the modem from being able
+# to make a connection. If the first and last two bytes are removed then
+# the values become valid, but the modulation will be forced to ANSI
+# T1.413 only which may not be appropriate.
+#
+# The original binary format is a packed list of le32 values.
+
+import sys
+import struct
+
+i = 0
+while True:
+ buf = sys.stdin.read(4)
+
+ if len(buf) == 0:
+ break
+ elif len(buf) != 4:
+ sys.stdout.write("\n")
+ sys.stderr.write("Error: read {0} not 4 bytes\n".format(len(buf)))
+ sys.exit(1)
+
+ if i > 0:
+ sys.stdout.write(" ")
+ sys.stdout.write("{0:x}={1}".format(i, struct.unpack("<I", buf)[0]))
+ i += 1
+
+sys.stdout.write("\n")
diff --git a/Documentation/networking/cxacru.txt b/Documentation/networking/cxacru.txt
new file mode 100644
index 000000000..2cce04457
--- /dev/null
+++ b/Documentation/networking/cxacru.txt
@@ -0,0 +1,100 @@
+Firmware is required for this device: http://accessrunner.sourceforge.net/
+
+While it is capable of managing/maintaining the ADSL connection without the
+module loaded, the device will sometimes stop responding after unloading the
+driver and it is necessary to unplug/remove power to the device to fix this.
+
+Note: support for cxacru-cf.bin has been removed. It was not loaded correctly
+so it had no effect on the device configuration. Fixing it could have stopped
+existing devices working when an invalid configuration is supplied.
+
+There is a script cxacru-cf.py to convert an existing file to the sysfs form.
+
+Detected devices will appear as ATM devices named "cxacru". In /sys/class/atm/
+these are directories named cxacruN where N is the device number. A symlink
+named device points to the USB interface device's directory which contains
+several sysfs attribute files for retrieving device statistics:
+
+* adsl_controller_version
+
+* adsl_headend
+* adsl_headend_environment
+ Information about the remote headend.
+
+* adsl_config
+ Configuration writing interface.
+ Write parameters in hexadecimal format <index>=<value>,
+ separated by whitespace, e.g.:
+ "1=0 a=5"
+ Up to 7 parameters at a time will be sent and the modem will restart
+ the ADSL connection when any value is set. These are logged for future
+ reference.
+
+* downstream_attenuation (dB)
+* downstream_bits_per_frame
+* downstream_rate (kbps)
+* downstream_snr_margin (dB)
+ Downstream stats.
+
+* upstream_attenuation (dB)
+* upstream_bits_per_frame
+* upstream_rate (kbps)
+* upstream_snr_margin (dB)
+* transmitter_power (dBm/Hz)
+ Upstream stats.
+
+* downstream_crc_errors
+* downstream_fec_errors
+* downstream_hec_errors
+* upstream_crc_errors
+* upstream_fec_errors
+* upstream_hec_errors
+ Error counts.
+
+* line_startable
+ Indicates that ADSL support on the device
+ is/can be enabled, see adsl_start.
+
+* line_status
+ "initialising"
+ "down"
+ "attempting to activate"
+ "training"
+ "channel analysis"
+ "exchange"
+ "waiting"
+ "up"
+
+ Changes between "down" and "attempting to activate"
+ if there is no signal.
+
+* link_status
+ "not connected"
+ "connected"
+ "lost"
+
+* mac_address
+
+* modulation
+ "" (when not connected)
+ "ANSI T1.413"
+ "ITU-T G.992.1 (G.DMT)"
+ "ITU-T G.992.2 (G.LITE)"
+
+* startup_attempts
+ Count of total attempts to initialise ADSL.
+
+To enable/disable ADSL, the following can be written to the adsl_state file:
+ "start"
+ "stop
+ "restart" (stops, waits 1.5s, then starts)
+ "poll" (used to resume status polling if it was disabled due to failure)
+
+Changes in adsl/line state are reported via kernel log messages:
+ [4942145.150704] ATM dev 0: ADSL state: running
+ [4942243.663766] ATM dev 0: ADSL line: down
+ [4942249.665075] ATM dev 0: ADSL line: attempting to activate
+ [4942253.654954] ATM dev 0: ADSL line: training
+ [4942255.666387] ATM dev 0: ADSL line: channel analysis
+ [4942259.656262] ATM dev 0: ADSL line: exchange
+ [2635357.696901] ATM dev 0: ADSL line: up (8128 kb/s down | 832 kb/s up)
diff --git a/Documentation/networking/cxgb.txt b/Documentation/networking/cxgb.txt
new file mode 100644
index 000000000..20a887615
--- /dev/null
+++ b/Documentation/networking/cxgb.txt
@@ -0,0 +1,352 @@
+ Chelsio N210 10Gb Ethernet Network Controller
+
+ Driver Release Notes for Linux
+
+ Version 2.1.1
+
+ June 20, 2005
+
+CONTENTS
+========
+ INTRODUCTION
+ FEATURES
+ PERFORMANCE
+ DRIVER MESSAGES
+ KNOWN ISSUES
+ SUPPORT
+
+
+INTRODUCTION
+============
+
+ This document describes the Linux driver for Chelsio 10Gb Ethernet Network
+ Controller. This driver supports the Chelsio N210 NIC and is backward
+ compatible with the Chelsio N110 model 10Gb NICs.
+
+
+FEATURES
+========
+
+ Adaptive Interrupts (adaptive-rx)
+ ---------------------------------
+
+ This feature provides an adaptive algorithm that adjusts the interrupt
+ coalescing parameters, allowing the driver to dynamically adapt the latency
+ settings to achieve the highest performance during various types of network
+ load.
+
+ The interface used to control this feature is ethtool. Please see the
+ ethtool manpage for additional usage information.
+
+ By default, adaptive-rx is disabled.
+ To enable adaptive-rx:
+
+ ethtool -C <interface> adaptive-rx on
+
+ To disable adaptive-rx, use ethtool:
+
+ ethtool -C <interface> adaptive-rx off
+
+ After disabling adaptive-rx, the timer latency value will be set to 50us.
+ You may set the timer latency after disabling adaptive-rx:
+
+ ethtool -C <interface> rx-usecs <microseconds>
+
+ An example to set the timer latency value to 100us on eth0:
+
+ ethtool -C eth0 rx-usecs 100
+
+ You may also provide a timer latency value while disabling adaptive-rx:
+
+ ethtool -C <interface> adaptive-rx off rx-usecs <microseconds>
+
+ If adaptive-rx is disabled and a timer latency value is specified, the timer
+ will be set to the specified value until changed by the user or until
+ adaptive-rx is enabled.
+
+ To view the status of the adaptive-rx and timer latency values:
+
+ ethtool -c <interface>
+
+
+ TCP Segmentation Offloading (TSO) Support
+ -----------------------------------------
+
+ This feature, also known as "large send", enables a system's protocol stack
+ to offload portions of outbound TCP processing to a network interface card
+ thereby reducing system CPU utilization and enhancing performance.
+
+ The interface used to control this feature is ethtool version 1.8 or higher.
+ Please see the ethtool manpage for additional usage information.
+
+ By default, TSO is enabled.
+ To disable TSO:
+
+ ethtool -K <interface> tso off
+
+ To enable TSO:
+
+ ethtool -K <interface> tso on
+
+ To view the status of TSO:
+
+ ethtool -k <interface>
+
+
+PERFORMANCE
+===========
+
+ The following information is provided as an example of how to change system
+ parameters for "performance tuning" an what value to use. You may or may not
+ want to change these system parameters, depending on your server/workstation
+ application. Doing so is not warranted in any way by Chelsio Communications,
+ and is done at "YOUR OWN RISK". Chelsio will not be held responsible for loss
+ of data or damage to equipment.
+
+ Your distribution may have a different way of doing things, or you may prefer
+ a different method. These commands are shown only to provide an example of
+ what to do and are by no means definitive.
+
+ Making any of the following system changes will only last until you reboot
+ your system. You may want to write a script that runs at boot-up which
+ includes the optimal settings for your system.
+
+ Setting PCI Latency Timer:
+ setpci -d 1425:* 0x0c.l=0x0000F800
+
+ Disabling TCP timestamp:
+ sysctl -w net.ipv4.tcp_timestamps=0
+
+ Disabling SACK:
+ sysctl -w net.ipv4.tcp_sack=0
+
+ Setting large number of incoming connection requests:
+ sysctl -w net.ipv4.tcp_max_syn_backlog=3000
+
+ Setting maximum receive socket buffer size:
+ sysctl -w net.core.rmem_max=1024000
+
+ Setting maximum send socket buffer size:
+ sysctl -w net.core.wmem_max=1024000
+
+ Set smp_affinity (on a multiprocessor system) to a single CPU:
+ echo 1 > /proc/irq/<interrupt_number>/smp_affinity
+
+ Setting default receive socket buffer size:
+ sysctl -w net.core.rmem_default=524287
+
+ Setting default send socket buffer size:
+ sysctl -w net.core.wmem_default=524287
+
+ Setting maximum option memory buffers:
+ sysctl -w net.core.optmem_max=524287
+
+ Setting maximum backlog (# of unprocessed packets before kernel drops):
+ sysctl -w net.core.netdev_max_backlog=300000
+
+ Setting TCP read buffers (min/default/max):
+ sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000"
+
+ Setting TCP write buffers (min/pressure/max):
+ sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000"
+
+ Setting TCP buffer space (min/pressure/max):
+ sysctl -w net.ipv4.tcp_mem="10000000 10000000 10000000"
+
+ TCP window size for single connections:
+ The receive buffer (RX_WINDOW) size must be at least as large as the
+ Bandwidth-Delay Product of the communication link between the sender and
+ receiver. Due to the variations of RTT, you may want to increase the buffer
+ size up to 2 times the Bandwidth-Delay Product. Reference page 289 of
+ "TCP/IP Illustrated, Volume 1, The Protocols" by W. Richard Stevens.
+ At 10Gb speeds, use the following formula:
+ RX_WINDOW >= 1.25MBytes * RTT(in milliseconds)
+ Example for RTT with 100us: RX_WINDOW = (1,250,000 * 0.1) = 125,000
+ RX_WINDOW sizes of 256KB - 512KB should be sufficient.
+ Setting the min, max, and default receive buffer (RX_WINDOW) size:
+ sysctl -w net.ipv4.tcp_rmem="<min> <default> <max>"
+
+ TCP window size for multiple connections:
+ The receive buffer (RX_WINDOW) size may be calculated the same as single
+ connections, but should be divided by the number of connections. The
+ smaller window prevents congestion and facilitates better pacing,
+ especially if/when MAC level flow control does not work well or when it is
+ not supported on the machine. Experimentation may be necessary to attain
+ the correct value. This method is provided as a starting point for the
+ correct receive buffer size.
+ Setting the min, max, and default receive buffer (RX_WINDOW) size is
+ performed in the same manner as single connection.
+
+
+DRIVER MESSAGES
+===============
+
+ The following messages are the most common messages logged by syslog. These
+ may be found in /var/log/messages.
+
+ Driver up:
+ Chelsio Network Driver - version 2.1.1
+
+ NIC detected:
+ eth#: Chelsio N210 1x10GBaseX NIC (rev #), PCIX 133MHz/64-bit
+
+ Link up:
+ eth#: link is up at 10 Gbps, full duplex
+
+ Link down:
+ eth#: link is down
+
+
+KNOWN ISSUES
+============
+
+ These issues have been identified during testing. The following information
+ is provided as a workaround to the problem. In some cases, this problem is
+ inherent to Linux or to a particular Linux Distribution and/or hardware
+ platform.
+
+ 1. Large number of TCP retransmits on a multiprocessor (SMP) system.
+
+ On a system with multiple CPUs, the interrupt (IRQ) for the network
+ controller may be bound to more than one CPU. This will cause TCP
+ retransmits if the packet data were to be split across different CPUs
+ and re-assembled in a different order than expected.
+
+ To eliminate the TCP retransmits, set smp_affinity on the particular
+ interrupt to a single CPU. You can locate the interrupt (IRQ) used on
+ the N110/N210 by using ifconfig:
+ ifconfig <dev_name> | grep Interrupt
+ Set the smp_affinity to a single CPU:
+ echo 1 > /proc/irq/<interrupt_number>/smp_affinity
+
+ It is highly suggested that you do not run the irqbalance daemon on your
+ system, as this will change any smp_affinity setting you have applied.
+ The irqbalance daemon runs on a 10 second interval and binds interrupts
+ to the least loaded CPU determined by the daemon. To disable this daemon:
+ chkconfig --level 2345 irqbalance off
+
+ By default, some Linux distributions enable the kernel feature,
+ irqbalance, which performs the same function as the daemon. To disable
+ this feature, add the following line to your bootloader:
+ noirqbalance
+
+ Example using the Grub bootloader:
+ title Red Hat Enterprise Linux AS (2.4.21-27.ELsmp)
+ root (hd0,0)
+ kernel /vmlinuz-2.4.21-27.ELsmp ro root=/dev/hda3 noirqbalance
+ initrd /initrd-2.4.21-27.ELsmp.img
+
+ 2. After running insmod, the driver is loaded and the incorrect network
+ interface is brought up without running ifup.
+
+ When using 2.4.x kernels, including RHEL kernels, the Linux kernel
+ invokes a script named "hotplug". This script is primarily used to
+ automatically bring up USB devices when they are plugged in, however,
+ the script also attempts to automatically bring up a network interface
+ after loading the kernel module. The hotplug script does this by scanning
+ the ifcfg-eth# config files in /etc/sysconfig/network-scripts, looking
+ for HWADDR=<mac_address>.
+
+ If the hotplug script does not find the HWADDRR within any of the
+ ifcfg-eth# files, it will bring up the device with the next available
+ interface name. If this interface is already configured for a different
+ network card, your new interface will have incorrect IP address and
+ network settings.
+
+ To solve this issue, you can add the HWADDR=<mac_address> key to the
+ interface config file of your network controller.
+
+ To disable this "hotplug" feature, you may add the driver (module name)
+ to the "blacklist" file located in /etc/hotplug. It has been noted that
+ this does not work for network devices because the net.agent script
+ does not use the blacklist file. Simply remove, or rename, the net.agent
+ script located in /etc/hotplug to disable this feature.
+
+ 3. Transport Protocol (TP) hangs when running heavy multi-connection traffic
+ on an AMD Opteron system with HyperTransport PCI-X Tunnel chipset.
+
+ If your AMD Opteron system uses the AMD-8131 HyperTransport PCI-X Tunnel
+ chipset, you may experience the "133-Mhz Mode Split Completion Data
+ Corruption" bug identified by AMD while using a 133Mhz PCI-X card on the
+ bus PCI-X bus.
+
+ AMD states, "Under highly specific conditions, the AMD-8131 PCI-X Tunnel
+ can provide stale data via split completion cycles to a PCI-X card that
+ is operating at 133 Mhz", causing data corruption.
+
+ AMD's provides three workarounds for this problem, however, Chelsio
+ recommends the first option for best performance with this bug:
+
+ For 133Mhz secondary bus operation, limit the transaction length and
+ the number of outstanding transactions, via BIOS configuration
+ programming of the PCI-X card, to the following:
+
+ Data Length (bytes): 1k
+ Total allowed outstanding transactions: 2
+
+ Please refer to AMD 8131-HT/PCI-X Errata 26310 Rev 3.08 August 2004,
+ section 56, "133-MHz Mode Split Completion Data Corruption" for more
+ details with this bug and workarounds suggested by AMD.
+
+ It may be possible to work outside AMD's recommended PCI-X settings, try
+ increasing the Data Length to 2k bytes for increased performance. If you
+ have issues with these settings, please revert to the "safe" settings
+ and duplicate the problem before submitting a bug or asking for support.
+
+ NOTE: The default setting on most systems is 8 outstanding transactions
+ and 2k bytes data length.
+
+ 4. On multiprocessor systems, it has been noted that an application which
+ is handling 10Gb networking can switch between CPUs causing degraded
+ and/or unstable performance.
+
+ If running on an SMP system and taking performance measurements, it
+ is suggested you either run the latest netperf-2.4.0+ or use a binding
+ tool such as Tim Hockin's procstate utilities (runon)
+ <http://www.hockin.org/~thockin/procstate/>.
+
+ Binding netserver and netperf (or other applications) to particular
+ CPUs will have a significant difference in performance measurements.
+ You may need to experiment which CPU to bind the application to in
+ order to achieve the best performance for your system.
+
+ If you are developing an application designed for 10Gb networking,
+ please keep in mind you may want to look at kernel functions
+ sched_setaffinity & sched_getaffinity to bind your application.
+
+ If you are just running user-space applications such as ftp, telnet,
+ etc., you may want to try the runon tool provided by Tim Hockin's
+ procstate utility. You could also try binding the interface to a
+ particular CPU: runon 0 ifup eth0
+
+
+SUPPORT
+=======
+
+ If you have problems with the software or hardware, please contact our
+ customer support team via email at support@chelsio.com or check our website
+ at http://www.chelsio.com
+
+===============================================================================
+
+ Chelsio Communications
+ 370 San Aleso Ave.
+ Suite 100
+ Sunnyvale, CA 94085
+ http://www.chelsio.com
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License, version 2, as
+published by the Free Software Foundation.
+
+You should have received a copy of the GNU General Public License along
+with this program; if not, write to the Free Software Foundation, Inc.,
+59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+
+THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
+WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
+
+ Copyright (c) 2003-2005 Chelsio Communications. All rights reserved.
+
+===============================================================================
diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt
new file mode 100644
index 000000000..55c575fca
--- /dev/null
+++ b/Documentation/networking/dccp.txt
@@ -0,0 +1,207 @@
+DCCP protocol
+=============
+
+
+Contents
+========
+- Introduction
+- Missing features
+- Socket options
+- Sysctl variables
+- IOCTLs
+- Other tunables
+- Notes
+
+
+Introduction
+============
+Datagram Congestion Control Protocol (DCCP) is an unreliable, connection
+oriented protocol designed to solve issues present in UDP and TCP, particularly
+for real-time and multimedia (streaming) traffic.
+It divides into a base protocol (RFC 4340) and pluggable congestion control
+modules called CCIDs. Like pluggable TCP congestion control, at least one CCID
+needs to be enabled in order for the protocol to function properly. In the Linux
+implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as
+the TCP-friendly CCID3 (RFC 4342), are optional.
+For a brief introduction to CCIDs and suggestions for choosing a CCID to match
+given applications, see section 10 of RFC 4340.
+
+It has a base protocol and pluggable congestion control IDs (CCIDs).
+
+DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol
+is at http://www.ietf.org/html.charters/dccp-charter.html
+
+
+Missing features
+================
+The Linux DCCP implementation does not currently support all the features that are
+specified in RFCs 4340...42.
+
+The known bugs are at:
+ http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP
+
+For more up-to-date versions of the DCCP implementation, please consider using
+the experimental DCCP test tree; instructions for checking this out are on:
+http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree
+
+
+Socket options
+==============
+DCCP_SOCKOPT_QPOLICY_ID sets the dequeuing policy for outgoing packets. It takes
+a policy ID as argument and can only be set before the connection (i.e. changes
+during an established connection are not supported). Currently, two policies are
+defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special,
+and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an
+u32 priority value as ancillary data to sendmsg(), where higher numbers indicate
+a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to
+be formatted using a cmsg(3) message header filled in as follows:
+ cmsg->cmsg_level = SOL_DCCP;
+ cmsg->cmsg_type = DCCP_SCM_PRIORITY;
+ cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t)); /* or CMSG_LEN(4) */
+
+DCCP_SOCKOPT_QPOLICY_TXQLEN sets the maximum length of the output queue. A zero
+value is always interpreted as unbounded queue length. If different from zero,
+the interpretation of this parameter depends on the current dequeuing policy
+(see above): the "simple" policy will enforce a fixed queue size by returning
+EAGAIN, whereas the "prio" policy enforces a fixed queue length by dropping the
+lowest-priority packet first. The default value for this parameter is
+initialised from /proc/sys/net/dccp/default/tx_qlen.
+
+DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of
+service codes (RFC 4340, sec. 8.1.2); if this socket option is not set,
+the socket will fall back to 0 (which means that no meaningful service code
+is present). On active sockets this is set before connect(); specifying more
+than one code has no effect (all subsequent service codes are ignored). The
+case is different for passive sockets, where multiple service codes (up to 32)
+can be set before calling bind().
+
+DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet
+size (application payload size) in bytes, see RFC 4340, section 14.
+
+DCCP_SOCKOPT_AVAILABLE_CCIDS is also read-only and returns the list of CCIDs
+supported by the endpoint. The option value is an array of type uint8_t whose
+size is passed as option length. The minimum array size is 4 elements, the
+value returned in the optlen argument always reflects the true number of
+built-in CCIDs.
+
+DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same
+time, combining the operation of the next two socket options. This option is
+preferable over the latter two, since often applications will use the same
+type of CCID for both directions; and mixed use of CCIDs is not currently well
+understood. This socket option takes as argument at least one uint8_t value, or
+an array of uint8_t values, which must match available CCIDS (see above). CCIDs
+must be registered on the socket before calling connect() or listen().
+
+DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets
+the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID.
+Please note that the getsockopt argument type here is `int', not uint8_t.
+
+DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID.
+
+DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold
+timewait state when closing the connection (RFC 4340, 8.3). The usual case is
+that the closing server sends a CloseReq, whereupon the client holds timewait
+state. When this boolean socket option is on, the server sends a Close instead
+and will enter TIMEWAIT. This option must be set after accept() returns.
+
+DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the
+partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums
+always cover the entire packet and that only fully covered application data is
+accepted by the receiver. Hence, when using this feature on the sender, it must
+be enabled at the receiver, too with suitable choice of CsCov.
+
+DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the
+ range 0..15 are acceptable. The default setting is 0 (full coverage),
+ values between 1..15 indicate partial coverage.
+DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it
+ sets a threshold, where again values 0..15 are acceptable. The default
+ of 0 means that all packets with a partial coverage will be discarded.
+ Values in the range 1..15 indicate that packets with minimally such a
+ coverage value are also acceptable. The higher the number, the more
+ restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage
+ settings are inherited to the child socket after accept().
+
+The following two options apply to CCID 3 exclusively and are getsockopt()-only.
+In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned.
+DCCP_SOCKOPT_CCID_RX_INFO
+ Returns a `struct tfrc_rx_info' in optval; the buffer for optval and
+ optlen must be set to at least sizeof(struct tfrc_rx_info).
+DCCP_SOCKOPT_CCID_TX_INFO
+ Returns a `struct tfrc_tx_info' in optval; the buffer for optval and
+ optlen must be set to at least sizeof(struct tfrc_tx_info).
+
+On unidirectional connections it is useful to close the unused half-connection
+via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs.
+
+
+Sysctl variables
+================
+Several DCCP default parameters can be managed by the following sysctls
+(sysctl net.dccp.default or /proc/sys/net/dccp/default):
+
+request_retries
+ The number of active connection initiation retries (the number of
+ Requests minus one) before timing out. In addition, it also governs
+ the behaviour of the other, passive side: this variable also sets
+ the number of times DCCP repeats sending a Response when the initial
+ handshake does not progress from RESPOND to OPEN (i.e. when no Ack
+ is received after the initial Request). This value should be greater
+ than 0, suggested is less than 10. Analogue of tcp_syn_retries.
+
+retries1
+ How often a DCCP Response is retransmitted until the listening DCCP
+ side considers its connecting peer dead. Analogue of tcp_retries1.
+
+retries2
+ The number of times a general DCCP packet is retransmitted. This has
+ importance for retransmitted acknowledgments and feature negotiation,
+ data packets are never retransmitted. Analogue of tcp_retries2.
+
+tx_ccid = 2
+ Default CCID for the sender-receiver half-connection. Depending on the
+ choice of CCID, the Send Ack Vector feature is enabled automatically.
+
+rx_ccid = 2
+ Default CCID for the receiver-sender half-connection; see tx_ccid.
+
+seq_window = 100
+ The initial sequence window (sec. 7.5.2) of the sender. This influences
+ the local ackno validity and the remote seqno validity windows (7.5.1).
+ Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set.
+
+tx_qlen = 5
+ The size of the transmit buffer in packets. A value of 0 corresponds
+ to an unbounded transmit buffer.
+
+sync_ratelimit = 125 ms
+ The timeout between subsequent DCCP-Sync packets sent in response to
+ sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit
+ of this parameter is milliseconds; a value of 0 disables rate-limiting.
+
+
+IOCTLS
+======
+FIONREAD
+ Works as in udp(7): returns in the `int' argument pointer the size of
+ the next pending datagram in bytes, or 0 when no datagram is pending.
+
+
+Other tunables
+==============
+Per-route rto_min support
+ CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value
+ of the RTO timer. This setting can be modified via the 'rto_min' option
+ of iproute2; for example:
+ > ip route change 10.0.0.0/24 rto_min 250j dev wlan0
+ > ip route add 10.0.0.254/32 rto_min 800j dev wlan0
+ > ip route show dev wlan0
+ CCID-3 also supports the rto_min setting: it is used to define the lower
+ bound for the expiry of the nofeedback timer. This can be useful on LANs
+ with very low RTTs (e.g., loopback, Gbit ethernet).
+
+
+Notes
+=====
+DCCP does not travel through NAT successfully at present on many boxes. This is
+because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT
+support for DCCP has been added.
diff --git a/Documentation/networking/dctcp.txt b/Documentation/networking/dctcp.txt
new file mode 100644
index 000000000..0d5dfbc89
--- /dev/null
+++ b/Documentation/networking/dctcp.txt
@@ -0,0 +1,43 @@
+DCTCP (DataCenter TCP)
+----------------------
+
+DCTCP is an enhancement to the TCP congestion control algorithm for data
+center networks and leverages Explicit Congestion Notification (ECN) in
+the data center network to provide multi-bit feedback to the end hosts.
+
+To enable it on end hosts:
+
+ sysctl -w net.ipv4.tcp_congestion_control=dctcp
+
+All switches in the data center network running DCTCP must support ECN
+marking and be configured for marking when reaching defined switch buffer
+thresholds. The default ECN marking threshold heuristic for DCTCP on
+switches is 20 packets (30KB) at 1Gbps, and 65 packets (~100KB) at 10Gbps,
+but might need further careful tweaking.
+
+For more details, see below documents:
+
+Paper:
+
+The algorithm is further described in detail in the following two
+SIGCOMM/SIGMETRICS papers:
+
+ i) Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye,
+ Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan:
+ "Data Center TCP (DCTCP)", Data Center Networks session
+ Proc. ACM SIGCOMM, New Delhi, 2010.
+ http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
+ http://www.sigcomm.org/ccr/papers/2010/October/1851275.1851192
+
+ii) Mohammad Alizadeh, Adel Javanmard, and Balaji Prabhakar:
+ "Analysis of DCTCP: Stability, Convergence, and Fairness"
+ Proc. ACM SIGMETRICS, San Jose, 2011.
+ http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
+
+IETF informational draft:
+
+ http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
+
+DCTCP site:
+
+ http://simula.stanford.edu/~alizade/Site/DCTCP.html
diff --git a/Documentation/networking/de4x5.txt b/Documentation/networking/de4x5.txt
new file mode 100644
index 000000000..c8e4ca9b2
--- /dev/null
+++ b/Documentation/networking/de4x5.txt
@@ -0,0 +1,178 @@
+ Originally, this driver was written for the Digital Equipment
+ Corporation series of EtherWORKS Ethernet cards:
+
+ DE425 TP/COAX EISA
+ DE434 TP PCI
+ DE435 TP/COAX/AUI PCI
+ DE450 TP/COAX/AUI PCI
+ DE500 10/100 PCI Fasternet
+
+ but it will now attempt to support all cards which conform to the
+ Digital Semiconductor SROM Specification. The driver currently
+ recognises the following chips:
+
+ DC21040 (no SROM)
+ DC21041[A]
+ DC21140[A]
+ DC21142
+ DC21143
+
+ So far the driver is known to work with the following cards:
+
+ KINGSTON
+ Linksys
+ ZNYX342
+ SMC8432
+ SMC9332 (w/new SROM)
+ ZNYX31[45]
+ ZNYX346 10/100 4 port (can act as a 10/100 bridge!)
+
+ The driver has been tested on a relatively busy network using the DE425,
+ DE434, DE435 and DE500 cards and benchmarked with 'ttcp': it transferred
+ 16M of data to a DECstation 5000/200 as follows:
+
+ TCP UDP
+ TX RX TX RX
+ DE425 1030k 997k 1170k 1128k
+ DE434 1063k 995k 1170k 1125k
+ DE435 1063k 995k 1170k 1125k
+ DE500 1063k 998k 1170k 1125k in 10Mb/s mode
+
+ All values are typical (in kBytes/sec) from a sample of 4 for each
+ measurement. Their error is +/-20k on a quiet (private) network and also
+ depend on what load the CPU has.
+
+ =========================================================================
+
+ The ability to load this driver as a loadable module has been included
+ and used extensively during the driver development (to save those long
+ reboot sequences). Loadable module support under PCI and EISA has been
+ achieved by letting the driver autoprobe as if it were compiled into the
+ kernel. Do make sure you're not sharing interrupts with anything that
+ cannot accommodate interrupt sharing!
+
+ To utilise this ability, you have to do 8 things:
+
+ 0) have a copy of the loadable modules code installed on your system.
+ 1) copy de4x5.c from the /linux/drivers/net directory to your favourite
+ temporary directory.
+ 2) for fixed autoprobes (not recommended), edit the source code near
+ line 5594 to reflect the I/O address you're using, or assign these when
+ loading by:
+
+ insmod de4x5 io=0xghh where g = bus number
+ hh = device number
+
+ NB: autoprobing for modules is now supported by default. You may just
+ use:
+
+ insmod de4x5
+
+ to load all available boards. For a specific board, still use
+ the 'io=?' above.
+ 3) compile de4x5.c, but include -DMODULE in the command line to ensure
+ that the correct bits are compiled (see end of source code).
+ 4) if you are wanting to add a new card, goto 5. Otherwise, recompile a
+ kernel with the de4x5 configuration turned off and reboot.
+ 5) insmod de4x5 [io=0xghh]
+ 6) run the net startup bits for your new eth?? interface(s) manually
+ (usually /etc/rc.inet[12] at boot time).
+ 7) enjoy!
+
+ To unload a module, turn off the associated interface(s)
+ 'ifconfig eth?? down' then 'rmmod de4x5'.
+
+ Automedia detection is included so that in principle you can disconnect
+ from, e.g. TP, reconnect to BNC and things will still work (after a
+ pause whilst the driver figures out where its media went). My tests
+ using ping showed that it appears to work....
+
+ By default, the driver will now autodetect any DECchip based card.
+ Should you have a need to restrict the driver to DIGITAL only cards, you
+ can compile with a DEC_ONLY define, or if loading as a module, use the
+ 'dec_only=1' parameter.
+
+ I've changed the timing routines to use the kernel timer and scheduling
+ functions so that the hangs and other assorted problems that occurred
+ while autosensing the media should be gone. A bonus for the DC21040
+ auto media sense algorithm is that it can now use one that is more in
+ line with the rest (the DC21040 chip doesn't have a hardware timer).
+ The downside is the 1 'jiffies' (10ms) resolution.
+
+ IEEE 802.3u MII interface code has been added in anticipation that some
+ products may use it in the future.
+
+ The SMC9332 card has a non-compliant SROM which needs fixing - I have
+ patched this driver to detect it because the SROM format used complies
+ to a previous DEC-STD format.
+
+ I have removed the buffer copies needed for receive on Intels. I cannot
+ remove them for Alphas since the Tulip hardware only does longword
+ aligned DMA transfers and the Alphas get alignment traps with non
+ longword aligned data copies (which makes them really slow). No comment.
+
+ I have added SROM decoding routines to make this driver work with any
+ card that supports the Digital Semiconductor SROM spec. This will help
+ all cards running the dc2114x series chips in particular. Cards using
+ the dc2104x chips should run correctly with the basic driver. I'm in
+ debt to <mjacob@feral.com> for the testing and feedback that helped get
+ this feature working. So far we have tested KINGSTON, SMC8432, SMC9332
+ (with the latest SROM complying with the SROM spec V3: their first was
+ broken), ZNYX342 and LinkSys. ZNYX314 (dual 21041 MAC) and ZNYX 315
+ (quad 21041 MAC) cards also appear to work despite their incorrectly
+ wired IRQs.
+
+ I have added a temporary fix for interrupt problems when some SCSI cards
+ share the same interrupt as the DECchip based cards. The problem occurs
+ because the SCSI card wants to grab the interrupt as a fast interrupt
+ (runs the service routine with interrupts turned off) vs. this card
+ which really needs to run the service routine with interrupts turned on.
+ This driver will now add the interrupt service routine as a fast
+ interrupt if it is bounced from the slow interrupt. THIS IS NOT A
+ RECOMMENDED WAY TO RUN THE DRIVER and has been done for a limited time
+ until people sort out their compatibility issues and the kernel
+ interrupt service code is fixed. YOU SHOULD SEPARATE OUT THE FAST
+ INTERRUPT CARDS FROM THE SLOW INTERRUPT CARDS to ensure that they do not
+ run on the same interrupt. PCMCIA/CardBus is another can of worms...
+
+ Finally, I think I have really fixed the module loading problem with
+ more than one DECchip based card. As a side effect, I don't mess with
+ the device structure any more which means that if more than 1 card in
+ 2.0.x is installed (4 in 2.1.x), the user will have to edit
+ linux/drivers/net/Space.c to make room for them. Hence, module loading
+ is the preferred way to use this driver, since it doesn't have this
+ limitation.
+
+ Where SROM media detection is used and full duplex is specified in the
+ SROM, the feature is ignored unless lp->params.fdx is set at compile
+ time OR during a module load (insmod de4x5 args='eth??:fdx' [see
+ below]). This is because there is no way to automatically detect full
+ duplex links except through autonegotiation. When I include the
+ autonegotiation feature in the SROM autoconf code, this detection will
+ occur automatically for that case.
+
+ Command line arguments are now allowed, similar to passing arguments
+ through LILO. This will allow a per adapter board set up of full duplex
+ and media. The only lexical constraints are: the board name (dev->name)
+ appears in the list before its parameters. The list of parameters ends
+ either at the end of the parameter list or with another board name. The
+ following parameters are allowed:
+
+ fdx for full duplex
+ autosense to set the media/speed; with the following
+ sub-parameters:
+ TP, TP_NW, BNC, AUI, BNC_AUI, 100Mb, 10Mb, AUTO
+
+ Case sensitivity is important for the sub-parameters. They *must* be
+ upper case. Examples:
+
+ insmod de4x5 args='eth1:fdx autosense=BNC eth0:autosense=100Mb'.
+
+ For a compiled in driver, in linux/drivers/net/CONFIG, place e.g.
+ DE4X5_OPTS = -DDE4X5_PARM='"eth0:fdx autosense=AUI eth2:autosense=TP"'
+
+ Yes, I know full duplex isn't permissible on BNC or AUI; they're just
+ examples. By default, full duplex is turned off and AUTO is the default
+ autosense setting. In reality, I expect only the full duplex option to
+ be used. Note the use of single quotes in the two examples above and the
+ lack of commas to separate items.
diff --git a/Documentation/networking/decnet.txt b/Documentation/networking/decnet.txt
new file mode 100644
index 000000000..e12a4900c
--- /dev/null
+++ b/Documentation/networking/decnet.txt
@@ -0,0 +1,232 @@
+ Linux DECnet Networking Layer Information
+ ===========================================
+
+1) Other documentation....
+
+ o Project Home Pages
+ http://www.chygwyn.com/ - Kernel info
+ http://linux-decnet.sourceforge.net/ - Userland tools
+ http://www.sourceforge.net/projects/linux-decnet/ - Status page
+
+2) Configuring the kernel
+
+Be sure to turn on the following options:
+
+ CONFIG_DECNET (obviously)
+ CONFIG_PROC_FS (to see what's going on)
+ CONFIG_SYSCTL (for easy configuration)
+
+if you want to try out router support (not properly debugged yet)
+you'll need the following options as well...
+
+ CONFIG_DECNET_ROUTER (to be able to add/delete routes)
+ CONFIG_NETFILTER (will be required for the DECnet routing daemon)
+
+ CONFIG_DECNET_ROUTE_FWMARK is optional
+
+Don't turn on SIOCGIFCONF support for DECnet unless you are really sure
+that you need it, in general you won't and it can cause ifconfig to
+malfunction.
+
+Run time configuration has changed slightly from the 2.4 system. If you
+want to configure an endnode, then the simplified procedure is as follows:
+
+ o Set the MAC address on your ethernet card before starting _any_ other
+ network protocols.
+
+As soon as your network card is brought into the UP state, DECnet should
+start working. If you need something more complicated or are unsure how
+to set the MAC address, see the next section. Also all configurations which
+worked with 2.4 will work under 2.5 with no change.
+
+3) Command line options
+
+You can set a DECnet address on the kernel command line for compatibility
+with the 2.4 configuration procedure, but in general it's not needed any more.
+If you do st a DECnet address on the command line, it has only one purpose
+which is that its added to the addresses on the loopback device.
+
+With 2.4 kernels, DECnet would only recognise addresses as local if they
+were added to the loopback device. In 2.5, any local interface address
+can be used to loop back to the local machine. Of course this does not
+prevent you adding further addresses to the loopback device if you
+want to.
+
+N.B. Since the address list of an interface determines the addresses for
+which "hello" messages are sent, if you don't set an address on the loopback
+interface then you won't see any entries in /proc/net/neigh for the local
+host until such time as you start a connection. This doesn't affect the
+operation of the local communications in any other way though.
+
+The kernel command line takes options looking like the following:
+
+ decnet.addr=1,2
+
+the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels
+and early 2.3.xx kernels, you must use a comma when specifying the
+DECnet address like this. For more recent 2.3.xx kernels, you may
+use almost any character except space, although a `.` would be the most
+obvious choice :-)
+
+There used to be a third number specifying the node type. This option
+has gone away in favour of a per interface node type. This is now set
+using /proc/sys/net/decnet/conf/<dev>/forwarding. This file can be
+set with a single digit, 0=EndNode, 1=L1 Router and 2=L2 Router.
+
+There are also equivalent options for modules. The node address can
+also be set through the /proc/sys/net/decnet/ files, as can other system
+parameters.
+
+Currently the only supported devices are ethernet and ip_gre. The
+ethernet address of your ethernet card has to be set according to the DECnet
+address of the node in order for it to be autoconfigured (and then appear in
+/proc/net/decnet_dev). There is a utility available at the above
+FTP sites called dn2ethaddr which can compute the correct ethernet
+address to use. The address can be set by ifconfig either before or
+at the time the device is brought up. If you are using RedHat you can
+add the line:
+
+ MACADDR=AA:00:04:00:03:04
+
+or something similar, to /etc/sysconfig/network-scripts/ifcfg-eth0 or
+wherever your network card's configuration lives. Setting the MAC address
+of your ethernet card to an address starting with "hi-ord" will cause a
+DECnet address which matches to be added to the interface (which you can
+verify with iproute2).
+
+The default device for routing can be set through the /proc filesystem
+by setting /proc/sys/net/decnet/default_device to the
+device you want DECnet to route packets out of when no specific route
+is available. Usually this will be eth0, for example:
+
+ echo -n "eth0" >/proc/sys/net/decnet/default_device
+
+If you don't set the default device, then it will default to the first
+ethernet card which has been autoconfigured as described above. You can
+confirm that by looking in the default_device file of course.
+
+There is a list of what the other files under /proc/sys/net/decnet/ do
+on the kernel patch web site (shown above).
+
+4) Run time kernel configuration
+
+This is either done through the sysctl/proc interface (see the kernel web
+pages for details on what the various options do) or through the iproute2
+package in the same way as IPv4/6 configuration is performed.
+
+Documentation for iproute2 is included with the package, although there is
+as yet no specific section on DECnet, most of the features apply to both
+IP and DECnet, albeit with DECnet addresses instead of IP addresses and
+a reduced functionality.
+
+If you want to configure a DECnet router you'll need the iproute2 package
+since its the _only_ way to add and delete routes currently. Eventually
+there will be a routing daemon to send and receive routing messages for
+each interface and update the kernel routing tables accordingly. The
+routing daemon will use netfilter to listen to routing packets, and
+rtnetlink to update the kernels routing tables.
+
+The DECnet raw socket layer has been removed since it was there purely
+for use by the routing daemon which will now use netfilter (a much cleaner
+and more generic solution) instead.
+
+5) How can I tell if its working ?
+
+Here is a quick guide of what to look for in order to know if your DECnet
+kernel subsystem is working.
+
+ - Is the node address set (see /proc/sys/net/decnet/node_address)
+ - Is the node of the correct type
+ (see /proc/sys/net/decnet/conf/<dev>/forwarding)
+ - Is the Ethernet MAC address of each Ethernet card set to match
+ the DECnet address. If in doubt use the dn2ethaddr utility available
+ at the ftp archive.
+ - If the previous two steps are satisfied, and the Ethernet card is up,
+ you should find that it is listed in /proc/net/decnet_dev and also
+ that it appears as a directory in /proc/sys/net/decnet/conf/. The
+ loopback device (lo) should also appear and is required to communicate
+ within a node.
+ - If you have any DECnet routers on your network, they should appear
+ in /proc/net/decnet_neigh, otherwise this file will only contain the
+ entry for the node itself (if it doesn't check to see if lo is up).
+ - If you want to send to any node which is not listed in the
+ /proc/net/decnet_neigh file, you'll need to set the default device
+ to point to an Ethernet card with connection to a router. This is
+ again done with the /proc/sys/net/decnet/default_device file.
+ - Try starting a simple server and client, like the dnping/dnmirror
+ over the loopback interface. With luck they should communicate.
+ For this step and those after, you'll need the DECnet library
+ which can be obtained from the above ftp sites as well as the
+ actual utilities themselves.
+ - If this seems to work, then try talking to a node on your local
+ network, and see if you can obtain the same results.
+ - At this point you are on your own... :-)
+
+6) How to send a bug report
+
+If you've found a bug and want to report it, then there are several things
+you can do to help me work out exactly what it is that is wrong. Useful
+information (_most_ of which _is_ _essential_) includes:
+
+ - What kernel version are you running ?
+ - What version of the patch are you running ?
+ - How far though the above set of tests can you get ?
+ - What is in the /proc/decnet* files and /proc/sys/net/decnet/* files ?
+ - Which services are you running ?
+ - Which client caused the problem ?
+ - How much data was being transferred ?
+ - Was the network congested ?
+ - How can the problem be reproduced ?
+ - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of
+ tcpdump don't understand how to dump DECnet properly, so including
+ the hex listing of the packet contents is _essential_, usually the -x flag.
+ You may also need to increase the length grabbed with the -s flag. The
+ -e flag also provides very useful information (ethernet MAC addresses))
+
+7) MAC FAQ
+
+A quick FAQ on ethernet MAC addresses to explain how Linux and DECnet
+interact and how to get the best performance from your hardware.
+
+Ethernet cards are designed to normally only pass received network frames
+to a host computer when they are addressed to it, or to the broadcast address.
+
+Linux has an interface which allows the setting of extra addresses for
+an ethernet card to listen to. If the ethernet card supports it, the
+filtering operation will be done in hardware, if not the extra unwanted packets
+received will be discarded by the host computer. In the latter case,
+significant processor time and bus bandwidth can be used up on a busy
+network (see the NAPI documentation for a longer explanation of these
+effects).
+
+DECnet makes use of this interface to allow running DECnet on an ethernet
+card which has already been configured using TCP/IP (presumably using the
+built in MAC address of the card, as usual) and/or to allow multiple DECnet
+addresses on each physical interface. If you do this, be aware that if your
+ethernet card doesn't support perfect hashing in its MAC address filter
+then your computer will be doing more work than required. Some cards
+will simply set themselves into promiscuous mode in order to receive
+packets from the DECnet specified addresses. So if you have one of these
+cards its better to set the MAC address of the card as described above
+to gain the best efficiency. Better still is to use a card which supports
+NAPI as well.
+
+
+8) Mailing list
+
+If you are keen to get involved in development, or want to ask questions
+about configuration, or even just report bugs, then there is a mailing
+list that you can join, details are at:
+
+http://sourceforge.net/mail/?group_id=4993
+
+9) Legal Info
+
+The Linux DECnet project team have placed their code under the GPL. The
+software is provided "as is" and without warranty express or implied.
+DECnet is a trademark of Compaq. This software is not a product of
+Compaq. We acknowledge the help of people at Compaq in providing extra
+documentation above and beyond what was previously publicly available.
+
+Steve Whitehouse <SteveW@ACM.org>
+
diff --git a/Documentation/networking/dl2k.txt b/Documentation/networking/dl2k.txt
new file mode 100644
index 000000000..cba74f7a3
--- /dev/null
+++ b/Documentation/networking/dl2k.txt
@@ -0,0 +1,282 @@
+
+ D-Link DL2000-based Gigabit Ethernet Adapter Installation
+ for Linux
+ May 23, 2002
+
+Contents
+========
+ - Compatibility List
+ - Quick Install
+ - Compiling the Driver
+ - Installing the Driver
+ - Option parameter
+ - Configuration Script Sample
+ - Troubleshooting
+
+
+Compatibility List
+=================
+Adapter Support:
+
+D-Link DGE-550T Gigabit Ethernet Adapter.
+D-Link DGE-550SX Gigabit Ethernet Adapter.
+D-Link DL2000-based Gigabit Ethernet Adapter.
+
+
+The driver support Linux kernel 2.4.7 later. We had tested it
+on the environments below.
+
+ . Red Hat v6.2 (update kernel to 2.4.7)
+ . Red Hat v7.0 (update kernel to 2.4.7)
+ . Red Hat v7.1 (kernel 2.4.7)
+ . Red Hat v7.2 (kernel 2.4.7-10)
+
+
+Quick Install
+=============
+Install linux driver as following command:
+
+1. make all
+2. insmod dl2k.ko
+3. ifconfig eth0 up 10.xxx.xxx.xxx netmask 255.0.0.0
+ ^^^^^^^^^^^^^^^\ ^^^^^^^^\
+ IP NETMASK
+Now eth0 should active, you can test it by "ping" or get more information by
+"ifconfig". If tested ok, continue the next step.
+
+4. cp dl2k.ko /lib/modules/`uname -r`/kernel/drivers/net
+5. Add the following line to /etc/modprobe.d/dl2k.conf:
+ alias eth0 dl2k
+6. Run depmod to updated module indexes.
+7. Run "netconfig" or "netconf" to create configuration script ifcfg-eth0
+ located at /etc/sysconfig/network-scripts or create it manually.
+ [see - Configuration Script Sample]
+8. Driver will automatically load and configure at next boot time.
+
+Compiling the Driver
+====================
+ In Linux, NIC drivers are most commonly configured as loadable modules.
+The approach of building a monolithic kernel has become obsolete. The driver
+can be compiled as part of a monolithic kernel, but is strongly discouraged.
+The remainder of this section assumes the driver is built as a loadable module.
+In the Linux environment, it is a good idea to rebuild the driver from the
+source instead of relying on a precompiled version. This approach provides
+better reliability since a precompiled driver might depend on libraries or
+kernel features that are not present in a given Linux installation.
+
+The 3 files necessary to build Linux device driver are dl2k.c, dl2k.h and
+Makefile. To compile, the Linux installation must include the gcc compiler,
+the kernel source, and the kernel headers. The Linux driver supports Linux
+Kernels 2.4.7. Copy the files to a directory and enter the following command
+to compile and link the driver:
+
+CD-ROM drive
+------------
+
+[root@XXX /] mkdir cdrom
+[root@XXX /] mount -r -t iso9660 -o conv=auto /dev/cdrom /cdrom
+[root@XXX /] cd root
+[root@XXX /root] mkdir dl2k
+[root@XXX /root] cd dl2k
+[root@XXX dl2k] cp /cdrom/linux/dl2k.tgz /root/dl2k
+[root@XXX dl2k] tar xfvz dl2k.tgz
+[root@XXX dl2k] make all
+
+Floppy disc drive
+-----------------
+
+[root@XXX /] cd root
+[root@XXX /root] mkdir dl2k
+[root@XXX /root] cd dl2k
+[root@XXX dl2k] mcopy a:/linux/dl2k.tgz /root/dl2k
+[root@XXX dl2k] tar xfvz dl2k.tgz
+[root@XXX dl2k] make all
+
+Installing the Driver
+=====================
+
+ Manual Installation
+ -------------------
+ Once the driver has been compiled, it must be loaded, enabled, and bound
+ to a protocol stack in order to establish network connectivity. To load a
+ module enter the command:
+
+ insmod dl2k.o
+
+ or
+
+ insmod dl2k.o <optional parameter> ; add parameter
+
+ ===============================================================
+ example: insmod dl2k.o media=100mbps_hd
+ or insmod dl2k.o media=3
+ or insmod dl2k.o media=3,2 ; for 2 cards
+ ===============================================================
+
+ Please reference the list of the command line parameters supported by
+ the Linux device driver below.
+
+ The insmod command only loads the driver and gives it a name of the form
+ eth0, eth1, etc. To bring the NIC into an operational state,
+ it is necessary to issue the following command:
+
+ ifconfig eth0 up
+
+ Finally, to bind the driver to the active protocol (e.g., TCP/IP with
+ Linux), enter the following command:
+
+ ifup eth0
+
+ Note that this is meaningful only if the system can find a configuration
+ script that contains the necessary network information. A sample will be
+ given in the next paragraph.
+
+ The commands to unload a driver are as follows:
+
+ ifdown eth0
+ ifconfig eth0 down
+ rmmod dl2k.o
+
+ The following are the commands to list the currently loaded modules and
+ to see the current network configuration.
+
+ lsmod
+ ifconfig
+
+
+ Automated Installation
+ ----------------------
+ This section describes how to install the driver such that it is
+ automatically loaded and configured at boot time. The following description
+ is based on a Red Hat 6.0/7.0 distribution, but it can easily be ported to
+ other distributions as well.
+
+ Red Hat v6.x/v7.x
+ -----------------
+ 1. Copy dl2k.o to the network modules directory, typically
+ /lib/modules/2.x.x-xx/net or /lib/modules/2.x.x/kernel/drivers/net.
+ 2. Locate the boot module configuration file, most commonly in the
+ /etc/modprobe.d/ directory. Add the following lines:
+
+ alias ethx dl2k
+ options dl2k <optional parameters>
+
+ where ethx will be eth0 if the NIC is the only ethernet adapter, eth1 if
+ one other ethernet adapter is installed, etc. Refer to the table in the
+ previous section for the list of optional parameters.
+ 3. Locate the network configuration scripts, normally the
+ /etc/sysconfig/network-scripts directory, and create a configuration
+ script named ifcfg-ethx that contains network information.
+ 4. Note that for most Linux distributions, Red Hat included, a configuration
+ utility with a graphical user interface is provided to perform steps 2
+ and 3 above.
+
+
+Parameter Description
+=====================
+You can install this driver without any additional parameter. However, if you
+are going to have extensive functions then it is necessary to set extra
+parameter. Below is a list of the command line parameters supported by the
+Linux device
+driver.
+
+mtu=packet_size - Specifies the maximum packet size. default
+ is 1500.
+
+media=media_type - Specifies the media type the NIC operates at.
+ autosense Autosensing active media.
+ 10mbps_hd 10Mbps half duplex.
+ 10mbps_fd 10Mbps full duplex.
+ 100mbps_hd 100Mbps half duplex.
+ 100mbps_fd 100Mbps full duplex.
+ 1000mbps_fd 1000Mbps full duplex.
+ 1000mbps_hd 1000Mbps half duplex.
+ 0 Autosensing active media.
+ 1 10Mbps half duplex.
+ 2 10Mbps full duplex.
+ 3 100Mbps half duplex.
+ 4 100Mbps full duplex.
+ 5 1000Mbps half duplex.
+ 6 1000Mbps full duplex.
+
+ By default, the NIC operates at autosense.
+ 1000mbps_fd and 1000mbps_hd types are only
+ available for fiber adapter.
+
+vlan=n - Specifies the VLAN ID. If vlan=0, the
+ Virtual Local Area Network (VLAN) function is
+ disable.
+
+jumbo=[0|1] - Specifies the jumbo frame support. If jumbo=1,
+ the NIC accept jumbo frames. By default, this
+ function is disabled.
+ Jumbo frame usually improve the performance
+ int gigabit.
+ This feature need jumbo frame compatible
+ remote.
+
+rx_coalesce=m - Number of rx frame handled each interrupt.
+rx_timeout=n - Rx DMA wait time for an interrupt.
+ If set rx_coalesce > 0, hardware only assert
+ an interrupt for m frames. Hardware won't
+ assert rx interrupt until m frames received or
+ reach timeout of n * 640 nano seconds.
+ Set proper rx_coalesce and rx_timeout can
+ reduce congestion collapse and overload which
+ has been a bottleneck for high speed network.
+
+ For example, rx_coalesce=10 rx_timeout=800.
+ that is, hardware assert only 1 interrupt
+ for 10 frames received or timeout of 512 us.
+
+tx_coalesce=n - Number of tx frame handled each interrupt.
+ Set n > 1 can reduce the interrupts
+ congestion usually lower performance of
+ high speed network card. Default is 16.
+
+tx_flow=[1|0] - Specifies the Tx flow control. If tx_flow=0,
+ the Tx flow control disable else driver
+ autodetect.
+rx_flow=[1|0] - Specifies the Rx flow control. If rx_flow=0,
+ the Rx flow control enable else driver
+ autodetect.
+
+
+Configuration Script Sample
+===========================
+Here is a sample of a simple configuration script:
+
+DEVICE=eth0
+USERCTL=no
+ONBOOT=yes
+POOTPROTO=none
+BROADCAST=207.200.5.255
+NETWORK=207.200.5.0
+NETMASK=255.255.255.0
+IPADDR=207.200.5.2
+
+
+Troubleshooting
+===============
+Q1. Source files contain ^ M behind every line.
+ Make sure all files are Unix file format (no LF). Try the following
+ shell command to convert files.
+
+ cat dl2k.c | col -b > dl2k.tmp
+ mv dl2k.tmp dl2k.c
+
+ OR
+
+ cat dl2k.c | tr -d "\r" > dl2k.tmp
+ mv dl2k.tmp dl2k.c
+
+Q2: Could not find header files (*.h) ?
+ To compile the driver, you need kernel header files. After
+ installing the kernel source, the header files are usually located in
+ /usr/src/linux/include, which is the default include directory configured
+ in Makefile. For some distributions, there is a copy of header files in
+ /usr/src/include/linux and /usr/src/include/asm, that you can change the
+ INCLUDEDIR in Makefile to /usr/include without installing kernel source.
+ Note that RH 7.0 didn't provide correct header files in /usr/include,
+ including those files will make a wrong version driver.
+
diff --git a/Documentation/networking/dm9000.txt b/Documentation/networking/dm9000.txt
new file mode 100644
index 000000000..5552e2e57
--- /dev/null
+++ b/Documentation/networking/dm9000.txt
@@ -0,0 +1,167 @@
+DM9000 Network driver
+=====================
+
+Copyright 2008 Simtec Electronics,
+ Ben Dooks <ben@simtec.co.uk> <ben-linux@fluff.org>
+
+
+Introduction
+------------
+
+This file describes how to use the DM9000 platform-device based network driver
+that is contained in the files drivers/net/dm9000.c and drivers/net/dm9000.h.
+
+The driver supports three DM9000 variants, the DM9000E which is the first chip
+supported as well as the newer DM9000A and DM9000B devices. It is currently
+maintained and tested by Ben Dooks, who should be CC: to any patches for this
+driver.
+
+
+Defining the platform device
+----------------------------
+
+The minimum set of resources attached to the platform device are as follows:
+
+ 1) The physical address of the address register
+ 2) The physical address of the data register
+ 3) The IRQ line the device's interrupt pin is connected to.
+
+These resources should be specified in that order, as the ordering of the
+two address regions is important (the driver expects these to be address
+and then data).
+
+An example from arch/arm/mach-s3c2410/mach-bast.c is:
+
+static struct resource bast_dm9k_resource[] = {
+ [0] = {
+ .start = S3C2410_CS5 + BAST_PA_DM9000,
+ .end = S3C2410_CS5 + BAST_PA_DM9000 + 3,
+ .flags = IORESOURCE_MEM,
+ },
+ [1] = {
+ .start = S3C2410_CS5 + BAST_PA_DM9000 + 0x40,
+ .end = S3C2410_CS5 + BAST_PA_DM9000 + 0x40 + 0x3f,
+ .flags = IORESOURCE_MEM,
+ },
+ [2] = {
+ .start = IRQ_DM9000,
+ .end = IRQ_DM9000,
+ .flags = IORESOURCE_IRQ | IORESOURCE_IRQ_HIGHLEVEL,
+ }
+};
+
+static struct platform_device bast_device_dm9k = {
+ .name = "dm9000",
+ .id = 0,
+ .num_resources = ARRAY_SIZE(bast_dm9k_resource),
+ .resource = bast_dm9k_resource,
+};
+
+Note the setting of the IRQ trigger flag in bast_dm9k_resource[2].flags,
+as this will generate a warning if it is not present. The trigger from
+the flags field will be passed to request_irq() when registering the IRQ
+handler to ensure that the IRQ is setup correctly.
+
+This shows a typical platform device, without the optional configuration
+platform data supplied. The next example uses the same resources, but adds
+the optional platform data to pass extra configuration data:
+
+static struct dm9000_plat_data bast_dm9k_platdata = {
+ .flags = DM9000_PLATF_16BITONLY,
+};
+
+static struct platform_device bast_device_dm9k = {
+ .name = "dm9000",
+ .id = 0,
+ .num_resources = ARRAY_SIZE(bast_dm9k_resource),
+ .resource = bast_dm9k_resource,
+ .dev = {
+ .platform_data = &bast_dm9k_platdata,
+ }
+};
+
+The platform data is defined in include/linux/dm9000.h and described below.
+
+
+Platform data
+-------------
+
+Extra platform data for the DM9000 can describe the IO bus width to the
+device, whether or not an external PHY is attached to the device and
+the availability of an external configuration EEPROM.
+
+The flags for the platform data .flags field are as follows:
+
+DM9000_PLATF_8BITONLY
+
+ The IO should be done with 8bit operations.
+
+DM9000_PLATF_16BITONLY
+
+ The IO should be done with 16bit operations.
+
+DM9000_PLATF_32BITONLY
+
+ The IO should be done with 32bit operations.
+
+DM9000_PLATF_EXT_PHY
+
+ The chip is connected to an external PHY.
+
+DM9000_PLATF_NO_EEPROM
+
+ This can be used to signify that the board does not have an
+ EEPROM, or that the EEPROM should be hidden from the user.
+
+DM9000_PLATF_SIMPLE_PHY
+
+ Switch to using the simpler PHY polling method which does not
+ try and read the MII PHY state regularly. This is only available
+ when using the internal PHY. See the section on link state polling
+ for more information.
+
+ The config symbol DM9000_FORCE_SIMPLE_PHY_POLL, Kconfig entry
+ "Force simple NSR based PHY polling" allows this flag to be
+ forced on at build time.
+
+
+PHY Link state polling
+----------------------
+
+The driver keeps track of the link state and informs the network core
+about link (carrier) availability. This is managed by several methods
+depending on the version of the chip and on which PHY is being used.
+
+For the internal PHY, the original (and currently default) method is
+to read the MII state, either when the status changes if we have the
+necessary interrupt support in the chip or every two seconds via a
+periodic timer.
+
+To reduce the overhead for the internal PHY, there is now the option
+of using the DM9000_FORCE_SIMPLE_PHY_POLL config, or DM9000_PLATF_SIMPLE_PHY
+platform data option to read the summary information without the
+expensive MII accesses. This method is faster, but does not print
+as much information.
+
+When using an external PHY, the driver currently has to poll the MII
+link status as there is no method for getting an interrupt on link change.
+
+
+DM9000A / DM9000B
+-----------------
+
+These chips are functionally similar to the DM9000E and are supported easily
+by the same driver. The features are:
+
+ 1) Interrupt on internal PHY state change. This means that the periodic
+ polling of the PHY status may be disabled on these devices when using
+ the internal PHY.
+
+ 2) TCP/UDP checksum offloading, which the driver does not currently support.
+
+
+ethtool
+-------
+
+The driver supports the ethtool interface for access to the driver
+state information, the PHY state and the EEPROM.
diff --git a/Documentation/networking/dmfe.txt b/Documentation/networking/dmfe.txt
new file mode 100644
index 000000000..25320bf19
--- /dev/null
+++ b/Documentation/networking/dmfe.txt
@@ -0,0 +1,66 @@
+Note: This driver doesn't have a maintainer.
+
+Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver for Linux.
+
+This program is free software; you can redistribute it and/or
+modify it under the terms of the GNU General Public License
+as published by the Free Software Foundation; either version 2
+of the License, or (at your option) any later version.
+
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+GNU General Public License for more details.
+
+
+This driver provides kernel support for Davicom DM9102(A)/DM9132/DM9801 ethernet cards ( CNET
+10/100 ethernet cards uses Davicom chipset too, so this driver supports CNET cards too ).If you
+didn't compile this driver as a module, it will automatically load itself on boot and print a
+line similar to :
+
+ dmfe: Davicom DM9xxx net driver, version 1.36.4 (2002-01-17)
+
+If you compiled this driver as a module, you have to load it on boot.You can load it with command :
+
+ insmod dmfe
+
+This way it will autodetect the device mode.This is the suggested way to load the module.Or you can pass
+a mode= setting to module while loading, like :
+
+ insmod dmfe mode=0 # Force 10M Half Duplex
+ insmod dmfe mode=1 # Force 100M Half Duplex
+ insmod dmfe mode=4 # Force 10M Full Duplex
+ insmod dmfe mode=5 # Force 100M Full Duplex
+
+Next you should configure your network interface with a command similar to :
+
+ ifconfig eth0 172.22.3.18
+ ^^^^^^^^^^^
+ Your IP Address
+
+Then you may have to modify the default routing table with command :
+
+ route add default eth0
+
+
+Now your ethernet card should be up and running.
+
+
+TODO:
+
+Implement pci_driver::suspend() and pci_driver::resume() power management methods.
+Check on 64 bit boxes.
+Check and fix on big endian boxes.
+Test and make sure PCI latency is now correct for all cases.
+
+
+Authors:
+
+Sten Wang <sten_wang@davicom.com.tw > : Original Author
+
+Contributors:
+
+Marcelo Tosatti <marcelo@conectiva.com.br>
+Alan Cox <alan@lxorguk.ukuu.org.uk>
+Jeff Garzik <jgarzik@pobox.com>
+Vojtech Pavlik <vojtech@suse.cz>
diff --git a/Documentation/networking/dns_resolver.txt b/Documentation/networking/dns_resolver.txt
new file mode 100644
index 000000000..d86adcdae
--- /dev/null
+++ b/Documentation/networking/dns_resolver.txt
@@ -0,0 +1,157 @@
+ ===================
+ DNS Resolver Module
+ ===================
+
+Contents:
+
+ - Overview.
+ - Compilation.
+ - Setting up.
+ - Usage.
+ - Mechanism.
+ - Debugging.
+
+
+========
+OVERVIEW
+========
+
+The DNS resolver module provides a way for kernel services to make DNS queries
+by way of requesting a key of key type dns_resolver. These queries are
+upcalled to userspace through /sbin/request-key.
+
+These routines must be supported by userspace tools dns.upcall, cifs.upcall and
+request-key. It is under development and does not yet provide the full feature
+set. The features it does support include:
+
+ (*) Implements the dns_resolver key_type to contact userspace.
+
+It does not yet support the following AFS features:
+
+ (*) Dns query support for AFSDB resource record.
+
+This code is extracted from the CIFS filesystem.
+
+
+===========
+COMPILATION
+===========
+
+The module should be enabled by turning on the kernel configuration options:
+
+ CONFIG_DNS_RESOLVER - tristate "DNS Resolver support"
+
+
+==========
+SETTING UP
+==========
+
+To set up this facility, the /etc/request-key.conf file must be altered so that
+/sbin/request-key can appropriately direct the upcalls. For example, to handle
+basic dname to IPv4/IPv6 address resolution, the following line should be
+added:
+
+ #OP TYPE DESC CO-INFO PROGRAM ARG1 ARG2 ARG3 ...
+ #====== ============ ======= ======= ==========================
+ create dns_resolver * * /usr/sbin/cifs.upcall %k
+
+To direct a query for query type 'foo', a line of the following should be added
+before the more general line given above as the first match is the one taken.
+
+ create dns_resolver foo:* * /usr/sbin/dns.foo %k
+
+
+=====
+USAGE
+=====
+
+To make use of this facility, one of the following functions that are
+implemented in the module can be called after doing:
+
+ #include <linux/dns_resolver.h>
+
+ (1) int dns_query(const char *type, const char *name, size_t namelen,
+ const char *options, char **_result, time_t *_expiry);
+
+ This is the basic access function. It looks for a cached DNS query and if
+ it doesn't find it, it upcalls to userspace to make a new DNS query, which
+ may then be cached. The key description is constructed as a string of the
+ form:
+
+ [<type>:]<name>
+
+ where <type> optionally specifies the particular upcall program to invoke,
+ and thus the type of query to do, and <name> specifies the string to be
+ looked up. The default query type is a straight hostname to IP address
+ set lookup.
+
+ The name parameter is not required to be a NUL-terminated string, and its
+ length should be given by the namelen argument.
+
+ The options parameter may be NULL or it may be a set of options
+ appropriate to the query type.
+
+ The return value is a string appropriate to the query type. For instance,
+ for the default query type it is just a list of comma-separated IPv4 and
+ IPv6 addresses. The caller must free the result.
+
+ The length of the result string is returned on success, and a negative
+ error code is returned otherwise. -EKEYREJECTED will be returned if the
+ DNS lookup failed.
+
+ If _expiry is non-NULL, the expiry time (TTL) of the result will be
+ returned also.
+
+The kernel maintains an internal keyring in which it caches looked up keys.
+This can be cleared by any process that has the CAP_SYS_ADMIN capability by
+the use of KEYCTL_KEYRING_CLEAR on the keyring ID.
+
+
+===============================
+READING DNS KEYS FROM USERSPACE
+===============================
+
+Keys of dns_resolver type can be read from userspace using keyctl_read() or
+"keyctl read/print/pipe".
+
+
+=========
+MECHANISM
+=========
+
+The dnsresolver module registers a key type called "dns_resolver". Keys of
+this type are used to transport and cache DNS lookup results from userspace.
+
+When dns_query() is invoked, it calls request_key() to search the local
+keyrings for a cached DNS result. If that fails to find one, it upcalls to
+userspace to get a new result.
+
+Upcalls to userspace are made through the request_key() upcall vector, and are
+directed by means of configuration lines in /etc/request-key.conf that tell
+/sbin/request-key what program to run to instantiate the key.
+
+The upcall handler program is responsible for querying the DNS, processing the
+result into a form suitable for passing to the keyctl_instantiate_key()
+routine. This then passes the data to dns_resolver_instantiate() which strips
+off and processes any options included in the data, and then attaches the
+remainder of the string to the key as its payload.
+
+The upcall handler program should set the expiry time on the key to that of the
+lowest TTL of all the records it has extracted a result from. This means that
+the key will be discarded and recreated when the data it holds has expired.
+
+dns_query() returns a copy of the value attached to the key, or an error if
+that is indicated instead.
+
+See <file:Documentation/security/keys-request-key.txt> for further
+information about request-key function.
+
+
+=========
+DEBUGGING
+=========
+
+Debugging messages can be turned on dynamically by writing a 1 into the
+following file:
+
+ /sys/module/dnsresolver/parameters/debug
diff --git a/Documentation/networking/driver.txt b/Documentation/networking/driver.txt
new file mode 100644
index 000000000..da59e2884
--- /dev/null
+++ b/Documentation/networking/driver.txt
@@ -0,0 +1,93 @@
+Document about softnet driver issues
+
+Transmit path guidelines:
+
+1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under
+ any normal circumstances. It is considered a hard error unless
+ there is no way your device can tell ahead of time when it's
+ transmit function will become busy.
+
+ Instead it must maintain the queue properly. For example,
+ for a driver implementing scatter-gather this means:
+
+ static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb,
+ struct net_device *dev)
+ {
+ struct drv *dp = netdev_priv(dev);
+
+ lock_tx(dp);
+ ...
+ /* This is a hard error log it. */
+ if (TX_BUFFS_AVAIL(dp) <= (skb_shinfo(skb)->nr_frags + 1)) {
+ netif_stop_queue(dev);
+ unlock_tx(dp);
+ printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n",
+ dev->name);
+ return NETDEV_TX_BUSY;
+ }
+
+ ... queue packet to card ...
+ ... update tx consumer index ...
+
+ if (TX_BUFFS_AVAIL(dp) <= (MAX_SKB_FRAGS + 1))
+ netif_stop_queue(dev);
+
+ ...
+ unlock_tx(dp);
+ ...
+ return NETDEV_TX_OK;
+ }
+
+ And then at the end of your TX reclamation event handling:
+
+ if (netif_queue_stopped(dp->dev) &&
+ TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1))
+ netif_wake_queue(dp->dev);
+
+ For a non-scatter-gather supporting card, the three tests simply become:
+
+ /* This is a hard error log it. */
+ if (TX_BUFFS_AVAIL(dp) <= 0)
+
+ and:
+
+ if (TX_BUFFS_AVAIL(dp) == 0)
+
+ and:
+
+ if (netif_queue_stopped(dp->dev) &&
+ TX_BUFFS_AVAIL(dp) > 0)
+ netif_wake_queue(dp->dev);
+
+2) An ndo_start_xmit method must not modify the shared parts of a
+ cloned SKB.
+
+3) Do not forget that once you return NETDEV_TX_OK from your
+ ndo_start_xmit method, it is your driver's responsibility to free
+ up the SKB and in some finite amount of time.
+
+ For example, this means that it is not allowed for your TX
+ mitigation scheme to let TX packets "hang out" in the TX
+ ring unreclaimed forever if no new TX packets are sent.
+ This error can deadlock sockets waiting for send buffer room
+ to be freed up.
+
+ If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you
+ must not keep any reference to that SKB and you must not attempt
+ to free it up.
+
+Probing guidelines:
+
+1) Any hardware layer address you obtain for your device should
+ be verified. For example, for ethernet check it with
+ linux/etherdevice.h:is_valid_ether_addr()
+
+Close/stop guidelines:
+
+1) After the ndo_stop routine has been called, the hardware must
+ not receive or transmit any data. All in flight packets must
+ be aborted. If necessary, poll or wait for completion of
+ any reset commands.
+
+2) The ndo_stop routine will be called by unregister_netdevice
+ if device is still UP.
diff --git a/Documentation/networking/e100.txt b/Documentation/networking/e100.txt
new file mode 100644
index 000000000..f862cf3af
--- /dev/null
+++ b/Documentation/networking/e100.txt
@@ -0,0 +1,197 @@
+Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters
+==============================================================
+
+March 15, 2011
+
+Contents
+========
+
+- In This Release
+- Identifying Your Adapter
+- Building and Installation
+- Driver Configuration Parameters
+- Additional Configurations
+- Known Issues
+- Support
+
+
+In This Release
+===============
+
+This file describes the Linux* Base Driver for the Intel(R) PRO/100 Family of
+Adapters. This driver includes support for Itanium(R)2-based systems.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your Intel PRO/100 adapter.
+
+The following features are now available in supported kernels:
+ - Native VLANs
+ - Channel Bonding (teaming)
+ - SNMP
+
+Channel Bonding documentation can be found in the Linux kernel source:
+/Documentation/networking/bonding.txt
+
+
+Identifying Your Adapter
+========================
+
+For more information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/network/adapter/pro100/21397.htm
+
+For the latest Intel network drivers for Linux, refer to the following
+website. In the search field, enter your adapter name or type, or use the
+networking link on the left to search for your adapter:
+
+ http://downloadfinder.intel.com/scripts-df/support_intel.asp
+
+Driver Configuration Parameters
+===============================
+
+The default value for each parameter is generally the recommended setting,
+unless otherwise noted.
+
+Rx Descriptors: Number of receive descriptors. A receive descriptor is a data
+ structure that describes a receive buffer and its attributes to the network
+ controller. The data in the descriptor is used by the controller to write
+ data from the controller to host memory. In the 3.x.x driver the valid range
+ for this parameter is 64-256. The default value is 64. This parameter can be
+ changed using the command:
+
+ ethtool -G eth? rx n, where n is the number of desired rx descriptors.
+
+Tx Descriptors: Number of transmit descriptors. A transmit descriptor is a data
+ structure that describes a transmit buffer and its attributes to the network
+ controller. The data in the descriptor is used by the controller to read
+ data from the host memory to the controller. In the 3.x.x driver the valid
+ range for this parameter is 64-256. The default value is 64. This parameter
+ can be changed using the command:
+
+ ethtool -G eth? tx n, where n is the number of desired tx descriptors.
+
+Speed/Duplex: The driver auto-negotiates the link speed and duplex settings by
+ default. The ethtool utility can be used as follows to force speed/duplex.
+
+ ethtool -s eth? autoneg off speed {10|100} duplex {full|half}
+
+ NOTE: setting the speed/duplex to incorrect values will cause the link to
+ fail.
+
+Event Log Message Level: The driver uses the message level flag to log events
+ to syslog. The message level can be set at driver load time. It can also be
+ set using the command:
+
+ ethtool -s eth? msglvl n
+
+
+Additional Configurations
+=========================
+
+ Configuring the Driver on Different Distributions
+ -------------------------------------------------
+
+ Configuring a network driver to load properly when the system is started is
+ distribution dependent. Typically, the configuration process involves adding
+ an alias line to /etc/modprobe.d/*.conf as well as editing other system
+ startup scripts and/or configuration files. Many popular Linux
+ distributions ship with tools to make these changes for you. To learn the
+ proper way to configure a network device for your system, refer to your
+ distribution documentation. If during this process you are asked for the
+ driver or module name, the name for the Linux Base Driver for the Intel
+ PRO/100 Family of Adapters is e100.
+
+ As an example, if you install the e100 driver for two PRO/100 adapters
+ (eth0 and eth1), add the following to a configuration file in /etc/modprobe.d/
+
+ alias eth0 e100
+ alias eth1 e100
+
+ Viewing Link Messages
+ ---------------------
+ In order to see link messages and other Intel driver information on your
+ console, you must set the dmesg level up to six. This can be done by
+ entering the following on the command line before loading the e100 driver:
+
+ dmesg -n 8
+
+ If you wish to see all messages issued by the driver, including debug
+ messages, set the dmesg level to eight.
+
+ NOTE: This setting is not saved across reboots.
+
+
+ ethtool
+ -------
+
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. The ethtool
+ version 1.6 or later is required for this functionality.
+
+ The latest release of ethtool can be found from
+ http://ftp.kernel.org/pub/software/network/ethtool/
+
+ Enabling Wake on LAN* (WoL)
+ ---------------------------
+ WoL is provided through the ethtool* utility. For instructions on enabling
+ WoL with ethtool, refer to the ethtool man page.
+
+ WoL will be enabled on the system during the next shut down or reboot. For
+ this driver version, in order to enable WoL, the e100 driver must be
+ loaded when shutting down or rebooting the system.
+
+ NAPI
+ ----
+
+ NAPI (Rx polling mode) is supported in the e100 driver.
+
+ See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI.
+
+ Multiple Interfaces on Same Ethernet Broadcast Network
+ ------------------------------------------------------
+
+ Due to the default ARP behavior on Linux, it is not possible to have
+ one system on two IP networks in the same Ethernet broadcast domain
+ (non-partitioned switch) behave as expected. All Ethernet interfaces
+ will respond to IP traffic for any IP address assigned to the system.
+ This results in unbalanced receive traffic.
+
+ If you have multiple interfaces in a server, either turn on ARP
+ filtering by
+
+ (1) entering: echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
+ (this only works if your kernel's version is higher than 2.4.5), or
+
+ (2) installing the interfaces in separate broadcast domains (either
+ in different switches or in a switch partitioned to VLANs).
+
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ http://support.intel.com
+
+ or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://sourceforge.net/projects/e1000
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related to the
+issue to e1000-devel@lists.sourceforge.net.
+
+
+License
+=======
+
+This software program is released under the terms of a license agreement
+between you ('Licensee') and Intel. Do not use or load this software or any
+associated materials (collectively, the 'Software') until you have carefully
+read the full terms and conditions of the file COPYING located in this software
+package. By loading or using the Software, you agree to the terms of this
+Agreement. If you do not agree with the terms of this Agreement, do not install
+or use the Software.
+
+* Other names and brands may be claimed as the property of others.
diff --git a/Documentation/networking/e1000.txt b/Documentation/networking/e1000.txt
new file mode 100644
index 000000000..437b2099c
--- /dev/null
+++ b/Documentation/networking/e1000.txt
@@ -0,0 +1,461 @@
+Linux* Base Driver for Intel(R) Ethernet Network Connection
+===========================================================
+
+Intel Gigabit Linux driver.
+Copyright(c) 1999 - 2013 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Command Line Parameters
+- Speed and Duplex Configuration
+- Additional Configurations
+- Support
+
+Identifying Your Adapter
+========================
+
+For more information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/go/network/adapter/idguide.htm
+
+For the latest Intel network drivers for Linux, refer to the following
+website. In the search field, enter your adapter name or type, or use the
+networking link on the left to search for your adapter:
+
+ http://support.intel.com/support/go/network/adapter/home.htm
+
+Command Line Parameters
+=======================
+
+The default value for each parameter is generally the recommended setting,
+unless otherwise noted.
+
+NOTES: For more information about the AutoNeg, Duplex, and Speed
+ parameters, see the "Speed and Duplex Configuration" section in
+ this document.
+
+ For more information about the InterruptThrottleRate,
+ RxIntDelay, TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay
+ parameters, see the application note at:
+ http://www.intel.com/design/network/applnots/ap450.htm
+
+AutoNeg
+-------
+(Supported only on adapters with copper connections)
+Valid Range: 0x01-0x0F, 0x20-0x2F
+Default Value: 0x2F
+
+This parameter is a bit-mask that specifies the speed and duplex settings
+advertised by the adapter. When this parameter is used, the Speed and
+Duplex parameters must not be specified.
+
+NOTE: Refer to the Speed and Duplex section of this readme for more
+ information on the AutoNeg parameter.
+
+Duplex
+------
+(Supported only on adapters with copper connections)
+Valid Range: 0-2 (0=auto-negotiate, 1=half, 2=full)
+Default Value: 0
+
+This defines the direction in which data is allowed to flow. Can be
+either one or two-directional. If both Duplex and the link partner are
+set to auto-negotiate, the board auto-detects the correct duplex. If the
+link partner is forced (either full or half), Duplex defaults to half-
+duplex.
+
+FlowControl
+-----------
+Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx)
+Default Value: Reads flow control settings from the EEPROM
+
+This parameter controls the automatic generation(Tx) and response(Rx)
+to Ethernet PAUSE frames.
+
+InterruptThrottleRate
+---------------------
+(not supported on Intel(R) 82542, 82543 or 82544-based adapters)
+Valid Range: 0,1,3,4,100-100000 (0=off, 1=dynamic, 3=dynamic conservative,
+ 4=simplified balancing)
+Default Value: 3
+
+The driver can limit the amount of interrupts per second that the adapter
+will generate for incoming packets. It does this by writing a value to the
+adapter that is based on the maximum amount of interrupts that the adapter
+will generate per second.
+
+Setting InterruptThrottleRate to a value greater or equal to 100
+will program the adapter to send out a maximum of that many interrupts
+per second, even if more packets have come in. This reduces interrupt
+load on the system and can lower CPU utilization under heavy load,
+but will increase latency as packets are not processed as quickly.
+
+The default behaviour of the driver previously assumed a static
+InterruptThrottleRate value of 8000, providing a good fallback value for
+all traffic types,but lacking in small packet performance and latency.
+The hardware can handle many more small packets per second however, and
+for this reason an adaptive interrupt moderation algorithm was implemented.
+
+Since 7.3.x, the driver has two adaptive modes (setting 1 or 3) in which
+it dynamically adjusts the InterruptThrottleRate value based on the traffic
+that it receives. After determining the type of incoming traffic in the last
+timeframe, it will adjust the InterruptThrottleRate to an appropriate value
+for that traffic.
+
+The algorithm classifies the incoming traffic every interval into
+classes. Once the class is determined, the InterruptThrottleRate value is
+adjusted to suit that traffic type the best. There are three classes defined:
+"Bulk traffic", for large amounts of packets of normal size; "Low latency",
+for small amounts of traffic and/or a significant percentage of small
+packets; and "Lowest latency", for almost completely small packets or
+minimal traffic.
+
+In dynamic conservative mode, the InterruptThrottleRate value is set to 4000
+for traffic that falls in class "Bulk traffic". If traffic falls in the "Low
+latency" or "Lowest latency" class, the InterruptThrottleRate is increased
+stepwise to 20000. This default mode is suitable for most applications.
+
+For situations where low latency is vital such as cluster or
+grid computing, the algorithm can reduce latency even more when
+InterruptThrottleRate is set to mode 1. In this mode, which operates
+the same as mode 3, the InterruptThrottleRate will be increased stepwise to
+70000 for traffic in class "Lowest latency".
+
+In simplified mode the interrupt rate is based on the ratio of TX and
+RX traffic. If the bytes per second rate is approximately equal, the
+interrupt rate will drop as low as 2000 interrupts per second. If the
+traffic is mostly transmit or mostly receive, the interrupt rate could
+be as high as 8000.
+
+Setting InterruptThrottleRate to 0 turns off any interrupt moderation
+and may improve small packet latency, but is generally not suitable
+for bulk throughput traffic.
+
+NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and
+ RxAbsIntDelay parameters. In other words, minimizing the receive
+ and/or transmit absolute delays does not force the controller to
+ generate more interrupts than what the Interrupt Throttle Rate
+ allows.
+
+CAUTION: If you are using the Intel(R) PRO/1000 CT Network Connection
+ (controller 82547), setting InterruptThrottleRate to a value
+ greater than 75,000, may hang (stop transmitting) adapters
+ under certain network conditions. If this occurs a NETDEV
+ WATCHDOG message is logged in the system event log. In
+ addition, the controller is automatically reset, restoring
+ the network connection. To eliminate the potential for the
+ hang, ensure that InterruptThrottleRate is set no greater
+ than 75,000 and is not set to 0.
+
+NOTE: When e1000 is loaded with default settings and multiple adapters
+ are in use simultaneously, the CPU utilization may increase non-
+ linearly. In order to limit the CPU utilization without impacting
+ the overall throughput, we recommend that you load the driver as
+ follows:
+
+ modprobe e1000 InterruptThrottleRate=3000,3000,3000
+
+ This sets the InterruptThrottleRate to 3000 interrupts/sec for
+ the first, second, and third instances of the driver. The range
+ of 2000 to 3000 interrupts per second works on a majority of
+ systems and is a good starting point, but the optimal value will
+ be platform-specific. If CPU utilization is not a concern, use
+ RX_POLLING (NAPI) and default driver settings.
+
+RxDescriptors
+-------------
+Valid Range: 80-256 for 82542 and 82543-based adapters
+ 80-4096 for all other supported adapters
+Default Value: 256
+
+This value specifies the number of receive buffer descriptors allocated
+by the driver. Increasing this value allows the driver to buffer more
+incoming packets, at the expense of increased system memory utilization.
+
+Each descriptor is 16 bytes. A receive buffer is also allocated for each
+descriptor and can be either 2048, 4096, 8192, or 16384 bytes, depending
+on the MTU setting. The maximum MTU size is 16110.
+
+NOTE: MTU designates the frame size. It only needs to be set for Jumbo
+ Frames. Depending on the available system resources, the request
+ for a higher number of receive descriptors may be denied. In this
+ case, use a lower number.
+
+RxIntDelay
+----------
+Valid Range: 0-65535 (0=off)
+Default Value: 0
+
+This value delays the generation of receive interrupts in units of 1.024
+microseconds. Receive interrupt reduction can improve CPU efficiency if
+properly tuned for specific network traffic. Increasing this value adds
+extra latency to frame reception and can end up decreasing the throughput
+of TCP traffic. If the system is reporting dropped receives, this value
+may be set too high, causing the driver to run out of available receive
+descriptors.
+
+CAUTION: When setting RxIntDelay to a value other than 0, adapters may
+ hang (stop transmitting) under certain network conditions. If
+ this occurs a NETDEV WATCHDOG message is logged in the system
+ event log. In addition, the controller is automatically reset,
+ restoring the network connection. To eliminate the potential
+ for the hang ensure that RxIntDelay is set to 0.
+
+RxAbsIntDelay
+-------------
+(This parameter is supported only on 82540, 82545 and later adapters.)
+Valid Range: 0-65535 (0=off)
+Default Value: 128
+
+This value, in units of 1.024 microseconds, limits the delay in which a
+receive interrupt is generated. Useful only if RxIntDelay is non-zero,
+this value ensures that an interrupt is generated after the initial
+packet is received within the set amount of time. Proper tuning,
+along with RxIntDelay, may improve traffic throughput in specific network
+conditions.
+
+Speed
+-----
+(This parameter is supported only on adapters with copper connections.)
+Valid Settings: 0, 10, 100, 1000
+Default Value: 0 (auto-negotiate at all supported speeds)
+
+Speed forces the line speed to the specified value in megabits per second
+(Mbps). If this parameter is not specified or is set to 0 and the link
+partner is set to auto-negotiate, the board will auto-detect the correct
+speed. Duplex should also be set when Speed is set to either 10 or 100.
+
+TxDescriptors
+-------------
+Valid Range: 80-256 for 82542 and 82543-based adapters
+ 80-4096 for all other supported adapters
+Default Value: 256
+
+This value is the number of transmit descriptors allocated by the driver.
+Increasing this value allows the driver to queue more transmits. Each
+descriptor is 16 bytes.
+
+NOTE: Depending on the available system resources, the request for a
+ higher number of transmit descriptors may be denied. In this case,
+ use a lower number.
+
+TxDescriptorStep
+----------------
+Valid Range: 1 (use every Tx Descriptor)
+ 4 (use every 4th Tx Descriptor)
+
+Default Value: 1 (use every Tx Descriptor)
+
+On certain non-Intel architectures, it has been observed that intense TX
+traffic bursts of short packets may result in an improper descriptor
+writeback. If this occurs, the driver will report a "TX Timeout" and reset
+the adapter, after which the transmit flow will restart, though data may
+have stalled for as much as 10 seconds before it resumes.
+
+The improper writeback does not occur on the first descriptor in a system
+memory cache-line, which is typically 32 bytes, or 4 descriptors long.
+
+Setting TxDescriptorStep to a value of 4 will ensure that all TX descriptors
+are aligned to the start of a system memory cache line, and so this problem
+will not occur.
+
+NOTES: Setting TxDescriptorStep to 4 effectively reduces the number of
+ TxDescriptors available for transmits to 1/4 of the normal allocation.
+ This has a possible negative performance impact, which may be
+ compensated for by allocating more descriptors using the TxDescriptors
+ module parameter.
+
+ There are other conditions which may result in "TX Timeout", which will
+ not be resolved by the use of the TxDescriptorStep parameter. As the
+ issue addressed by this parameter has never been observed on Intel
+ Architecture platforms, it should not be used on Intel platforms.
+
+TxIntDelay
+----------
+Valid Range: 0-65535 (0=off)
+Default Value: 64
+
+This value delays the generation of transmit interrupts in units of
+1.024 microseconds. Transmit interrupt reduction can improve CPU
+efficiency if properly tuned for specific network traffic. If the
+system is reporting dropped transmits, this value may be set too high
+causing the driver to run out of available transmit descriptors.
+
+TxAbsIntDelay
+-------------
+(This parameter is supported only on 82540, 82545 and later adapters.)
+Valid Range: 0-65535 (0=off)
+Default Value: 64
+
+This value, in units of 1.024 microseconds, limits the delay in which a
+transmit interrupt is generated. Useful only if TxIntDelay is non-zero,
+this value ensures that an interrupt is generated after the initial
+packet is sent on the wire within the set amount of time. Proper tuning,
+along with TxIntDelay, may improve traffic throughput in specific
+network conditions.
+
+XsumRX
+------
+(This parameter is NOT supported on the 82542-based adapter.)
+Valid Range: 0-1
+Default Value: 1
+
+A value of '1' indicates that the driver should enable IP checksum
+offload for received packets (both UDP and TCP) to the adapter hardware.
+
+Copybreak
+---------
+Valid Range: 0-xxxxxxx (0=off)
+Default Value: 256
+Usage: insmod e1000.ko copybreak=128
+
+Driver copies all packets below or equaling this size to a fresh RX
+buffer before handing it up the stack.
+
+This parameter is different than other parameters, in that it is a
+single (not 1,1,1 etc.) parameter applied to all driver instances and
+it is also available during runtime at
+/sys/module/e1000/parameters/copybreak
+
+SmartPowerDownEnable
+--------------------
+Valid Range: 0-1
+Default Value: 0 (disabled)
+
+Allows PHY to turn off in lower power states. The user can turn off
+this parameter in supported chipsets.
+
+KumeranLockLoss
+---------------
+Valid Range: 0-1
+Default Value: 1 (enabled)
+
+This workaround skips resetting the PHY at shutdown for the initial
+silicon releases of ICH8 systems.
+
+Speed and Duplex Configuration
+==============================
+
+Three keywords are used to control the speed and duplex configuration.
+These keywords are Speed, Duplex, and AutoNeg.
+
+If the board uses a fiber interface, these keywords are ignored, and the
+fiber interface board only links at 1000 Mbps full-duplex.
+
+For copper-based boards, the keywords interact as follows:
+
+ The default operation is auto-negotiate. The board advertises all
+ supported speed and duplex combinations, and it links at the highest
+ common speed and duplex mode IF the link partner is set to auto-negotiate.
+
+ If Speed = 1000, limited auto-negotiation is enabled and only 1000 Mbps
+ is advertised (The 1000BaseT spec requires auto-negotiation.)
+
+ If Speed = 10 or 100, then both Speed and Duplex should be set. Auto-
+ negotiation is disabled, and the AutoNeg parameter is ignored. Partner
+ SHOULD also be forced.
+
+The AutoNeg parameter is used when more control is required over the
+auto-negotiation process. It should be used when you wish to control which
+speed and duplex combinations are advertised during the auto-negotiation
+process.
+
+The parameter may be specified as either a decimal or hexadecimal value as
+determined by the bitmap below.
+
+Bit position 7 6 5 4 3 2 1 0
+Decimal Value 128 64 32 16 8 4 2 1
+Hex value 80 40 20 10 8 4 2 1
+Speed (Mbps) N/A N/A 1000 N/A 100 100 10 10
+Duplex Full Full Half Full Half
+
+Some examples of using AutoNeg:
+
+ modprobe e1000 AutoNeg=0x01 (Restricts autonegotiation to 10 Half)
+ modprobe e1000 AutoNeg=1 (Same as above)
+ modprobe e1000 AutoNeg=0x02 (Restricts autonegotiation to 10 Full)
+ modprobe e1000 AutoNeg=0x03 (Restricts autonegotiation to 10 Half or 10 Full)
+ modprobe e1000 AutoNeg=0x04 (Restricts autonegotiation to 100 Half)
+ modprobe e1000 AutoNeg=0x05 (Restricts autonegotiation to 10 Half or 100
+ Half)
+ modprobe e1000 AutoNeg=0x020 (Restricts autonegotiation to 1000 Full)
+ modprobe e1000 AutoNeg=32 (Same as above)
+
+Note that when this parameter is used, Speed and Duplex must not be specified.
+
+If the link partner is forced to a specific speed and duplex, then this
+parameter should not be used. Instead, use the Speed and Duplex parameters
+previously mentioned to force the adapter to the same speed and duplex.
+
+Additional Configurations
+=========================
+
+ Jumbo Frames
+ ------------
+ Jumbo Frames support is enabled by changing the MTU to a value larger than
+ the default of 1500. Use the ifconfig command to increase the MTU size.
+ For example:
+
+ ifconfig eth<x> mtu 9000 up
+
+ This setting is not saved across reboots. It can be made permanent if
+ you add:
+
+ MTU=9000
+
+ to the file /etc/sysconfig/network-scripts/ifcfg-eth<x>. This example
+ applies to the Red Hat distributions; other distributions may store this
+ setting in a different location.
+
+ Notes:
+ Degradation in throughput performance may be observed in some Jumbo frames
+ environments. If this is observed, increasing the application's socket buffer
+ size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help.
+ See the specific application manual and /usr/src/linux*/Documentation/
+ networking/ip-sysctl.txt for more details.
+
+ - The maximum MTU setting for Jumbo Frames is 16110. This value coincides
+ with the maximum Jumbo Frames size of 16128.
+
+ - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in
+ poor performance or loss of link.
+
+ - Adapters based on the Intel(R) 82542 and 82573V/E controller do not
+ support Jumbo Frames. These correspond to the following product names:
+ Intel(R) PRO/1000 Gigabit Server Adapter
+ Intel(R) PRO/1000 PM Network Connection
+
+ ethtool
+ -------
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. The ethtool
+ version 1.6 or later is required for this functionality.
+
+ The latest release of ethtool can be found from
+ http://ftp.kernel.org/pub/software/network/ethtool/
+
+ Enabling Wake on LAN* (WoL)
+ ---------------------------
+ WoL is configured through the ethtool* utility.
+
+ WoL will be enabled on the system during the next shut down or reboot.
+ For this driver version, in order to enable WoL, the e1000 driver must be
+ loaded when shutting down or rebooting the system.
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ http://support.intel.com
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://sourceforge.net/projects/e1000
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/e1000e.txt b/Documentation/networking/e1000e.txt
new file mode 100644
index 000000000..ad2d9f38c
--- /dev/null
+++ b/Documentation/networking/e1000e.txt
@@ -0,0 +1,312 @@
+Linux* Driver for Intel(R) Ethernet Network Connection
+======================================================
+
+Intel Gigabit Linux driver.
+Copyright(c) 1999 - 2013 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Command Line Parameters
+- Additional Configurations
+- Support
+
+Identifying Your Adapter
+========================
+
+The e1000e driver supports all PCI Express Intel(R) Gigabit Network
+Connections, except those that are 82575, 82576 and 82580-based*.
+
+* NOTE: The Intel(R) PRO/1000 P Dual Port Server Adapter is supported by
+ the e1000 driver, not the e1000e driver due to the 82546 part being used
+ behind a PCI Express bridge.
+
+For more information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/go/network/adapter/idguide.htm
+
+For the latest Intel network drivers for Linux, refer to the following
+website. In the search field, enter your adapter name or type, or use the
+networking link on the left to search for your adapter:
+
+ http://support.intel.com/support/go/network/adapter/home.htm
+
+Command Line Parameters
+=======================
+
+The default value for each parameter is generally the recommended setting,
+unless otherwise noted.
+
+NOTES: For more information about the InterruptThrottleRate,
+ RxIntDelay, TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay
+ parameters, see the application note at:
+ http://www.intel.com/design/network/applnots/ap450.htm
+
+InterruptThrottleRate
+---------------------
+Valid Range: 0,1,3,4,100-100000 (0=off, 1=dynamic, 3=dynamic conservative,
+ 4=simplified balancing)
+Default Value: 3
+
+The driver can limit the amount of interrupts per second that the adapter
+will generate for incoming packets. It does this by writing a value to the
+adapter that is based on the maximum amount of interrupts that the adapter
+will generate per second.
+
+Setting InterruptThrottleRate to a value greater or equal to 100
+will program the adapter to send out a maximum of that many interrupts
+per second, even if more packets have come in. This reduces interrupt
+load on the system and can lower CPU utilization under heavy load,
+but will increase latency as packets are not processed as quickly.
+
+The default behaviour of the driver previously assumed a static
+InterruptThrottleRate value of 8000, providing a good fallback value for
+all traffic types, but lacking in small packet performance and latency.
+The hardware can handle many more small packets per second however, and
+for this reason an adaptive interrupt moderation algorithm was implemented.
+
+The driver has two adaptive modes (setting 1 or 3) in which
+it dynamically adjusts the InterruptThrottleRate value based on the traffic
+that it receives. After determining the type of incoming traffic in the last
+timeframe, it will adjust the InterruptThrottleRate to an appropriate value
+for that traffic.
+
+The algorithm classifies the incoming traffic every interval into
+classes. Once the class is determined, the InterruptThrottleRate value is
+adjusted to suit that traffic type the best. There are three classes defined:
+"Bulk traffic", for large amounts of packets of normal size; "Low latency",
+for small amounts of traffic and/or a significant percentage of small
+packets; and "Lowest latency", for almost completely small packets or
+minimal traffic.
+
+In dynamic conservative mode, the InterruptThrottleRate value is set to 4000
+for traffic that falls in class "Bulk traffic". If traffic falls in the "Low
+latency" or "Lowest latency" class, the InterruptThrottleRate is increased
+stepwise to 20000. This default mode is suitable for most applications.
+
+For situations where low latency is vital such as cluster or
+grid computing, the algorithm can reduce latency even more when
+InterruptThrottleRate is set to mode 1. In this mode, which operates
+the same as mode 3, the InterruptThrottleRate will be increased stepwise to
+70000 for traffic in class "Lowest latency".
+
+In simplified mode the interrupt rate is based on the ratio of TX and
+RX traffic. If the bytes per second rate is approximately equal, the
+interrupt rate will drop as low as 2000 interrupts per second. If the
+traffic is mostly transmit or mostly receive, the interrupt rate could
+be as high as 8000.
+
+Setting InterruptThrottleRate to 0 turns off any interrupt moderation
+and may improve small packet latency, but is generally not suitable
+for bulk throughput traffic.
+
+NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and
+ RxAbsIntDelay parameters. In other words, minimizing the receive
+ and/or transmit absolute delays does not force the controller to
+ generate more interrupts than what the Interrupt Throttle Rate
+ allows.
+
+NOTE: When e1000e is loaded with default settings and multiple adapters
+ are in use simultaneously, the CPU utilization may increase non-
+ linearly. In order to limit the CPU utilization without impacting
+ the overall throughput, we recommend that you load the driver as
+ follows:
+
+ modprobe e1000e InterruptThrottleRate=3000,3000,3000
+
+ This sets the InterruptThrottleRate to 3000 interrupts/sec for
+ the first, second, and third instances of the driver. The range
+ of 2000 to 3000 interrupts per second works on a majority of
+ systems and is a good starting point, but the optimal value will
+ be platform-specific. If CPU utilization is not a concern, use
+ RX_POLLING (NAPI) and default driver settings.
+
+RxIntDelay
+----------
+Valid Range: 0-65535 (0=off)
+Default Value: 0
+
+This value delays the generation of receive interrupts in units of 1.024
+microseconds. Receive interrupt reduction can improve CPU efficiency if
+properly tuned for specific network traffic. Increasing this value adds
+extra latency to frame reception and can end up decreasing the throughput
+of TCP traffic. If the system is reporting dropped receives, this value
+may be set too high, causing the driver to run out of available receive
+descriptors.
+
+CAUTION: When setting RxIntDelay to a value other than 0, adapters may
+ hang (stop transmitting) under certain network conditions. If
+ this occurs a NETDEV WATCHDOG message is logged in the system
+ event log. In addition, the controller is automatically reset,
+ restoring the network connection. To eliminate the potential
+ for the hang ensure that RxIntDelay is set to 0.
+
+RxAbsIntDelay
+-------------
+Valid Range: 0-65535 (0=off)
+Default Value: 8
+
+This value, in units of 1.024 microseconds, limits the delay in which a
+receive interrupt is generated. Useful only if RxIntDelay is non-zero,
+this value ensures that an interrupt is generated after the initial
+packet is received within the set amount of time. Proper tuning,
+along with RxIntDelay, may improve traffic throughput in specific network
+conditions.
+
+TxIntDelay
+----------
+Valid Range: 0-65535 (0=off)
+Default Value: 8
+
+This value delays the generation of transmit interrupts in units of
+1.024 microseconds. Transmit interrupt reduction can improve CPU
+efficiency if properly tuned for specific network traffic. If the
+system is reporting dropped transmits, this value may be set too high
+causing the driver to run out of available transmit descriptors.
+
+TxAbsIntDelay
+-------------
+Valid Range: 0-65535 (0=off)
+Default Value: 32
+
+This value, in units of 1.024 microseconds, limits the delay in which a
+transmit interrupt is generated. Useful only if TxIntDelay is non-zero,
+this value ensures that an interrupt is generated after the initial
+packet is sent on the wire within the set amount of time. Proper tuning,
+along with TxIntDelay, may improve traffic throughput in specific
+network conditions.
+
+Copybreak
+---------
+Valid Range: 0-xxxxxxx (0=off)
+Default Value: 256
+
+Driver copies all packets below or equaling this size to a fresh RX
+buffer before handing it up the stack.
+
+This parameter is different than other parameters, in that it is a
+single (not 1,1,1 etc.) parameter applied to all driver instances and
+it is also available during runtime at
+/sys/module/e1000e/parameters/copybreak
+
+SmartPowerDownEnable
+--------------------
+Valid Range: 0-1
+Default Value: 0 (disabled)
+
+Allows PHY to turn off in lower power states. The user can set this parameter
+in supported chipsets.
+
+KumeranLockLoss
+---------------
+Valid Range: 0-1
+Default Value: 1 (enabled)
+
+This workaround skips resetting the PHY at shutdown for the initial
+silicon releases of ICH8 systems.
+
+IntMode
+-------
+Valid Range: 0-2 (0=legacy, 1=MSI, 2=MSI-X)
+Default Value: 2
+
+Allows changing the interrupt mode at module load time, without requiring a
+recompile. If the driver load fails to enable a specific interrupt mode, the
+driver will try other interrupt modes, from least to most compatible. The
+interrupt order is MSI-X, MSI, Legacy. If specifying MSI (IntMode=1)
+interrupts, only MSI and Legacy will be attempted.
+
+CrcStripping
+------------
+Valid Range: 0-1
+Default Value: 1 (enabled)
+
+Strip the CRC from received packets before sending up the network stack. If
+you have a machine with a BMC enabled but cannot receive IPMI traffic after
+loading or enabling the driver, try disabling this feature.
+
+WriteProtectNVM
+---------------
+Valid Range: 0,1
+Default Value: 1
+
+If set to 1, configure the hardware to ignore all write/erase cycles to the
+GbE region in the ICHx NVM (in order to prevent accidental corruption of the
+NVM). This feature can be disabled by setting the parameter to 0 during initial
+driver load.
+NOTE: The machine must be power cycled (full off/on) when enabling NVM writes
+via setting the parameter to zero. Once the NVM has been locked (via the
+parameter at 1 when the driver loads) it cannot be unlocked except via power
+cycle.
+
+Additional Configurations
+=========================
+
+ Jumbo Frames
+ ------------
+ Jumbo Frames support is enabled by changing the MTU to a value larger than
+ the default of 1500. Use the ifconfig command to increase the MTU size.
+ For example:
+
+ ifconfig eth<x> mtu 9000 up
+
+ This setting is not saved across reboots.
+
+ Notes:
+
+ - The maximum MTU setting for Jumbo Frames is 9216. This value coincides
+ with the maximum Jumbo Frames size of 9234 bytes.
+
+ - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in
+ poor performance or loss of link.
+
+ - Some adapters limit Jumbo Frames sized packets to a maximum of
+ 4096 bytes and some adapters do not support Jumbo Frames.
+
+ - Jumbo Frames cannot be configured on an 82579-based Network device, if
+ MACSec is enabled on the system.
+
+ ethtool
+ -------
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. We
+ strongly recommend downloading the latest version of ethtool at:
+
+ http://ftp.kernel.org/pub/software/network/ethtool/
+
+ NOTE: When validating enable/disable tests on some parts (82578, for example)
+ you need to add a few seconds between tests when working with ethtool.
+
+ Speed and Duplex
+ ----------------
+ Speed and Duplex are configured through the ethtool* utility. For
+ instructions, refer to the ethtool man page.
+
+ Enabling Wake on LAN* (WoL)
+ ---------------------------
+ WoL is configured through the ethtool* utility. For instructions on
+ enabling WoL with ethtool, refer to the ethtool man page.
+
+ WoL will be enabled on the system during the next shut down or reboot.
+ For this driver version, in order to enable WoL, the e1000e driver must be
+ loaded when shutting down or rebooting the system.
+
+ In most cases Wake On LAN is only supported on port A for multiple port
+ adapters. To verify if a port supports Wake on Lan run ethtool eth<X>.
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ www.intel.com/support/
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://sourceforge.net/projects/e1000
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/eql.txt b/Documentation/networking/eql.txt
new file mode 100644
index 000000000..0f1550150
--- /dev/null
+++ b/Documentation/networking/eql.txt
@@ -0,0 +1,528 @@
+ EQL Driver: Serial IP Load Balancing HOWTO
+ Simon "Guru Aleph-Null" Janes, simon@ncm.com
+ v1.1, February 27, 1995
+
+ This is the manual for the EQL device driver. EQL is a software device
+ that lets you load-balance IP serial links (SLIP or uncompressed PPP)
+ to increase your bandwidth. It will not reduce your latency (i.e. ping
+ times) except in the case where you already have lots of traffic on
+ your link, in which it will help them out. This driver has been tested
+ with the 1.1.75 kernel, and is known to have patched cleanly with
+ 1.1.86. Some testing with 1.1.92 has been done with the v1.1 patch
+ which was only created to patch cleanly in the very latest kernel
+ source trees. (Yes, it worked fine.)
+
+ 1. Introduction
+
+ Which is worse? A huge fee for a 56K leased line or two phone lines?
+ It's probably the former. If you find yourself craving more bandwidth,
+ and have a ISP that is flexible, it is now possible to bind modems
+ together to work as one point-to-point link to increase your
+ bandwidth. All without having to have a special black box on either
+ side.
+
+
+ The eql driver has only been tested with the Livingston PortMaster-2e
+ terminal server. I do not know if other terminal servers support load-
+ balancing, but I do know that the PortMaster does it, and does it
+ almost as well as the eql driver seems to do it (-- Unfortunately, in
+ my testing so far, the Livingston PortMaster 2e's load-balancing is a
+ good 1 to 2 KB/s slower than the test machine working with a 28.8 Kbps
+ and 14.4 Kbps connection. However, I am not sure that it really is
+ the PortMaster, or if it's Linux's TCP drivers. I'm told that Linux's
+ TCP implementation is pretty fast though.--)
+
+
+ I suggest to ISPs out there that it would probably be fair to charge
+ a load-balancing client 75% of the cost of the second line and 50% of
+ the cost of the third line etc...
+
+
+ Hey, we can all dream you know...
+
+
+ 2. Kernel Configuration
+
+ Here I describe the general steps of getting a kernel up and working
+ with the eql driver. From patching, building, to installing.
+
+
+ 2.1. Patching The Kernel
+
+ If you do not have or cannot get a copy of the kernel with the eql
+ driver folded into it, get your copy of the driver from
+ ftp://slaughter.ncm.com/pub/Linux/LOAD_BALANCING/eql-1.1.tar.gz.
+ Unpack this archive someplace obvious like /usr/local/src/. It will
+ create the following files:
+
+
+
+ ______________________________________________________________________
+ -rw-r--r-- guru/ncm 198 Jan 19 18:53 1995 eql-1.1/NO-WARRANTY
+ -rw-r--r-- guru/ncm 30620 Feb 27 21:40 1995 eql-1.1/eql-1.1.patch
+ -rwxr-xr-x guru/ncm 16111 Jan 12 22:29 1995 eql-1.1/eql_enslave
+ -rw-r--r-- guru/ncm 2195 Jan 10 21:48 1995 eql-1.1/eql_enslave.c
+ ______________________________________________________________________
+
+ Unpack a recent kernel (something after 1.1.92) someplace convenient
+ like say /usr/src/linux-1.1.92.eql. Use symbolic links to point
+ /usr/src/linux to this development directory.
+
+
+ Apply the patch by running the commands:
+
+
+ ______________________________________________________________________
+ cd /usr/src
+ patch </usr/local/src/eql-1.1/eql-1.1.patch
+ ______________________________________________________________________
+
+
+
+
+
+ 2.2. Building The Kernel
+
+ After patching the kernel, run make config and configure the kernel
+ for your hardware.
+
+
+ After configuration, make and install according to your habit.
+
+
+ 3. Network Configuration
+
+ So far, I have only used the eql device with the DSLIP SLIP connection
+ manager by Matt Dillon (-- "The man who sold his soul to code so much
+ so quickly."--) . How you configure it for other "connection"
+ managers is up to you. Most other connection managers that I've seen
+ don't do a very good job when it comes to handling more than one
+ connection.
+
+
+ 3.1. /etc/rc.d/rc.inet1
+
+ In rc.inet1, ifconfig the eql device to the IP address you usually use
+ for your machine, and the MTU you prefer for your SLIP lines. One
+ could argue that MTU should be roughly half the usual size for two
+ modems, one-third for three, one-fourth for four, etc... But going
+ too far below 296 is probably overkill. Here is an example ifconfig
+ command that sets up the eql device:
+
+
+
+ ______________________________________________________________________
+ ifconfig eql 198.67.33.239 mtu 1006
+ ______________________________________________________________________
+
+
+
+
+
+ Once the eql device is up and running, add a static default route to
+ it in the routing table using the cool new route syntax that makes
+ life so much easier:
+
+
+
+ ______________________________________________________________________
+ route add default eql
+ ______________________________________________________________________
+
+
+ 3.2. Enslaving Devices By Hand
+
+ Enslaving devices by hand requires two utility programs: eql_enslave
+ and eql_emancipate (-- eql_emancipate hasn't been written because when
+ an enslaved device "dies", it is automatically taken out of the queue.
+ I haven't found a good reason to write it yet... other than for
+ completeness, but that isn't a good motivator is it?--)
+
+
+ The syntax for enslaving a device is "eql_enslave <master-name>
+ <slave-name> <estimated-bps>". Here are some example enslavings:
+
+
+
+ ______________________________________________________________________
+ eql_enslave eql sl0 28800
+ eql_enslave eql ppp0 14400
+ eql_enslave eql sl1 57600
+ ______________________________________________________________________
+
+
+
+
+
+ When you want to free a device from its life of slavery, you can
+ either down the device with ifconfig (eql will automatically bury the
+ dead slave and remove it from its queue) or use eql_emancipate to free
+ it. (-- Or just ifconfig it down, and the eql driver will take it out
+ for you.--)
+
+
+
+ ______________________________________________________________________
+ eql_emancipate eql sl0
+ eql_emancipate eql ppp0
+ eql_emancipate eql sl1
+ ______________________________________________________________________
+
+
+
+
+
+ 3.3. DSLIP Configuration for the eql Device
+
+ The general idea is to bring up and keep up as many SLIP connections
+ as you need, automatically.
+
+
+ 3.3.1. /etc/slip/runslip.conf
+
+ Here is an example runslip.conf:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ ______________________________________________________________________
+ name sl-line-1
+ enabled
+ baud 38400
+ mtu 576
+ ducmd -e /etc/slip/dialout/cua2-288.xp -t 9
+ command eql_enslave eql $interface 28800
+ address 198.67.33.239
+ line /dev/cua2
+
+ name sl-line-2
+ enabled
+ baud 38400
+ mtu 576
+ ducmd -e /etc/slip/dialout/cua3-288.xp -t 9
+ command eql_enslave eql $interface 28800
+ address 198.67.33.239
+ line /dev/cua3
+ ______________________________________________________________________
+
+
+
+
+
+ 3.4. Using PPP and the eql Device
+
+ I have not yet done any load-balancing testing for PPP devices, mainly
+ because I don't have a PPP-connection manager like SLIP has with
+ DSLIP. I did find a good tip from LinuxNET:Billy for PPP performance:
+ make sure you have asyncmap set to something so that control
+ characters are not escaped.
+
+
+ I tried to fix up a PPP script/system for redialing lost PPP
+ connections for use with the eql driver the weekend of Feb 25-26 '95
+ (Hereafter known as the 8-hour PPP Hate Festival). Perhaps later this
+ year.
+
+
+ 4. About the Slave Scheduler Algorithm
+
+ The slave scheduler probably could be replaced with a dozen other
+ things and push traffic much faster. The formula in the current set
+ up of the driver was tuned to handle slaves with wildly different
+ bits-per-second "priorities".
+
+
+ All testing I have done was with two 28.8 V.FC modems, one connecting
+ at 28800 bps or slower, and the other connecting at 14400 bps all the
+ time.
+
+
+ One version of the scheduler was able to push 5.3 K/s through the
+ 28800 and 14400 connections, but when the priorities on the links were
+ very wide apart (57600 vs. 14400) the "faster" modem received all
+ traffic and the "slower" modem starved.
+
+
+ 5. Testers' Reports
+
+ Some people have experimented with the eql device with newer
+ kernels (than 1.1.75). I have since updated the driver to patch
+ cleanly in newer kernels because of the removal of the old "slave-
+ balancing" driver config option.
+
+
+ o icee from LinuxNET patched 1.1.86 without any rejects and was able
+ to boot the kernel and enslave a couple of ISDN PPP links.
+
+ 5.1. Randolph Bentson's Test Report
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ From bentson@grieg.seaslug.org Wed Feb 8 19:08:09 1995
+ Date: Tue, 7 Feb 95 22:57 PST
+ From: Randolph Bentson <bentson@grieg.seaslug.org>
+ To: guru@ncm.com
+ Subject: EQL driver tests
+
+
+ I have been checking out your eql driver. (Nice work, that!)
+ Although you may already done this performance testing, here
+ are some data I've discovered.
+
+ Randolph Bentson
+ bentson@grieg.seaslug.org
+
+ ---------------------------------------------------------
+
+
+ A pseudo-device driver, EQL, written by Simon Janes, can be used
+ to bundle multiple SLIP connections into what appears to be a
+ single connection. This allows one to improve dial-up network
+ connectivity gradually, without having to buy expensive DSU/CSU
+ hardware and services.
+
+ I have done some testing of this software, with two goals in
+ mind: first, to ensure it actually works as described and
+ second, as a method of exercising my device driver.
+
+ The following performance measurements were derived from a set
+ of SLIP connections run between two Linux systems (1.1.84) using
+ a 486DX2/66 with a Cyclom-8Ys and a 486SLC/40 with a Cyclom-16Y.
+ (Ports 0,1,2,3 were used. A later configuration will distribute
+ port selection across the different Cirrus chips on the boards.)
+ Once a link was established, I timed a binary ftp transfer of
+ 289284 bytes of data. If there were no overhead (packet headers,
+ inter-character and inter-packet delays, etc.) the transfers
+ would take the following times:
+
+ bits/sec seconds
+ 345600 8.3
+ 234600 12.3
+ 172800 16.7
+ 153600 18.8
+ 76800 37.6
+ 57600 50.2
+ 38400 75.3
+ 28800 100.4
+ 19200 150.6
+ 9600 301.3
+
+ A single line running at the lower speeds and with large packets
+ comes to within 2% of this. Performance is limited for the higher
+ speeds (as predicted by the Cirrus databook) to an aggregate of
+ about 160 kbits/sec. The next round of testing will distribute
+ the load across two or more Cirrus chips.
+
+ The good news is that one gets nearly the full advantage of the
+ second, third, and fourth line's bandwidth. (The bad news is
+ that the connection establishment seemed fragile for the higher
+ speeds. Once established, the connection seemed robust enough.)
+
+ #lines speed mtu seconds theory actual %of
+ kbit/sec duration speed speed max
+ 3 115200 900 _ 345600
+ 3 115200 400 18.1 345600 159825 46
+ 2 115200 900 _ 230400
+ 2 115200 600 18.1 230400 159825 69
+ 2 115200 400 19.3 230400 149888 65
+ 4 57600 900 _ 234600
+ 4 57600 600 _ 234600
+ 4 57600 400 _ 234600
+ 3 57600 600 20.9 172800 138413 80
+ 3 57600 900 21.2 172800 136455 78
+ 3 115200 600 21.7 345600 133311 38
+ 3 57600 400 22.5 172800 128571 74
+ 4 38400 900 25.2 153600 114795 74
+ 4 38400 600 26.4 153600 109577 71
+ 4 38400 400 27.3 153600 105965 68
+ 2 57600 900 29.1 115200 99410.3 86
+ 1 115200 900 30.7 115200 94229.3 81
+ 2 57600 600 30.2 115200 95789.4 83
+ 3 38400 900 30.3 115200 95473.3 82
+ 3 38400 600 31.2 115200 92719.2 80
+ 1 115200 600 31.3 115200 92423 80
+ 2 57600 400 32.3 115200 89561.6 77
+ 1 115200 400 32.8 115200 88196.3 76
+ 3 38400 400 33.5 115200 86353.4 74
+ 2 38400 900 43.7 76800 66197.7 86
+ 2 38400 600 44 76800 65746.4 85
+ 2 38400 400 47.2 76800 61289 79
+ 4 19200 900 50.8 76800 56945.7 74
+ 4 19200 400 53.2 76800 54376.7 70
+ 4 19200 600 53.7 76800 53870.4 70
+ 1 57600 900 54.6 57600 52982.4 91
+ 1 57600 600 56.2 57600 51474 89
+ 3 19200 900 60.5 57600 47815.5 83
+ 1 57600 400 60.2 57600 48053.8 83
+ 3 19200 600 62 57600 46658.7 81
+ 3 19200 400 64.7 57600 44711.6 77
+ 1 38400 900 79.4 38400 36433.8 94
+ 1 38400 600 82.4 38400 35107.3 91
+ 2 19200 900 84.4 38400 34275.4 89
+ 1 38400 400 86.8 38400 33327.6 86
+ 2 19200 600 87.6 38400 33023.3 85
+ 2 19200 400 91.2 38400 31719.7 82
+ 4 9600 900 94.7 38400 30547.4 79
+ 4 9600 400 106 38400 27290.9 71
+ 4 9600 600 110 38400 26298.5 68
+ 3 9600 900 118 28800 24515.6 85
+ 3 9600 600 120 28800 24107 83
+ 3 9600 400 131 28800 22082.7 76
+ 1 19200 900 155 19200 18663.5 97
+ 1 19200 600 161 19200 17968 93
+ 1 19200 400 170 19200 17016.7 88
+ 2 9600 600 176 19200 16436.6 85
+ 2 9600 900 180 19200 16071.3 83
+ 2 9600 400 181 19200 15982.5 83
+ 1 9600 900 305 9600 9484.72 98
+ 1 9600 600 314 9600 9212.87 95
+ 1 9600 400 332 9600 8713.37 90
+
+
+
+
+
+ 5.2. Anthony Healy's Report
+
+
+
+
+
+
+
+ Date: Mon, 13 Feb 1995 16:17:29 +1100 (EST)
+ From: Antony Healey <ahealey@st.nepean.uws.edu.au>
+ To: Simon Janes <guru@ncm.com>
+ Subject: Re: Load Balancing
+
+ Hi Simon,
+ I've installed your patch and it works great. I have trialed
+ it over twin SL/IP lines, just over null modems, but I was
+ able to data at over 48Kb/s [ISDN link -Simon]. I managed a
+ transfer of up to 7.5 Kbyte/s on one go, but averaged around
+ 6.4 Kbyte/s, which I think is pretty cool. :)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/Documentation/networking/fib_trie.txt b/Documentation/networking/fib_trie.txt
new file mode 100644
index 000000000..fe7193885
--- /dev/null
+++ b/Documentation/networking/fib_trie.txt
@@ -0,0 +1,145 @@
+ LC-trie implementation notes.
+
+Node types
+----------
+leaf
+ An end node with data. This has a copy of the relevant key, along
+ with 'hlist' with routing table entries sorted by prefix length.
+ See struct leaf and struct leaf_info.
+
+trie node or tnode
+ An internal node, holding an array of child (leaf or tnode) pointers,
+ indexed through a subset of the key. See Level Compression.
+
+A few concepts explained
+------------------------
+Bits (tnode)
+ The number of bits in the key segment used for indexing into the
+ child array - the "child index". See Level Compression.
+
+Pos (tnode)
+ The position (in the key) of the key segment used for indexing into
+ the child array. See Path Compression.
+
+Path Compression / skipped bits
+ Any given tnode is linked to from the child array of its parent, using
+ a segment of the key specified by the parent's "pos" and "bits"
+ In certain cases, this tnode's own "pos" will not be immediately
+ adjacent to the parent (pos+bits), but there will be some bits
+ in the key skipped over because they represent a single path with no
+ deviations. These "skipped bits" constitute Path Compression.
+ Note that the search algorithm will simply skip over these bits when
+ searching, making it necessary to save the keys in the leaves to
+ verify that they actually do match the key we are searching for.
+
+Level Compression / child arrays
+ the trie is kept level balanced moving, under certain conditions, the
+ children of a full child (see "full_children") up one level, so that
+ instead of a pure binary tree, each internal node ("tnode") may
+ contain an arbitrarily large array of links to several children.
+ Conversely, a tnode with a mostly empty child array (see empty_children)
+ may be "halved", having some of its children moved downwards one level,
+ in order to avoid ever-increasing child arrays.
+
+empty_children
+ the number of positions in the child array of a given tnode that are
+ NULL.
+
+full_children
+ the number of children of a given tnode that aren't path compressed.
+ (in other words, they aren't NULL or leaves and their "pos" is equal
+ to this tnode's "pos"+"bits").
+
+ (The word "full" here is used more in the sense of "complete" than
+ as the opposite of "empty", which might be a tad confusing.)
+
+Comments
+---------
+
+We have tried to keep the structure of the code as close to fib_hash as
+possible to allow verification and help up reviewing.
+
+fib_find_node()
+ A good start for understanding this code. This function implements a
+ straightforward trie lookup.
+
+fib_insert_node()
+ Inserts a new leaf node in the trie. This is bit more complicated than
+ fib_find_node(). Inserting a new node means we might have to run the
+ level compression algorithm on part of the trie.
+
+trie_leaf_remove()
+ Looks up a key, deletes it and runs the level compression algorithm.
+
+trie_rebalance()
+ The key function for the dynamic trie after any change in the trie
+ it is run to optimize and reorganize. It will walk the trie upwards
+ towards the root from a given tnode, doing a resize() at each step
+ to implement level compression.
+
+resize()
+ Analyzes a tnode and optimizes the child array size by either inflating
+ or shrinking it repeatedly until it fulfills the criteria for optimal
+ level compression. This part follows the original paper pretty closely
+ and there may be some room for experimentation here.
+
+inflate()
+ Doubles the size of the child array within a tnode. Used by resize().
+
+halve()
+ Halves the size of the child array within a tnode - the inverse of
+ inflate(). Used by resize();
+
+fn_trie_insert(), fn_trie_delete(), fn_trie_select_default()
+ The route manipulation functions. Should conform pretty closely to the
+ corresponding functions in fib_hash.
+
+fn_trie_flush()
+ This walks the full trie (using nextleaf()) and searches for empty
+ leaves which have to be removed.
+
+fn_trie_dump()
+ Dumps the routing table ordered by prefix length. This is somewhat
+ slower than the corresponding fib_hash function, as we have to walk the
+ entire trie for each prefix length. In comparison, fib_hash is organized
+ as one "zone"/hash per prefix length.
+
+Locking
+-------
+
+fib_lock is used for an RW-lock in the same way that this is done in fib_hash.
+However, the functions are somewhat separated for other possible locking
+scenarios. It might conceivably be possible to run trie_rebalance via RCU
+to avoid read_lock in the fn_trie_lookup() function.
+
+Main lookup mechanism
+---------------------
+fn_trie_lookup() is the main lookup function.
+
+The lookup is in its simplest form just like fib_find_node(). We descend the
+trie, key segment by key segment, until we find a leaf. check_leaf() does
+the fib_semantic_match in the leaf's sorted prefix hlist.
+
+If we find a match, we are done.
+
+If we don't find a match, we enter prefix matching mode. The prefix length,
+starting out at the same as the key length, is reduced one step at a time,
+and we backtrack upwards through the trie trying to find a longest matching
+prefix. The goal is always to reach a leaf and get a positive result from the
+fib_semantic_match mechanism.
+
+Inside each tnode, the search for longest matching prefix consists of searching
+through the child array, chopping off (zeroing) the least significant "1" of
+the child index until we find a match or the child index consists of nothing but
+zeros.
+
+At this point we backtrack (t->stats.backtrack++) up the trie, continuing to
+chop off part of the key in order to find the longest matching prefix.
+
+At this point we will repeatedly descend subtries to look for a match, and there
+are some optimizations available that can provide us with "shortcuts" to avoid
+descending into dead ends. Look for "HL_OPTIMIZE" sections in the code.
+
+To alleviate any doubts about the correctness of the route selection process,
+a new netlink operation has been added. Look for NETLINK_FIB_LOOKUP, which
+gives userland access to fib_lookup().
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
new file mode 100644
index 000000000..135581f01
--- /dev/null
+++ b/Documentation/networking/filter.txt
@@ -0,0 +1,1297 @@
+Linux Socket Filtering aka Berkeley Packet Filter (BPF)
+=======================================================
+
+Introduction
+------------
+
+Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
+Though there are some distinct differences between the BSD and Linux
+Kernel filtering, but when we speak of BPF or LSF in Linux context, we
+mean the very same mechanism of filtering in the Linux kernel.
+
+BPF allows a user-space program to attach a filter onto any socket and
+allow or disallow certain types of data to come through the socket. LSF
+follows exactly the same filter code structure as BSD's BPF, so referring
+to the BSD bpf.4 manpage is very helpful in creating filters.
+
+On Linux, BPF is much simpler than on BSD. One does not have to worry
+about devices or anything like that. You simply create your filter code,
+send it to the kernel via the SO_ATTACH_FILTER option and if your filter
+code passes the kernel check on it, you then immediately begin filtering
+data on that socket.
+
+You can also detach filters from your socket via the SO_DETACH_FILTER
+option. This will probably not be used much since when you close a socket
+that has a filter on it the filter is automagically removed. The other
+less common case may be adding a different filter on the same socket where
+you had another filter that is still running: the kernel takes care of
+removing the old one and placing your new one in its place, assuming your
+filter has passed the checks, otherwise if it fails the old filter will
+remain on that socket.
+
+SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
+set, a filter cannot be removed or changed. This allows one process to
+setup a socket, attach a filter, lock it then drop privileges and be
+assured that the filter will be kept until the socket is closed.
+
+The biggest user of this construct might be libpcap. Issuing a high-level
+filter command like `tcpdump -i em1 port 22` passes through the libpcap
+internal compiler that generates a structure that can eventually be loaded
+via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
+displays what is being placed into this structure.
+
+Although we were only speaking about sockets here, BPF in Linux is used
+in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
+qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places
+such as team driver, PTP code, etc where BPF is being used.
+
+ [1] Documentation/prctl/seccomp_filter.txt
+
+Original BPF paper:
+
+Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
+architecture for user-level packet capture. In Proceedings of the
+USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
+Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
+CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]
+
+Structure
+---------
+
+User space applications include <linux/filter.h> which contains the
+following relevant structures:
+
+struct sock_filter { /* Filter block */
+ __u16 code; /* Actual filter code */
+ __u8 jt; /* Jump true */
+ __u8 jf; /* Jump false */
+ __u32 k; /* Generic multiuse field */
+};
+
+Such a structure is assembled as an array of 4-tuples, that contains
+a code, jt, jf and k value. jt and jf are jump offsets and k a generic
+value to be used for a provided code.
+
+struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
+ unsigned short len; /* Number of filter blocks */
+ struct sock_filter __user *filter;
+};
+
+For socket filtering, a pointer to this structure (as shown in
+follow-up example) is being passed to the kernel through setsockopt(2).
+
+Example
+-------
+
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <arpa/inet.h>
+#include <linux/if_ether.h>
+/* ... */
+
+/* From the example above: tcpdump -i em1 port 22 -dd */
+struct sock_filter code[] = {
+ { 0x28, 0, 0, 0x0000000c },
+ { 0x15, 0, 8, 0x000086dd },
+ { 0x30, 0, 0, 0x00000014 },
+ { 0x15, 2, 0, 0x00000084 },
+ { 0x15, 1, 0, 0x00000006 },
+ { 0x15, 0, 17, 0x00000011 },
+ { 0x28, 0, 0, 0x00000036 },
+ { 0x15, 14, 0, 0x00000016 },
+ { 0x28, 0, 0, 0x00000038 },
+ { 0x15, 12, 13, 0x00000016 },
+ { 0x15, 0, 12, 0x00000800 },
+ { 0x30, 0, 0, 0x00000017 },
+ { 0x15, 2, 0, 0x00000084 },
+ { 0x15, 1, 0, 0x00000006 },
+ { 0x15, 0, 8, 0x00000011 },
+ { 0x28, 0, 0, 0x00000014 },
+ { 0x45, 6, 0, 0x00001fff },
+ { 0xb1, 0, 0, 0x0000000e },
+ { 0x48, 0, 0, 0x0000000e },
+ { 0x15, 2, 0, 0x00000016 },
+ { 0x48, 0, 0, 0x00000010 },
+ { 0x15, 0, 1, 0x00000016 },
+ { 0x06, 0, 0, 0x0000ffff },
+ { 0x06, 0, 0, 0x00000000 },
+};
+
+struct sock_fprog bpf = {
+ .len = ARRAY_SIZE(code),
+ .filter = code,
+};
+
+sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+if (sock < 0)
+ /* ... bail out ... */
+
+ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
+if (ret < 0)
+ /* ... bail out ... */
+
+/* ... */
+close(sock);
+
+The above example code attaches a socket filter for a PF_PACKET socket
+in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
+be dropped for this socket.
+
+The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
+and SO_LOCK_FILTER for preventing the filter to be detached, takes an
+integer value with 0 or 1.
+
+Note that socket filters are not restricted to PF_PACKET sockets only,
+but can also be used on other socket families.
+
+Summary of system calls:
+
+ * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
+ * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
+ * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val));
+
+Normally, most use cases for socket filtering on packet sockets will be
+covered by libpcap in high-level syntax, so as an application developer
+you should stick to that. libpcap wraps its own layer around all that.
+
+Unless i) using/linking to libpcap is not an option, ii) the required BPF
+filters use Linux extensions that are not supported by libpcap's compiler,
+iii) a filter might be more complex and not cleanly implementable with
+libpcap's compiler, or iv) particular filter codes should be optimized
+differently than libpcap's internal compiler does; then in such cases
+writing such a filter "by hand" can be of an alternative. For example,
+xt_bpf and cls_bpf users might have requirements that could result in
+more complex filter code, or one that cannot be expressed with libpcap
+(e.g. different return codes for various code paths). Moreover, BPF JIT
+implementors may wish to manually write test cases and thus need low-level
+access to BPF code as well.
+
+BPF engine and instruction set
+------------------------------
+
+Under tools/net/ there's a small helper tool called bpf_asm which can
+be used to write low-level filters for example scenarios mentioned in the
+previous section. Asm-like syntax mentioned here has been implemented in
+bpf_asm and will be used for further explanations (instead of dealing with
+less readable opcodes directly, principles are the same). The syntax is
+closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
+
+The BPF architecture consists of the following basic elements:
+
+ Element Description
+
+ A 32 bit wide accumulator
+ X 32 bit wide X register
+ M[] 16 x 32 bit wide misc registers aka "scratch memory
+ store", addressable from 0 to 15
+
+A program, that is translated by bpf_asm into "opcodes" is an array that
+consists of the following elements (as already mentioned):
+
+ op:16, jt:8, jf:8, k:32
+
+The element op is a 16 bit wide opcode that has a particular instruction
+encoded. jt and jf are two 8 bit wide jump targets, one for condition
+"jump if true", the other one "jump if false". Eventually, element k
+contains a miscellaneous argument that can be interpreted in different
+ways depending on the given instruction in op.
+
+The instruction set consists of load, store, branch, alu, miscellaneous
+and return instructions that are also represented in bpf_asm syntax. This
+table lists all bpf_asm instructions available resp. what their underlying
+opcodes as defined in linux/filter.h stand for:
+
+ Instruction Addressing mode Description
+
+ ld 1, 2, 3, 4, 10 Load word into A
+ ldi 4 Load word into A
+ ldh 1, 2 Load half-word into A
+ ldb 1, 2 Load byte into A
+ ldx 3, 4, 5, 10 Load word into X
+ ldxi 4 Load word into X
+ ldxb 5 Load byte into X
+
+ st 3 Store A into M[]
+ stx 3 Store X into M[]
+
+ jmp 6 Jump to label
+ ja 6 Jump to label
+ jeq 7, 8 Jump on k == A
+ jneq 8 Jump on k != A
+ jne 8 Jump on k != A
+ jlt 8 Jump on k < A
+ jle 8 Jump on k <= A
+ jgt 7, 8 Jump on k > A
+ jge 7, 8 Jump on k >= A
+ jset 7, 8 Jump on k & A
+
+ add 0, 4 A + <x>
+ sub 0, 4 A - <x>
+ mul 0, 4 A * <x>
+ div 0, 4 A / <x>
+ mod 0, 4 A % <x>
+ neg 0, 4 !A
+ and 0, 4 A & <x>
+ or 0, 4 A | <x>
+ xor 0, 4 A ^ <x>
+ lsh 0, 4 A << <x>
+ rsh 0, 4 A >> <x>
+
+ tax Copy A into X
+ txa Copy X into A
+
+ ret 4, 9 Return
+
+The next table shows addressing formats from the 2nd column:
+
+ Addressing mode Syntax Description
+
+ 0 x/%x Register X
+ 1 [k] BHW at byte offset k in the packet
+ 2 [x + k] BHW at the offset X + k in the packet
+ 3 M[k] Word at offset k in M[]
+ 4 #k Literal value stored in k
+ 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet
+ 6 L Jump label L
+ 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf
+ 8 #k,Lt Jump to Lt if predicate is true
+ 9 a/%a Accumulator A
+ 10 extension BPF extension
+
+The Linux kernel also has a couple of BPF extensions that are used along
+with the class of load instructions by "overloading" the k argument with
+a negative offset + a particular extension offset. The result of such BPF
+extensions are loaded into A.
+
+Possible BPF extensions are shown in the following table:
+
+ Extension Description
+
+ len skb->len
+ proto skb->protocol
+ type skb->pkt_type
+ poff Payload start offset
+ ifidx skb->dev->ifindex
+ nla Netlink attribute of type X with offset A
+ nlan Nested Netlink attribute of type X with offset A
+ mark skb->mark
+ queue skb->queue_mapping
+ hatype skb->dev->type
+ rxhash skb->hash
+ cpu raw_smp_processor_id()
+ vlan_tci skb_vlan_tag_get(skb)
+ vlan_avail skb_vlan_tag_present(skb)
+ vlan_tpid skb->vlan_proto
+ rand prandom_u32()
+
+These extensions can also be prefixed with '#'.
+Examples for low-level BPF:
+
+** ARP packets:
+
+ ldh [12]
+ jne #0x806, drop
+ ret #-1
+ drop: ret #0
+
+** IPv4 TCP packets:
+
+ ldh [12]
+ jne #0x800, drop
+ ldb [23]
+ jneq #6, drop
+ ret #-1
+ drop: ret #0
+
+** (Accelerated) VLAN w/ id 10:
+
+ ld vlan_tci
+ jneq #10, drop
+ ret #-1
+ drop: ret #0
+
+** icmp random packet sampling, 1 in 4
+ ldh [12]
+ jne #0x800, drop
+ ldb [23]
+ jneq #1, drop
+ # get a random uint32 number
+ ld rand
+ mod #4
+ jneq #1, drop
+ ret #-1
+ drop: ret #0
+
+** SECCOMP filter example:
+
+ ld [4] /* offsetof(struct seccomp_data, arch) */
+ jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */
+ ld [0] /* offsetof(struct seccomp_data, nr) */
+ jeq #15, good /* __NR_rt_sigreturn */
+ jeq #231, good /* __NR_exit_group */
+ jeq #60, good /* __NR_exit */
+ jeq #0, good /* __NR_read */
+ jeq #1, good /* __NR_write */
+ jeq #5, good /* __NR_fstat */
+ jeq #9, good /* __NR_mmap */
+ jeq #14, good /* __NR_rt_sigprocmask */
+ jeq #13, good /* __NR_rt_sigaction */
+ jeq #35, good /* __NR_nanosleep */
+ bad: ret #0 /* SECCOMP_RET_KILL */
+ good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */
+
+The above example code can be placed into a file (here called "foo"), and
+then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
+and cls_bpf understands and can directly be loaded with. Example with above
+ARP code:
+
+$ ./bpf_asm foo
+4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
+
+In copy and paste C-like output:
+
+$ ./bpf_asm -c foo
+{ 0x28, 0, 0, 0x0000000c },
+{ 0x15, 0, 1, 0x00000806 },
+{ 0x06, 0, 0, 0xffffffff },
+{ 0x06, 0, 0, 0000000000 },
+
+In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
+filters that might not be obvious at first, it's good to test filters before
+attaching to a live system. For that purpose, there's a small tool called
+bpf_dbg under tools/net/ in the kernel source directory. This debugger allows
+for testing BPF filters against given pcap files, single stepping through the
+BPF code on the pcap's packets and to do BPF machine register dumps.
+
+Starting bpf_dbg is trivial and just requires issuing:
+
+# ./bpf_dbg
+
+In case input and output do not equal stdin/stdout, bpf_dbg takes an
+alternative stdin source as a first argument, and an alternative stdout
+sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.
+
+Other than that, a particular libreadline configuration can be set via
+file "~/.bpf_dbg_init" and the command history is stored in the file
+"~/.bpf_dbg_history".
+
+Interaction in bpf_dbg happens through a shell that also has auto-completion
+support (follow-up example commands starting with '>' denote bpf_dbg shell).
+The usual workflow would be to ...
+
+> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
+ Loads a BPF filter from standard output of bpf_asm, or transformed via
+ e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT
+ debugging (next section), this command creates a temporary socket and
+ loads the BPF code into the kernel. Thus, this will also be useful for
+ JIT developers.
+
+> load pcap foo.pcap
+ Loads standard tcpdump pcap file.
+
+> run [<n>]
+bpf passes:1 fails:9
+ Runs through all packets from a pcap to account how many passes and fails
+ the filter will generate. A limit of packets to traverse can be given.
+
+> disassemble
+l0: ldh [12]
+l1: jeq #0x800, l2, l5
+l2: ldb [23]
+l3: jeq #0x1, l4, l5
+l4: ret #0xffff
+l5: ret #0
+ Prints out BPF code disassembly.
+
+> dump
+/* { op, jt, jf, k }, */
+{ 0x28, 0, 0, 0x0000000c },
+{ 0x15, 0, 3, 0x00000800 },
+{ 0x30, 0, 0, 0x00000017 },
+{ 0x15, 0, 1, 0x00000001 },
+{ 0x06, 0, 0, 0x0000ffff },
+{ 0x06, 0, 0, 0000000000 },
+ Prints out C-style BPF code dump.
+
+> breakpoint 0
+breakpoint at: l0: ldh [12]
+> breakpoint 1
+breakpoint at: l1: jeq #0x800, l2, l5
+ ...
+ Sets breakpoints at particular BPF instructions. Issuing a `run` command
+ will walk through the pcap file continuing from the current packet and
+ break when a breakpoint is being hit (another `run` will continue from
+ the currently active breakpoint executing next instructions):
+
+ > run
+ -- register dump --
+ pc: [0] <-- program counter
+ code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction
+ curr: l0: ldh [12] <-- disassembly of current instruction
+ A: [00000000][0] <-- content of A (hex, decimal)
+ X: [00000000][0] <-- content of X (hex, decimal)
+ M[0,15]: [00000000][0] <-- folded content of M (hex, decimal)
+ -- packet dump -- <-- Current packet from pcap (hex)
+ len: 42
+ 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
+ 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
+ 32: 00 00 00 00 00 00 0a 3b 01 01
+ (breakpoint)
+ >
+
+> breakpoint
+breakpoints: 0 1
+ Prints currently set breakpoints.
+
+> step [-<n>, +<n>]
+ Performs single stepping through the BPF program from the current pc
+ offset. Thus, on each step invocation, above register dump is issued.
+ This can go forwards and backwards in time, a plain `step` will break
+ on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
+
+> select <n>
+ Selects a given packet from the pcap file to continue from. Thus, on
+ the next `run` or `step`, the BPF program is being evaluated against
+ the user pre-selected packet. Numbering starts just as in Wireshark
+ with index 1.
+
+> quit
+#
+ Exits bpf_dbg.
+
+JIT compiler
+------------
+
+The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC,
+ARM, ARM64, MIPS and s390 and can be enabled through CONFIG_BPF_JIT. The JIT
+compiler is transparently invoked for each attached filter from user space
+or for internal kernel users if it has been previously enabled by root:
+
+ echo 1 > /proc/sys/net/core/bpf_jit_enable
+
+For JIT developers, doing audits etc, each compile run can output the generated
+opcode image into the kernel log via:
+
+ echo 2 > /proc/sys/net/core/bpf_jit_enable
+
+Example output from dmesg:
+
+[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
+[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
+[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
+[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
+[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
+[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
+
+In the kernel source tree under tools/net/, there's bpf_jit_disasm for
+generating disassembly out of the kernel log's hexdump:
+
+# ./bpf_jit_disasm
+70 bytes emitted from JIT compiler (pass:3, flen:6)
+ffffffffa0069c8f + <x>:
+ 0: push %rbp
+ 1: mov %rsp,%rbp
+ 4: sub $0x60,%rsp
+ 8: mov %rbx,-0x8(%rbp)
+ c: mov 0x68(%rdi),%r9d
+ 10: sub 0x6c(%rdi),%r9d
+ 14: mov 0xd8(%rdi),%r8
+ 1b: mov $0xc,%esi
+ 20: callq 0xffffffffe0ff9442
+ 25: cmp $0x800,%eax
+ 2a: jne 0x0000000000000042
+ 2c: mov $0x17,%esi
+ 31: callq 0xffffffffe0ff945e
+ 36: cmp $0x1,%eax
+ 39: jne 0x0000000000000042
+ 3b: mov $0xffff,%eax
+ 40: jmp 0x0000000000000044
+ 42: xor %eax,%eax
+ 44: leaveq
+ 45: retq
+
+Issuing option `-o` will "annotate" opcodes to resulting assembler
+instructions, which can be very useful for JIT developers:
+
+# ./bpf_jit_disasm -o
+70 bytes emitted from JIT compiler (pass:3, flen:6)
+ffffffffa0069c8f + <x>:
+ 0: push %rbp
+ 55
+ 1: mov %rsp,%rbp
+ 48 89 e5
+ 4: sub $0x60,%rsp
+ 48 83 ec 60
+ 8: mov %rbx,-0x8(%rbp)
+ 48 89 5d f8
+ c: mov 0x68(%rdi),%r9d
+ 44 8b 4f 68
+ 10: sub 0x6c(%rdi),%r9d
+ 44 2b 4f 6c
+ 14: mov 0xd8(%rdi),%r8
+ 4c 8b 87 d8 00 00 00
+ 1b: mov $0xc,%esi
+ be 0c 00 00 00
+ 20: callq 0xffffffffe0ff9442
+ e8 1d 94 ff e0
+ 25: cmp $0x800,%eax
+ 3d 00 08 00 00
+ 2a: jne 0x0000000000000042
+ 75 16
+ 2c: mov $0x17,%esi
+ be 17 00 00 00
+ 31: callq 0xffffffffe0ff945e
+ e8 28 94 ff e0
+ 36: cmp $0x1,%eax
+ 83 f8 01
+ 39: jne 0x0000000000000042
+ 75 07
+ 3b: mov $0xffff,%eax
+ b8 ff ff 00 00
+ 40: jmp 0x0000000000000044
+ eb 02
+ 42: xor %eax,%eax
+ 31 c0
+ 44: leaveq
+ c9
+ 45: retq
+ c3
+
+For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
+toolchain for developing and testing the kernel's JIT compiler.
+
+BPF kernel internals
+--------------------
+Internally, for the kernel interpreter, a different instruction set
+format with similar underlying principles from BPF described in previous
+paragraphs is being used. However, the instruction set format is modelled
+closer to the underlying architecture to mimic native instruction sets, so
+that a better performance can be achieved (more details later). This new
+ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which
+originates from [e]xtended BPF is not the same as BPF extensions! While
+eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
+of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
+
+It is designed to be JITed with one to one mapping, which can also open up
+the possibility for GCC/LLVM compilers to generate optimized eBPF code through
+an eBPF backend that performs almost as fast as natively compiled code.
+
+The new instruction set was originally designed with the possible goal in
+mind to write programs in "restricted C" and compile into eBPF with a optional
+GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
+minimal performance overhead over two steps, that is, C -> eBPF -> native code.
+
+Currently, the new format is being used for running user BPF programs, which
+includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
+team driver's classifier for its load-balancing mode, netfilter's xt_bpf
+extension, PTP dissector/classifier, and much more. They are all internally
+converted by the kernel into the new instruction set representation and run
+in the eBPF interpreter. For in-kernel handlers, this all works transparently
+by using bpf_prog_create() for setting up the filter, resp.
+bpf_prog_destroy() for destroying it. The macro
+BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed
+code to run the filter. 'filter' is a pointer to struct bpf_prog that we
+got from bpf_prog_create(), and 'ctx' the given context (e.g.
+skb pointer). All constraints and restrictions from bpf_check_classic() apply
+before a conversion to the new layout is being done behind the scenes!
+
+Currently, the classic BPF format is being used for JITing on most of the
+architectures. Only x86-64 performs JIT compilation from eBPF instruction set,
+however, future work will migrate other JIT compilers as well, so that they
+will profit from the very same benefits.
+
+Some core changes of the new internal format:
+
+- Number of registers increase from 2 to 10:
+
+ The old format had two registers A and X, and a hidden frame pointer. The
+ new layout extends this to be 10 internal registers and a read-only frame
+ pointer. Since 64-bit CPUs are passing arguments to functions via registers
+ the number of args from eBPF program to in-kernel function is restricted
+ to 5 and one register is used to accept return value from an in-kernel
+ function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
+ sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
+ registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
+
+ Therefore, eBPF calling convention is defined as:
+
+ * R0 - return value from in-kernel function, and exit value for eBPF program
+ * R1 - R5 - arguments from eBPF program to in-kernel function
+ * R6 - R9 - callee saved registers that in-kernel function will preserve
+ * R10 - read-only frame pointer to access stack
+
+ Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
+ etc, and eBPF calling convention maps directly to ABIs used by the kernel on
+ 64-bit architectures.
+
+ On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
+ and may let more complex programs to be interpreted.
+
+ R0 - R5 are scratch registers and eBPF program needs spill/fill them if
+ necessary across calls. Note that there is only one eBPF program (== one
+ eBPF main routine) and it cannot call other eBPF functions, it can only
+ call predefined in-kernel functions, though.
+
+- Register width increases from 32-bit to 64-bit:
+
+ Still, the semantics of the original 32-bit ALU operations are preserved
+ via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
+ subregisters that zero-extend into 64-bit if they are being written to.
+ That behavior maps directly to x86_64 and arm64 subregister definition, but
+ makes other JITs more difficult.
+
+ 32-bit architectures run 64-bit internal BPF programs via interpreter.
+ Their JITs may convert BPF programs that only use 32-bit subregisters into
+ native instruction set and let the rest being interpreted.
+
+ Operation is 64-bit, because on 64-bit architectures, pointers are also
+ 64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
+ so 32-bit eBPF registers would otherwise require to define register-pair
+ ABI, thus, there won't be able to use a direct eBPF register to HW register
+ mapping and JIT would need to do combine/split/move operations for every
+ register in and out of the function, which is complex, bug prone and slow.
+ Another reason is the use of atomic 64-bit counters.
+
+- Conditional jt/jf targets replaced with jt/fall-through:
+
+ While the original design has constructs such as "if (cond) jump_true;
+ else jump_false;", they are being replaced into alternative constructs like
+ "if (cond) jump_true; /* else fall-through */".
+
+- Introduces bpf_call insn and register passing convention for zero overhead
+ calls from/to other kernel functions:
+
+ Before an in-kernel function call, the internal BPF program needs to
+ place function arguments into R1 to R5 registers to satisfy calling
+ convention, then the interpreter will take them from registers and pass
+ to in-kernel function. If R1 - R5 registers are mapped to CPU registers
+ that are used for argument passing on given architecture, the JIT compiler
+ doesn't need to emit extra moves. Function arguments will be in the correct
+ registers and BPF_CALL instruction will be JITed as single 'call' HW
+ instruction. This calling convention was picked to cover common call
+ situations without performance penalty.
+
+ After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
+ a return value of the function. Since R6 - R9 are callee saved, their state
+ is preserved across the call.
+
+ For example, consider three C functions:
+
+ u64 f1() { return (*_f2)(1); }
+ u64 f2(u64 a) { return f3(a + 1, a); }
+ u64 f3(u64 a, u64 b) { return a - b; }
+
+ GCC can compile f1, f3 into x86_64:
+
+ f1:
+ movl $1, %edi
+ movq _f2(%rip), %rax
+ jmp *%rax
+ f3:
+ movq %rdi, %rax
+ subq %rsi, %rax
+ ret
+
+ Function f2 in eBPF may look like:
+
+ f2:
+ bpf_mov R2, R1
+ bpf_add R1, 1
+ bpf_call f3
+ bpf_exit
+
+ If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and
+ returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
+ be used to call into f2.
+
+ For practical reasons all eBPF programs have only one argument 'ctx' which is
+ already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
+ can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
+ are currently not supported, but these restrictions can be lifted if necessary
+ in the future.
+
+ On 64-bit architectures all register map to HW registers one to one. For
+ example, x86_64 JIT compiler can map them as ...
+
+ R0 - rax
+ R1 - rdi
+ R2 - rsi
+ R3 - rdx
+ R4 - rcx
+ R5 - r8
+ R6 - rbx
+ R7 - r13
+ R8 - r14
+ R9 - r15
+ R10 - rbp
+
+ ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
+ and rbx, r12 - r15 are callee saved.
+
+ Then the following internal BPF pseudo-program:
+
+ bpf_mov R6, R1 /* save ctx */
+ bpf_mov R2, 2
+ bpf_mov R3, 3
+ bpf_mov R4, 4
+ bpf_mov R5, 5
+ bpf_call foo
+ bpf_mov R7, R0 /* save foo() return value */
+ bpf_mov R1, R6 /* restore ctx for next call */
+ bpf_mov R2, 6
+ bpf_mov R3, 7
+ bpf_mov R4, 8
+ bpf_mov R5, 9
+ bpf_call bar
+ bpf_add R0, R7
+ bpf_exit
+
+ After JIT to x86_64 may look like:
+
+ push %rbp
+ mov %rsp,%rbp
+ sub $0x228,%rsp
+ mov %rbx,-0x228(%rbp)
+ mov %r13,-0x220(%rbp)
+ mov %rdi,%rbx
+ mov $0x2,%esi
+ mov $0x3,%edx
+ mov $0x4,%ecx
+ mov $0x5,%r8d
+ callq foo
+ mov %rax,%r13
+ mov %rbx,%rdi
+ mov $0x2,%esi
+ mov $0x3,%edx
+ mov $0x4,%ecx
+ mov $0x5,%r8d
+ callq bar
+ add %r13,%rax
+ mov -0x228(%rbp),%rbx
+ mov -0x220(%rbp),%r13
+ leaveq
+ retq
+
+ Which is in this example equivalent in C to:
+
+ u64 bpf_filter(u64 ctx)
+ {
+ return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
+ }
+
+ In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
+ arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
+ registers and place their return value into '%rax' which is R0 in eBPF.
+ Prologue and epilogue are emitted by JIT and are implicit in the
+ interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
+ them across the calls as defined by calling convention.
+
+ For example the following program is invalid:
+
+ bpf_mov R1, 1
+ bpf_call foo
+ bpf_mov R0, R1
+ bpf_exit
+
+ After the call the registers R1-R5 contain junk values and cannot be read.
+ In the future an eBPF verifier can be used to validate internal BPF programs.
+
+Also in the new design, eBPF is limited to 4096 insns, which means that any
+program will terminate quickly and will only call a fixed number of kernel
+functions. Original BPF and the new format are two operand instructions,
+which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
+
+The input context pointer for invoking the interpreter function is generic,
+its content is defined by a specific use case. For seccomp register R1 points
+to seccomp_data, for converted BPF filters R1 points to a skb.
+
+A program, that is translated internally consists of the following elements:
+
+ op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
+
+So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field
+has room for new instructions. Some of them may use 16/24/32 byte encoding. New
+instructions must be multiple of 8 bytes to preserve backward compatibility.
+
+Internal BPF is a general purpose RISC instruction set. Not every register and
+every instruction are used during translation from original BPF to new format.
+For example, socket filters are not using 'exclusive add' instruction, but
+tracing filters may do to maintain counters of events, for example. Register R9
+is not used by socket filters either, but more complex filters may be running
+out of registers and would have to resort to spill/fill to stack.
+
+Internal BPF can used as generic assembler for last step performance
+optimizations, socket filters and seccomp are using it as assembler. Tracing
+filters may use it as assembler to generate code from kernel. In kernel usage
+may not be bounded by security considerations, since generated internal BPF code
+may be optimizing internal code path and not being exposed to the user space.
+Safety of internal BPF can come from a verifier (TBD). In such use cases as
+described, it may be used as safe instruction set.
+
+Just like the original BPF, the new format runs within a controlled environment,
+is deterministic and the kernel can easily prove that. The safety of the program
+can be determined in two steps: first step does depth-first-search to disallow
+loops and other CFG validation; second step starts from the first insn and
+descends all possible paths. It simulates execution of every insn and observes
+the state change of registers and stack.
+
+eBPF opcode encoding
+--------------------
+
+eBPF is reusing most of the opcode encoding from classic to simplify conversion
+of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code'
+field is divided into three parts:
+
+ +----------------+--------+--------------------+
+ | 4 bits | 1 bit | 3 bits |
+ | operation code | source | instruction class |
+ +----------------+--------+--------------------+
+ (MSB) (LSB)
+
+Three LSB bits store instruction class which is one of:
+
+ Classic BPF classes: eBPF classes:
+
+ BPF_LD 0x00 BPF_LD 0x00
+ BPF_LDX 0x01 BPF_LDX 0x01
+ BPF_ST 0x02 BPF_ST 0x02
+ BPF_STX 0x03 BPF_STX 0x03
+ BPF_ALU 0x04 BPF_ALU 0x04
+ BPF_JMP 0x05 BPF_JMP 0x05
+ BPF_RET 0x06 [ class 6 unused, for future if needed ]
+ BPF_MISC 0x07 BPF_ALU64 0x07
+
+When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ...
+
+ BPF_K 0x00
+ BPF_X 0x08
+
+ * in classic BPF, this means:
+
+ BPF_SRC(code) == BPF_X - use register X as source operand
+ BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
+
+ * in eBPF, this means:
+
+ BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
+ BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
+
+... and four MSB bits store operation code.
+
+If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:
+
+ BPF_ADD 0x00
+ BPF_SUB 0x10
+ BPF_MUL 0x20
+ BPF_DIV 0x30
+ BPF_OR 0x40
+ BPF_AND 0x50
+ BPF_LSH 0x60
+ BPF_RSH 0x70
+ BPF_NEG 0x80
+ BPF_MOD 0x90
+ BPF_XOR 0xa0
+ BPF_MOV 0xb0 /* eBPF only: mov reg to reg */
+ BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
+ BPF_END 0xd0 /* eBPF only: endianness conversion */
+
+If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of:
+
+ BPF_JA 0x00
+ BPF_JEQ 0x10
+ BPF_JGT 0x20
+ BPF_JGE 0x30
+ BPF_JSET 0x40
+ BPF_JNE 0x50 /* eBPF only: jump != */
+ BPF_JSGT 0x60 /* eBPF only: signed '>' */
+ BPF_JSGE 0x70 /* eBPF only: signed '>=' */
+ BPF_CALL 0x80 /* eBPF only: function call */
+ BPF_EXIT 0x90 /* eBPF only: function return */
+
+So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
+and eBPF. There are only two registers in classic BPF, so it means A += X.
+In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
+BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
+src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
+
+Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
+eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
+BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
+exactly the same operations as BPF_ALU, but with 64-bit wide operands
+instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
+dst_reg = dst_reg + src_reg
+
+Classic BPF wastes the whole BPF_RET class to represent a single 'ret'
+operation. Classic BPF_RET | BPF_K means copy imm32 into return register
+and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
+in eBPF means function exit only. The eBPF program needs to store return
+value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently
+unused and reserved for future use.
+
+For load and store instructions the 8-bit 'code' field is divided as:
+
+ +--------+--------+-------------------+
+ | 3 bits | 2 bits | 3 bits |
+ | mode | size | instruction class |
+ +--------+--------+-------------------+
+ (MSB) (LSB)
+
+Size modifier is one of ...
+
+ BPF_W 0x00 /* word */
+ BPF_H 0x08 /* half word */
+ BPF_B 0x10 /* byte */
+ BPF_DW 0x18 /* eBPF only, double word */
+
+... which encodes size of load/store operation:
+
+ B - 1 byte
+ H - 2 byte
+ W - 4 byte
+ DW - 8 byte (eBPF only)
+
+Mode modifier is one of:
+
+ BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
+ BPF_ABS 0x20
+ BPF_IND 0x40
+ BPF_MEM 0x60
+ BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */
+ BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */
+ BPF_XADD 0xc0 /* eBPF only, exclusive add */
+
+eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
+(BPF_IND | <size> | BPF_LD) which are used to access packet data.
+
+They had to be carried over from classic to have strong performance of
+socket filters running in eBPF interpreter. These instructions can only
+be used when interpreter context is a pointer to 'struct sk_buff' and
+have seven implicit operands. Register R6 is an implicit input that must
+contain pointer to sk_buff. Register R0 is an implicit output which contains
+the data fetched from the packet. Registers R1-R5 are scratch registers
+and must not be used to store the data across BPF_ABS | BPF_LD or
+BPF_IND | BPF_LD instructions.
+
+These instructions have implicit program exit condition as well. When
+eBPF program is trying to access the data beyond the packet boundary,
+the interpreter will abort the execution of the program. JIT compilers
+therefore must preserve this property. src_reg and imm32 fields are
+explicit inputs to these instructions.
+
+For example:
+
+ BPF_IND | BPF_W | BPF_LD means:
+
+ R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
+ and R1 - R5 were scratched.
+
+Unlike classic BPF instruction set, eBPF has generic load/store operations:
+
+BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg
+BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32
+BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off)
+BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
+BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
+
+Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
+2 byte atomic increments are not supported.
+
+eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists
+of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single
+instruction that loads 64-bit immediate value into a dst_reg.
+Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
+32-bit immediate value into a register.
+
+eBPF verifier
+-------------
+The safety of the eBPF program is determined in two steps.
+
+First step does DAG check to disallow loops and other CFG validation.
+In particular it will detect programs that have unreachable instructions.
+(though classic BPF checker allows them)
+
+Second step starts from the first insn and descends all possible paths.
+It simulates execution of every insn and observes the state change of
+registers and stack.
+
+At the start of the program the register R1 contains a pointer to context
+and has type PTR_TO_CTX.
+If verifier sees an insn that does R2=R1, then R2 has now type
+PTR_TO_CTX as well and can be used on the right hand side of expression.
+If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=UNKNOWN_VALUE,
+since addition of two valid pointers makes invalid pointer.
+(In 'secure' mode verifier will reject any type of pointer arithmetic to make
+sure that kernel addresses don't leak to unprivileged users)
+
+If register was never written to, it's not readable:
+ bpf_mov R0 = R2
+ bpf_exit
+will be rejected, since R2 is unreadable at the start of the program.
+
+After kernel function call, R1-R5 are reset to unreadable and
+R0 has a return type of the function.
+
+Since R6-R9 are callee saved, their state is preserved across the call.
+ bpf_mov R6 = 1
+ bpf_call foo
+ bpf_mov R0 = R6
+ bpf_exit
+is a correct program. If there was R1 instead of R6, it would have
+been rejected.
+
+load/store instructions are allowed only with registers of valid types, which
+are PTR_TO_CTX, PTR_TO_MAP, FRAME_PTR. They are bounds and alignment checked.
+For example:
+ bpf_mov R1 = 1
+ bpf_mov R2 = 2
+ bpf_xadd *(u32 *)(R1 + 3) += R2
+ bpf_exit
+will be rejected, since R1 doesn't have a valid pointer type at the time of
+execution of instruction bpf_xadd.
+
+At the start R1 type is PTR_TO_CTX (a pointer to generic 'struct bpf_context')
+A callback is used to customize verifier to restrict eBPF program access to only
+certain fields within ctx structure with specified size and alignment.
+
+For example, the following insn:
+ bpf_ld R0 = *(u32 *)(R6 + 8)
+intends to load a word from address R6 + 8 and store it into R0
+If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
+that offset 8 of size 4 bytes can be accessed for reading, otherwise
+the verifier will reject the program.
+If R6=FRAME_PTR, then access should be aligned and be within
+stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
+so it will fail verification, since it's out of bounds.
+
+The verifier will allow eBPF program to read data from stack only after
+it wrote into it.
+Classic BPF verifier does similar check with M[0-15] memory slots.
+For example:
+ bpf_ld R0 = *(u32 *)(R10 - 4)
+ bpf_exit
+is invalid program.
+Though R10 is correct read-only register and has type FRAME_PTR
+and R10 - 4 is within stack bounds, there were no stores into that location.
+
+Pointer register spill/fill is tracked as well, since four (R6-R9)
+callee saved registers may not be enough for some programs.
+
+Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
+The eBPF verifier will check that registers match argument constraints.
+After the call register R0 will be set to return type of the function.
+
+Function calls is a main mechanism to extend functionality of eBPF programs.
+Socket filters may let programs to call one set of functions, whereas tracing
+filters may allow completely different set.
+
+If a function made accessible to eBPF program, it needs to be thought through
+from safety point of view. The verifier will guarantee that the function is
+called with valid arguments.
+
+seccomp vs socket filters have different security restrictions for classic BPF.
+Seccomp solves this by two stage verifier: classic BPF verifier is followed
+by seccomp verifier. In case of eBPF one configurable verifier is shared for
+all use cases.
+
+See details of eBPF verifier in kernel/bpf/verifier.c
+
+eBPF maps
+---------
+'maps' is a generic storage of different types for sharing data between kernel
+and userspace.
+
+The maps are accessed from user space via BPF syscall, which has commands:
+- create a map with given type and attributes
+ map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
+ using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
+ returns process-local file descriptor or negative error
+
+- lookup key in a given map
+ err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
+ using attr->map_fd, attr->key, attr->value
+ returns zero and stores found elem into value or negative error
+
+- create or update key/value pair in a given map
+ err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
+ using attr->map_fd, attr->key, attr->value
+ returns zero or negative error
+
+- find and delete element by key in a given map
+ err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
+ using attr->map_fd, attr->key
+
+- to delete map: close(fd)
+ Exiting process will delete maps automatically
+
+userspace programs use this syscall to create/access maps that eBPF programs
+are concurrently updating.
+
+maps can have different types: hash, array, bloom filter, radix-tree, etc.
+
+The map is defined by:
+ . type
+ . max number of elements
+ . key size in bytes
+ . value size in bytes
+
+Understanding eBPF verifier messages
+------------------------------------
+
+The following are few examples of invalid eBPF programs and verifier error
+messages as seen in the log:
+
+Program with unreachable instructions:
+static struct bpf_insn prog[] = {
+ BPF_EXIT_INSN(),
+ BPF_EXIT_INSN(),
+};
+Error:
+ unreachable insn 1
+
+Program that reads uninitialized register:
+ BPF_MOV64_REG(BPF_REG_0, BPF_REG_2),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (bf) r0 = r2
+ R2 !read_ok
+
+Program that doesn't initialize R0 before exiting:
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_1),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (bf) r2 = r1
+ 1: (95) exit
+ R0 !read_ok
+
+Program that accesses stack out of bounds:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 +8) = 0
+ invalid stack off=8 size=8
+
+Program that doesn't initialize stack before passing its address into function:
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (bf) r2 = r10
+ 1: (07) r2 += -8
+ 2: (b7) r1 = 0x0
+ 3: (85) call 1
+ invalid indirect read from stack off -8+0 size 8
+
+Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 0x0
+ 4: (85) call 1
+ fd 0 is not pointing to valid bpf_map
+
+Program that doesn't check return value of map_lookup_elem() before accessing
+map element:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 0x0
+ 4: (85) call 1
+ 5: (7a) *(u64 *)(r0 +0) = 0
+ R0 invalid mem access 'map_value_or_null'
+
+Program that correctly checks map_lookup_elem() returned value for NULL, but
+accesses the memory with incorrect alignment:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 1
+ 4: (85) call 1
+ 5: (15) if r0 == 0x0 goto pc+1
+ R0=map_ptr R10=fp
+ 6: (7a) *(u64 *)(r0 +4) = 0
+ misaligned access off 4 size 8
+
+Program that correctly checks map_lookup_elem() returned value for NULL and
+accesses memory with correct alignment in one side of 'if' branch, but fails
+to do so in the other side of 'if' branch:
+ BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+ BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+ BPF_LD_MAP_FD(BPF_REG_1, 0),
+ BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
+ BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
+ BPF_EXIT_INSN(),
+ BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
+ BPF_EXIT_INSN(),
+Error:
+ 0: (7a) *(u64 *)(r10 -8) = 0
+ 1: (bf) r2 = r10
+ 2: (07) r2 += -8
+ 3: (b7) r1 = 1
+ 4: (85) call 1
+ 5: (15) if r0 == 0x0 goto pc+2
+ R0=map_ptr R10=fp
+ 6: (7a) *(u64 *)(r0 +0) = 0
+ 7: (95) exit
+
+ from 5 to 8: R0=imm0 R10=fp
+ 8: (7a) *(u64 *)(r0 +0) = 1
+ R0 invalid mem access 'imm'
+
+Testing
+-------
+
+Next to the BPF toolchain, the kernel also ships a test module that contains
+various test cases for classic and internal BPF that can be executed against
+the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
+enabled via Kconfig:
+
+ CONFIG_TEST_BPF=m
+
+After the module has been built and installed, the test suite can be executed
+via insmod or modprobe against 'test_bpf' module. Results of the test cases
+including timings in nsec can be found in the kernel log (dmesg).
+
+Misc
+----
+
+Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
+SECCOMP-BPF kernel fuzzing.
+
+Written by
+----------
+
+The document was written in the hope that it is found useful and in order
+to give potential BPF hackers or security auditors a better overview of
+the underlying architecture.
+
+Jay Schulist <jschlst@samba.org>
+Daniel Borkmann <dborkman@redhat.com>
+Alexei Starovoitov <ast@plumgrid.com>
diff --git a/Documentation/networking/fore200e.txt b/Documentation/networking/fore200e.txt
new file mode 100644
index 000000000..7afbdcc8c
--- /dev/null
+++ b/Documentation/networking/fore200e.txt
@@ -0,0 +1,39 @@
+
+FORE Systems PCA-200E/SBA-200E ATM NIC driver
+---------------------------------------------
+
+This driver adds support for the FORE Systems 200E-series ATM adapters
+to the Linux operating system. It is based on the earlier PCA-200E driver
+written by Uwe Dannowski.
+
+The driver simultaneously supports PCA-200E and SBA-200E adapters on
+i386, alpha (untested), powerpc, sparc and sparc64 archs.
+
+The intent is to enable the use of different models of FORE adapters at the
+same time, by hosts that have several bus interfaces (such as PCI+SBUS,
+or PCI+EISA).
+
+Only PCI and SBUS devices are currently supported by the driver, but support
+for other bus interfaces such as EISA should not be too hard to add.
+
+
+Firmware Copyright Notice
+-------------------------
+
+Please read the fore200e_firmware_copyright file present
+in the linux/drivers/atm directory for details and restrictions.
+
+
+Firmware Updates
+----------------
+
+The FORE Systems 200E-series driver is shipped with firmware data being
+uploaded to the ATM adapters at system boot time or at module loading time.
+/*(DEBLOBBED)*/
+
+
+Feedback
+--------
+
+Feedback is welcome. Please send success stories/bug reports/
+patches/improvement/comments/flames to <lizzi@cnam.fr>.
diff --git a/Documentation/networking/framerelay.txt b/Documentation/networking/framerelay.txt
new file mode 100644
index 000000000..1a0b72044
--- /dev/null
+++ b/Documentation/networking/framerelay.txt
@@ -0,0 +1,39 @@
+Frame Relay (FR) support for linux is built into a two tiered system of device
+drivers. The upper layer implements RFC1490 FR specification, and uses the
+Data Link Connection Identifier (DLCI) as its hardware address. Usually these
+are assigned by your network supplier, they give you the number/numbers of
+the Virtual Connections (VC) assigned to you.
+
+Each DLCI is a point-to-point link between your machine and a remote one.
+As such, a separate device is needed to accommodate the routing. Within the
+net-tools archives is 'dlcicfg'. This program will communicate with the
+base "DLCI" device, and create new net devices named 'dlci00', 'dlci01'...
+The configuration script will ask you how many DLCIs you need, as well as
+how many DLCIs you want to assign to each Frame Relay Access Device (FRAD).
+
+The DLCI uses a number of function calls to communicate with the FRAD, all
+of which are stored in the FRAD's private data area. assoc/deassoc,
+activate/deactivate and dlci_config. The DLCI supplies a receive function
+to the FRAD to accept incoming packets.
+
+With this initial offering, only 1 FRAD driver is available. With many thanks
+to Sangoma Technologies, David Mandelstam & Gene Kozin, the S502A, S502E &
+S508 are supported. This driver is currently set up for only FR, but as
+Sangoma makes more firmware modules available, it can be updated to provide
+them as well.
+
+Configuration of the FRAD makes use of another net-tools program, 'fradcfg'.
+This program makes use of a configuration file (which dlcicfg can also read)
+to specify the types of boards to be configured as FRADs, as well as perform
+any board specific configuration. The Sangoma module of fradcfg loads the
+FR firmware into the card, sets the irq/port/memory information, and provides
+an initial configuration.
+
+Additional FRAD device drivers can be added as hardware is available.
+
+At this time, the dlcicfg and fradcfg programs have not been incorporated into
+the net-tools distribution. They can be found at ftp.invlogic.com, in
+/pub/linux. Note that with OS/2 FTPD, you end up in /pub by default, so just
+use 'cd linux'. v0.10 is for use on pre-2.0.3 and earlier, v0.15 is for
+pre-2.0.4 and later.
+
diff --git a/Documentation/networking/gen_stats.txt b/Documentation/networking/gen_stats.txt
new file mode 100644
index 000000000..70e6275b7
--- /dev/null
+++ b/Documentation/networking/gen_stats.txt
@@ -0,0 +1,117 @@
+Generic networking statistics for netlink users
+======================================================================
+
+Statistic counters are grouped into structs:
+
+Struct TLV type Description
+----------------------------------------------------------------------
+gnet_stats_basic TCA_STATS_BASIC Basic statistics
+gnet_stats_rate_est TCA_STATS_RATE_EST Rate estimator
+gnet_stats_queue TCA_STATS_QUEUE Queue statistics
+none TCA_STATS_APP Application specific
+
+
+Collecting:
+-----------
+
+Declare the statistic structs you need:
+struct mystruct {
+ struct gnet_stats_basic bstats;
+ struct gnet_stats_queue qstats;
+ ...
+};
+
+Update statistics:
+mystruct->tstats.packet++;
+mystruct->qstats.backlog += skb->pkt_len;
+
+
+Export to userspace (Dump):
+---------------------------
+
+my_dumping_routine(struct sk_buff *skb, ...)
+{
+ struct gnet_dump dump;
+
+ if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump) < 0)
+ goto rtattr_failure;
+
+ if (gnet_stats_copy_basic(&dump, &mystruct->bstats) < 0 ||
+ gnet_stats_copy_queue(&dump, &mystruct->qstats) < 0 ||
+ gnet_stats_copy_app(&dump, &xstats, sizeof(xstats)) < 0)
+ goto rtattr_failure;
+
+ if (gnet_stats_finish_copy(&dump) < 0)
+ goto rtattr_failure;
+ ...
+}
+
+TCA_STATS/TCA_XSTATS backward compatibility:
+--------------------------------------------
+
+Prior users of struct tc_stats and xstats can maintain backward
+compatibility by calling the compat wrappers to keep providing the
+existing TLV types.
+
+my_dumping_routine(struct sk_buff *skb, ...)
+{
+ if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS,
+ TCA_XSTATS, &mystruct->lock, &dump) < 0)
+ goto rtattr_failure;
+ ...
+}
+
+A struct tc_stats will be filled out during gnet_stats_copy_* calls
+and appended to the skb. TCA_XSTATS is provided if gnet_stats_copy_app
+was called.
+
+
+Locking:
+--------
+
+Locks are taken before writing and released once all statistics have
+been written. Locks are always released in case of an error. You
+are responsible for making sure that the lock is initialized.
+
+
+Rate Estimator:
+--------------
+
+0) Prepare an estimator attribute. Most likely this would be in user
+ space. The value of this TLV should contain a tc_estimator structure.
+ As usual, such a TLV needs to be 32 bit aligned and therefore the
+ length needs to be appropriately set, etc. The estimator interval
+ and ewma log need to be converted to the appropriate values.
+ tc_estimator.c::tc_setup_estimator() is advisable to be used as the
+ conversion routine. It does a few clever things. It takes a time
+ interval in microsecs, a time constant also in microsecs and a struct
+ tc_estimator to be populated. The returned tc_estimator can be
+ transported to the kernel. Transfer such a structure in a TLV of type
+ TCA_RATE to your code in the kernel.
+
+In the kernel when setting up:
+1) make sure you have basic stats and rate stats setup first.
+2) make sure you have initialized stats lock that is used to setup such
+ stats.
+3) Now initialize a new estimator:
+
+ int ret = gen_new_estimator(my_basicstats,my_rate_est_stats,
+ mystats_lock, attr_with_tcestimator_struct);
+
+ if ret == 0
+ success
+ else
+ failed
+
+From now on, every time you dump my_rate_est_stats it will contain
+up-to-date info.
+
+Once you are done, call gen_kill_estimator(my_basicstats,
+my_rate_est_stats) Make sure that my_basicstats and my_rate_est_stats
+are still valid (i.e still exist) at the time of making this call.
+
+
+Authors:
+--------
+Thomas Graf <tgraf@suug.ch>
+Jamal Hadi Salim <hadi@cyberus.ca>
diff --git a/Documentation/networking/generic-hdlc.txt b/Documentation/networking/generic-hdlc.txt
new file mode 100644
index 000000000..4eb3cc40b
--- /dev/null
+++ b/Documentation/networking/generic-hdlc.txt
@@ -0,0 +1,132 @@
+Generic HDLC layer
+Krzysztof Halasa <khc@pm.waw.pl>
+
+
+Generic HDLC layer currently supports:
+1. Frame Relay (ANSI, CCITT, Cisco and no LMI)
+ - Normal (routed) and Ethernet-bridged (Ethernet device emulation)
+ interfaces can share a single PVC.
+ - ARP support (no InARP support in the kernel - there is an
+ experimental InARP user-space daemon available on:
+ http://www.kernel.org/pub/linux/utils/net/hdlc/).
+2. raw HDLC - either IP (IPv4) interface or Ethernet device emulation
+3. Cisco HDLC
+4. PPP
+5. X.25 (uses X.25 routines).
+
+Generic HDLC is a protocol driver only - it needs a low-level driver
+for your particular hardware.
+
+Ethernet device emulation (using HDLC or Frame-Relay PVC) is compatible
+with IEEE 802.1Q (VLANs) and 802.1D (Ethernet bridging).
+
+
+Make sure the hdlc.o and the hardware driver are loaded. It should
+create a number of "hdlc" (hdlc0 etc) network devices, one for each
+WAN port. You'll need the "sethdlc" utility, get it from:
+ http://www.kernel.org/pub/linux/utils/net/hdlc/
+
+Compile sethdlc.c utility:
+ gcc -O2 -Wall -o sethdlc sethdlc.c
+Make sure you're using a correct version of sethdlc for your kernel.
+
+Use sethdlc to set physical interface, clock rate, HDLC mode used,
+and add any required PVCs if using Frame Relay.
+Usually you want something like:
+
+ sethdlc hdlc0 clock int rate 128000
+ sethdlc hdlc0 cisco interval 10 timeout 25
+or
+ sethdlc hdlc0 rs232 clock ext
+ sethdlc hdlc0 fr lmi ansi
+ sethdlc hdlc0 create 99
+ ifconfig hdlc0 up
+ ifconfig pvc0 localIP pointopoint remoteIP
+
+In Frame Relay mode, ifconfig master hdlc device up (without assigning
+any IP address to it) before using pvc devices.
+
+
+Setting interface:
+
+* v35 | rs232 | x21 | t1 | e1 - sets physical interface for a given port
+ if the card has software-selectable interfaces
+ loopback - activate hardware loopback (for testing only)
+* clock ext - both RX clock and TX clock external
+* clock int - both RX clock and TX clock internal
+* clock txint - RX clock external, TX clock internal
+* clock txfromrx - RX clock external, TX clock derived from RX clock
+* rate - sets clock rate in bps (for "int" or "txint" clock only)
+
+
+Setting protocol:
+
+* hdlc - sets raw HDLC (IP-only) mode
+ nrz / nrzi / fm-mark / fm-space / manchester - sets transmission code
+ no-parity / crc16 / crc16-pr0 (CRC16 with preset zeros) / crc32-itu
+ crc16-itu (CRC16 with ITU-T polynomial) / crc16-itu-pr0 - sets parity
+
+* hdlc-eth - Ethernet device emulation using HDLC. Parity and encoding
+ as above.
+
+* cisco - sets Cisco HDLC mode (IP, IPv6 and IPX supported)
+ interval - time in seconds between keepalive packets
+ timeout - time in seconds after last received keepalive packet before
+ we assume the link is down
+
+* ppp - sets synchronous PPP mode
+
+* x25 - sets X.25 mode
+
+* fr - Frame Relay mode
+ lmi ansi / ccitt / cisco / none - LMI (link management) type
+ dce - Frame Relay DCE (network) side LMI instead of default DTE (user).
+ It has nothing to do with clocks!
+ t391 - link integrity verification polling timer (in seconds) - user
+ t392 - polling verification timer (in seconds) - network
+ n391 - full status polling counter - user
+ n392 - error threshold - both user and network
+ n393 - monitored events count - both user and network
+
+Frame-Relay only:
+* create n | delete n - adds / deletes PVC interface with DLCI #n.
+ Newly created interface will be named pvc0, pvc1 etc.
+
+* create ether n | delete ether n - adds a device for Ethernet-bridged
+ frames. The device will be named pvceth0, pvceth1 etc.
+
+
+
+
+Board-specific issues
+---------------------
+
+n2.o and c101.o need parameters to work:
+
+ insmod n2 hw=io,irq,ram,ports[:io,irq,...]
+example:
+ insmod n2 hw=0x300,10,0xD0000,01
+
+or
+ insmod c101 hw=irq,ram[:irq,...]
+example:
+ insmod c101 hw=9,0xdc000
+
+If built into the kernel, these drivers need kernel (command line) parameters:
+ n2.hw=io,irq,ram,ports:...
+or
+ c101.hw=irq,ram:...
+
+
+
+If you have a problem with N2, C101 or PLX200SYN card, you can issue the
+"private" command to see port's packet descriptor rings (in kernel logs):
+
+ sethdlc hdlc0 private
+
+The hardware driver has to be build with #define DEBUG_RINGS.
+Attaching this info to bug reports would be helpful. Anyway, let me know
+if you have problems using this.
+
+For patches and other info look at:
+<http://www.kernel.org/pub/linux/utils/net/hdlc/>.
diff --git a/Documentation/networking/generic_netlink.txt b/Documentation/networking/generic_netlink.txt
new file mode 100644
index 000000000..3e071115c
--- /dev/null
+++ b/Documentation/networking/generic_netlink.txt
@@ -0,0 +1,3 @@
+A wiki document on how to use Generic Netlink can be found here:
+
+ * http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto
diff --git a/Documentation/networking/gianfar.txt b/Documentation/networking/gianfar.txt
new file mode 100644
index 000000000..ba1daea7f
--- /dev/null
+++ b/Documentation/networking/gianfar.txt
@@ -0,0 +1,42 @@
+The Gianfar Ethernet Driver
+
+Author: Andy Fleming <afleming@freescale.com>
+Updated: 2005-07-28
+
+
+CHECKSUM OFFLOADING
+
+The eTSEC controller (first included in parts from late 2005 like
+the 8548) has the ability to perform TCP, UDP, and IP checksums
+in hardware. The Linux kernel only offloads the TCP and UDP
+checksums (and always performs the pseudo header checksums), so
+the driver only supports checksumming for TCP/IP and UDP/IP
+packets. Use ethtool to enable or disable this feature for RX
+and TX.
+
+VLAN
+
+In order to use VLAN, please consult Linux documentation on
+configuring VLANs. The gianfar driver supports hardware insertion and
+extraction of VLAN headers, but not filtering. Filtering will be
+done by the kernel.
+
+MULTICASTING
+
+The gianfar driver supports using the group hash table on the
+TSEC (and the extended hash table on the eTSEC) for multicast
+filtering. On the eTSEC, the exact-match MAC registers are used
+before the hash tables. See Linux documentation on how to join
+multicast groups.
+
+PADDING
+
+The gianfar driver supports padding received frames with 2 bytes
+to align the IP header to a 16-byte boundary, when supported by
+hardware.
+
+ETHTOOL
+
+The gianfar driver supports the use of ethtool for many
+configuration options. You must run ethtool only on currently
+open interfaces. See ethtool documentation for details.
diff --git a/Documentation/networking/i40e.txt b/Documentation/networking/i40e.txt
new file mode 100644
index 000000000..a251bf4fe
--- /dev/null
+++ b/Documentation/networking/i40e.txt
@@ -0,0 +1,118 @@
+Linux Base Driver for the Intel(R) Ethernet Controller XL710 Family
+===================================================================
+
+Intel i40e Linux driver.
+Copyright(c) 2013 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Additional Configurations
+- Performance Tuning
+- Known Issues
+- Support
+
+
+Identifying Your Adapter
+========================
+
+The driver in this release is compatible with the Intel Ethernet
+Controller XL710 Family.
+
+For more information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/network/sb/CS-012904.htm
+
+
+Enabling the driver
+===================
+
+The driver is enabled via the standard kernel configuration system,
+using the make command:
+
+ Make oldconfig/silentoldconfig/menuconfig/etc.
+
+The driver is located in the menu structure at:
+
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet driver support
+ -> Intel devices
+ -> Intel(R) Ethernet Controller XL710 Family
+
+Additional Configurations
+=========================
+
+ Generic Receive Offload (GRO)
+ -----------------------------
+ The driver supports the in-kernel software implementation of GRO. GRO has
+ shown that by coalescing Rx traffic into larger chunks of data, CPU
+ utilization can be significantly reduced when under large Rx load. GRO is
+ an evolution of the previously-used LRO interface. GRO is able to coalesce
+ other protocols besides TCP. It's also safe to use with configurations that
+ are problematic for LRO, namely bridging and iSCSI.
+
+ Ethtool
+ -------
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. The latest
+ ethtool version is required for this functionality.
+
+ The latest release of ethtool can be found from
+ https://www.kernel.org/pub/software/network/ethtool
+
+ Data Center Bridging (DCB)
+ --------------------------
+ DCB configuration is not currently supported.
+
+ FCoE
+ ----
+ The driver supports Fiber Channel over Ethernet (FCoE) and Data Center
+ Bridging (DCB) functionality. Configuring DCB and FCoE is outside the scope
+ of this driver doc. Refer to http://www.open-fcoe.org/ for FCoE project
+ information and http://www.open-lldp.org/ or email list
+ e1000-eedc@lists.sourceforge.net for DCB information.
+
+ MAC and VLAN anti-spoofing feature
+ ----------------------------------
+ When a malicious driver attempts to send a spoofed packet, it is dropped by
+ the hardware and not transmitted. An interrupt is sent to the PF driver
+ notifying it of the spoof attempt.
+
+ When a spoofed packet is detected the PF driver will send the following
+ message to the system log (displayed by the "dmesg" command):
+
+ Spoof event(s) detected on VF (n)
+
+ Where n=the VF that attempted to do the spoofing.
+
+
+Performance Tuning
+==================
+
+An excellent article on performance tuning can be found at:
+
+http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf
+
+
+Known Issues
+============
+
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ http://support.intel.com
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://e1000.sourceforge.net
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sourceforge.net and copy
+netdev@vger.kernel.org.
diff --git a/Documentation/networking/i40evf.txt b/Documentation/networking/i40evf.txt
new file mode 100644
index 000000000..21e41271a
--- /dev/null
+++ b/Documentation/networking/i40evf.txt
@@ -0,0 +1,47 @@
+Linux* Base Driver for Intel(R) Network Connection
+==================================================
+
+Intel XL710 X710 Virtual Function Linux driver.
+Copyright(c) 2013 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Known Issues/Troubleshooting
+- Support
+
+This file describes the i40evf Linux* Base Driver for the Intel(R) XL710
+X710 Virtual Function.
+
+The i40evf driver supports XL710 and X710 virtual function devices that
+can only be activated on kernels with CONFIG_PCI_IOV enabled.
+
+The guest OS loading the i40evf driver must support MSI-X interrupts.
+
+Identifying Your Adapter
+========================
+
+For more information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/go/network/adapter/idguide.htm
+
+Known Issues/Troubleshooting
+============================
+
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ http://support.intel.com
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://sourceforge.net/projects/e1000
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt
new file mode 100644
index 000000000..22bbc7225
--- /dev/null
+++ b/Documentation/networking/ieee802154.txt
@@ -0,0 +1,151 @@
+
+ Linux IEEE 802.15.4 implementation
+
+
+Introduction
+============
+The IEEE 802.15.4 working group focuses on standardization of bottom
+two layers: Medium Access Control (MAC) and Physical (PHY). And there
+are mainly two options available for upper layers:
+ - ZigBee - proprietary protocol from ZigBee Alliance
+ - 6LowPAN - IPv6 networking over low rate personal area networks
+
+The Linux-ZigBee project goal is to provide complete implementation
+of IEEE 802.15.4 and 6LoWPAN protocols. IEEE 802.15.4 is a stack
+of protocols for organizing Low-Rate Wireless Personal Area Networks.
+
+The stack is composed of three main parts:
+ - IEEE 802.15.4 layer; We have chosen to use plain Berkeley socket API,
+ the generic Linux networking stack to transfer IEEE 802.15.4 messages
+ and a special protocol over genetlink for configuration/management
+ - MAC - provides access to shared channel and reliable data delivery
+ - PHY - represents device drivers
+
+
+Socket API
+==========
+
+int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0);
+.....
+
+The address family, socket addresses etc. are defined in the
+include/net/af_ieee802154.h header or in the special header
+in our userspace package (see either linux-zigbee sourceforge download page
+or git tree at git://linux-zigbee.git.sourceforge.net/gitroot/linux-zigbee).
+
+One can use SOCK_RAW for passing raw data towards device xmit function. YMMV.
+
+
+Kernel side
+=============
+
+Like with WiFi, there are several types of devices implementing IEEE 802.15.4.
+1) 'HardMAC'. The MAC layer is implemented in the device itself, the device
+ exports MLME and data API.
+2) 'SoftMAC' or just radio. These types of devices are just radio transceivers
+ possibly with some kinds of acceleration like automatic CRC computation and
+ comparation, automagic ACK handling, address matching, etc.
+
+Those types of devices require different approach to be hooked into Linux kernel.
+
+
+MLME - MAC Level Management
+============================
+
+Most of IEEE 802.15.4 MLME interfaces are directly mapped on netlink commands.
+See the include/net/nl802154.h header. Our userspace tools package
+(see above) provides CLI configuration utility for radio interfaces and simple
+coordinator for IEEE 802.15.4 networks as an example users of MLME protocol.
+
+
+HardMAC
+=======
+
+See the header include/net/ieee802154_netdev.h. You have to implement Linux
+net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family
+code via plain sk_buffs. On skb reception skb->cb must contain additional
+info as described in the struct ieee802154_mac_cb. During packet transmission
+the skb->cb is used to provide additional data to device's header_ops->create
+function. Be aware that this data can be overridden later (when socket code
+submits skb to qdisc), so if you need something from that cb later, you should
+store info in the skb->data on your own.
+
+To hook the MLME interface you have to populate the ml_priv field of your
+net_device with a pointer to struct ieee802154_mlme_ops instance. The fields
+assoc_req, assoc_resp, disassoc_req, start_req, and scan_req are optional.
+All other fields are required.
+
+We provide an example of simple HardMAC driver at drivers/ieee802154/fakehard.c
+
+
+SoftMAC
+=======
+
+The MAC is the middle layer in the IEEE 802.15.4 Linux stack. This moment it
+provides interface for drivers registration and management of slave interfaces.
+
+NOTE: Currently the only monitor device type is supported - it's IEEE 802.15.4
+stack interface for network sniffers (e.g. WireShark).
+
+This layer is going to be extended soon.
+
+See header include/net/mac802154.h and several drivers in drivers/ieee802154/.
+
+
+Device drivers API
+==================
+
+The include/net/mac802154.h defines following functions:
+ - struct ieee802154_dev *ieee802154_alloc_device
+ (size_t priv_size, struct ieee802154_ops *ops):
+ allocation of IEEE 802.15.4 compatible device
+
+ - void ieee802154_free_device(struct ieee802154_dev *dev):
+ freeing allocated device
+
+ - int ieee802154_register_device(struct ieee802154_dev *dev):
+ register PHY in the system
+
+ - void ieee802154_unregister_device(struct ieee802154_dev *dev):
+ freeing registered PHY
+
+Moreover IEEE 802.15.4 device operations structure should be filled.
+
+Fake drivers
+============
+
+In addition there are two drivers available which simulate real devices with
+HardMAC (fakehard) and SoftMAC (fakelb - IEEE 802.15.4 loopback driver)
+interfaces. This option provides possibility to test and debug stack without
+usage of real hardware.
+
+See sources in drivers/ieee802154 folder for more details.
+
+
+6LoWPAN Linux implementation
+============================
+
+The IEEE 802.15.4 standard specifies an MTU of 128 bytes, yielding about 80
+octets of actual MAC payload once security is turned on, on a wireless link
+with a link throughput of 250 kbps or less. The 6LoWPAN adaptation format
+[RFC4944] was specified to carry IPv6 datagrams over such constrained links,
+taking into account limited bandwidth, memory, or energy resources that are
+expected in applications such as wireless Sensor Networks. [RFC4944] defines
+a Mesh Addressing header to support sub-IP forwarding, a Fragmentation header
+to support the IPv6 minimum MTU requirement [RFC2460], and stateless header
+compression for IPv6 datagrams (LOWPAN_HC1 and LOWPAN_HC2) to reduce the
+relatively large IPv6 and UDP headers down to (in the best case) several bytes.
+
+In Semptember 2011 the standard update was published - [RFC6282].
+It deprecates HC1 and HC2 compression and defines IPHC encoding format which is
+used in this Linux implementation.
+
+All the code related to 6lowpan you may find in files: net/ieee802154/6lowpan.*
+
+To setup 6lowpan interface you need (busybox release > 1.17.0):
+1. Add IEEE802.15.4 interface and initialize PANid;
+2. Add 6lowpan interface by command like:
+ # ip link add link wpan0 name lowpan0 type lowpan
+3. Set MAC (if needs):
+ # ip link set lowpan0 address de:ad:be:ef:ca:fe:ba:be
+4. Bring up 'lowpan0' interface
diff --git a/Documentation/networking/igb.txt b/Documentation/networking/igb.txt
new file mode 100644
index 000000000..15534fdd0
--- /dev/null
+++ b/Documentation/networking/igb.txt
@@ -0,0 +1,129 @@
+Linux* Base Driver for Intel(R) Ethernet Network Connection
+===========================================================
+
+Intel Gigabit Linux driver.
+Copyright(c) 1999 - 2013 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Additional Configurations
+- Support
+
+Identifying Your Adapter
+========================
+
+This driver supports all 82575, 82576 and 82580-based Intel (R) gigabit network
+connections.
+
+For specific information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/go/network/adapter/idguide.htm
+
+Command Line Parameters
+=======================
+
+The default value for each parameter is generally the recommended setting,
+unless otherwise noted.
+
+max_vfs
+-------
+Valid Range: 0-7
+Default Value: 0
+
+This parameter adds support for SR-IOV. It causes the driver to spawn up to
+max_vfs worth of virtual function.
+
+Additional Configurations
+=========================
+
+ Jumbo Frames
+ ------------
+ Jumbo Frames support is enabled by changing the MTU to a value larger than
+ the default of 1500. Use the ip command to increase the MTU size.
+ For example:
+
+ ip link set dev eth<x> mtu 9000
+
+ This setting is not saved across reboots.
+
+ Notes:
+
+ - The maximum MTU setting for Jumbo Frames is 9216. This value coincides
+ with the maximum Jumbo Frames size of 9234 bytes.
+
+ - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in
+ poor performance or loss of link.
+
+ ethtool
+ -------
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. The latest
+ version of ethtool can be found at:
+
+ http://ftp.kernel.org/pub/software/network/ethtool/
+
+ Enabling Wake on LAN* (WoL)
+ ---------------------------
+ WoL is configured through the ethtool* utility.
+
+ For instructions on enabling WoL with ethtool, refer to the ethtool man page.
+
+ WoL will be enabled on the system during the next shut down or reboot.
+ For this driver version, in order to enable WoL, the igb driver must be
+ loaded when shutting down or rebooting the system.
+
+ Wake On LAN is only supported on port A of multi-port adapters.
+
+ Wake On LAN is not supported for the Intel(R) Gigabit VT Quad Port Server
+ Adapter.
+
+ Multiqueue
+ ----------
+ In this mode, a separate MSI-X vector is allocated for each queue and one
+ for "other" interrupts such as link status change and errors. All
+ interrupts are throttled via interrupt moderation. Interrupt moderation
+ must be used to avoid interrupt storms while the driver is processing one
+ interrupt. The moderation value should be at least as large as the expected
+ time for the driver to process an interrupt. Multiqueue is off by default.
+
+ REQUIREMENTS: MSI-X support is required for Multiqueue. If MSI-X is not
+ found, the system will fallback to MSI or to Legacy interrupts.
+
+ MAC and VLAN anti-spoofing feature
+ ----------------------------------
+ When a malicious driver attempts to send a spoofed packet, it is dropped by
+ the hardware and not transmitted. An interrupt is sent to the PF driver
+ notifying it of the spoof attempt.
+
+ When a spoofed packet is detected the PF driver will send the following
+ message to the system log (displayed by the "dmesg" command):
+
+ Spoof event(s) detected on VF(n)
+
+ Where n=the VF that attempted to do the spoofing.
+
+ Setting MAC Address, VLAN and Rate Limit Using IProute2 Tool
+ ------------------------------------------------------------
+ You can set a MAC address of a Virtual Function (VF), a default VLAN and the
+ rate limit using the IProute2 tool. Download the latest version of the
+ iproute2 tool from Sourceforge if your version does not have all the
+ features you require.
+
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ www.intel.com/support/
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://sourceforge.net/projects/e1000
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/igbvf.txt b/Documentation/networking/igbvf.txt
new file mode 100644
index 000000000..40db17a66
--- /dev/null
+++ b/Documentation/networking/igbvf.txt
@@ -0,0 +1,80 @@
+Linux* Base Driver for Intel(R) Ethernet Network Connection
+===========================================================
+
+Intel Gigabit Linux driver.
+Copyright(c) 1999 - 2013 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Additional Configurations
+- Support
+
+This file describes the igbvf Linux* Base Driver for Intel Network Connection.
+
+The igbvf driver supports 82576-based virtual function devices that can only
+be activated on kernels that support SR-IOV. SR-IOV requires the correct
+platform and OS support.
+
+The igbvf driver requires the igb driver, version 2.0 or later. The igbvf
+driver supports virtual functions generated by the igb driver with a max_vfs
+value of 1 or greater. For more information on the max_vfs parameter refer
+to the README included with the igb driver.
+
+The guest OS loading the igbvf driver must support MSI-X interrupts.
+
+This driver is only supported as a loadable module at this time. Intel is
+not supplying patches against the kernel source to allow for static linking
+of the driver. For questions related to hardware requirements, refer to the
+documentation supplied with your Intel Gigabit adapter. All hardware
+requirements listed apply to use with Linux.
+
+Instructions on updating ethtool can be found in the section "Additional
+Configurations" later in this document.
+
+VLANs: There is a limit of a total of 32 shared VLANs to 1 or more VFs.
+
+Identifying Your Adapter
+========================
+
+The igbvf driver supports 82576-based virtual function devices that can only
+be activated on kernels that support SR-IOV.
+
+For more information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/go/network/adapter/idguide.htm
+
+For the latest Intel network drivers for Linux, refer to the following
+website. In the search field, enter your adapter name or type, or use the
+networking link on the left to search for your adapter:
+
+ http://downloadcenter.intel.com/scripts-df-external/Support_Intel.aspx
+
+Additional Configurations
+=========================
+
+ ethtool
+ -------
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. The ethtool
+ version 3.0 or later is required for this functionality, although we
+ strongly recommend downloading the latest version at:
+
+ http://ftp.kernel.org/pub/software/network/ethtool/
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ http://support.intel.com
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://sourceforge.net/projects/e1000
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
new file mode 100644
index 000000000..071fb18dc
--- /dev/null
+++ b/Documentation/networking/ip-sysctl.txt
@@ -0,0 +1,1858 @@
+/proc/sys/net/ipv4/* Variables:
+
+ip_forward - BOOLEAN
+ 0 - disabled (default)
+ not 0 - enabled
+
+ Forward Packets between interfaces.
+
+ This variable is special, its change resets all configuration
+ parameters to their default state (RFC1122 for hosts, RFC1812
+ for routers)
+
+ip_default_ttl - INTEGER
+ Default value of TTL field (Time To Live) for outgoing (but not
+ forwarded) IP packets. Should be between 1 and 255 inclusive.
+ Default: 64 (as recommended by RFC1700)
+
+ip_no_pmtu_disc - INTEGER
+ Disable Path MTU Discovery. If enabled in mode 1 and a
+ fragmentation-required ICMP is received, the PMTU to this
+ destination will be set to min_pmtu (see below). You will need
+ to raise min_pmtu to the smallest interface MTU on your system
+ manually if you want to avoid locally generated fragments.
+
+ In mode 2 incoming Path MTU Discovery messages will be
+ discarded. Outgoing frames are handled the same as in mode 1,
+ implicitly setting IP_PMTUDISC_DONT on every created socket.
+
+ Mode 3 is a hardend pmtu discover mode. The kernel will only
+ accept fragmentation-needed errors if the underlying protocol
+ can verify them besides a plain socket lookup. Current
+ protocols for which pmtu events will be honored are TCP, SCTP
+ and DCCP as they verify e.g. the sequence number or the
+ association. This mode should not be enabled globally but is
+ only intended to secure e.g. name servers in namespaces where
+ TCP path mtu must still work but path MTU information of other
+ protocols should be discarded. If enabled globally this mode
+ could break other protocols.
+
+ Possible values: 0-3
+ Default: FALSE
+
+min_pmtu - INTEGER
+ default 552 - minimum discovered Path MTU
+
+ip_forward_use_pmtu - BOOLEAN
+ By default we don't trust protocol path MTUs while forwarding
+ because they could be easily forged and can lead to unwanted
+ fragmentation by the router.
+ You only need to enable this if you have user-space software
+ which tries to discover path mtus by itself and depends on the
+ kernel honoring this information. This is normally not the
+ case.
+ Default: 0 (disabled)
+ Possible values:
+ 0 - disabled
+ 1 - enabled
+
+fwmark_reflect - BOOLEAN
+ Controls the fwmark of kernel-generated IPv4 reply packets that are not
+ associated with a socket for example, TCP RSTs or ICMP echo replies).
+ If unset, these packets have a fwmark of zero. If set, they have the
+ fwmark of the packet they are replying to.
+ Default: 0
+
+route/max_size - INTEGER
+ Maximum number of routes allowed in the kernel. Increase
+ this when using large numbers of interfaces and/or routes.
+ From linux kernel 3.6 onwards, this is deprecated for ipv4
+ as route cache is no longer used.
+
+neigh/default/gc_thresh1 - INTEGER
+ Minimum number of entries to keep. Garbage collector will not
+ purge entries if there are fewer than this number.
+ Default: 128
+
+neigh/default/gc_thresh2 - INTEGER
+ Threshold when garbage collector becomes more aggressive about
+ purging entries. Entries older than 5 seconds will be cleared
+ when over this number.
+ Default: 512
+
+neigh/default/gc_thresh3 - INTEGER
+ Maximum number of neighbor entries allowed. Increase this
+ when using large numbers of interfaces and when communicating
+ with large numbers of directly-connected peers.
+ Default: 1024
+
+neigh/default/unres_qlen_bytes - INTEGER
+ The maximum number of bytes which may be used by packets
+ queued for each unresolved address by other network layers.
+ (added in linux 3.3)
+ Setting negative value is meaningless and will return error.
+ Default: 65536 Bytes(64KB)
+
+neigh/default/unres_qlen - INTEGER
+ The maximum number of packets which may be queued for each
+ unresolved address by other network layers.
+ (deprecated in linux 3.3) : use unres_qlen_bytes instead.
+ Prior to linux 3.3, the default value is 3 which may cause
+ unexpected packet loss. The current default value is calculated
+ according to default value of unres_qlen_bytes and true size of
+ packet.
+ Default: 31
+
+mtu_expires - INTEGER
+ Time, in seconds, that cached PMTU information is kept.
+
+min_adv_mss - INTEGER
+ The advertised MSS depends on the first hop route MTU, but will
+ never be lower than this setting.
+
+IP Fragmentation:
+
+ipfrag_high_thresh - INTEGER
+ Maximum memory used to reassemble IP fragments. When
+ ipfrag_high_thresh bytes of memory is allocated for this purpose,
+ the fragment handler will toss packets until ipfrag_low_thresh
+ is reached. This also serves as a maximum limit to namespaces
+ different from the initial one.
+
+ipfrag_low_thresh - INTEGER
+ Maximum memory used to reassemble IP fragments before the kernel
+ begins to remove incomplete fragment queues to free up resources.
+ The kernel still accepts new fragments for defragmentation.
+
+ipfrag_time - INTEGER
+ Time in seconds to keep an IP fragment in memory.
+
+ipfrag_max_dist - INTEGER
+ ipfrag_max_dist is a non-negative integer value which defines the
+ maximum "disorder" which is allowed among fragments which share a
+ common IP source address. Note that reordering of packets is
+ not unusual, but if a large number of fragments arrive from a source
+ IP address while a particular fragment queue remains incomplete, it
+ probably indicates that one or more fragments belonging to that queue
+ have been lost. When ipfrag_max_dist is positive, an additional check
+ is done on fragments before they are added to a reassembly queue - if
+ ipfrag_max_dist (or more) fragments have arrived from a particular IP
+ address between additions to any IP fragment queue using that source
+ address, it's presumed that one or more fragments in the queue are
+ lost. The existing fragment queue will be dropped, and a new one
+ started. An ipfrag_max_dist value of zero disables this check.
+
+ Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can
+ result in unnecessarily dropping fragment queues when normal
+ reordering of packets occurs, which could lead to poor application
+ performance. Using a very large value, e.g. 50000, increases the
+ likelihood of incorrectly reassembling IP fragments that originate
+ from different IP datagrams, which could result in data corruption.
+ Default: 64
+
+INET peer storage:
+
+inet_peer_threshold - INTEGER
+ The approximate size of the storage. Starting from this threshold
+ entries will be thrown aggressively. This threshold also determines
+ entries' time-to-live and time intervals between garbage collection
+ passes. More entries, less time-to-live, less GC interval.
+
+inet_peer_minttl - INTEGER
+ Minimum time-to-live of entries. Should be enough to cover fragment
+ time-to-live on the reassembling side. This minimum time-to-live is
+ guaranteed if the pool size is less than inet_peer_threshold.
+ Measured in seconds.
+
+inet_peer_maxttl - INTEGER
+ Maximum time-to-live of entries. Unused entries will expire after
+ this period of time if there is no memory pressure on the pool (i.e.
+ when the number of entries in the pool is very small).
+ Measured in seconds.
+
+TCP variables:
+
+somaxconn - INTEGER
+ Limit of socket listen() backlog, known in userspace as SOMAXCONN.
+ Defaults to 128. See also tcp_max_syn_backlog for additional tuning
+ for TCP sockets.
+
+tcp_abort_on_overflow - BOOLEAN
+ If listening service is too slow to accept new connections,
+ reset them. Default state is FALSE. It means that if overflow
+ occurred due to a burst, connection will recover. Enable this
+ option _only_ if you are really sure that listening daemon
+ cannot be tuned to accept connections faster. Enabling this
+ option can harm clients of your server.
+
+tcp_adv_win_scale - INTEGER
+ Count buffering overhead as bytes/2^tcp_adv_win_scale
+ (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale),
+ if it is <= 0.
+ Possible values are [-31, 31], inclusive.
+ Default: 1
+
+tcp_allowed_congestion_control - STRING
+ Show/set the congestion control choices available to non-privileged
+ processes. The list is a subset of those listed in
+ tcp_available_congestion_control.
+ Default is "reno" and the default setting (tcp_congestion_control).
+
+tcp_app_win - INTEGER
+ Reserve max(window/2^tcp_app_win, mss) of window for application
+ buffer. Value 0 is special, it means that nothing is reserved.
+ Default: 31
+
+tcp_autocorking - BOOLEAN
+ Enable TCP auto corking :
+ When applications do consecutive small write()/sendmsg() system calls,
+ we try to coalesce these small writes as much as possible, to lower
+ total amount of sent packets. This is done if at least one prior
+ packet for the flow is waiting in Qdisc queues or device transmit
+ queue. Applications can still use TCP_CORK for optimal behavior
+ when they know how/when to uncork their sockets.
+ Default : 1
+
+tcp_available_congestion_control - STRING
+ Shows the available congestion control choices that are registered.
+ More congestion control algorithms may be available as modules,
+ but not loaded.
+
+tcp_base_mss - INTEGER
+ The initial value of search_low to be used by the packetization layer
+ Path MTU discovery (MTU probing). If MTU probing is enabled,
+ this is the initial MSS used by the connection.
+
+tcp_congestion_control - STRING
+ Set the congestion control algorithm to be used for new
+ connections. The algorithm "reno" is always available, but
+ additional choices may be available based on kernel configuration.
+ Default is set as part of kernel configuration.
+ For passive connections, the listener congestion control choice
+ is inherited.
+ [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ]
+
+tcp_dsack - BOOLEAN
+ Allows TCP to send "duplicate" SACKs.
+
+tcp_early_retrans - INTEGER
+ Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
+ for triggering fast retransmit when the amount of outstanding data is
+ small and when no previously unsent data can be transmitted (such
+ that limited transmit could be used). Also controls the use of
+ Tail loss probe (TLP) that converts RTOs occurring due to tail
+ losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01).
+ Possible values:
+ 0 disables ER
+ 1 enables ER
+ 2 enables ER but delays fast recovery and fast retransmit
+ by a fourth of RTT. This mitigates connection falsely
+ recovers when network has a small degree of reordering
+ (less than 3 packets).
+ 3 enables delayed ER and TLP.
+ 4 enables TLP only.
+ Default: 3
+
+tcp_ecn - INTEGER
+ Control use of Explicit Congestion Notification (ECN) by TCP.
+ ECN is used only when both ends of the TCP connection indicate
+ support for it. This feature is useful in avoiding losses due
+ to congestion by allowing supporting routers to signal
+ congestion before having to drop packets.
+ Possible values are:
+ 0 Disable ECN. Neither initiate nor accept ECN.
+ 1 Enable ECN when requested by incoming connections and
+ also request ECN on outgoing connection attempts.
+ 2 Enable ECN when requested by incoming connections
+ but do not request ECN on outgoing connections.
+ Default: 2
+
+tcp_fack - BOOLEAN
+ Enable FACK congestion avoidance and fast retransmission.
+ The value is not used, if tcp_sack is not enabled.
+
+tcp_fin_timeout - INTEGER
+ The length of time an orphaned (no longer referenced by any
+ application) connection will remain in the FIN_WAIT_2 state
+ before it is aborted at the local end. While a perfectly
+ valid "receive only" state for an un-orphaned connection, an
+ orphaned connection in FIN_WAIT_2 state could otherwise wait
+ forever for the remote to close its end of the connection.
+ Cf. tcp_max_orphans
+ Default: 60 seconds
+
+tcp_frto - INTEGER
+ Enables Forward RTO-Recovery (F-RTO) defined in RFC5682.
+ F-RTO is an enhanced recovery algorithm for TCP retransmission
+ timeouts. It is particularly beneficial in networks where the
+ RTT fluctuates (e.g., wireless). F-RTO is sender-side only
+ modification. It does not require any support from the peer.
+
+ By default it's enabled with a non-zero value. 0 disables F-RTO.
+
+tcp_invalid_ratelimit - INTEGER
+ Limit the maximal rate for sending duplicate acknowledgments
+ in response to incoming TCP packets that are for an existing
+ connection but that are invalid due to any of these reasons:
+
+ (a) out-of-window sequence number,
+ (b) out-of-window acknowledgment number, or
+ (c) PAWS (Protection Against Wrapped Sequence numbers) check failure
+
+ This can help mitigate simple "ack loop" DoS attacks, wherein
+ a buggy or malicious middlebox or man-in-the-middle can
+ rewrite TCP header fields in manner that causes each endpoint
+ to think that the other is sending invalid TCP segments, thus
+ causing each side to send an unterminating stream of duplicate
+ acknowledgments for invalid segments.
+
+ Using 0 disables rate-limiting of dupacks in response to
+ invalid segments; otherwise this value specifies the minimal
+ space between sending such dupacks, in milliseconds.
+
+ Default: 500 (milliseconds).
+
+tcp_keepalive_time - INTEGER
+ How often TCP sends out keepalive messages when keepalive is enabled.
+ Default: 2hours.
+
+tcp_keepalive_probes - INTEGER
+ How many keepalive probes TCP sends out, until it decides that the
+ connection is broken. Default value: 9.
+
+tcp_keepalive_intvl - INTEGER
+ How frequently the probes are send out. Multiplied by
+ tcp_keepalive_probes it is time to kill not responding connection,
+ after probes started. Default value: 75sec i.e. connection
+ will be aborted after ~11 minutes of retries.
+
+tcp_low_latency - BOOLEAN
+ If set, the TCP stack makes decisions that prefer lower
+ latency as opposed to higher throughput. By default, this
+ option is not set meaning that higher throughput is preferred.
+ An example of an application where this default should be
+ changed would be a Beowulf compute cluster.
+ Default: 0
+
+tcp_max_orphans - INTEGER
+ Maximal number of TCP sockets not attached to any user file handle,
+ held by system. If this number is exceeded orphaned connections are
+ reset immediately and warning is printed. This limit exists
+ only to prevent simple DoS attacks, you _must_ not rely on this
+ or lower the limit artificially, but rather increase it
+ (probably, after increasing installed memory),
+ if network conditions require more than default value,
+ and tune network services to linger and kill such states
+ more aggressively. Let me to remind again: each orphan eats
+ up to ~64K of unswappable memory.
+
+tcp_max_syn_backlog - INTEGER
+ Maximal number of remembered connection requests, which have not
+ received an acknowledgment from connecting client.
+ The minimal value is 128 for low memory machines, and it will
+ increase in proportion to the memory of machine.
+ If server suffers from overload, try increasing this number.
+
+tcp_max_tw_buckets - INTEGER
+ Maximal number of timewait sockets held by system simultaneously.
+ If this number is exceeded time-wait socket is immediately destroyed
+ and warning is printed. This limit exists only to prevent
+ simple DoS attacks, you _must_ not lower the limit artificially,
+ but rather increase it (probably, after increasing installed memory),
+ if network conditions require more than default value.
+
+tcp_mem - vector of 3 INTEGERs: min, pressure, max
+ min: below this number of pages TCP is not bothered about its
+ memory appetite.
+
+ pressure: when amount of memory allocated by TCP exceeds this number
+ of pages, TCP moderates its memory consumption and enters memory
+ pressure mode, which is exited when memory consumption falls
+ under "min".
+
+ max: number of pages allowed for queueing by all TCP sockets.
+
+ Defaults are calculated at boot time from amount of available
+ memory.
+
+tcp_moderate_rcvbuf - BOOLEAN
+ If set, TCP performs receive buffer auto-tuning, attempting to
+ automatically size the buffer (no greater than tcp_rmem[2]) to
+ match the size required by the path for full throughput. Enabled by
+ default.
+
+tcp_mtu_probing - INTEGER
+ Controls TCP Packetization-Layer Path MTU Discovery. Takes three
+ values:
+ 0 - Disabled
+ 1 - Disabled by default, enabled when an ICMP black hole detected
+ 2 - Always enabled, use initial MSS of tcp_base_mss.
+
+tcp_probe_interval - INTEGER
+ Controls how often to start TCP Packetization-Layer Path MTU
+ Discovery reprobe. The default is reprobing every 10 minutes as
+ per RFC4821.
+
+tcp_probe_threshold - INTEGER
+ Controls when TCP Packetization-Layer Path MTU Discovery probing
+ will stop in respect to the width of search range in bytes. Default
+ is 8 bytes.
+
+tcp_no_metrics_save - BOOLEAN
+ By default, TCP saves various connection metrics in the route cache
+ when the connection closes, so that connections established in the
+ near future can use these to set initial conditions. Usually, this
+ increases overall performance, but may sometimes cause performance
+ degradation. If set, TCP will not cache metrics on closing
+ connections.
+
+tcp_orphan_retries - INTEGER
+ This value influences the timeout of a locally closed TCP connection,
+ when RTO retransmissions remain unacknowledged.
+ See tcp_retries2 for more details.
+
+ The default value is 8.
+ If your machine is a loaded WEB server,
+ you should think about lowering this value, such sockets
+ may consume significant resources. Cf. tcp_max_orphans.
+
+tcp_reordering - INTEGER
+ Initial reordering level of packets in a TCP stream.
+ TCP stack can then dynamically adjust flow reordering level
+ between this initial value and tcp_max_reordering
+ Default: 3
+
+tcp_max_reordering - INTEGER
+ Maximal reordering level of packets in a TCP stream.
+ 300 is a fairly conservative value, but you might increase it
+ if paths are using per packet load balancing (like bonding rr mode)
+ Default: 300
+
+tcp_retrans_collapse - BOOLEAN
+ Bug-to-bug compatibility with some broken printers.
+ On retransmit try to send bigger packets to work around bugs in
+ certain TCP stacks.
+
+tcp_retries1 - INTEGER
+ This value influences the time, after which TCP decides, that
+ something is wrong due to unacknowledged RTO retransmissions,
+ and reports this suspicion to the network layer.
+ See tcp_retries2 for more details.
+
+ RFC 1122 recommends at least 3 retransmissions, which is the
+ default.
+
+tcp_retries2 - INTEGER
+ This value influences the timeout of an alive TCP connection,
+ when RTO retransmissions remain unacknowledged.
+ Given a value of N, a hypothetical TCP connection following
+ exponential backoff with an initial RTO of TCP_RTO_MIN would
+ retransmit N times before killing the connection at the (N+1)th RTO.
+
+ The default value of 15 yields a hypothetical timeout of 924.6
+ seconds and is a lower bound for the effective timeout.
+ TCP will effectively time out at the first RTO which exceeds the
+ hypothetical timeout.
+
+ RFC 1122 recommends at least 100 seconds for the timeout,
+ which corresponds to a value of at least 8.
+
+tcp_rfc1337 - BOOLEAN
+ If set, the TCP stack behaves conforming to RFC1337. If unset,
+ we are not conforming to RFC, but prevent TCP TIME_WAIT
+ assassination.
+ Default: 0
+
+tcp_rmem - vector of 3 INTEGERs: min, default, max
+ min: Minimal size of receive buffer used by TCP sockets.
+ It is guaranteed to each TCP socket, even under moderate memory
+ pressure.
+ Default: 1 page
+
+ default: initial size of receive buffer used by TCP sockets.
+ This value overrides net.core.rmem_default used by other protocols.
+ Default: 87380 bytes. This value results in window of 65535 with
+ default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit
+ less for default tcp_app_win. See below about these variables.
+
+ max: maximal size of receive buffer allowed for automatically
+ selected receiver buffers for TCP socket. This value does not override
+ net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables
+ automatic tuning of that socket's receive buffer size, in which
+ case this value is ignored.
+ Default: between 87380B and 6MB, depending on RAM size.
+
+tcp_sack - BOOLEAN
+ Enable select acknowledgments (SACKS).
+
+tcp_slow_start_after_idle - BOOLEAN
+ If set, provide RFC2861 behavior and time out the congestion
+ window after an idle period. An idle period is defined at
+ the current RTO. If unset, the congestion window will not
+ be timed out after an idle period.
+ Default: 1
+
+tcp_stdurg - BOOLEAN
+ Use the Host requirements interpretation of the TCP urgent pointer field.
+ Most hosts use the older BSD interpretation, so if you turn this on
+ Linux might not communicate correctly with them.
+ Default: FALSE
+
+tcp_synack_retries - INTEGER
+ Number of times SYNACKs for a passive TCP connection attempt will
+ be retransmitted. Should not be higher than 255. Default value
+ is 5, which corresponds to 31seconds till the last retransmission
+ with the current initial RTO of 1second. With this the final timeout
+ for a passive TCP connection will happen after 63seconds.
+
+tcp_syncookies - BOOLEAN
+ Only valid when the kernel was compiled with CONFIG_SYN_COOKIES
+ Send out syncookies when the syn backlog queue of a socket
+ overflows. This is to prevent against the common 'SYN flood attack'
+ Default: 1
+
+ Note, that syncookies is fallback facility.
+ It MUST NOT be used to help highly loaded servers to stand
+ against legal connection rate. If you see SYN flood warnings
+ in your logs, but investigation shows that they occur
+ because of overload with legal connections, you should tune
+ another parameters until this warning disappear.
+ See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow.
+
+ syncookies seriously violate TCP protocol, do not allow
+ to use TCP extensions, can result in serious degradation
+ of some services (f.e. SMTP relaying), visible not by you,
+ but your clients and relays, contacting you. While you see
+ SYN flood warnings in logs not being really flooded, your server
+ is seriously misconfigured.
+
+ If you want to test which effects syncookies have to your
+ network connections you can set this knob to 2 to enable
+ unconditionally generation of syncookies.
+
+tcp_fastopen - INTEGER
+ Enable TCP Fast Open feature (draft-ietf-tcpm-fastopen) to send data
+ in the opening SYN packet. To use this feature, the client application
+ must use sendmsg() or sendto() with MSG_FASTOPEN flag rather than
+ connect() to perform a TCP handshake automatically.
+
+ The values (bitmap) are
+ 1: Enables sending data in the opening SYN on the client w/ MSG_FASTOPEN.
+ 2: Enables TCP Fast Open on the server side, i.e., allowing data in
+ a SYN packet to be accepted and passed to the application before
+ 3-way hand shake finishes.
+ 4: Send data in the opening SYN regardless of cookie availability and
+ without a cookie option.
+ 0x100: Accept SYN data w/o validating the cookie.
+ 0x200: Accept data-in-SYN w/o any cookie option present.
+ 0x400/0x800: Enable Fast Open on all listeners regardless of the
+ TCP_FASTOPEN socket option. The two different flags designate two
+ different ways of setting max_qlen without the TCP_FASTOPEN socket
+ option.
+
+ Default: 1
+
+ Note that the client & server side Fast Open flags (1 and 2
+ respectively) must be also enabled before the rest of flags can take
+ effect.
+
+ See include/net/tcp.h and the code for more details.
+
+tcp_syn_retries - INTEGER
+ Number of times initial SYNs for an active TCP connection attempt
+ will be retransmitted. Should not be higher than 255. Default value
+ is 6, which corresponds to 63seconds till the last retransmission
+ with the current initial RTO of 1second. With this the final timeout
+ for an active TCP connection attempt will happen after 127seconds.
+
+tcp_timestamps - BOOLEAN
+ Enable timestamps as defined in RFC1323.
+
+tcp_min_tso_segs - INTEGER
+ Minimal number of segments per TSO frame.
+ Since linux-3.12, TCP does an automatic sizing of TSO frames,
+ depending on flow rate, instead of filling 64Kbytes packets.
+ For specific usages, it's possible to force TCP to build big
+ TSO frames. Note that TCP stack might split too big TSO packets
+ if available window is too small.
+ Default: 2
+
+tcp_tso_win_divisor - INTEGER
+ This allows control over what percentage of the congestion window
+ can be consumed by a single TSO frame.
+ The setting of this parameter is a choice between burstiness and
+ building larger TSO frames.
+ Default: 3
+
+tcp_tw_recycle - BOOLEAN
+ Enable fast recycling TIME-WAIT sockets. Default value is 0.
+ It should not be changed without advice/request of technical
+ experts.
+
+tcp_tw_reuse - BOOLEAN
+ Allow to reuse TIME-WAIT sockets for new connections when it is
+ safe from protocol viewpoint. Default value is 0.
+ It should not be changed without advice/request of technical
+ experts.
+
+tcp_window_scaling - BOOLEAN
+ Enable window scaling as defined in RFC1323.
+
+tcp_wmem - vector of 3 INTEGERs: min, default, max
+ min: Amount of memory reserved for send buffers for TCP sockets.
+ Each TCP socket has rights to use it due to fact of its birth.
+ Default: 1 page
+
+ default: initial size of send buffer used by TCP sockets. This
+ value overrides net.core.wmem_default used by other protocols.
+ It is usually lower than net.core.wmem_default.
+ Default: 16K
+
+ max: Maximal amount of memory allowed for automatically tuned
+ send buffers for TCP sockets. This value does not override
+ net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables
+ automatic tuning of that socket's send buffer size, in which case
+ this value is ignored.
+ Default: between 64K and 4MB, depending on RAM size.
+
+tcp_notsent_lowat - UNSIGNED INTEGER
+ A TCP socket can control the amount of unsent bytes in its write queue,
+ thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll()
+ reports POLLOUT events if the amount of unsent bytes is below a per
+ socket value, and if the write queue is not full. sendmsg() will
+ also not add new buffers if the limit is hit.
+
+ This global variable controls the amount of unsent data for
+ sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change
+ to the global variable has immediate effect.
+
+ Default: UINT_MAX (0xFFFFFFFF)
+
+tcp_workaround_signed_windows - BOOLEAN
+ If set, assume no receipt of a window scaling option means the
+ remote TCP is broken and treats the window as a signed quantity.
+ If unset, assume the remote TCP is not broken even if we do
+ not receive a window scaling option from them.
+ Default: 0
+
+tcp_thin_linear_timeouts - BOOLEAN
+ Enable dynamic triggering of linear timeouts for thin streams.
+ If set, a check is performed upon retransmission by timeout to
+ determine if the stream is thin (less than 4 packets in flight).
+ As long as the stream is found to be thin, up to 6 linear
+ timeouts may be performed before exponential backoff mode is
+ initiated. This improves retransmission latency for
+ non-aggressive thin streams, often found to be time-dependent.
+ For more information on thin streams, see
+ Documentation/networking/tcp-thin.txt
+ Default: 0
+
+tcp_thin_dupack - BOOLEAN
+ Enable dynamic triggering of retransmissions after one dupACK
+ for thin streams. If set, a check is performed upon reception
+ of a dupACK to determine if the stream is thin (less than 4
+ packets in flight). As long as the stream is found to be thin,
+ data is retransmitted on the first received dupACK. This
+ improves retransmission latency for non-aggressive thin
+ streams, often found to be time-dependent.
+ For more information on thin streams, see
+ Documentation/networking/tcp-thin.txt
+ Default: 0
+
+tcp_limit_output_bytes - INTEGER
+ Controls TCP Small Queue limit per tcp socket.
+ TCP bulk sender tends to increase packets in flight until it
+ gets losses notifications. With SNDBUF autotuning, this can
+ result in a large amount of packets queued in qdisc/device
+ on the local machine, hurting latency of other flows, for
+ typical pfifo_fast qdiscs.
+ tcp_limit_output_bytes limits the number of bytes on qdisc
+ or device to reduce artificial RTT/cwnd and reduce bufferbloat.
+ Default: 131072
+
+tcp_challenge_ack_limit - INTEGER
+ Limits number of Challenge ACK sent per second, as recommended
+ in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks)
+ Default: 100
+
+UDP variables:
+
+udp_mem - vector of 3 INTEGERs: min, pressure, max
+ Number of pages allowed for queueing by all UDP sockets.
+
+ min: Below this number of pages UDP is not bothered about its
+ memory appetite. When amount of memory allocated by UDP exceeds
+ this number, UDP starts to moderate memory usage.
+
+ pressure: This value was introduced to follow format of tcp_mem.
+
+ max: Number of pages allowed for queueing by all UDP sockets.
+
+ Default is calculated at boot time from amount of available memory.
+
+udp_rmem_min - INTEGER
+ Minimal size of receive buffer used by UDP sockets in moderation.
+ Each UDP socket is able to use the size for receiving data, even if
+ total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
+ Default: 1 page
+
+udp_wmem_min - INTEGER
+ Minimal size of send buffer used by UDP sockets in moderation.
+ Each UDP socket is able to use the size for sending data, even if
+ total pages of UDP sockets exceed udp_mem pressure. The unit is byte.
+ Default: 1 page
+
+CIPSOv4 Variables:
+
+cipso_cache_enable - BOOLEAN
+ If set, enable additions to and lookups from the CIPSO label mapping
+ cache. If unset, additions are ignored and lookups always result in a
+ miss. However, regardless of the setting the cache is still
+ invalidated when required when means you can safely toggle this on and
+ off and the cache will always be "safe".
+ Default: 1
+
+cipso_cache_bucket_size - INTEGER
+ The CIPSO label cache consists of a fixed size hash table with each
+ hash bucket containing a number of cache entries. This variable limits
+ the number of entries in each hash bucket; the larger the value the
+ more CIPSO label mappings that can be cached. When the number of
+ entries in a given hash bucket reaches this limit adding new entries
+ causes the oldest entry in the bucket to be removed to make room.
+ Default: 10
+
+cipso_rbm_optfmt - BOOLEAN
+ Enable the "Optimized Tag 1 Format" as defined in section 3.4.2.6 of
+ the CIPSO draft specification (see Documentation/netlabel for details).
+ This means that when set the CIPSO tag will be padded with empty
+ categories in order to make the packet data 32-bit aligned.
+ Default: 0
+
+cipso_rbm_structvalid - BOOLEAN
+ If set, do a very strict check of the CIPSO option when
+ ip_options_compile() is called. If unset, relax the checks done during
+ ip_options_compile(). Either way is "safe" as errors are caught else
+ where in the CIPSO processing code but setting this to 0 (False) should
+ result in less work (i.e. it should be faster) but could cause problems
+ with other implementations that require strict checking.
+ Default: 0
+
+IP Variables:
+
+ip_local_port_range - 2 INTEGERS
+ Defines the local port range that is used by TCP and UDP to
+ choose the local port. The first number is the first, the
+ second the last local port number. The default values are
+ 32768 and 61000 respectively.
+
+ip_local_reserved_ports - list of comma separated ranges
+ Specify the ports which are reserved for known third-party
+ applications. These ports will not be used by automatic port
+ assignments (e.g. when calling connect() or bind() with port
+ number 0). Explicit port allocation behavior is unchanged.
+
+ The format used for both input and output is a comma separated
+ list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and
+ 10). Writing to the file will clear all previously reserved
+ ports and update the current list with the one given in the
+ input.
+
+ Note that ip_local_port_range and ip_local_reserved_ports
+ settings are independent and both are considered by the kernel
+ when determining which ports are available for automatic port
+ assignments.
+
+ You can reserve ports which are not in the current
+ ip_local_port_range, e.g.:
+
+ $ cat /proc/sys/net/ipv4/ip_local_port_range
+ 32000 61000
+ $ cat /proc/sys/net/ipv4/ip_local_reserved_ports
+ 8080,9148
+
+ although this is redundant. However such a setting is useful
+ if later the port range is changed to a value that will
+ include the reserved ports.
+
+ Default: Empty
+
+ip_nonlocal_bind - BOOLEAN
+ If set, allows processes to bind() to non-local IP addresses,
+ which can be quite useful - but may break some applications.
+ Default: 0
+
+ip_dynaddr - BOOLEAN
+ If set non-zero, enables support for dynamic addresses.
+ If set to a non-zero value larger than 1, a kernel log
+ message will be printed when dynamic address rewriting
+ occurs.
+ Default: 0
+
+ip_early_demux - BOOLEAN
+ Optimize input packet processing down to one demux for
+ certain kinds of local sockets. Currently we only do this
+ for established TCP sockets.
+
+ It may add an additional cost for pure routing workloads that
+ reduces overall throughput, in such case you should disable it.
+ Default: 1
+
+icmp_echo_ignore_all - BOOLEAN
+ If set non-zero, then the kernel will ignore all ICMP ECHO
+ requests sent to it.
+ Default: 0
+
+icmp_echo_ignore_broadcasts - BOOLEAN
+ If set non-zero, then the kernel will ignore all ICMP ECHO and
+ TIMESTAMP requests sent to it via broadcast/multicast.
+ Default: 1
+
+icmp_ratelimit - INTEGER
+ Limit the maximal rates for sending ICMP packets whose type matches
+ icmp_ratemask (see below) to specific targets.
+ 0 to disable any limiting,
+ otherwise the minimal space between responses in milliseconds.
+ Note that another sysctl, icmp_msgs_per_sec limits the number
+ of ICMP packets sent on all targets.
+ Default: 1000
+
+icmp_msgs_per_sec - INTEGER
+ Limit maximal number of ICMP packets sent per second from this host.
+ Only messages whose type matches icmp_ratemask (see below) are
+ controlled by this limit.
+ Default: 1000
+
+icmp_msgs_burst - INTEGER
+ icmp_msgs_per_sec controls number of ICMP packets sent per second,
+ while icmp_msgs_burst controls the burst size of these packets.
+ Default: 50
+
+icmp_ratemask - INTEGER
+ Mask made of ICMP types for which rates are being limited.
+ Significant bits: IHGFEDCBA9876543210
+ Default mask: 0000001100000011000 (6168)
+
+ Bit definitions (see include/linux/icmp.h):
+ 0 Echo Reply
+ 3 Destination Unreachable *
+ 4 Source Quench *
+ 5 Redirect
+ 8 Echo Request
+ B Time Exceeded *
+ C Parameter Problem *
+ D Timestamp Request
+ E Timestamp Reply
+ F Info Request
+ G Info Reply
+ H Address Mask Request
+ I Address Mask Reply
+
+ * These are rate limited by default (see default mask above)
+
+icmp_ignore_bogus_error_responses - BOOLEAN
+ Some routers violate RFC1122 by sending bogus responses to broadcast
+ frames. Such violations are normally logged via a kernel warning.
+ If this is set to TRUE, the kernel will not give such warnings, which
+ will avoid log file clutter.
+ Default: 1
+
+icmp_errors_use_inbound_ifaddr - BOOLEAN
+
+ If zero, icmp error messages are sent with the primary address of
+ the exiting interface.
+
+ If non-zero, the message will be sent with the primary address of
+ the interface that received the packet that caused the icmp error.
+ This is the behaviour network many administrators will expect from
+ a router. And it can make debugging complicated network layouts
+ much easier.
+
+ Note that if no primary address exists for the interface selected,
+ then the primary address of the first non-loopback interface that
+ has one will be used regardless of this setting.
+
+ Default: 0
+
+igmp_max_memberships - INTEGER
+ Change the maximum number of multicast groups we can subscribe to.
+ Default: 20
+
+ Theoretical maximum value is bounded by having to send a membership
+ report in a single datagram (i.e. the report can't span multiple
+ datagrams, or risk confusing the switch and leaving groups you don't
+ intend to).
+
+ The number of supported groups 'M' is bounded by the number of group
+ report entries you can fit into a single datagram of 65535 bytes.
+
+ M = 65536-sizeof (ip header)/(sizeof(Group record))
+
+ Group records are variable length, with a minimum of 12 bytes.
+ So net.ipv4.igmp_max_memberships should not be set higher than:
+
+ (65536-24) / 12 = 5459
+
+ The value 5459 assumes no IP header options, so in practice
+ this number may be lower.
+
+ conf/interface/* changes special settings per interface (where
+ "interface" is the name of your network interface)
+
+ conf/all/* is special, changes the settings for all interfaces
+
+igmp_qrv - INTEGER
+ Controls the IGMP query robustness variable (see RFC2236 8.1).
+ Default: 2 (as specified by RFC2236 8.1)
+ Minimum: 1 (as specified by RFC6636 4.5)
+
+log_martians - BOOLEAN
+ Log packets with impossible addresses to kernel log.
+ log_martians for the interface will be enabled if at least one of
+ conf/{all,interface}/log_martians is set to TRUE,
+ it will be disabled otherwise
+
+accept_redirects - BOOLEAN
+ Accept ICMP redirect messages.
+ accept_redirects for the interface will be enabled if:
+ - both conf/{all,interface}/accept_redirects are TRUE in the case
+ forwarding for the interface is enabled
+ or
+ - at least one of conf/{all,interface}/accept_redirects is TRUE in the
+ case forwarding for the interface is disabled
+ accept_redirects for the interface will be disabled otherwise
+ default TRUE (host)
+ FALSE (router)
+
+forwarding - BOOLEAN
+ Enable IP forwarding on this interface.
+
+mc_forwarding - BOOLEAN
+ Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE
+ and a multicast routing daemon is required.
+ conf/all/mc_forwarding must also be set to TRUE to enable multicast
+ routing for the interface
+
+medium_id - INTEGER
+ Integer value used to differentiate the devices by the medium they
+ are attached to. Two devices can have different id values when
+ the broadcast packets are received only on one of them.
+ The default value 0 means that the device is the only interface
+ to its medium, value of -1 means that medium is not known.
+
+ Currently, it is used to change the proxy_arp behavior:
+ the proxy_arp feature is enabled for packets forwarded between
+ two devices attached to different media.
+
+proxy_arp - BOOLEAN
+ Do proxy arp.
+ proxy_arp for the interface will be enabled if at least one of
+ conf/{all,interface}/proxy_arp is set to TRUE,
+ it will be disabled otherwise
+
+proxy_arp_pvlan - BOOLEAN
+ Private VLAN proxy arp.
+ Basically allow proxy arp replies back to the same interface
+ (from which the ARP request/solicitation was received).
+
+ This is done to support (ethernet) switch features, like RFC
+ 3069, where the individual ports are NOT allowed to
+ communicate with each other, but they are allowed to talk to
+ the upstream router. As described in RFC 3069, it is possible
+ to allow these hosts to communicate through the upstream
+ router by proxy_arp'ing. Don't need to be used together with
+ proxy_arp.
+
+ This technology is known by different names:
+ In RFC 3069 it is called VLAN Aggregation.
+ Cisco and Allied Telesyn call it Private VLAN.
+ Hewlett-Packard call it Source-Port filtering or port-isolation.
+ Ericsson call it MAC-Forced Forwarding (RFC Draft).
+
+shared_media - BOOLEAN
+ Send(router) or accept(host) RFC1620 shared media redirects.
+ Overrides ip_secure_redirects.
+ shared_media for the interface will be enabled if at least one of
+ conf/{all,interface}/shared_media is set to TRUE,
+ it will be disabled otherwise
+ default TRUE
+
+secure_redirects - BOOLEAN
+ Accept ICMP redirect messages only for gateways,
+ listed in default gateway list.
+ secure_redirects for the interface will be enabled if at least one of
+ conf/{all,interface}/secure_redirects is set to TRUE,
+ it will be disabled otherwise
+ default TRUE
+
+send_redirects - BOOLEAN
+ Send redirects, if router.
+ send_redirects for the interface will be enabled if at least one of
+ conf/{all,interface}/send_redirects is set to TRUE,
+ it will be disabled otherwise
+ Default: TRUE
+
+bootp_relay - BOOLEAN
+ Accept packets with source address 0.b.c.d destined
+ not to this host as local ones. It is supposed, that
+ BOOTP relay daemon will catch and forward such packets.
+ conf/all/bootp_relay must also be set to TRUE to enable BOOTP relay
+ for the interface
+ default FALSE
+ Not Implemented Yet.
+
+accept_source_route - BOOLEAN
+ Accept packets with SRR option.
+ conf/all/accept_source_route must also be set to TRUE to accept packets
+ with SRR option on the interface
+ default TRUE (router)
+ FALSE (host)
+
+accept_local - BOOLEAN
+ Accept packets with local source addresses. In combination with
+ suitable routing, this can be used to direct packets between two
+ local interfaces over the wire and have them accepted properly.
+ default FALSE
+
+route_localnet - BOOLEAN
+ Do not consider loopback addresses as martian source or destination
+ while routing. This enables the use of 127/8 for local routing purposes.
+ default FALSE
+
+rp_filter - INTEGER
+ 0 - No source validation.
+ 1 - Strict mode as defined in RFC3704 Strict Reverse Path
+ Each incoming packet is tested against the FIB and if the interface
+ is not the best reverse path the packet check will fail.
+ By default failed packets are discarded.
+ 2 - Loose mode as defined in RFC3704 Loose Reverse Path
+ Each incoming packet's source address is also tested against the FIB
+ and if the source address is not reachable via any interface
+ the packet check will fail.
+
+ Current recommended practice in RFC3704 is to enable strict mode
+ to prevent IP spoofing from DDos attacks. If using asymmetric routing
+ or other complicated routing, then loose mode is recommended.
+
+ The max value from conf/{all,interface}/rp_filter is used
+ when doing source validation on the {interface}.
+
+ Default value is 0. Note that some distributions enable it
+ in startup scripts.
+
+arp_filter - BOOLEAN
+ 1 - Allows you to have multiple network interfaces on the same
+ subnet, and have the ARPs for each interface be answered
+ based on whether or not the kernel would route a packet from
+ the ARP'd IP out that interface (therefore you must use source
+ based routing for this to work). In other words it allows control
+ of which cards (usually 1) will respond to an arp request.
+
+ 0 - (default) The kernel can respond to arp requests with addresses
+ from other interfaces. This may seem wrong but it usually makes
+ sense, because it increases the chance of successful communication.
+ IP addresses are owned by the complete host on Linux, not by
+ particular interfaces. Only for more complex setups like load-
+ balancing, does this behaviour cause problems.
+
+ arp_filter for the interface will be enabled if at least one of
+ conf/{all,interface}/arp_filter is set to TRUE,
+ it will be disabled otherwise
+
+arp_announce - INTEGER
+ Define different restriction levels for announcing the local
+ source IP address from IP packets in ARP requests sent on
+ interface:
+ 0 - (default) Use any local address, configured on any interface
+ 1 - Try to avoid local addresses that are not in the target's
+ subnet for this interface. This mode is useful when target
+ hosts reachable via this interface require the source IP
+ address in ARP requests to be part of their logical network
+ configured on the receiving interface. When we generate the
+ request we will check all our subnets that include the
+ target IP and will preserve the source address if it is from
+ such subnet. If there is no such subnet we select source
+ address according to the rules for level 2.
+ 2 - Always use the best local address for this target.
+ In this mode we ignore the source address in the IP packet
+ and try to select local address that we prefer for talks with
+ the target host. Such local address is selected by looking
+ for primary IP addresses on all our subnets on the outgoing
+ interface that include the target IP address. If no suitable
+ local address is found we select the first local address
+ we have on the outgoing interface or on all other interfaces,
+ with the hope we will receive reply for our request and
+ even sometimes no matter the source IP address we announce.
+
+ The max value from conf/{all,interface}/arp_announce is used.
+
+ Increasing the restriction level gives more chance for
+ receiving answer from the resolved target while decreasing
+ the level announces more valid sender's information.
+
+arp_ignore - INTEGER
+ Define different modes for sending replies in response to
+ received ARP requests that resolve local target IP addresses:
+ 0 - (default): reply for any local target IP address, configured
+ on any interface
+ 1 - reply only if the target IP address is local address
+ configured on the incoming interface
+ 2 - reply only if the target IP address is local address
+ configured on the incoming interface and both with the
+ sender's IP address are part from same subnet on this interface
+ 3 - do not reply for local addresses configured with scope host,
+ only resolutions for global and link addresses are replied
+ 4-7 - reserved
+ 8 - do not reply for all local addresses
+
+ The max value from conf/{all,interface}/arp_ignore is used
+ when ARP request is received on the {interface}
+
+arp_notify - BOOLEAN
+ Define mode for notification of address and device changes.
+ 0 - (default): do nothing
+ 1 - Generate gratuitous arp requests when device is brought up
+ or hardware address changes.
+
+arp_accept - BOOLEAN
+ Define behavior for gratuitous ARP frames who's IP is not
+ already present in the ARP table:
+ 0 - don't create new entries in the ARP table
+ 1 - create new entries in the ARP table
+
+ Both replies and requests type gratuitous arp will trigger the
+ ARP table to be updated, if this setting is on.
+
+ If the ARP table already contains the IP address of the
+ gratuitous arp frame, the arp table will be updated regardless
+ if this setting is on or off.
+
+mcast_solicit - INTEGER
+ The maximum number of multicast probes in INCOMPLETE state,
+ when the associated hardware address is unknown. Defaults
+ to 3.
+
+ucast_solicit - INTEGER
+ The maximum number of unicast probes in PROBE state, when
+ the hardware address is being reconfirmed. Defaults to 3.
+
+app_solicit - INTEGER
+ The maximum number of probes to send to the user space ARP daemon
+ via netlink before dropping back to multicast probes (see
+ mcast_resolicit). Defaults to 0.
+
+mcast_resolicit - INTEGER
+ The maximum number of multicast probes after unicast and
+ app probes in PROBE state. Defaults to 0.
+
+disable_policy - BOOLEAN
+ Disable IPSEC policy (SPD) for this interface
+
+disable_xfrm - BOOLEAN
+ Disable IPSEC encryption on this interface, whatever the policy
+
+igmpv2_unsolicited_report_interval - INTEGER
+ The interval in milliseconds in which the next unsolicited
+ IGMPv1 or IGMPv2 report retransmit will take place.
+ Default: 10000 (10 seconds)
+
+igmpv3_unsolicited_report_interval - INTEGER
+ The interval in milliseconds in which the next unsolicited
+ IGMPv3 report retransmit will take place.
+ Default: 1000 (1 seconds)
+
+promote_secondaries - BOOLEAN
+ When a primary IP address is removed from this interface
+ promote a corresponding secondary IP address instead of
+ removing all the corresponding secondary IP addresses.
+
+
+tag - INTEGER
+ Allows you to write a number, which can be used as required.
+ Default value is 0.
+
+Alexey Kuznetsov.
+kuznet@ms2.inr.ac.ru
+
+Updated by:
+Andi Kleen
+ak@muc.de
+Nicolas Delon
+delon.nicolas@wanadoo.fr
+
+
+
+
+/proc/sys/net/ipv6/* Variables:
+
+IPv6 has no global variables such as tcp_*. tcp_* settings under ipv4/ also
+apply to IPv6 [XXX?].
+
+bindv6only - BOOLEAN
+ Default value for IPV6_V6ONLY socket option,
+ which restricts use of the IPv6 socket to IPv6 communication
+ only.
+ TRUE: disable IPv4-mapped address feature
+ FALSE: enable IPv4-mapped address feature
+
+ Default: FALSE (as specified in RFC3493)
+
+flowlabel_consistency - BOOLEAN
+ Protect the consistency (and unicity) of flow label.
+ You have to disable it to use IPV6_FL_F_REFLECT flag on the
+ flow label manager.
+ TRUE: enabled
+ FALSE: disabled
+ Default: TRUE
+
+auto_flowlabels - BOOLEAN
+ Automatically generate flow labels based based on a flow hash
+ of the packet. This allows intermediate devices, such as routers,
+ to idenfify packet flows for mechanisms like Equal Cost Multipath
+ Routing (see RFC 6438).
+ TRUE: enabled
+ FALSE: disabled
+ Default: false
+
+anycast_src_echo_reply - BOOLEAN
+ Controls the use of anycast addresses as source addresses for ICMPv6
+ echo reply
+ TRUE: enabled
+ FALSE: disabled
+ Default: FALSE
+
+idgen_delay - INTEGER
+ Controls the delay in seconds after which time to retry
+ privacy stable address generation if a DAD conflict is
+ detected.
+ Default: 1 (as specified in RFC7217)
+
+idgen_retries - INTEGER
+ Controls the number of retries to generate a stable privacy
+ address if a DAD conflict is detected.
+ Default: 3 (as specified in RFC7217)
+
+mld_qrv - INTEGER
+ Controls the MLD query robustness variable (see RFC3810 9.1).
+ Default: 2 (as specified by RFC3810 9.1)
+ Minimum: 1 (as specified by RFC6636 4.5)
+
+IPv6 Fragmentation:
+
+ip6frag_high_thresh - INTEGER
+ Maximum memory used to reassemble IPv6 fragments. When
+ ip6frag_high_thresh bytes of memory is allocated for this purpose,
+ the fragment handler will toss packets until ip6frag_low_thresh
+ is reached.
+
+ip6frag_low_thresh - INTEGER
+ See ip6frag_high_thresh
+
+ip6frag_time - INTEGER
+ Time in seconds to keep an IPv6 fragment in memory.
+
+conf/default/*:
+ Change the interface-specific default settings.
+
+
+conf/all/*:
+ Change all the interface-specific settings.
+
+ [XXX: Other special features than forwarding?]
+
+conf/all/forwarding - BOOLEAN
+ Enable global IPv6 forwarding between all interfaces.
+
+ IPv4 and IPv6 work differently here; e.g. netfilter must be used
+ to control which interfaces may forward packets and which not.
+
+ This also sets all interfaces' Host/Router setting
+ 'forwarding' to the specified value. See below for details.
+
+ This referred to as global forwarding.
+
+proxy_ndp - BOOLEAN
+ Do proxy ndp.
+
+fwmark_reflect - BOOLEAN
+ Controls the fwmark of kernel-generated IPv6 reply packets that are not
+ associated with a socket for example, TCP RSTs or ICMPv6 echo replies).
+ If unset, these packets have a fwmark of zero. If set, they have the
+ fwmark of the packet they are replying to.
+ Default: 0
+
+conf/interface/*:
+ Change special settings per interface.
+
+ The functional behaviour for certain settings is different
+ depending on whether local forwarding is enabled or not.
+
+accept_ra - INTEGER
+ Accept Router Advertisements; autoconfigure using them.
+
+ It also determines whether or not to transmit Router
+ Solicitations. If and only if the functional setting is to
+ accept Router Advertisements, Router Solicitations will be
+ transmitted.
+
+ Possible values are:
+ 0 Do not accept Router Advertisements.
+ 1 Accept Router Advertisements if forwarding is disabled.
+ 2 Overrule forwarding behaviour. Accept Router Advertisements
+ even if forwarding is enabled.
+
+ Functional default: enabled if local forwarding is disabled.
+ disabled if local forwarding is enabled.
+
+accept_ra_defrtr - BOOLEAN
+ Learn default router in Router Advertisement.
+
+ Functional default: enabled if accept_ra is enabled.
+ disabled if accept_ra is disabled.
+
+accept_ra_from_local - BOOLEAN
+ Accept RA with source-address that is found on local machine
+ if the RA is otherwise proper and able to be accepted.
+ Default is to NOT accept these as it may be an un-intended
+ network loop.
+
+ Functional default:
+ enabled if accept_ra_from_local is enabled
+ on a specific interface.
+ disabled if accept_ra_from_local is disabled
+ on a specific interface.
+
+accept_ra_pinfo - BOOLEAN
+ Learn Prefix Information in Router Advertisement.
+
+ Functional default: enabled if accept_ra is enabled.
+ disabled if accept_ra is disabled.
+
+accept_ra_rt_info_max_plen - INTEGER
+ Maximum prefix length of Route Information in RA.
+
+ Route Information w/ prefix larger than or equal to this
+ variable shall be ignored.
+
+ Functional default: 0 if accept_ra_rtr_pref is enabled.
+ -1 if accept_ra_rtr_pref is disabled.
+
+accept_ra_rtr_pref - BOOLEAN
+ Accept Router Preference in RA.
+
+ Functional default: enabled if accept_ra is enabled.
+ disabled if accept_ra is disabled.
+
+accept_ra_mtu - BOOLEAN
+ Apply the MTU value specified in RA option 5 (RFC4861). If
+ disabled, the MTU specified in the RA will be ignored.
+
+ Functional default: enabled if accept_ra is enabled.
+ disabled if accept_ra is disabled.
+
+accept_redirects - BOOLEAN
+ Accept Redirects.
+
+ Functional default: enabled if local forwarding is disabled.
+ disabled if local forwarding is enabled.
+
+accept_source_route - INTEGER
+ Accept source routing (routing extension header).
+
+ >= 0: Accept only routing header type 2.
+ < 0: Do not accept routing header.
+
+ Default: 0
+
+autoconf - BOOLEAN
+ Autoconfigure addresses using Prefix Information in Router
+ Advertisements.
+
+ Functional default: enabled if accept_ra_pinfo is enabled.
+ disabled if accept_ra_pinfo is disabled.
+
+dad_transmits - INTEGER
+ The amount of Duplicate Address Detection probes to send.
+ Default: 1
+
+forwarding - INTEGER
+ Configure interface-specific Host/Router behaviour.
+
+ Note: It is recommended to have the same setting on all
+ interfaces; mixed router/host scenarios are rather uncommon.
+
+ Possible values are:
+ 0 Forwarding disabled
+ 1 Forwarding enabled
+
+ FALSE (0):
+
+ By default, Host behaviour is assumed. This means:
+
+ 1. IsRouter flag is not set in Neighbour Advertisements.
+ 2. If accept_ra is TRUE (default), transmit Router
+ Solicitations.
+ 3. If accept_ra is TRUE (default), accept Router
+ Advertisements (and do autoconfiguration).
+ 4. If accept_redirects is TRUE (default), accept Redirects.
+
+ TRUE (1):
+
+ If local forwarding is enabled, Router behaviour is assumed.
+ This means exactly the reverse from the above:
+
+ 1. IsRouter flag is set in Neighbour Advertisements.
+ 2. Router Solicitations are not sent unless accept_ra is 2.
+ 3. Router Advertisements are ignored unless accept_ra is 2.
+ 4. Redirects are ignored.
+
+ Default: 0 (disabled) if global forwarding is disabled (default),
+ otherwise 1 (enabled).
+
+hop_limit - INTEGER
+ Default Hop Limit to set.
+ Default: 64
+
+mtu - INTEGER
+ Default Maximum Transfer Unit
+ Default: 1280 (IPv6 required minimum)
+
+router_probe_interval - INTEGER
+ Minimum interval (in seconds) between Router Probing described
+ in RFC4191.
+
+ Default: 60
+
+router_solicitation_delay - INTEGER
+ Number of seconds to wait after interface is brought up
+ before sending Router Solicitations.
+ Default: 1
+
+router_solicitation_interval - INTEGER
+ Number of seconds to wait between Router Solicitations.
+ Default: 4
+
+router_solicitations - INTEGER
+ Number of Router Solicitations to send until assuming no
+ routers are present.
+ Default: 3
+
+use_tempaddr - INTEGER
+ Preference for Privacy Extensions (RFC3041).
+ <= 0 : disable Privacy Extensions
+ == 1 : enable Privacy Extensions, but prefer public
+ addresses over temporary addresses.
+ > 1 : enable Privacy Extensions and prefer temporary
+ addresses over public addresses.
+ Default: 0 (for most devices)
+ -1 (for point-to-point devices and loopback devices)
+
+temp_valid_lft - INTEGER
+ valid lifetime (in seconds) for temporary addresses.
+ Default: 604800 (7 days)
+
+temp_prefered_lft - INTEGER
+ Preferred lifetime (in seconds) for temporary addresses.
+ Default: 86400 (1 day)
+
+max_desync_factor - INTEGER
+ Maximum value for DESYNC_FACTOR, which is a random value
+ that ensures that clients don't synchronize with each
+ other and generate new addresses at exactly the same time.
+ value is in seconds.
+ Default: 600
+
+regen_max_retry - INTEGER
+ Number of attempts before give up attempting to generate
+ valid temporary addresses.
+ Default: 5
+
+max_addresses - INTEGER
+ Maximum number of autoconfigured addresses per interface. Setting
+ to zero disables the limitation. It is not recommended to set this
+ value too large (or to zero) because it would be an easy way to
+ crash the kernel by allowing too many addresses to be created.
+ Default: 16
+
+disable_ipv6 - BOOLEAN
+ Disable IPv6 operation. If accept_dad is set to 2, this value
+ will be dynamically set to TRUE if DAD fails for the link-local
+ address.
+ Default: FALSE (enable IPv6 operation)
+
+ When this value is changed from 1 to 0 (IPv6 is being enabled),
+ it will dynamically create a link-local address on the given
+ interface and start Duplicate Address Detection, if necessary.
+
+ When this value is changed from 0 to 1 (IPv6 is being disabled),
+ it will dynamically delete all address on the given interface.
+
+accept_dad - INTEGER
+ Whether to accept DAD (Duplicate Address Detection).
+ 0: Disable DAD
+ 1: Enable DAD (default)
+ 2: Enable DAD, and disable IPv6 operation if MAC-based duplicate
+ link-local address has been found.
+
+force_tllao - BOOLEAN
+ Enable sending the target link-layer address option even when
+ responding to a unicast neighbor solicitation.
+ Default: FALSE
+
+ Quoting from RFC 2461, section 4.4, Target link-layer address:
+
+ "The option MUST be included for multicast solicitations in order to
+ avoid infinite Neighbor Solicitation "recursion" when the peer node
+ does not have a cache entry to return a Neighbor Advertisements
+ message. When responding to unicast solicitations, the option can be
+ omitted since the sender of the solicitation has the correct link-
+ layer address; otherwise it would not have be able to send the unicast
+ solicitation in the first place. However, including the link-layer
+ address in this case adds little overhead and eliminates a potential
+ race condition where the sender deletes the cached link-layer address
+ prior to receiving a response to a previous solicitation."
+
+ndisc_notify - BOOLEAN
+ Define mode for notification of address and device changes.
+ 0 - (default): do nothing
+ 1 - Generate unsolicited neighbour advertisements when device is brought
+ up or hardware address changes.
+
+mldv1_unsolicited_report_interval - INTEGER
+ The interval in milliseconds in which the next unsolicited
+ MLDv1 report retransmit will take place.
+ Default: 10000 (10 seconds)
+
+mldv2_unsolicited_report_interval - INTEGER
+ The interval in milliseconds in which the next unsolicited
+ MLDv2 report retransmit will take place.
+ Default: 1000 (1 second)
+
+force_mld_version - INTEGER
+ 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed
+ 1 - Enforce to use MLD version 1
+ 2 - Enforce to use MLD version 2
+
+suppress_frag_ndisc - INTEGER
+ Control RFC 6980 (Security Implications of IPv6 Fragmentation
+ with IPv6 Neighbor Discovery) behavior:
+ 1 - (default) discard fragmented neighbor discovery packets
+ 0 - allow fragmented neighbor discovery packets
+
+optimistic_dad - BOOLEAN
+ Whether to perform Optimistic Duplicate Address Detection (RFC 4429).
+ 0: disabled (default)
+ 1: enabled
+
+use_optimistic - BOOLEAN
+ If enabled, do not classify optimistic addresses as deprecated during
+ source address selection. Preferred addresses will still be chosen
+ before optimistic addresses, subject to other ranking in the source
+ address selection algorithm.
+ 0: disabled (default)
+ 1: enabled
+
+stable_secret - IPv6 address
+ This IPv6 address will be used as a secret to generate IPv6
+ addresses for link-local addresses and autoconfigured
+ ones. All addresses generated after setting this secret will
+ be stable privacy ones by default. This can be changed via the
+ addrgenmode ip-link. conf/default/stable_secret is used as the
+ secret for the namespace, the interface specific ones can
+ overwrite that. Writes to conf/all/stable_secret are refused.
+
+ It is recommended to generate this secret during installation
+ of a system and keep it stable after that.
+
+ By default the stable secret is unset.
+
+icmp/*:
+ratelimit - INTEGER
+ Limit the maximal rates for sending ICMPv6 packets.
+ 0 to disable any limiting,
+ otherwise the minimal space between responses in milliseconds.
+ Default: 1000
+
+
+IPv6 Update by:
+Pekka Savola <pekkas@netcore.fi>
+YOSHIFUJI Hideaki / USAGI Project <yoshfuji@linux-ipv6.org>
+
+
+/proc/sys/net/bridge/* Variables:
+
+bridge-nf-call-arptables - BOOLEAN
+ 1 : pass bridged ARP traffic to arptables' FORWARD chain.
+ 0 : disable this.
+ Default: 1
+
+bridge-nf-call-iptables - BOOLEAN
+ 1 : pass bridged IPv4 traffic to iptables' chains.
+ 0 : disable this.
+ Default: 1
+
+bridge-nf-call-ip6tables - BOOLEAN
+ 1 : pass bridged IPv6 traffic to ip6tables' chains.
+ 0 : disable this.
+ Default: 1
+
+bridge-nf-filter-vlan-tagged - BOOLEAN
+ 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables.
+ 0 : disable this.
+ Default: 0
+
+bridge-nf-filter-pppoe-tagged - BOOLEAN
+ 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables.
+ 0 : disable this.
+ Default: 0
+
+bridge-nf-pass-vlan-input-dev - BOOLEAN
+ 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan
+ interface on the bridge and set the netfilter input device to the vlan.
+ This allows use of e.g. "iptables -i br0.1" and makes the REDIRECT
+ target work with vlan-on-top-of-bridge interfaces. When no matching
+ vlan interface is found, or this switch is off, the input device is
+ set to the bridge interface.
+ 0: disable bridge netfilter vlan interface lookup.
+ Default: 0
+
+proc/sys/net/sctp/* Variables:
+
+addip_enable - BOOLEAN
+ Enable or disable extension of Dynamic Address Reconfiguration
+ (ADD-IP) functionality specified in RFC5061. This extension provides
+ the ability to dynamically add and remove new addresses for the SCTP
+ associations.
+
+ 1: Enable extension.
+
+ 0: Disable extension.
+
+ Default: 0
+
+addip_noauth_enable - BOOLEAN
+ Dynamic Address Reconfiguration (ADD-IP) requires the use of
+ authentication to protect the operations of adding or removing new
+ addresses. This requirement is mandated so that unauthorized hosts
+ would not be able to hijack associations. However, older
+ implementations may not have implemented this requirement while
+ allowing the ADD-IP extension. For reasons of interoperability,
+ we provide this variable to control the enforcement of the
+ authentication requirement.
+
+ 1: Allow ADD-IP extension to be used without authentication. This
+ should only be set in a closed environment for interoperability
+ with older implementations.
+
+ 0: Enforce the authentication requirement
+
+ Default: 0
+
+auth_enable - BOOLEAN
+ Enable or disable Authenticated Chunks extension. This extension
+ provides the ability to send and receive authenticated chunks and is
+ required for secure operation of Dynamic Address Reconfiguration
+ (ADD-IP) extension.
+
+ 1: Enable this extension.
+ 0: Disable this extension.
+
+ Default: 0
+
+prsctp_enable - BOOLEAN
+ Enable or disable the Partial Reliability extension (RFC3758) which
+ is used to notify peers that a given DATA should no longer be expected.
+
+ 1: Enable extension
+ 0: Disable
+
+ Default: 1
+
+max_burst - INTEGER
+ The limit of the number of new packets that can be initially sent. It
+ controls how bursty the generated traffic can be.
+
+ Default: 4
+
+association_max_retrans - INTEGER
+ Set the maximum number for retransmissions that an association can
+ attempt deciding that the remote end is unreachable. If this value
+ is exceeded, the association is terminated.
+
+ Default: 10
+
+max_init_retransmits - INTEGER
+ The maximum number of retransmissions of INIT and COOKIE-ECHO chunks
+ that an association will attempt before declaring the destination
+ unreachable and terminating.
+
+ Default: 8
+
+path_max_retrans - INTEGER
+ The maximum number of retransmissions that will be attempted on a given
+ path. Once this threshold is exceeded, the path is considered
+ unreachable, and new traffic will use a different path when the
+ association is multihomed.
+
+ Default: 5
+
+pf_retrans - INTEGER
+ The number of retransmissions that will be attempted on a given path
+ before traffic is redirected to an alternate transport (should one
+ exist). Note this is distinct from path_max_retrans, as a path that
+ passes the pf_retrans threshold can still be used. Its only
+ deprioritized when a transmission path is selected by the stack. This
+ setting is primarily used to enable fast failover mechanisms without
+ having to reduce path_max_retrans to a very low value. See:
+ http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt
+ for details. Note also that a value of pf_retrans > path_max_retrans
+ disables this feature
+
+ Default: 0
+
+rto_initial - INTEGER
+ The initial round trip timeout value in milliseconds that will be used
+ in calculating round trip times. This is the initial time interval
+ for retransmissions.
+
+ Default: 3000
+
+rto_max - INTEGER
+ The maximum value (in milliseconds) of the round trip timeout. This
+ is the largest time interval that can elapse between retransmissions.
+
+ Default: 60000
+
+rto_min - INTEGER
+ The minimum value (in milliseconds) of the round trip timeout. This
+ is the smallest time interval the can elapse between retransmissions.
+
+ Default: 1000
+
+hb_interval - INTEGER
+ The interval (in milliseconds) between HEARTBEAT chunks. These chunks
+ are sent at the specified interval on idle paths to probe the state of
+ a given path between 2 associations.
+
+ Default: 30000
+
+sack_timeout - INTEGER
+ The amount of time (in milliseconds) that the implementation will wait
+ to send a SACK.
+
+ Default: 200
+
+valid_cookie_life - INTEGER
+ The default lifetime of the SCTP cookie (in milliseconds). The cookie
+ is used during association establishment.
+
+ Default: 60000
+
+cookie_preserve_enable - BOOLEAN
+ Enable or disable the ability to extend the lifetime of the SCTP cookie
+ that is used during the establishment phase of SCTP association
+
+ 1: Enable cookie lifetime extension.
+ 0: Disable
+
+ Default: 1
+
+cookie_hmac_alg - STRING
+ Select the hmac algorithm used when generating the cookie value sent by
+ a listening sctp socket to a connecting client in the INIT-ACK chunk.
+ Valid values are:
+ * md5
+ * sha1
+ * none
+ Ability to assign md5 or sha1 as the selected alg is predicated on the
+ configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and
+ CONFIG_CRYPTO_SHA1).
+
+ Default: Dependent on configuration. MD5 if available, else SHA1 if
+ available, else none.
+
+rcvbuf_policy - INTEGER
+ Determines if the receive buffer is attributed to the socket or to
+ association. SCTP supports the capability to create multiple
+ associations on a single socket. When using this capability, it is
+ possible that a single stalled association that's buffering a lot
+ of data may block other associations from delivering their data by
+ consuming all of the receive buffer space. To work around this,
+ the rcvbuf_policy could be set to attribute the receiver buffer space
+ to each association instead of the socket. This prevents the described
+ blocking.
+
+ 1: rcvbuf space is per association
+ 0: rcvbuf space is per socket
+
+ Default: 0
+
+sndbuf_policy - INTEGER
+ Similar to rcvbuf_policy above, this applies to send buffer space.
+
+ 1: Send buffer is tracked per association
+ 0: Send buffer is tracked per socket.
+
+ Default: 0
+
+sctp_mem - vector of 3 INTEGERs: min, pressure, max
+ Number of pages allowed for queueing by all SCTP sockets.
+
+ min: Below this number of pages SCTP is not bothered about its
+ memory appetite. When amount of memory allocated by SCTP exceeds
+ this number, SCTP starts to moderate memory usage.
+
+ pressure: This value was introduced to follow format of tcp_mem.
+
+ max: Number of pages allowed for queueing by all SCTP sockets.
+
+ Default is calculated at boot time from amount of available memory.
+
+sctp_rmem - vector of 3 INTEGERs: min, default, max
+ Only the first value ("min") is used, "default" and "max" are
+ ignored.
+
+ min: Minimal size of receive buffer used by SCTP socket.
+ It is guaranteed to each SCTP socket (but not association) even
+ under moderate memory pressure.
+
+ Default: 1 page
+
+sctp_wmem - vector of 3 INTEGERs: min, default, max
+ Currently this tunable has no effect.
+
+addr_scope_policy - INTEGER
+ Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00
+
+ 0 - Disable IPv4 address scoping
+ 1 - Enable IPv4 address scoping
+ 2 - Follow draft but allow IPv4 private addresses
+ 3 - Follow draft but allow IPv4 link local addresses
+
+ Default: 1
+
+
+/proc/sys/net/core/*
+ Please see: Documentation/sysctl/net.txt for descriptions of these entries.
+
+
+/proc/sys/net/unix/*
+max_dgram_qlen - INTEGER
+ The maximum length of dgram socket receive queue
+
+ Default: 10
+
+
+UNDOCUMENTED:
+
+/proc/sys/net/irda/*
+ fast_poll_increase FIXME
+ warn_noreply_time FIXME
+ discovery_slots FIXME
+ slot_timeout FIXME
+ max_baud_rate FIXME
+ discovery_timeout FIXME
+ lap_keepalive_time FIXME
+ max_noreply_time FIXME
+ max_tx_data_size FIXME
+ max_tx_window FIXME
+ min_tx_turn_time FIXME
diff --git a/Documentation/networking/ip_dynaddr.txt b/Documentation/networking/ip_dynaddr.txt
new file mode 100644
index 000000000..45f3c1268
--- /dev/null
+++ b/Documentation/networking/ip_dynaddr.txt
@@ -0,0 +1,29 @@
+IP dynamic address hack-port v0.03
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This stuff allows diald ONESHOT connections to get established by
+dynamically changing packet source address (and socket's if local procs).
+It is implemented for TCP diald-box connections(1) and IP_MASQuerading(2).
+
+If enabled[*] and forwarding interface has changed:
+ 1) Socket (and packet) source address is rewritten ON RETRANSMISSIONS
+ while in SYN_SENT state (diald-box processes).
+ 2) Out-bounded MASQueraded source address changes ON OUTPUT (when
+ internal host does retransmission) until a packet from outside is
+ received by the tunnel.
+
+This is specially helpful for auto dialup links (diald), where the
+``actual'' outgoing address is unknown at the moment the link is
+going up. So, the *same* (local AND masqueraded) connections requests that
+bring the link up will be able to get established.
+
+[*] At boot, by default no address rewriting is attempted.
+ To enable:
+ # echo 1 > /proc/sys/net/ipv4/ip_dynaddr
+ To enable verbose mode:
+ # echo 2 > /proc/sys/net/ipv4/ip_dynaddr
+ To disable (default)
+ # echo 0 > /proc/sys/net/ipv4/ip_dynaddr
+
+Enjoy!
+
+-- Juanjo <jjciarla@raiz.uncu.edu.ar>
diff --git a/Documentation/networking/ipddp.txt b/Documentation/networking/ipddp.txt
new file mode 100644
index 000000000..ba5c217ff
--- /dev/null
+++ b/Documentation/networking/ipddp.txt
@@ -0,0 +1,73 @@
+Text file for ipddp.c:
+ AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation
+
+This text file is written by Jay Schulist <jschlst@samba.org>
+
+Introduction
+------------
+
+AppleTalk-IP (IPDDP) is the method computers connected to AppleTalk
+networks can use to communicate via IP. AppleTalk-IP is simply IP datagrams
+inside AppleTalk packets.
+
+Through this driver you can either allow your Linux box to communicate
+IP over an AppleTalk network or you can provide IP gatewaying functions
+for your AppleTalk users.
+
+You can currently encapsulate or decapsulate AppleTalk-IP on LocalTalk,
+EtherTalk and PPPTalk. The only limit on the protocol is that of what
+kernel AppleTalk layer and drivers are available.
+
+Each mode requires its own user space software.
+
+Compiling AppleTalk-IP Decapsulation/Encapsulation
+=================================================
+
+AppleTalk-IP decapsulation needs to be compiled into your kernel. You
+will need to turn on AppleTalk-IP driver support. Then you will need to
+select ONE of the two options; IP to AppleTalk-IP encapsulation support or
+AppleTalk-IP to IP decapsulation support. If you compile the driver
+statically you will only be able to use the driver for the function you have
+enabled in the kernel. If you compile the driver as a module you can
+select what mode you want it to run in via a module loading param.
+ipddp_mode=1 for AppleTalk-IP encapsulation and ipddp_mode=2 for
+AppleTalk-IP to IP decapsulation.
+
+Basic instructions for user space tools
+=======================================
+
+I will briefly describe the operation of the tools, but you will
+need to consult the supporting documentation for each set of tools.
+
+Decapsulation - You will need to download a software package called
+MacGate. In this distribution there will be a tool called MacRoute
+which enables you to add routes to the kernel for your Macs by hand.
+Also the tool MacRegGateWay is included to register the
+proper IP Gateway and IP addresses for your machine. Included in this
+distribution is a patch to netatalk-1.4b2+asun2.0a17.2 (available from
+ftp.u.washington.edu/pub/user-supported/asun/) this patch is optional
+but it allows automatic adding and deleting of routes for Macs. (Handy
+for locations with large Mac installations)
+
+Encapsulation - You will need to download a software daemon called ipddpd.
+This software expects there to be an AppleTalk-IP gateway on the network.
+You will also need to add the proper routes to route your Linux box's IP
+traffic out the ipddp interface.
+
+Common Uses of ipddp.c
+----------------------
+Of course AppleTalk-IP decapsulation and encapsulation, but specifically
+decapsulation is being used most for connecting LocalTalk networks to
+IP networks. Although it has been used on EtherTalk networks to allow
+Macs that are only able to tunnel IP over EtherTalk.
+
+Encapsulation has been used to allow a Linux box stuck on a LocalTalk
+network to use IP. It should work equally well if you are stuck on an
+EtherTalk only network.
+
+Further Assistance
+-------------------
+You can contact me (Jay Schulist <jschlst@samba.org>) with any
+questions regarding decapsulation or encapsulation. Bradford W. Johnson
+<johns393@maroon.tc.umn.edu> originally wrote the ipddp.c driver for IP
+encapsulation in AppleTalk.
diff --git a/Documentation/networking/iphase.txt b/Documentation/networking/iphase.txt
new file mode 100644
index 000000000..670b72f16
--- /dev/null
+++ b/Documentation/networking/iphase.txt
@@ -0,0 +1,158 @@
+
+ READ ME FISRT
+ ATM (i)Chip IA Linux Driver Source
+--------------------------------------------------------------------------------
+ Read This Before You Begin!
+--------------------------------------------------------------------------------
+
+Description
+-----------
+
+This is the README file for the Interphase PCI ATM (i)Chip IA Linux driver
+source release.
+
+The features and limitations of this driver are as follows:
+ - A single VPI (VPI value of 0) is supported.
+ - Supports 4K VCs for the server board (with 512K control memory) and 1K
+ VCs for the client board (with 128K control memory).
+ - UBR, ABR and CBR service categories are supported.
+ - Only AAL5 is supported.
+ - Supports setting of PCR on the VCs.
+ - Multiple adapters in a system are supported.
+ - All variants of Interphase ATM PCI (i)Chip adapter cards are supported,
+ including x575 (OC3, control memory 128K , 512K and packet memory 128K,
+ 512K and 1M), x525 (UTP25) and x531 (DS3 and E3). See
+ http://www.iphase.com/
+ for details.
+ - Only x86 platforms are supported.
+ - SMP is supported.
+
+
+Before You Start
+----------------
+
+
+Installation
+------------
+
+1. Installing the adapters in the system
+ To install the ATM adapters in the system, follow the steps below.
+ a. Login as root.
+ b. Shut down the system and power off the system.
+ c. Install one or more ATM adapters in the system.
+ d. Connect each adapter to a port on an ATM switch. The green 'Link'
+ LED on the front panel of the adapter will be on if the adapter is
+ connected to the switch properly when the system is powered up.
+ e. Power on and boot the system.
+
+2. [ Removed ]
+
+3. Rebuild kernel with ABR support
+ [ a. and b. removed ]
+ c. Reconfigure the kernel, choose the Interphase ia driver through "make
+ menuconfig" or "make xconfig".
+ d. Rebuild the kernel, loadable modules and the atm tools.
+ e. Install the new built kernel and modules and reboot.
+
+4. Load the adapter hardware driver (ia driver) if it is built as a module
+ a. Login as root.
+ b. Change directory to /lib/modules/<kernel-version>/atm.
+ c. Run "insmod suni.o;insmod iphase.o"
+ The yellow 'status' LED on the front panel of the adapter will blink
+ while the driver is loaded in the system.
+ d. To verify that the 'ia' driver is loaded successfully, run the
+ following command:
+
+ cat /proc/atm/devices
+
+ If the driver is loaded successfully, the output of the command will
+ be similar to the following lines:
+
+ Itf Type ESI/"MAC"addr AAL(TX,err,RX,err,drop) ...
+ 0 ia xxxxxxxxx 0 ( 0 0 0 0 0 ) 5 ( 0 0 0 0 0 )
+
+ You can also check the system log file /var/log/messages for messages
+ related to the ATM driver.
+
+5. Ia Driver Configuration
+
+5.1 Configuration of adapter buffers
+ The (i)Chip boards have 3 different packet RAM size variants: 128K, 512K and
+ 1M. The RAM size decides the number of buffers and buffer size. The default
+ size and number of buffers are set as following:
+
+ Total Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf
+ RAM size size size size size cnt cnt
+ -------- ------ ------ ------ ------ ------ ------
+ 128K 64K 64K 10K 10K 6 6
+ 512K 256K 256K 10K 10K 25 25
+ 1M 512K 512K 10K 10K 51 51
+
+ These setting should work well in most environments, but can be
+ changed by typing the following command:
+
+ insmod <IA_DIR>/ia.o IA_RX_BUF=<RX_CNT> IA_RX_BUF_SZ=<RX_SIZE> \
+ IA_TX_BUF=<TX_CNT> IA_TX_BUF_SZ=<TX_SIZE>
+ Where:
+ RX_CNT = number of receive buffers in the range (1-128)
+ RX_SIZE = size of receive buffers in the range (48-64K)
+ TX_CNT = number of transmit buffers in the range (1-128)
+ TX_SIZE = size of transmit buffers in the range (48-64K)
+
+ 1. Transmit and receive buffer size must be a multiple of 4.
+ 2. Care should be taken so that the memory required for the
+ transmit and receive buffers is less than or equal to the
+ total adapter packet memory.
+
+5.2 Turn on ia debug trace
+
+ When the ia driver is built with the CONFIG_ATM_IA_DEBUG flag, the driver
+ can provide more debug trace if needed. There is a bit mask variable,
+ IADebugFlag, which controls the output of the traces. You can find the bit
+ map of the IADebugFlag in iphase.h.
+ The debug trace can be turn on through the insmod command line option, for
+ example, "insmod iphase.o IADebugFlag=0xffffffff" can turn on all the debug
+ traces together with loading the driver.
+
+6. Ia Driver Test Using ttcp_atm and PVC
+
+ For the PVC setup, the test machines can either be connected back-to-back or
+ through a switch. If connected through the switch, the switch must be
+ configured for the PVC(s).
+
+ a. For UBR test:
+ At the test machine intended to receive data, type:
+ ttcp_atm -r -a -s 0.100
+ At the other test machine, type:
+ ttcp_atm -t -a -s 0.100 -n 10000
+ Run "ttcp_atm -h" to display more options of the ttcp_atm tool.
+ b. For ABR test:
+ It is the same as the UBR testing, but with an extra command option:
+ -Pabr:max_pcr=<xxx>
+ where:
+ xxx = the maximum peak cell rate, from 170 - 353207.
+ This option must be set on both the machines.
+ c. For CBR test:
+ It is the same as the UBR testing, but with an extra command option:
+ -Pcbr:max_pcr=<xxx>
+ where:
+ xxx = the maximum peak cell rate, from 170 - 353207.
+ This option may only be set on the transmit machine.
+
+
+OUTSTANDING ISSUES
+------------------
+
+
+
+Contact Information
+-------------------
+
+ Customer Support:
+ United States: Telephone: (214) 654-5555
+ Fax: (214) 654-5500
+ E-Mail: intouch@iphase.com
+ Europe: Telephone: 33 (0)1 41 15 44 00
+ Fax: 33 (0)1 41 15 12 13
+ World Wide Web: http://www.iphase.com
+ Anonymous FTP: ftp.iphase.com
diff --git a/Documentation/networking/ipsec.txt b/Documentation/networking/ipsec.txt
new file mode 100644
index 000000000..8dbc08b7e
--- /dev/null
+++ b/Documentation/networking/ipsec.txt
@@ -0,0 +1,38 @@
+
+Here documents known IPsec corner cases which need to be keep in mind when
+deploy various IPsec configuration in real world production environment.
+
+1. IPcomp: Small IP packet won't get compressed at sender, and failed on
+ policy check on receiver.
+
+Quote from RFC3173:
+2.2. Non-Expansion Policy
+
+ If the total size of a compressed payload and the IPComp header, as
+ defined in section 3, is not smaller than the size of the original
+ payload, the IP datagram MUST be sent in the original non-compressed
+ form. To clarify: If an IP datagram is sent non-compressed, no
+
+ IPComp header is added to the datagram. This policy ensures saving
+ the decompression processing cycles and avoiding incurring IP
+ datagram fragmentation when the expanded datagram is larger than the
+ MTU.
+
+ Small IP datagrams are likely to expand as a result of compression.
+ Therefore, a numeric threshold should be applied before compression,
+ where IP datagrams of size smaller than the threshold are sent in the
+ original form without attempting compression. The numeric threshold
+ is implementation dependent.
+
+Current IPComp implementation is indeed by the book, while as in practice
+when sending non-compressed packet to the peer(whether or not packet len
+is smaller than the threshold or the compressed len is large than original
+packet len), the packet is dropped when checking the policy as this packet
+matches the selector but not coming from any XFRM layer, i.e., with no
+security path. Such naked packet will not eventually make it to upper layer.
+The result is much more wired to the user when ping peer with different
+payload length.
+
+One workaround is try to set "level use" for each policy if user observed
+above scenario. The consequence of doing so is small packet(uncompressed)
+will skip policy checking on receiver side.
diff --git a/Documentation/networking/ipv6.txt b/Documentation/networking/ipv6.txt
new file mode 100644
index 000000000..6cd74fa55
--- /dev/null
+++ b/Documentation/networking/ipv6.txt
@@ -0,0 +1,72 @@
+
+Options for the ipv6 module are supplied as parameters at load time.
+
+Module options may be given as command line arguments to the insmod
+or modprobe command, but are usually specified in either
+/etc/modules.d/*.conf configuration files, or in a distro-specific
+configuration file.
+
+The available ipv6 module parameters are listed below. If a parameter
+is not specified the default value is used.
+
+The parameters are as follows:
+
+disable
+
+ Specifies whether to load the IPv6 module, but disable all
+ its functionality. This might be used when another module
+ has a dependency on the IPv6 module being loaded, but no
+ IPv6 addresses or operations are desired.
+
+ The possible values and their effects are:
+
+ 0
+ IPv6 is enabled.
+
+ This is the default value.
+
+ 1
+ IPv6 is disabled.
+
+ No IPv6 addresses will be added to interfaces, and
+ it will not be possible to open an IPv6 socket.
+
+ A reboot is required to enable IPv6.
+
+autoconf
+
+ Specifies whether to enable IPv6 address autoconfiguration
+ on all interfaces. This might be used when one does not wish
+ for addresses to be automatically generated from prefixes
+ received in Router Advertisements.
+
+ The possible values and their effects are:
+
+ 0
+ IPv6 address autoconfiguration is disabled on all interfaces.
+
+ Only the IPv6 loopback address (::1) and link-local addresses
+ will be added to interfaces.
+
+ 1
+ IPv6 address autoconfiguration is enabled on all interfaces.
+
+ This is the default value.
+
+disable_ipv6
+
+ Specifies whether to disable IPv6 on all interfaces.
+ This might be used when no IPv6 addresses are desired.
+
+ The possible values and their effects are:
+
+ 0
+ IPv6 is enabled on all interfaces.
+
+ This is the default value.
+
+ 1
+ IPv6 is disabled on all interfaces.
+
+ No IPv6 addresses will be added to interfaces.
+
diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt
new file mode 100644
index 000000000..cf996394e
--- /dev/null
+++ b/Documentation/networking/ipvlan.txt
@@ -0,0 +1,107 @@
+
+ IPVLAN Driver HOWTO
+
+Initial Release:
+ Mahesh Bandewar <maheshb AT google.com>
+
+1. Introduction:
+ This is conceptually very similar to the macvlan driver with one major
+exception of using L3 for mux-ing /demux-ing among slaves. This property makes
+the master device share the L2 with it's slave devices. I have developed this
+driver in conjuntion with network namespaces and not sure if there is use case
+outside of it.
+
+
+2. Building and Installation:
+ In order to build the driver, please select the config item CONFIG_IPVLAN.
+The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module
+(CONFIG_IPVLAN=m).
+
+
+3. Configuration:
+ There are no module parameters for this driver and it can be configured
+using IProute2/ip utility.
+
+ ip link add link <master-dev> <slave-dev> type ipvlan mode { l2 | L3 }
+
+ e.g. ip link add link ipvl0 eth0 type ipvlan mode l2
+
+
+4. Operating modes:
+ IPvlan has two modes of operation - L2 and L3. For a given master device,
+you can select one of these two modes and all slaves on that master will
+operate in the same (selected) mode. The RX mode is almost identical except
+that in L3 mode the slaves wont receive any multicast / broadcast traffic.
+L3 mode is more restrictive since routing is controlled from the other (mostly)
+default namespace.
+
+4.1 L2 mode:
+ In this mode TX processing happens on the stack instance attached to the
+slave device and packets are switched and queued to the master device to send
+out. In this mode the slaves will RX/TX multicast and broadcast (if applicable)
+as well.
+
+4.2 L3 mode:
+ In this mode TX processing upto L3 happens on the stack instance attached
+to the slave device and packets are switched to the stack instance of the
+master device for the L2 processing and routing from that instance will be
+used before packets are queued on the outbound device. In this mode the slaves
+will not receive nor can send multicast / broadcast traffic.
+
+
+5. What to choose (macvlan vs. ipvlan)?
+ These two devices are very similar in many regards and the specific use
+case could very well define which device to choose. if one of the following
+situations defines your use case then you can choose to use ipvlan -
+ (a) The Linux host that is connected to the external switch / router has
+policy configured that allows only one mac per port.
+ (b) No of virtual devices created on a master exceed the mac capacity and
+puts the NIC in promiscous mode and degraded performance is a concern.
+ (c) If the slave device is to be put into the hostile / untrusted network
+namespace where L2 on the slave could be changed / misused.
+
+
+6. Example configuration:
+
+ +=============================================================+
+ | Host: host1 |
+ | |
+ | +----------------------+ +----------------------+ |
+ | | NS:ns0 | | NS:ns1 | |
+ | | | | | |
+ | | | | | |
+ | | ipvl0 | | ipvl1 | |
+ | +----------#-----------+ +-----------#----------+ |
+ | # # |
+ | ################################ |
+ | # eth0 |
+ +==============================#==============================+
+
+
+ (a) Create two network namespaces - ns0, ns1
+ ip netns add ns0
+ ip netns add ns1
+
+ (b) Create two ipvlan slaves on eth0 (master device)
+ ip link add link eth0 ipvl0 type ipvlan mode l2
+ ip link add link eth0 ipvl1 type ipvlan mode l2
+
+ (c) Assign slaves to the respective network namespaces
+ ip link set dev ipvl0 netns ns0
+ ip link set dev ipvl1 netns ns1
+
+ (d) Now switch to the namespace (ns0 or ns1) to configure the slave devices
+ - For ns0
+ (1) ip netns exec ns0 bash
+ (2) ip link set dev ipvl0 up
+ (3) ip link set dev lo up
+ (4) ip -4 addr add 127.0.0.1 dev lo
+ (5) ip -4 addr add $IPADDR dev ipvl0
+ (6) ip -4 route add default via $ROUTER dev ipvl0
+ - For ns1
+ (1) ip netns exec ns1 bash
+ (2) ip link set dev ipvl1 up
+ (3) ip link set dev lo up
+ (4) ip -4 addr add 127.0.0.1 dev lo
+ (5) ip -4 addr add $IPADDR dev ipvl1
+ (6) ip -4 route add default via $ROUTER dev ipvl1
diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt
new file mode 100644
index 000000000..3ba709531
--- /dev/null
+++ b/Documentation/networking/ipvs-sysctl.txt
@@ -0,0 +1,232 @@
+/proc/sys/net/ipv4/vs/* Variables:
+
+am_droprate - INTEGER
+ default 10
+
+ It sets the always mode drop rate, which is used in the mode 3
+ of the drop_rate defense.
+
+amemthresh - INTEGER
+ default 1024
+
+ It sets the available memory threshold (in pages), which is
+ used in the automatic modes of defense. When there is no
+ enough available memory, the respective strategy will be
+ enabled and the variable is automatically set to 2, otherwise
+ the strategy is disabled and the variable is set to 1.
+
+backup_only - BOOLEAN
+ 0 - disabled (default)
+ not 0 - enabled
+
+ If set, disable the director function while the server is
+ in backup mode to avoid packet loops for DR/TUN methods.
+
+conn_reuse_mode - INTEGER
+ 1 - default
+
+ Controls how ipvs will deal with connections that are detected
+ port reuse. It is a bitmap, with the values being:
+
+ 0: disable any special handling on port reuse. The new
+ connection will be delivered to the same real server that was
+ servicing the previous connection. This will effectively
+ disable expire_nodest_conn.
+
+ bit 1: enable rescheduling of new connections when it is safe.
+ That is, whenever expire_nodest_conn and for TCP sockets, when
+ the connection is in TIME_WAIT state (which is only possible if
+ you use NAT mode).
+
+ bit 2: it is bit 1 plus, for TCP connections, when connections
+ are in FIN_WAIT state, as this is the last state seen by load
+ balancer in Direct Routing mode. This bit helps on adding new
+ real servers to a very busy cluster.
+
+conntrack - BOOLEAN
+ 0 - disabled (default)
+ not 0 - enabled
+
+ If set, maintain connection tracking entries for
+ connections handled by IPVS.
+
+ This should be enabled if connections handled by IPVS are to be
+ also handled by stateful firewall rules. That is, iptables rules
+ that make use of connection tracking. It is a performance
+ optimisation to disable this setting otherwise.
+
+ Connections handled by the IPVS FTP application module
+ will have connection tracking entries regardless of this setting.
+
+ Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled.
+
+cache_bypass - BOOLEAN
+ 0 - disabled (default)
+ not 0 - enabled
+
+ If it is enabled, forward packets to the original destination
+ directly when no cache server is available and destination
+ address is not local (iph->daddr is RTN_UNICAST). It is mostly
+ used in transparent web cache cluster.
+
+debug_level - INTEGER
+ 0 - transmission error messages (default)
+ 1 - non-fatal error messages
+ 2 - configuration
+ 3 - destination trash
+ 4 - drop entry
+ 5 - service lookup
+ 6 - scheduling
+ 7 - connection new/expire, lookup and synchronization
+ 8 - state transition
+ 9 - binding destination, template checks and applications
+ 10 - IPVS packet transmission
+ 11 - IPVS packet handling (ip_vs_in/ip_vs_out)
+ 12 or more - packet traversal
+
+ Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled.
+
+ Higher debugging levels include the messages for lower debugging
+ levels, so setting debug level 2, includes level 0, 1 and 2
+ messages. Thus, logging becomes more and more verbose the higher
+ the level.
+
+drop_entry - INTEGER
+ 0 - disabled (default)
+
+ The drop_entry defense is to randomly drop entries in the
+ connection hash table, just in order to collect back some
+ memory for new connections. In the current code, the
+ drop_entry procedure can be activated every second, then it
+ randomly scans 1/32 of the whole and drops entries that are in
+ the SYN-RECV/SYNACK state, which should be effective against
+ syn-flooding attack.
+
+ The valid values of drop_entry are from 0 to 3, where 0 means
+ that this strategy is always disabled, 1 and 2 mean automatic
+ modes (when there is no enough available memory, the strategy
+ is enabled and the variable is automatically set to 2,
+ otherwise the strategy is disabled and the variable is set to
+ 1), and 3 means that that the strategy is always enabled.
+
+drop_packet - INTEGER
+ 0 - disabled (default)
+
+ The drop_packet defense is designed to drop 1/rate packets
+ before forwarding them to real servers. If the rate is 1, then
+ drop all the incoming packets.
+
+ The value definition is the same as that of the drop_entry. In
+ the automatic mode, the rate is determined by the follow
+ formula: rate = amemthresh / (amemthresh - available_memory)
+ when available memory is less than the available memory
+ threshold. When the mode 3 is set, the always mode drop rate
+ is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
+
+expire_nodest_conn - BOOLEAN
+ 0 - disabled (default)
+ not 0 - enabled
+
+ The default value is 0, the load balancer will silently drop
+ packets when its destination server is not available. It may
+ be useful, when user-space monitoring program deletes the
+ destination server (because of server overload or wrong
+ detection) and add back the server later, and the connections
+ to the server can continue.
+
+ If this feature is enabled, the load balancer will expire the
+ connection immediately when a packet arrives and its
+ destination server is not available, then the client program
+ will be notified that the connection is closed. This is
+ equivalent to the feature some people requires to flush
+ connections when its destination is not available.
+
+expire_quiescent_template - BOOLEAN
+ 0 - disabled (default)
+ not 0 - enabled
+
+ When set to a non-zero value, the load balancer will expire
+ persistent templates when the destination server is quiescent.
+ This may be useful, when a user makes a destination server
+ quiescent by setting its weight to 0 and it is desired that
+ subsequent otherwise persistent connections are sent to a
+ different destination server. By default new persistent
+ connections are allowed to quiescent destination servers.
+
+ If this feature is enabled, the load balancer will expire the
+ persistence template if it is to be used to schedule a new
+ connection and the destination server is quiescent.
+
+nat_icmp_send - BOOLEAN
+ 0 - disabled (default)
+ not 0 - enabled
+
+ It controls sending icmp error messages (ICMP_DEST_UNREACH)
+ for VS/NAT when the load balancer receives packets from real
+ servers but the connection entries don't exist.
+
+secure_tcp - INTEGER
+ 0 - disabled (default)
+
+ The secure_tcp defense is to use a more complicated TCP state
+ transition table. For VS/NAT, it also delays entering the
+ TCP ESTABLISHED state until the three way handshake is completed.
+
+ The value definition is the same as that of drop_entry and
+ drop_packet.
+
+sync_threshold - INTEGER
+ default 3
+
+ It sets synchronization threshold, which is the minimum number
+ of incoming packets that a connection needs to receive before
+ the connection will be synchronized. A connection will be
+ synchronized, every time the number of its incoming packets
+ modulus 50 equals the threshold. The range of the threshold is
+ from 0 to 49.
+
+snat_reroute - BOOLEAN
+ 0 - disabled
+ not 0 - enabled (default)
+
+ If enabled, recalculate the route of SNATed packets from
+ realservers so that they are routed as if they originate from the
+ director. Otherwise they are routed as if they are forwarded by the
+ director.
+
+ If policy routing is in effect then it is possible that the route
+ of a packet originating from a director is routed differently to a
+ packet being forwarded by the director.
+
+ If policy routing is not in effect then the recalculated route will
+ always be the same as the original route so it is an optimisation
+ to disable snat_reroute and avoid the recalculation.
+
+sync_persist_mode - INTEGER
+ default 0
+
+ Controls the synchronisation of connections when using persistence
+
+ 0: All types of connections are synchronised
+ 1: Attempt to reduce the synchronisation traffic depending on
+ the connection type. For persistent services avoid synchronisation
+ for normal connections, do it only for persistence templates.
+ In such case, for TCP and SCTP it may need enabling sloppy_tcp and
+ sloppy_sctp flags on backup servers. For non-persistent services
+ such optimization is not applied, mode 0 is assumed.
+
+sync_version - INTEGER
+ default 1
+
+ The version of the synchronisation protocol used when sending
+ synchronisation messages.
+
+ 0 selects the original synchronisation protocol (version 0). This
+ should be used when sending synchronisation messages to a legacy
+ system that only understands the original synchronisation protocol.
+
+ 1 selects the current synchronisation protocol (version 1). This
+ should be used where possible.
+
+ Kernels with this sync_version entry are able to receive messages
+ of both version 1 and version 2 of the synchronisation protocol.
diff --git a/Documentation/networking/irda.txt b/Documentation/networking/irda.txt
new file mode 100644
index 000000000..bff26c138
--- /dev/null
+++ b/Documentation/networking/irda.txt
@@ -0,0 +1,10 @@
+To use the IrDA protocols within Linux you will need to get a suitable copy
+of the IrDA Utilities. More detailed information about these and associated
+programs can be found on http://irda.sourceforge.net/
+
+For more information about how to use the IrDA protocol stack, see the
+Linux Infrared HOWTO by Werner Heuser <wehe@tuxmobil.org>:
+<http://www.tuxmobil.org/Infrared-HOWTO/Infrared-HOWTO.html>
+
+There is an active mailing list for discussing Linux-IrDA matters called
+ irda-users@lists.sourceforge.net
diff --git a/Documentation/networking/ixgb.txt b/Documentation/networking/ixgb.txt
new file mode 100644
index 000000000..9b4a10a1c
--- /dev/null
+++ b/Documentation/networking/ixgb.txt
@@ -0,0 +1,433 @@
+Linux Base Driver for 10 Gigabit Intel(R) Ethernet Network Connection
+=====================================================================
+
+March 14, 2011
+
+
+Contents
+========
+
+- In This Release
+- Identifying Your Adapter
+- Building and Installation
+- Command Line Parameters
+- Improving Performance
+- Additional Configurations
+- Known Issues/Troubleshooting
+- Support
+
+
+
+In This Release
+===============
+
+This file describes the ixgb Linux Base Driver for the 10 Gigabit Intel(R)
+Network Connection. This driver includes support for Itanium(R)2-based
+systems.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your 10 Gigabit adapter. All hardware requirements listed apply
+to use with Linux.
+
+The following features are available in this kernel:
+ - Native VLANs
+ - Channel Bonding (teaming)
+ - SNMP
+
+Channel Bonding documentation can be found in the Linux kernel source:
+/Documentation/networking/bonding.txt
+
+The driver information previously displayed in the /proc filesystem is not
+supported in this release. Alternatively, you can use ethtool (version 1.6
+or later), lspci, and iproute2 to obtain the same information.
+
+Instructions on updating ethtool can be found in the section "Additional
+Configurations" later in this document.
+
+
+Identifying Your Adapter
+========================
+
+The following Intel network adapters are compatible with the drivers in this
+release:
+
+Controller Adapter Name Physical Layer
+---------- ------------ --------------
+82597EX Intel(R) PRO/10GbE LR/SR/CX4 10G Base-LR (1310 nm optical fiber)
+ Server Adapters 10G Base-SR (850 nm optical fiber)
+ 10G Base-CX4(twin-axial copper cabling)
+
+For more information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/network/sb/CS-012904.htm
+
+
+Building and Installation
+=========================
+
+select m for "Intel(R) PRO/10GbE support" located at:
+ Location:
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet (10000 Mbit) (NETDEV_10000 [=y])
+1. make modules && make modules_install
+
+2. Load the module:
+
+    modprobe ixgb <parameter>=<value>
+
+ The insmod command can be used if the full
+ path to the driver module is specified. For example:
+
+ insmod /lib/modules/<KERNEL VERSION>/kernel/drivers/net/ixgb/ixgb.ko
+
+ With 2.6 based kernels also make sure that older ixgb drivers are
+ removed from the kernel, before loading the new module:
+
+ rmmod ixgb; modprobe ixgb
+
+3. Assign an IP address to the interface by entering the following, where
+ x is the interface number:
+
+ ip addr add ethx <IP_address>
+
+4. Verify that the interface works. Enter the following, where <IP_address>
+ is the IP address for another machine on the same subnet as the interface
+ that is being tested:
+
+ ping <IP_address>
+
+
+Command Line Parameters
+=======================
+
+If the driver is built as a module, the following optional parameters are
+used by entering them on the command line with the modprobe command using
+this syntax:
+
+ modprobe ixgb [<option>=<VAL1>,<VAL2>,...]
+
+For example, with two 10GbE PCI adapters, entering:
+
+ modprobe ixgb TxDescriptors=80,128
+
+loads the ixgb driver with 80 TX resources for the first adapter and 128 TX
+resources for the second adapter.
+
+The default value for each parameter is generally the recommended setting,
+unless otherwise noted.
+
+FlowControl
+Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx)
+Default: Read from the EEPROM
+ If EEPROM is not detected, default is 1
+ This parameter controls the automatic generation(Tx) and response(Rx) to
+ Ethernet PAUSE frames. There are hardware bugs associated with enabling
+ Tx flow control so beware.
+
+RxDescriptors
+Valid Range: 64-512
+Default Value: 512
+ This value is the number of receive descriptors allocated by the driver.
+ Increasing this value allows the driver to buffer more incoming packets.
+ Each descriptor is 16 bytes. A receive buffer is also allocated for
+ each descriptor and can be either 2048, 4056, 8192, or 16384 bytes,
+ depending on the MTU setting. When the MTU size is 1500 or less, the
+ receive buffer size is 2048 bytes. When the MTU is greater than 1500 the
+ receive buffer size will be either 4056, 8192, or 16384 bytes. The
+ maximum MTU size is 16114.
+
+RxIntDelay
+Valid Range: 0-65535 (0=off)
+Default Value: 72
+ This value delays the generation of receive interrupts in units of
+ 0.8192 microseconds. Receive interrupt reduction can improve CPU
+ efficiency if properly tuned for specific network traffic. Increasing
+ this value adds extra latency to frame reception and can end up
+ decreasing the throughput of TCP traffic. If the system is reporting
+ dropped receives, this value may be set too high, causing the driver to
+ run out of available receive descriptors.
+
+TxDescriptors
+Valid Range: 64-4096
+Default Value: 256
+ This value is the number of transmit descriptors allocated by the driver.
+ Increasing this value allows the driver to queue more transmits. Each
+ descriptor is 16 bytes.
+
+XsumRX
+Valid Range: 0-1
+Default Value: 1
+ A value of '1' indicates that the driver should enable IP checksum
+ offload for received packets (both UDP and TCP) to the adapter hardware.
+
+
+Improving Performance
+=====================
+
+With the 10 Gigabit server adapters, the default Linux configuration will
+very likely limit the total available throughput artificially. There is a set
+of configuration changes that, when applied together, will increase the ability
+of Linux to transmit and receive data. The following enhancements were
+originally acquired from settings published at http://www.spec.org/web99/ for
+various submitted results using Linux.
+
+NOTE: These changes are only suggestions, and serve as a starting point for
+ tuning your network performance.
+
+The changes are made in three major ways, listed in order of greatest effect:
+- Use ip link to modify the mtu (maximum transmission unit) and the txqueuelen
+ parameter.
+- Use sysctl to modify /proc parameters (essentially kernel tuning)
+- Use setpci to modify the MMRBC field in PCI-X configuration space to increase
+ transmit burst lengths on the bus.
+
+NOTE: setpci modifies the adapter's configuration registers to allow it to read
+up to 4k bytes at a time (for transmits). However, for some systems the
+behavior after modifying this register may be undefined (possibly errors of
+some kind). A power-cycle, hard reset or explicitly setting the e6 register
+back to 22 (setpci -d 8086:1a48 e6.b=22) may be required to get back to a
+stable configuration.
+
+- COPY these lines and paste them into ixgb_perf.sh:
+#!/bin/bash
+echo "configuring network performance , edit this file to change the interface
+or device ID of 10GbE card"
+# set mmrbc to 4k reads, modify only Intel 10GbE device IDs
+# replace 1a48 with appropriate 10GbE device's ID installed on the system,
+# if needed.
+setpci -d 8086:1a48 e6.b=2e
+# set the MTU (max transmission unit) - it requires your switch and clients
+# to change as well.
+# set the txqueuelen
+# your ixgb adapter should be loaded as eth1 for this to work, change if needed
+ip li set dev eth1 mtu 9000 txqueuelen 1000 up
+# call the sysctl utility to modify /proc/sys entries
+sysctl -p ./sysctl_ixgb.conf
+- END ixgb_perf.sh
+
+- COPY these lines and paste them into sysctl_ixgb.conf:
+# some of the defaults may be different for your kernel
+# call this file with sysctl -p <this file>
+# these are just suggested values that worked well to increase throughput in
+# several network benchmark tests, your mileage may vary
+
+### IPV4 specific settings
+# turn TCP timestamp support off, default 1, reduces CPU use
+net.ipv4.tcp_timestamps = 0
+# turn SACK support off, default on
+# on systems with a VERY fast bus -> memory interface this is the big gainer
+net.ipv4.tcp_sack = 0
+# set min/default/max TCP read buffer, default 4096 87380 174760
+net.ipv4.tcp_rmem = 10000000 10000000 10000000
+# set min/pressure/max TCP write buffer, default 4096 16384 131072
+net.ipv4.tcp_wmem = 10000000 10000000 10000000
+# set min/pressure/max TCP buffer space, default 31744 32256 32768
+net.ipv4.tcp_mem = 10000000 10000000 10000000
+
+### CORE settings (mostly for socket and UDP effect)
+# set maximum receive socket buffer size, default 131071
+net.core.rmem_max = 524287
+# set maximum send socket buffer size, default 131071
+net.core.wmem_max = 524287
+# set default receive socket buffer size, default 65535
+net.core.rmem_default = 524287
+# set default send socket buffer size, default 65535
+net.core.wmem_default = 524287
+# set maximum amount of option memory buffers, default 10240
+net.core.optmem_max = 524287
+# set number of unprocessed input packets before kernel starts dropping them; default 300
+net.core.netdev_max_backlog = 300000
+- END sysctl_ixgb.conf
+
+Edit the ixgb_perf.sh script if necessary to change eth1 to whatever interface
+your ixgb driver is using and/or replace '1a48' with appropriate 10GbE device's
+ID installed on the system.
+
+NOTE: Unless these scripts are added to the boot process, these changes will
+ only last only until the next system reboot.
+
+
+Resolving Slow UDP Traffic
+--------------------------
+If your server does not seem to be able to receive UDP traffic as fast as it
+can receive TCP traffic, it could be because Linux, by default, does not set
+the network stack buffers as large as they need to be to support high UDP
+transfer rates. One way to alleviate this problem is to allow more memory to
+be used by the IP stack to store incoming data.
+
+For instance, use the commands:
+ sysctl -w net.core.rmem_max=262143
+and
+ sysctl -w net.core.rmem_default=262143
+to increase the read buffer memory max and default to 262143 (256k - 1) from
+defaults of max=131071 (128k - 1) and default=65535 (64k - 1). These variables
+will increase the amount of memory used by the network stack for receives, and
+can be increased significantly more if necessary for your application.
+
+
+Additional Configurations
+=========================
+
+ Configuring the Driver on Different Distributions
+ -------------------------------------------------
+ Configuring a network driver to load properly when the system is started is
+ distribution dependent. Typically, the configuration process involves adding
+ an alias line to /etc/modprobe.conf as well as editing other system startup
+ scripts and/or configuration files. Many popular Linux distributions ship
+ with tools to make these changes for you. To learn the proper way to
+ configure a network device for your system, refer to your distribution
+ documentation. If during this process you are asked for the driver or module
+ name, the name for the Linux Base Driver for the Intel 10GbE Family of
+ Adapters is ixgb.
+
+ Viewing Link Messages
+ ---------------------
+ Link messages will not be displayed to the console if the distribution is
+ restricting system messages. In order to see network driver link messages on
+ your console, set dmesg to eight by entering the following:
+
+ dmesg -n 8
+
+ NOTE: This setting is not saved across reboots.
+
+
+ Jumbo Frames
+ ------------
+ The driver supports Jumbo Frames for all adapters. Jumbo Frames support is
+ enabled by changing the MTU to a value larger than the default of 1500.
+ The maximum value for the MTU is 16114. Use the ip command to
+ increase the MTU size. For example:
+
+ ip li set dev ethx mtu 9000
+
+ The maximum MTU setting for Jumbo Frames is 16114. This value coincides
+ with the maximum Jumbo Frames size of 16128.
+
+
+ ethtool
+ -------
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. The ethtool
+ version 1.6 or later is required for this functionality.
+
+ The latest release of ethtool can be found from
+ http://ftp.kernel.org/pub/software/network/ethtool/
+
+ NOTE: The ethtool version 1.6 only supports a limited set of ethtool options.
+ Support for a more complete ethtool feature set can be enabled by
+ upgrading to the latest version.
+
+
+ NAPI
+ ----
+
+ NAPI (Rx polling mode) is supported in the ixgb driver. NAPI is enabled
+ or disabled based on the configuration of the kernel. see CONFIG_IXGB_NAPI
+
+ See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI.
+
+
+Known Issues/Troubleshooting
+============================
+
+ NOTE: After installing the driver, if your Intel Network Connection is not
+ working, verify in the "In This Release" section of the readme that you have
+ installed the correct driver.
+
+ Intel(R) PRO/10GbE CX4 Server Adapter Cable Interoperability Issue with
+ Fujitsu XENPAK Module in SmartBits Chassis
+ ---------------------------------------------------------------------
+ Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4
+ Server adapter is connected to a Fujitsu XENPAK CX4 module in a SmartBits
+ chassis using 15 m/24AWG cable assemblies manufactured by Fujitsu or Leoni.
+ The CRC errors may be received either by the Intel(R) PRO/10GbE CX4
+ Server adapter or the SmartBits. If this situation occurs using a different
+ cable assembly may resolve the issue.
+
+ CX4 Server Adapter Cable Interoperability Issues with HP Procurve 3400cl
+ Switch Port
+ ------------------------------------------------------------------------
+ Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4 Server
+ adapter is connected to an HP Procurve 3400cl switch port using short cables
+ (1 m or shorter). If this situation occurs, using a longer cable may resolve
+ the issue.
+
+ Excessive CRC errors may be observed using Fujitsu 24AWG cable assemblies that
+ Are 10 m or longer or where using a Leoni 15 m/24AWG cable assembly. The CRC
+ errors may be received either by the CX4 Server adapter or at the switch. If
+ this situation occurs, using a different cable assembly may resolve the issue.
+
+
+ Jumbo Frames System Requirement
+ -------------------------------
+ Memory allocation failures have been observed on Linux systems with 64 MB
+ of RAM or less that are running Jumbo Frames. If you are using Jumbo
+ Frames, your system may require more than the advertised minimum
+ requirement of 64 MB of system memory.
+
+
+ Performance Degradation with Jumbo Frames
+ -----------------------------------------
+ Degradation in throughput performance may be observed in some Jumbo frames
+ environments. If this is observed, increasing the application's socket buffer
+ size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help.
+ See the specific application manual and /usr/src/linux*/Documentation/
+ networking/ip-sysctl.txt for more details.
+
+
+ Allocating Rx Buffers when Using Jumbo Frames
+ ---------------------------------------------
+ Allocating Rx buffers when using Jumbo Frames on 2.6.x kernels may fail if
+ the available memory is heavily fragmented. This issue may be seen with PCI-X
+ adapters or with packet split disabled. This can be reduced or eliminated
+ by changing the amount of available memory for receive buffer allocation, by
+ increasing /proc/sys/vm/min_free_kbytes.
+
+
+ Multiple Interfaces on Same Ethernet Broadcast Network
+ ------------------------------------------------------
+ Due to the default ARP behavior on Linux, it is not possible to have
+ one system on two IP networks in the same Ethernet broadcast domain
+ (non-partitioned switch) behave as expected. All Ethernet interfaces
+ will respond to IP traffic for any IP address assigned to the system.
+ This results in unbalanced receive traffic.
+
+ If you have multiple interfaces in a server, do either of the following:
+
+ - Turn on ARP filtering by entering:
+ echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
+
+ - Install the interfaces in separate broadcast domains - either in
+ different switches or in a switch partitioned to VLANs.
+
+
+ UDP Stress Test Dropped Packet Issue
+ --------------------------------------
+ Under small packets UDP stress test with 10GbE driver, the Linux system
+ may drop UDP packets due to the fullness of socket buffers. You may want
+ to change the driver's Flow Control variables to the minimum value for
+ controlling packet reception.
+
+
+ Tx Hangs Possible Under Stress
+ ------------------------------
+ Under stress conditions, if TX hangs occur, turning off TSO
+ "ethtool -K eth0 tso off" may resolve the problem.
+
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ http://support.intel.com
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://sourceforge.net/projects/e1000
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/ixgbe.txt b/Documentation/networking/ixgbe.txt
new file mode 100644
index 000000000..6f0cb57b5
--- /dev/null
+++ b/Documentation/networking/ixgbe.txt
@@ -0,0 +1,349 @@
+Linux* Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Family of
+Adapters
+=============================================================================
+
+Intel 10 Gigabit Linux driver.
+Copyright(c) 1999 - 2013 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Additional Configurations
+- Performance Tuning
+- Known Issues
+- Support
+
+Identifying Your Adapter
+========================
+
+The driver in this release is compatible with 82598, 82599 and X540-based
+Intel Network Connections.
+
+For more information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/network/sb/CS-012904.htm
+
+SFP+ Devices with Pluggable Optics
+----------------------------------
+
+82599-BASED ADAPTERS
+
+NOTES: If your 82599-based Intel(R) Network Adapter came with Intel optics, or
+is an Intel(R) Ethernet Server Adapter X520-2, then it only supports Intel
+optics and/or the direct attach cables listed below.
+
+When 82599-based SFP+ devices are connected back to back, they should be set to
+the same Speed setting via ethtool. Results may vary if you mix speed settings.
+82598-based adapters support all passive direct attach cables that comply
+with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. Active direct attach
+cables are not supported.
+
+Supplier Type Part Numbers
+
+SR Modules
+Intel DUAL RATE 1G/10G SFP+ SR (bailed) FTLX8571D3BCV-IT
+Intel DUAL RATE 1G/10G SFP+ SR (bailed) AFBR-703SDDZ-IN1
+Intel DUAL RATE 1G/10G SFP+ SR (bailed) AFBR-703SDZ-IN2
+LR Modules
+Intel DUAL RATE 1G/10G SFP+ LR (bailed) FTLX1471D3BCV-IT
+Intel DUAL RATE 1G/10G SFP+ LR (bailed) AFCT-701SDDZ-IN1
+Intel DUAL RATE 1G/10G SFP+ LR (bailed) AFCT-701SDZ-IN2
+
+The following is a list of 3rd party SFP+ modules and direct attach cables that
+have received some testing. Not all modules are applicable to all devices.
+
+Supplier Type Part Numbers
+
+Finisar SFP+ SR bailed, 10g single rate FTLX8571D3BCL
+Avago SFP+ SR bailed, 10g single rate AFBR-700SDZ
+Finisar SFP+ LR bailed, 10g single rate FTLX1471D3BCL
+
+Finisar DUAL RATE 1G/10G SFP+ SR (No Bail) FTLX8571D3QCV-IT
+Avago DUAL RATE 1G/10G SFP+ SR (No Bail) AFBR-703SDZ-IN1
+Finisar DUAL RATE 1G/10G SFP+ LR (No Bail) FTLX1471D3QCV-IT
+Avago DUAL RATE 1G/10G SFP+ LR (No Bail) AFCT-701SDZ-IN1
+Finistar 1000BASE-T SFP FCLF8522P2BTL
+Avago 1000BASE-T SFP ABCU-5710RZ
+
+82599-based adapters support all passive and active limiting direct attach
+cables that comply with SFF-8431 v4.1 and SFF-8472 v10.4 specifications.
+
+Laser turns off for SFP+ when device is down
+-------------------------------------------
+"ip link set down" turns off the laser for 82599-based SFP+ fiber adapters.
+"ip link set up" turns on the laser.
+
+
+82598-BASED ADAPTERS
+
+NOTES for 82598-Based Adapters:
+- Intel(R) Network Adapters that support removable optical modules only support
+ their original module type (i.e., the Intel(R) 10 Gigabit SR Dual Port
+ Express Module only supports SR optical modules). If you plug in a different
+ type of module, the driver will not load.
+- Hot Swapping/hot plugging optical modules is not supported.
+- Only single speed, 10 gigabit modules are supported.
+- LAN on Motherboard (LOMs) may support DA, SR, or LR modules. Other module
+ types are not supported. Please see your system documentation for details.
+
+The following is a list of 3rd party SFP+ modules and direct attach cables that
+have received some testing. Not all modules are applicable to all devices.
+
+Supplier Type Part Numbers
+
+Finisar SFP+ SR bailed, 10g single rate FTLX8571D3BCL
+Avago SFP+ SR bailed, 10g single rate AFBR-700SDZ
+Finisar SFP+ LR bailed, 10g single rate FTLX1471D3BCL
+
+82598-based adapters support all passive direct attach cables that comply
+with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. Active direct attach
+cables are not supported.
+
+
+Flow Control
+------------
+Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable
+receiving and transmitting pause frames for ixgbe. When TX is enabled, PAUSE
+frames are generated when the receive packet buffer crosses a predefined
+threshold. When rx is enabled, the transmit unit will halt for the time delay
+specified when a PAUSE frame is received.
+
+Flow Control is enabled by default. If you want to disable a flow control
+capable link partner, use ethtool:
+
+ ethtool -A eth? autoneg off RX off TX off
+
+NOTE: For 82598 backplane cards entering 1 gig mode, flow control default
+behavior is changed to off. Flow control in 1 gig mode on these devices can
+lead to Tx hangs.
+
+Intel(R) Ethernet Flow Director
+-------------------------------
+Supports advanced filters that direct receive packets by their flows to
+different queues. Enables tight control on routing a flow in the platform.
+Matches flows and CPU cores for flow affinity. Supports multiple parameters
+for flexible flow classification and load balancing.
+
+Flow director is enabled only if the kernel is multiple TX queue capable.
+
+An included script (set_irq_affinity.sh) automates setting the IRQ to CPU
+affinity.
+
+You can verify that the driver is using Flow Director by looking at the counter
+in ethtool: fdir_miss and fdir_match.
+
+Other ethtool Commands:
+To enable Flow Director
+ ethtool -K ethX ntuple on
+To add a filter
+ Use -U switch. e.g., ethtool -U ethX flow-type tcp4 src-ip 10.0.128.23
+ action 1
+To see the list of filters currently present:
+ ethtool -u ethX
+
+Perfect Filter: Perfect filter is an interface to load the filter table that
+funnels all flow into queue_0 unless an alternative queue is specified using
+"action". In that case, any flow that matches the filter criteria will be
+directed to the appropriate queue.
+
+If the queue is defined as -1, filter will drop matching packets.
+
+To account for filter matches and misses, there are two stats in ethtool:
+fdir_match and fdir_miss. In addition, rx_queue_N_packets shows the number of
+packets processed by the Nth queue.
+
+NOTE: Receive Packet Steering (RPS) and Receive Flow Steering (RFS) are not
+compatible with Flow Director. IF Flow Director is enabled, these will be
+disabled.
+
+The following three parameters impact Flow Director.
+
+FdirMode
+--------
+Valid Range: 0-2 (0=off, 1=ATR, 2=Perfect filter mode)
+Default Value: 1
+
+ Flow Director filtering modes.
+
+FdirPballoc
+-----------
+Valid Range: 0-2 (0=64k, 1=128k, 2=256k)
+Default Value: 0
+
+ Flow Director allocated packet buffer size.
+
+AtrSampleRate
+--------------
+Valid Range: 1-100
+Default Value: 20
+
+ Software ATR Tx packet sample rate. For example, when set to 20, every 20th
+ packet, looks to see if the packet will create a new flow.
+
+Node
+----
+Valid Range: 0-n
+Default Value: 1 (off)
+
+ 0 - n: where n is the number of NUMA nodes (i.e. 0 - 3) currently online in
+ your system
+ 1: turns this option off
+
+ The Node parameter will allow you to pick which NUMA node you want to have
+ the adapter allocate memory on.
+
+max_vfs
+-------
+Valid Range: 1-63
+Default Value: 0
+
+ If the value is greater than 0 it will also force the VMDq parameter to be 1
+ or more.
+
+ This parameter adds support for SR-IOV. It causes the driver to spawn up to
+ max_vfs worth of virtual function.
+
+
+Additional Configurations
+=========================
+
+ Jumbo Frames
+ ------------
+ The driver supports Jumbo Frames for all adapters. Jumbo Frames support is
+ enabled by changing the MTU to a value larger than the default of 1500.
+ The maximum value for the MTU is 16110. Use the ip command to
+ increase the MTU size. For example:
+
+ ip link set dev ethx mtu 9000
+
+ The maximum MTU setting for Jumbo Frames is 9710. This value coincides
+ with the maximum Jumbo Frames size of 9728.
+
+ Generic Receive Offload, aka GRO
+ --------------------------------
+ The driver supports the in-kernel software implementation of GRO. GRO has
+ shown that by coalescing Rx traffic into larger chunks of data, CPU
+ utilization can be significantly reduced when under large Rx load. GRO is an
+ evolution of the previously-used LRO interface. GRO is able to coalesce
+ other protocols besides TCP. It's also safe to use with configurations that
+ are problematic for LRO, namely bridging and iSCSI.
+
+ Data Center Bridging, aka DCB
+ -----------------------------
+ DCB is a configuration Quality of Service implementation in hardware.
+ It uses the VLAN priority tag (802.1p) to filter traffic. That means
+ that there are 8 different priorities that traffic can be filtered into.
+ It also enables priority flow control which can limit or eliminate the
+ number of dropped packets during network stress. Bandwidth can be
+ allocated to each of these priorities, which is enforced at the hardware
+ level.
+
+ To enable DCB support in ixgbe, you must enable the DCB netlink layer to
+ allow the userspace tools (see below) to communicate with the driver.
+ This can be found in the kernel configuration here:
+
+ -> Networking support
+ -> Networking options
+ -> Data Center Bridging support
+
+ Once this is selected, DCB support must be selected for ixgbe. This can
+ be found here:
+
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet (10000 Mbit) (NETDEV_10000 [=y])
+ -> Intel(R) 10GbE PCI Express adapters support
+ -> Data Center Bridging (DCB) Support
+
+ After these options are selected, you must rebuild your kernel and your
+ modules.
+
+ In order to use DCB, userspace tools must be downloaded and installed.
+ The dcbd tools can be found at:
+
+ http://e1000.sf.net
+
+ Ethtool
+ -------
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. The latest
+ ethtool version is required for this functionality.
+
+ The latest release of ethtool can be found from
+ http://ftp.kernel.org/pub/software/network/ethtool/
+
+ FCoE
+ ----
+ This release of the ixgbe driver contains new code to enable users to use
+ Fiber Channel over Ethernet (FCoE) and Data Center Bridging (DCB)
+ functionality that is supported by the 82598-based hardware. This code has
+ no default effect on the regular driver operation, and configuring DCB and
+ FCoE is outside the scope of this driver README. Refer to
+ http://www.open-fcoe.org/ for FCoE project information and contact
+ e1000-eedc@lists.sourceforge.net for DCB information.
+
+ MAC and VLAN anti-spoofing feature
+ ----------------------------------
+ When a malicious driver attempts to send a spoofed packet, it is dropped by
+ the hardware and not transmitted. An interrupt is sent to the PF driver
+ notifying it of the spoof attempt.
+
+ When a spoofed packet is detected the PF driver will send the following
+ message to the system log (displayed by the "dmesg" command):
+
+ Spoof event(s) detected on VF (n)
+
+ Where n=the VF that attempted to do the spoofing.
+
+
+Performance Tuning
+==================
+
+An excellent article on performance tuning can be found at:
+
+http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf
+
+
+Known Issues
+============
+
+ Enabling SR-IOV in a 32-bit or 64-bit Microsoft* Windows* Server 2008/R2
+ Guest OS using Intel (R) 82576-based GbE or Intel (R) 82599-based 10GbE
+ controller under KVM
+ ------------------------------------------------------------------------
+ KVM Hypervisor/VMM supports direct assignment of a PCIe device to a VM. This
+ includes traditional PCIe devices, as well as SR-IOV-capable devices using
+ Intel 82576-based and 82599-based controllers.
+
+ While direct assignment of a PCIe device or an SR-IOV Virtual Function (VF)
+ to a Linux-based VM running 2.6.32 or later kernel works fine, there is a
+ known issue with Microsoft Windows Server 2008 VM that results in a "yellow
+ bang" error. This problem is within the KVM VMM itself, not the Intel driver,
+ or the SR-IOV logic of the VMM, but rather that KVM emulates an older CPU
+ model for the guests, and this older CPU model does not support MSI-X
+ interrupts, which is a requirement for Intel SR-IOV.
+
+ If you wish to use the Intel 82576 or 82599-based controllers in SR-IOV mode
+ with KVM and a Microsoft Windows Server 2008 guest try the following
+ workaround. The workaround is to tell KVM to emulate a different model of CPU
+ when using qemu to create the KVM guest:
+
+ "-cpu qemu64,model=13"
+
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ http://support.intel.com
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://e1000.sourceforge.net
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/ixgbevf.txt b/Documentation/networking/ixgbevf.txt
new file mode 100644
index 000000000..53d8d2a5a
--- /dev/null
+++ b/Documentation/networking/ixgbevf.txt
@@ -0,0 +1,52 @@
+Linux* Base Driver for Intel(R) Ethernet Network Connection
+===========================================================
+
+Intel Gigabit Linux driver.
+Copyright(c) 1999 - 2013 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Known Issues/Troubleshooting
+- Support
+
+This file describes the ixgbevf Linux* Base Driver for Intel Network
+Connection.
+
+The ixgbevf driver supports 82599-based virtual function devices that can only
+be activated on kernels with CONFIG_PCI_IOV enabled.
+
+The ixgbevf driver supports virtual functions generated by the ixgbe driver
+with a max_vfs value of 1 or greater.
+
+The guest OS loading the ixgbevf driver must support MSI-X interrupts.
+
+VLANs: There is a limit of a total of 32 shared VLANs to 1 or more VFs.
+
+Identifying Your Adapter
+========================
+
+For more information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/go/network/adapter/idguide.htm
+
+Known Issues/Troubleshooting
+============================
+
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ http://support.intel.com
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://sourceforge.net/projects/e1000
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/l2tp.txt b/Documentation/networking/l2tp.txt
new file mode 100644
index 000000000..c74434de2
--- /dev/null
+++ b/Documentation/networking/l2tp.txt
@@ -0,0 +1,348 @@
+This document describes how to use the kernel's L2TP drivers to
+provide L2TP functionality. L2TP is a protocol that tunnels one or
+more sessions over an IP tunnel. It is commonly used for VPNs
+(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP
+network infrastructure. With L2TPv3, it is also useful as a Layer-2
+tunneling infrastructure.
+
+Features
+========
+
+L2TPv2 (PPP over L2TP (UDP tunnels)).
+L2TPv3 ethernet pseudowires.
+L2TPv3 PPP pseudowires.
+L2TPv3 IP encapsulation.
+Netlink sockets for L2TPv3 configuration management.
+
+History
+=======
+
+The original pppol2tp driver was introduced in 2.6.23 and provided
+L2TPv2 functionality (rfc2661). L2TPv2 is used to tunnel one or more PPP
+sessions over a UDP tunnel.
+
+L2TPv3 (rfc3931) changes the protocol to allow different frame types
+to be passed over an L2TP tunnel by moving the PPP-specific parts of
+the protocol out of the core L2TP packet headers. Each frame type is
+known as a pseudowire type. Ethernet, PPP, HDLC, Frame Relay and ATM
+pseudowires for L2TP are defined in separate RFC standards. Another
+change for L2TPv3 is that it can be carried directly over IP with no
+UDP header (UDP is optional). It is also possible to create static
+unmanaged L2TPv3 tunnels manually without a control protocol
+(userspace daemon) to manage them.
+
+To support L2TPv3, the original pppol2tp driver was split up to
+separate the L2TP and PPP functionality. Existing L2TPv2 userspace
+apps should be unaffected as the original pppol2tp sockets API is
+retained. L2TPv3, however, uses netlink to manage L2TPv3 tunnels and
+sessions.
+
+Design
+======
+
+The L2TP protocol separates control and data frames. The L2TP kernel
+drivers handle only L2TP data frames; control frames are always
+handled by userspace. L2TP control frames carry messages between L2TP
+clients/servers and are used to setup / teardown tunnels and
+sessions. An L2TP client or server is implemented in userspace.
+
+Each L2TP tunnel is implemented using a UDP or L2TPIP socket; L2TPIP
+provides L2TPv3 IP encapsulation (no UDP) and is implemented using a
+new l2tpip socket family. The tunnel socket is typically created by
+userspace, though for unmanaged L2TPv3 tunnels, the socket can also be
+created by the kernel. Each L2TP session (pseudowire) gets a network
+interface instance. In the case of PPP, these interfaces are created
+indirectly by pppd using a pppol2tp socket. In the case of ethernet,
+the netdevice is created upon a netlink request to create an L2TPv3
+ethernet pseudowire.
+
+For PPP, the PPPoL2TP driver, net/l2tp/l2tp_ppp.c, provides a
+mechanism by which PPP frames carried through an L2TP session are
+passed through the kernel's PPP subsystem. The standard PPP daemon,
+pppd, handles all PPP interaction with the peer. PPP network
+interfaces are created for each local PPP endpoint. The kernel's PPP
+subsystem arranges for PPP control frames to be delivered to pppd,
+while data frames are forwarded as usual.
+
+For ethernet, the L2TPETH driver, net/l2tp/l2tp_eth.c, implements a
+netdevice driver, managing virtual ethernet devices, one per
+pseudowire. These interfaces can be managed using standard Linux tools
+such as "ip" and "ifconfig". If only IP frames are passed over the
+tunnel, the interface can be given an IP addresses of itself and its
+peer. If non-IP frames are to be passed over the tunnel, the interface
+can be added to a bridge using brctl. All L2TP datapath protocol
+functions are handled by the L2TP core driver.
+
+Each tunnel and session within a tunnel is assigned a unique tunnel_id
+and session_id. These ids are carried in the L2TP header of every
+control and data packet. (Actually, in L2TPv3, the tunnel_id isn't
+present in data frames - it is inferred from the IP connection on
+which the packet was received.) The L2TP driver uses the ids to lookup
+internal tunnel and/or session contexts to determine how to handle the
+packet. Zero tunnel / session ids are treated specially - zero ids are
+never assigned to tunnels or sessions in the network. In the driver,
+the tunnel context keeps a reference to the tunnel UDP or L2TPIP
+socket. The session context holds data that lets the driver interface
+to the kernel's network frame type subsystems, i.e. PPP, ethernet.
+
+Userspace Programming
+=====================
+
+For L2TPv2, there are a number of requirements on the userspace L2TP
+daemon in order to use the pppol2tp driver.
+
+1. Use a UDP socket per tunnel.
+
+2. Create a single PPPoL2TP socket per tunnel bound to a special null
+ session id. This is used only for communicating with the driver but
+ must remain open while the tunnel is active. Opening this tunnel
+ management socket causes the driver to mark the tunnel socket as an
+ L2TP UDP encapsulation socket and flags it for use by the
+ referenced tunnel id. This hooks up the UDP receive path via
+ udp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed
+ in this special PPPoX socket.
+
+3. Create a PPPoL2TP socket per L2TP session. This is typically done
+ by starting pppd with the pppol2tp plugin and appropriate
+ arguments. A PPPoL2TP tunnel management socket (Step 2) must be
+ created before the first PPPoL2TP session socket is created.
+
+When creating PPPoL2TP sockets, the application provides information
+to the driver about the socket in a socket connect() call. Source and
+destination tunnel and session ids are provided, as well as the file
+descriptor of a UDP socket. See struct pppol2tp_addr in
+include/linux/if_pppol2tp.h. Note that zero tunnel / session ids are
+treated specially. When creating the per-tunnel PPPoL2TP management
+socket in Step 2 above, zero source and destination session ids are
+specified, which tells the driver to prepare the supplied UDP file
+descriptor for use as an L2TP tunnel socket.
+
+Userspace may control behavior of the tunnel or session using
+setsockopt and ioctl on the PPPoX socket. The following socket
+options are supported:-
+
+DEBUG - bitmask of debug message categories. See below.
+SENDSEQ - 0 => don't send packets with sequence numbers
+ 1 => send packets with sequence numbers
+RECVSEQ - 0 => receive packet sequence numbers are optional
+ 1 => drop receive packets without sequence numbers
+LNSMODE - 0 => act as LAC.
+ 1 => act as LNS.
+REORDERTO - reorder timeout (in millisecs). If 0, don't try to reorder.
+
+Only the DEBUG option is supported by the special tunnel management
+PPPoX socket.
+
+In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided
+to retrieve tunnel and session statistics from the kernel using the
+PPPoX socket of the appropriate tunnel or session.
+
+For L2TPv3, userspace must use the netlink API defined in
+include/linux/l2tp.h to manage tunnel and session contexts. The
+general procedure to create a new L2TP tunnel with one session is:-
+
+1. Open a GENL socket using L2TP_GENL_NAME for configuring the kernel
+ using netlink.
+
+2. Create a UDP or L2TPIP socket for the tunnel.
+
+3. Create a new L2TP tunnel using a L2TP_CMD_TUNNEL_CREATE
+ request. Set attributes according to desired tunnel parameters,
+ referencing the UDP or L2TPIP socket created in the previous step.
+
+4. Create a new L2TP session in the tunnel using a
+ L2TP_CMD_SESSION_CREATE request.
+
+The tunnel and all of its sessions are closed when the tunnel socket
+is closed. The netlink API may also be used to delete sessions and
+tunnels. Configuration and status info may be set or read using netlink.
+
+The L2TP driver also supports static (unmanaged) L2TPv3 tunnels. These
+are where there is no L2TP control message exchange with the peer to
+setup the tunnel; the tunnel is configured manually at each end of the
+tunnel. There is no need for an L2TP userspace application in this
+case -- the tunnel socket is created by the kernel and configured
+using parameters sent in the L2TP_CMD_TUNNEL_CREATE netlink
+request. The "ip" utility of iproute2 has commands for managing static
+L2TPv3 tunnels; do "ip l2tp help" for more information.
+
+Debugging
+=========
+
+The driver supports a flexible debug scheme where kernel trace
+messages may be optionally enabled per tunnel and per session. Care is
+needed when debugging a live system since the messages are not
+rate-limited and a busy system could be swamped. Userspace uses
+setsockopt on the PPPoX socket to set a debug mask.
+
+The following debug mask bits are available:
+
+PPPOL2TP_MSG_DEBUG verbose debug (if compiled in)
+PPPOL2TP_MSG_CONTROL userspace - kernel interface
+PPPOL2TP_MSG_SEQ sequence numbers handling
+PPPOL2TP_MSG_DATA data packets
+
+If enabled, files under a l2tp debugfs directory can be used to dump
+kernel state about L2TP tunnels and sessions. To access it, the
+debugfs filesystem must first be mounted.
+
+# mount -t debugfs debugfs /debug
+
+Files under the l2tp directory can then be accessed.
+
+# cat /debug/l2tp/tunnels
+
+The debugfs files should not be used by applications to obtain L2TP
+state information because the file format is subject to change. It is
+implemented to provide extra debug information to help diagnose
+problems.) Users should use the netlink API.
+
+/proc/net/pppol2tp is also provided for backwards compatibility with
+the original pppol2tp driver. It lists information about L2TPv2
+tunnels and sessions only. Its use is discouraged.
+
+Unmanaged L2TPv3 Tunnels
+========================
+
+Some commercial L2TP products support unmanaged L2TPv3 ethernet
+tunnels, where there is no L2TP control protocol; tunnels are
+configured at each side manually. New commands are available in
+iproute2's ip utility to support this.
+
+To create an L2TPv3 ethernet pseudowire between local host 192.168.1.1
+and peer 192.168.1.2, using IP addresses 10.5.1.1 and 10.5.1.2 for the
+tunnel endpoints:-
+
+# modprobe l2tp_eth
+# modprobe l2tp_netlink
+
+# ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \
+ udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2
+# ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1
+# ifconfig -a
+# ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0
+# ifconfig l2tpeth0 up
+
+Choose IP addresses to be the address of a local IP interface and that
+of the remote system. The IP addresses of the l2tpeth0 interface can be
+anything suitable.
+
+Repeat the above at the peer, with ports, tunnel/session ids and IP
+addresses reversed. The tunnel and session IDs can be any non-zero
+32-bit number, but the values must be reversed at the peer.
+
+Host 1 Host2
+udp_sport=5000 udp_sport=5001
+udp_dport=5001 udp_dport=5000
+tunnel_id=42 tunnel_id=45
+peer_tunnel_id=45 peer_tunnel_id=42
+session_id=128 session_id=5196755
+peer_session_id=5196755 peer_session_id=128
+
+When done at both ends of the tunnel, it should be possible to send
+data over the network. e.g.
+
+# ping 10.5.1.1
+
+
+Sample Userspace Code
+=====================
+
+1. Create tunnel management PPPoX socket
+
+ kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
+ if (kernel_fd >= 0) {
+ struct sockaddr_pppol2tp sax;
+ struct sockaddr_in const *peer_addr;
+
+ peer_addr = l2tp_tunnel_get_peer_addr(tunnel);
+ memset(&sax, 0, sizeof(sax));
+ sax.sa_family = AF_PPPOX;
+ sax.sa_protocol = PX_PROTO_OL2TP;
+ sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */
+ sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr;
+ sax.pppol2tp.addr.sin_port = peer_addr->sin_port;
+ sax.pppol2tp.addr.sin_family = AF_INET;
+ sax.pppol2tp.s_tunnel = tunnel_id;
+ sax.pppol2tp.s_session = 0; /* special case: mgmt socket */
+ sax.pppol2tp.d_tunnel = 0;
+ sax.pppol2tp.d_session = 0; /* special case: mgmt socket */
+
+ if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) {
+ perror("connect failed");
+ result = -errno;
+ goto err;
+ }
+ }
+
+2. Create session PPPoX data socket
+
+ struct sockaddr_pppol2tp sax;
+ int fd;
+
+ /* Note, the target socket must be bound already, else it will not be ready */
+ sax.sa_family = AF_PPPOX;
+ sax.sa_protocol = PX_PROTO_OL2TP;
+ sax.pppol2tp.fd = tunnel_fd;
+ sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
+ sax.pppol2tp.addr.sin_port = addr->sin_port;
+ sax.pppol2tp.addr.sin_family = AF_INET;
+ sax.pppol2tp.s_tunnel = tunnel_id;
+ sax.pppol2tp.s_session = session_id;
+ sax.pppol2tp.d_tunnel = peer_tunnel_id;
+ sax.pppol2tp.d_session = peer_session_id;
+
+ /* session_fd is the fd of the session's PPPoL2TP socket.
+ * tunnel_fd is the fd of the tunnel UDP socket.
+ */
+ fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
+ if (fd < 0 ) {
+ return -errno;
+ }
+ return 0;
+
+Internal Implementation
+=======================
+
+The driver keeps a struct l2tp_tunnel context per L2TP tunnel and a
+struct l2tp_session context for each session. The l2tp_tunnel is
+always associated with a UDP or L2TP/IP socket and keeps a list of
+sessions in the tunnel. The l2tp_session context keeps kernel state
+about the session. It has private data which is used for data specific
+to the session type. With L2TPv2, the session always carried PPP
+traffic. With L2TPv3, the session can also carry ethernet frames
+(ethernet pseudowire) or other data types such as ATM, HDLC or Frame
+Relay.
+
+When a tunnel is first opened, the reference count on the socket is
+increased using sock_hold(). This ensures that the kernel socket
+cannot be removed while L2TP's data structures reference it.
+
+Some L2TP sessions also have a socket (PPP pseudowires) while others
+do not (ethernet pseudowires). We can't use the socket reference count
+as the reference count for session contexts. The L2TP implementation
+therefore has its own internal reference counts on the session
+contexts.
+
+To Do
+=====
+
+Add L2TP tunnel switching support. This would route tunneled traffic
+from one L2TP tunnel into another. Specified in
+http://tools.ietf.org/html/draft-ietf-l2tpext-tunnel-switching-08
+
+Add L2TPv3 VLAN pseudowire support.
+
+Add L2TPv3 IP pseudowire support.
+
+Add L2TPv3 ATM pseudowire support.
+
+Miscellaneous
+=============
+
+The L2TP drivers were developed as part of the OpenL2TP project by
+Katalix Systems Ltd. OpenL2TP is a full-featured L2TP client / server,
+designed from the ground up to have the L2TP datapath in the
+kernel. The project also implemented the pppol2tp plugin for pppd
+which allows pppd to use the kernel driver. Details can be found at
+http://www.openl2tp.org.
diff --git a/Documentation/networking/lapb-module.txt b/Documentation/networking/lapb-module.txt
new file mode 100644
index 000000000..d4fc8f221
--- /dev/null
+++ b/Documentation/networking/lapb-module.txt
@@ -0,0 +1,263 @@
+ The Linux LAPB Module Interface 1.3
+
+ Jonathan Naylor 29.12.96
+
+Changed (Henner Eisen, 2000-10-29): int return value for data_indication()
+
+The LAPB module will be a separately compiled module for use by any parts of
+the Linux operating system that require a LAPB service. This document
+defines the interfaces to, and the services provided by this module. The
+term module in this context does not imply that the LAPB module is a
+separately loadable module, although it may be. The term module is used in
+its more standard meaning.
+
+The interface to the LAPB module consists of functions to the module,
+callbacks from the module to indicate important state changes, and
+structures for getting and setting information about the module.
+
+Structures
+----------
+
+Probably the most important structure is the skbuff structure for holding
+received and transmitted data, however it is beyond the scope of this
+document.
+
+The two LAPB specific structures are the LAPB initialisation structure and
+the LAPB parameter structure. These will be defined in a standard header
+file, <linux/lapb.h>. The header file <net/lapb.h> is internal to the LAPB
+module and is not for use.
+
+LAPB Initialisation Structure
+-----------------------------
+
+This structure is used only once, in the call to lapb_register (see below).
+It contains information about the device driver that requires the services
+of the LAPB module.
+
+struct lapb_register_struct {
+ void (*connect_confirmation)(int token, int reason);
+ void (*connect_indication)(int token, int reason);
+ void (*disconnect_confirmation)(int token, int reason);
+ void (*disconnect_indication)(int token, int reason);
+ int (*data_indication)(int token, struct sk_buff *skb);
+ void (*data_transmit)(int token, struct sk_buff *skb);
+};
+
+Each member of this structure corresponds to a function in the device driver
+that is called when a particular event in the LAPB module occurs. These will
+be described in detail below. If a callback is not required (!!) then a NULL
+may be substituted.
+
+
+LAPB Parameter Structure
+------------------------
+
+This structure is used with the lapb_getparms and lapb_setparms functions
+(see below). They are used to allow the device driver to get and set the
+operational parameters of the LAPB implementation for a given connection.
+
+struct lapb_parms_struct {
+ unsigned int t1;
+ unsigned int t1timer;
+ unsigned int t2;
+ unsigned int t2timer;
+ unsigned int n2;
+ unsigned int n2count;
+ unsigned int window;
+ unsigned int state;
+ unsigned int mode;
+};
+
+T1 and T2 are protocol timing parameters and are given in units of 100ms. N2
+is the maximum number of tries on the link before it is declared a failure.
+The window size is the maximum number of outstanding data packets allowed to
+be unacknowledged by the remote end, the value of the window is between 1
+and 7 for a standard LAPB link, and between 1 and 127 for an extended LAPB
+link.
+
+The mode variable is a bit field used for setting (at present) three values.
+The bit fields have the following meanings:
+
+Bit Meaning
+0 LAPB operation (0=LAPB_STANDARD 1=LAPB_EXTENDED).
+1 [SM]LP operation (0=LAPB_SLP 1=LAPB=MLP).
+2 DTE/DCE operation (0=LAPB_DTE 1=LAPB_DCE)
+3-31 Reserved, must be 0.
+
+Extended LAPB operation indicates the use of extended sequence numbers and
+consequently larger window sizes, the default is standard LAPB operation.
+MLP operation is the same as SLP operation except that the addresses used by
+LAPB are different to indicate the mode of operation, the default is Single
+Link Procedure. The difference between DCE and DTE operation is (i) the
+addresses used for commands and responses, and (ii) when the DCE is not
+connected, it sends DM without polls set, every T1. The upper case constant
+names will be defined in the public LAPB header file.
+
+
+Functions
+---------
+
+The LAPB module provides a number of function entry points.
+
+
+int lapb_register(void *token, struct lapb_register_struct);
+
+This must be called before the LAPB module may be used. If the call is
+successful then LAPB_OK is returned. The token must be a unique identifier
+generated by the device driver to allow for the unique identification of the
+instance of the LAPB link. It is returned by the LAPB module in all of the
+callbacks, and is used by the device driver in all calls to the LAPB module.
+For multiple LAPB links in a single device driver, multiple calls to
+lapb_register must be made. The format of the lapb_register_struct is given
+above. The return values are:
+
+LAPB_OK LAPB registered successfully.
+LAPB_BADTOKEN Token is already registered.
+LAPB_NOMEM Out of memory
+
+
+int lapb_unregister(void *token);
+
+This releases all the resources associated with a LAPB link. Any current
+LAPB link will be abandoned without further messages being passed. After
+this call, the value of token is no longer valid for any calls to the LAPB
+function. The valid return values are:
+
+LAPB_OK LAPB unregistered successfully.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+
+
+int lapb_getparms(void *token, struct lapb_parms_struct *parms);
+
+This allows the device driver to get the values of the current LAPB
+variables, the lapb_parms_struct is described above. The valid return values
+are:
+
+LAPB_OK LAPB getparms was successful.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+
+
+int lapb_setparms(void *token, struct lapb_parms_struct *parms);
+
+This allows the device driver to set the values of the current LAPB
+variables, the lapb_parms_struct is described above. The values of t1timer,
+t2timer and n2count are ignored, likewise changing the mode bits when
+connected will be ignored. An error implies that none of the values have
+been changed. The valid return values are:
+
+LAPB_OK LAPB getparms was successful.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+LAPB_INVALUE One of the values was out of its allowable range.
+
+
+int lapb_connect_request(void *token);
+
+Initiate a connect using the current parameter settings. The valid return
+values are:
+
+LAPB_OK LAPB is starting to connect.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+LAPB_CONNECTED LAPB module is already connected.
+
+
+int lapb_disconnect_request(void *token);
+
+Initiate a disconnect. The valid return values are:
+
+LAPB_OK LAPB is starting to disconnect.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+LAPB_NOTCONNECTED LAPB module is not connected.
+
+
+int lapb_data_request(void *token, struct sk_buff *skb);
+
+Queue data with the LAPB module for transmitting over the link. If the call
+is successful then the skbuff is owned by the LAPB module and may not be
+used by the device driver again. The valid return values are:
+
+LAPB_OK LAPB has accepted the data.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+LAPB_NOTCONNECTED LAPB module is not connected.
+
+
+int lapb_data_received(void *token, struct sk_buff *skb);
+
+Queue data with the LAPB module which has been received from the device. It
+is expected that the data passed to the LAPB module has skb->data pointing
+to the beginning of the LAPB data. If the call is successful then the skbuff
+is owned by the LAPB module and may not be used by the device driver again.
+The valid return values are:
+
+LAPB_OK LAPB has accepted the data.
+LAPB_BADTOKEN Invalid/unknown LAPB token.
+
+
+Callbacks
+---------
+
+These callbacks are functions provided by the device driver for the LAPB
+module to call when an event occurs. They are registered with the LAPB
+module with lapb_register (see above) in the structure lapb_register_struct
+(see above).
+
+
+void (*connect_confirmation)(void *token, int reason);
+
+This is called by the LAPB module when a connection is established after
+being requested by a call to lapb_connect_request (see above). The reason is
+always LAPB_OK.
+
+
+void (*connect_indication)(void *token, int reason);
+
+This is called by the LAPB module when the link is established by the remote
+system. The value of reason is always LAPB_OK.
+
+
+void (*disconnect_confirmation)(void *token, int reason);
+
+This is called by the LAPB module when an event occurs after the device
+driver has called lapb_disconnect_request (see above). The reason indicates
+what has happened. In all cases the LAPB link can be regarded as being
+terminated. The values for reason are:
+
+LAPB_OK The LAPB link was terminated normally.
+LAPB_NOTCONNECTED The remote system was not connected.
+LAPB_TIMEDOUT No response was received in N2 tries from the remote
+ system.
+
+
+void (*disconnect_indication)(void *token, int reason);
+
+This is called by the LAPB module when the link is terminated by the remote
+system or another event has occurred to terminate the link. This may be
+returned in response to a lapb_connect_request (see above) if the remote
+system refused the request. The values for reason are:
+
+LAPB_OK The LAPB link was terminated normally by the remote
+ system.
+LAPB_REFUSED The remote system refused the connect request.
+LAPB_NOTCONNECTED The remote system was not connected.
+LAPB_TIMEDOUT No response was received in N2 tries from the remote
+ system.
+
+
+int (*data_indication)(void *token, struct sk_buff *skb);
+
+This is called by the LAPB module when data has been received from the
+remote system that should be passed onto the next layer in the protocol
+stack. The skbuff becomes the property of the device driver and the LAPB
+module will not perform any more actions on it. The skb->data pointer will
+be pointing to the first byte of data after the LAPB header.
+
+This method should return NET_RX_DROP (as defined in the header
+file include/linux/netdevice.h) if and only if the frame was dropped
+before it could be delivered to the upper layer.
+
+
+void (*data_transmit)(void *token, struct sk_buff *skb);
+
+This is called by the LAPB module when data is to be transmitted to the
+remote system by the device driver. The skbuff becomes the property of the
+device driver and the LAPB module will not perform any more actions on it.
+The skb->data pointer will be pointing to the first byte of the LAPB header.
diff --git a/Documentation/networking/ltpc.txt b/Documentation/networking/ltpc.txt
new file mode 100644
index 000000000..0bf3220c7
--- /dev/null
+++ b/Documentation/networking/ltpc.txt
@@ -0,0 +1,131 @@
+This is the ALPHA version of the ltpc driver.
+
+In order to use it, you will need at least version 1.3.3 of the
+netatalk package, and the Apple or Farallon LocalTalk PC card.
+There are a number of different LocalTalk cards for the PC; this
+driver applies only to the one with the 65c02 processor chip on it.
+
+To include it in the kernel, select the CONFIG_LTPC switch in the
+configuration dialog. You can also compile it as a module.
+
+While the driver will attempt to autoprobe the I/O port address, IRQ
+line, and DMA channel of the card, this does not always work. For
+this reason, you should be prepared to supply these parameters
+yourself. (see "Card Configuration" below for how to determine or
+change the settings on your card)
+
+When the driver is compiled into the kernel, you can add a line such
+as the following to your /etc/lilo.conf:
+
+ append="ltpc=0x240,9,1"
+
+where the parameters (in order) are the port address, IRQ, and DMA
+channel. The second and third values can be omitted, in which case
+the driver will try to determine them itself.
+
+If you load the driver as a module, you can pass the parameters "io=",
+"irq=", and "dma=" on the command line with insmod or modprobe, or add
+them as options in a configuration file in /etc/modprobe.d/ directory:
+
+ alias lt0 ltpc # autoload the module when the interface is configured
+ options ltpc io=0x240 irq=9 dma=1
+
+Before starting up the netatalk demons (perhaps in rc.local), you
+need to add a line such as:
+
+ /sbin/ifconfig lt0 127.0.0.42
+
+The address is unimportant - however, the card needs to be configured
+with ifconfig so that Netatalk can find it.
+
+The appropriate netatalk configuration depends on whether you are
+attached to a network that includes AppleTalk routers or not. If,
+like me, you are simply connecting to your home Macintoshes and
+printers, you need to set up netatalk to "seed". The way I do this
+is to have the lines
+
+ dummy -seed -phase 2 -net 2000 -addr 2000.26 -zone "1033"
+ lt0 -seed -phase 1 -net 1033 -addr 1033.27 -zone "1033"
+
+in my atalkd.conf. What is going on here is that I need to fool
+netatalk into thinking that there are two AppleTalk interfaces
+present; otherwise, it refuses to seed. This is a hack, and a more
+permanent solution would be to alter the netatalk code. Also, make
+sure you have the correct name for the dummy interface - If it's
+compiled as a module, you will need to refer to it as "dummy0" or some
+such.
+
+If you are attached to an extended AppleTalk network, with routers on
+it, then you don't need to fool around with this -- the appropriate
+line in atalkd.conf is
+
+ lt0 -phase 1
+
+--------------------------------------
+
+Card Configuration:
+
+The interrupts and so forth are configured via the dipswitch on the
+board. Set the switches so as not to conflict with other hardware.
+
+ Interrupts -- set at most one. If none are set, the driver uses
+ polled mode. Because the card was developed in the XT era, the
+ original documentation refers to IRQ2. Since you'll be running
+ this on an AT (or later) class machine, that really means IRQ9.
+
+ SW1 IRQ 4
+ SW2 IRQ 3
+ SW3 IRQ 9 (2 in original card documentation only applies to XT)
+
+
+ DMA -- choose DMA 1 or 3, and set both corresponding switches.
+
+ SW4 DMA 3
+ SW5 DMA 1
+ SW6 DMA 3
+ SW7 DMA 1
+
+
+ I/O address -- choose one.
+
+ SW8 220 / 240
+
+--------------------------------------
+
+IP:
+
+Yes, it is possible to do IP over LocalTalk. However, you can't just
+treat the LocalTalk device like an ordinary Ethernet device, even if
+that's what it looks like to Netatalk.
+
+Instead, you follow the same procedure as for doing IP in EtherTalk.
+See Documentation/networking/ipddp.txt for more information about the
+kernel driver and userspace tools needed.
+
+--------------------------------------
+
+BUGS:
+
+IRQ autoprobing often doesn't work on a cold boot. To get around
+this, either compile the driver as a module, or pass the parameters
+for the card to the kernel as described above.
+
+Also, as usual, autoprobing is not recommended when you use the driver
+as a module. (though it usually works at boot time, at least)
+
+Polled mode is *really* slow sometimes, but this seems to depend on
+the configuration of the network.
+
+It may theoretically be possible to use two LTPC cards in the same
+machine, but this is unsupported, so if you really want to do this,
+you'll probably have to hack the initialization code a bit.
+
+______________________________________
+
+THANKS:
+ Thanks to Alan Cox for helpful discussions early on in this
+work, and to Denis Hainsworth for doing the bleeding-edge testing.
+
+-- Bradford Johnson <bradford@math.umn.edu>
+
+-- Updated 11/09/1998 by David Huggins-Daines <dhd@debian.org>
diff --git a/Documentation/networking/mac80211-auth-assoc-deauth.txt b/Documentation/networking/mac80211-auth-assoc-deauth.txt
new file mode 100644
index 000000000..d7a15fe91
--- /dev/null
+++ b/Documentation/networking/mac80211-auth-assoc-deauth.txt
@@ -0,0 +1,95 @@
+#
+# This outlines the Linux authentication/association and
+# deauthentication/disassociation flows.
+#
+# This can be converted into a diagram using the service
+# at http://www.websequencediagrams.com/
+#
+
+participant userspace
+participant mac80211
+participant driver
+
+alt authentication needed (not FT)
+userspace->mac80211: authenticate
+
+alt authenticated/authenticating already
+mac80211->driver: sta_state(AP, not-exists)
+mac80211->driver: bss_info_changed(clear BSSID)
+else associated
+note over mac80211,driver
+like deauth/disassoc, without sending the
+BA session stop & deauth/disassoc frames
+end note
+end
+
+mac80211->driver: config(channel, channel type)
+mac80211->driver: bss_info_changed(set BSSID, basic rate bitmap)
+mac80211->driver: sta_state(AP, exists)
+
+alt no probe request data known
+mac80211->driver: TX directed probe request
+driver->mac80211: RX probe response
+end
+
+mac80211->driver: TX auth frame
+driver->mac80211: RX auth frame
+
+alt WEP shared key auth
+mac80211->driver: TX auth frame
+driver->mac80211: RX auth frame
+end
+
+mac80211->driver: sta_state(AP, authenticated)
+mac80211->userspace: RX auth frame
+
+end
+
+userspace->mac80211: associate
+alt authenticated or associated
+note over mac80211,driver: cleanup like for authenticate
+end
+
+alt not previously authenticated (FT)
+mac80211->driver: config(channel, channel type)
+mac80211->driver: bss_info_changed(set BSSID, basic rate bitmap)
+mac80211->driver: sta_state(AP, exists)
+mac80211->driver: sta_state(AP, authenticated)
+end
+mac80211->driver: TX assoc
+driver->mac80211: RX assoc response
+note over mac80211: init rate control
+mac80211->driver: sta_state(AP, associated)
+
+alt not using WPA
+mac80211->driver: sta_state(AP, authorized)
+end
+
+mac80211->driver: set up QoS parameters
+
+mac80211->driver: bss_info_changed(QoS, HT, associated with AID)
+mac80211->userspace: associated
+
+note left of userspace: associated now
+
+alt using WPA
+note over userspace
+do 4-way-handshake
+(data frames)
+end note
+userspace->mac80211: authorized
+mac80211->driver: sta_state(AP, authorized)
+end
+
+userspace->mac80211: deauthenticate/disassociate
+mac80211->driver: stop BA sessions
+mac80211->driver: TX deauth/disassoc
+mac80211->driver: flush frames
+mac80211->driver: sta_state(AP,associated)
+mac80211->driver: sta_state(AP,authenticated)
+mac80211->driver: sta_state(AP,exists)
+mac80211->driver: sta_state(AP,not-exists)
+mac80211->driver: turn off powersave
+mac80211->driver: bss_info_changed(clear BSSID, not associated, no QoS, ...)
+mac80211->driver: config(channel type to non-HT)
+mac80211->userspace: disconnected
diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.txt
new file mode 100644
index 000000000..3a930072b
--- /dev/null
+++ b/Documentation/networking/mac80211-injection.txt
@@ -0,0 +1,67 @@
+How to use packet injection with mac80211
+=========================================
+
+mac80211 now allows arbitrary packets to be injected down any Monitor Mode
+interface from userland. The packet you inject needs to be composed in the
+following format:
+
+ [ radiotap header ]
+ [ ieee80211 header ]
+ [ payload ]
+
+The radiotap format is discussed in
+./Documentation/networking/radiotap-headers.txt.
+
+Despite many radiotap parameters being currently defined, most only make sense
+to appear on received packets. The following information is parsed from the
+radiotap headers and used to control injection:
+
+ * IEEE80211_RADIOTAP_FLAGS
+
+ IEEE80211_RADIOTAP_F_FCS: FCS will be removed and recalculated
+ IEEE80211_RADIOTAP_F_WEP: frame will be encrypted if key available
+ IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the
+ current fragmentation threshold.
+
+ * IEEE80211_RADIOTAP_TX_FLAGS
+
+ IEEE80211_RADIOTAP_F_TX_NOACK: frame should be sent without waiting for
+ an ACK even if it is a unicast frame
+
+The injection code can also skip all other currently defined radiotap fields
+facilitating replay of captured radiotap headers directly.
+
+Here is an example valid radiotap header defining some parameters
+
+ 0x00, 0x00, // <-- radiotap version
+ 0x0b, 0x00, // <- radiotap header length
+ 0x04, 0x0c, 0x00, 0x00, // <-- bitmap
+ 0x6c, // <-- rate
+ 0x0c, //<-- tx power
+ 0x01 //<-- antenna
+
+The ieee80211 header follows immediately afterwards, looking for example like
+this:
+
+ 0x08, 0x01, 0x00, 0x00,
+ 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
+ 0x13, 0x22, 0x33, 0x44, 0x55, 0x66,
+ 0x13, 0x22, 0x33, 0x44, 0x55, 0x66,
+ 0x10, 0x86
+
+Then lastly there is the payload.
+
+After composing the packet contents, it is sent by send()-ing it to a logical
+mac80211 interface that is in Monitor mode. Libpcap can also be used,
+(which is easier than doing the work to bind the socket to the right
+interface), along the following lines:
+
+ ppcap = pcap_open_live(szInterfaceName, 800, 1, 20, szErrbuf);
+...
+ r = pcap_inject(ppcap, u8aSendBuffer, nLength);
+
+You can also find a link to a complete inject application here:
+
+http://wireless.kernel.org/en/users/Documentation/packetspammer
+
+Andy Green <andy@warmcat.com>
diff --git a/Documentation/networking/mac80211_hwsim/README b/Documentation/networking/mac80211_hwsim/README
new file mode 100644
index 000000000..24ac91d56
--- /dev/null
+++ b/Documentation/networking/mac80211_hwsim/README
@@ -0,0 +1,68 @@
+mac80211_hwsim - software simulator of 802.11 radio(s) for mac80211
+Copyright (c) 2008, Jouni Malinen <j@w1.fi>
+
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License version 2 as
+published by the Free Software Foundation.
+
+
+Introduction
+
+mac80211_hwsim is a Linux kernel module that can be used to simulate
+arbitrary number of IEEE 802.11 radios for mac80211. It can be used to
+test most of the mac80211 functionality and user space tools (e.g.,
+hostapd and wpa_supplicant) in a way that matches very closely with
+the normal case of using real WLAN hardware. From the mac80211 view
+point, mac80211_hwsim is yet another hardware driver, i.e., no changes
+to mac80211 are needed to use this testing tool.
+
+The main goal for mac80211_hwsim is to make it easier for developers
+to test their code and work with new features to mac80211, hostapd,
+and wpa_supplicant. The simulated radios do not have the limitations
+of real hardware, so it is easy to generate an arbitrary test setup
+and always reproduce the same setup for future tests. In addition,
+since all radio operation is simulated, any channel can be used in
+tests regardless of regulatory rules.
+
+mac80211_hwsim kernel module has a parameter 'radios' that can be used
+to select how many radios are simulated (default 2). This allows
+configuration of both very simply setups (e.g., just a single access
+point and a station) or large scale tests (multiple access points with
+hundreds of stations).
+
+mac80211_hwsim works by tracking the current channel of each virtual
+radio and copying all transmitted frames to all other radios that are
+currently enabled and on the same channel as the transmitting
+radio. Software encryption in mac80211 is used so that the frames are
+actually encrypted over the virtual air interface to allow more
+complete testing of encryption.
+
+A global monitoring netdev, hwsim#, is created independent of
+mac80211. This interface can be used to monitor all transmitted frames
+regardless of channel.
+
+
+Simple example
+
+This example shows how to use mac80211_hwsim to simulate two radios:
+one to act as an access point and the other as a station that
+associates with the AP. hostapd and wpa_supplicant are used to take
+care of WPA2-PSK authentication. In addition, hostapd is also
+processing access point side of association.
+
+
+# Build mac80211_hwsim as part of kernel configuration
+
+# Load the module
+modprobe mac80211_hwsim
+
+# Run hostapd (AP) for wlan0
+hostapd hostapd.conf
+
+# Run wpa_supplicant (station) for wlan1
+wpa_supplicant -Dwext -iwlan1 -c wpa_supplicant.conf
+
+
+More test cases are available in hostap.git:
+git://w1.fi/srv/git/hostap.git and mac80211_hwsim/tests subdirectory
+(http://w1.fi/gitweb/gitweb.cgi?p=hostap.git;a=tree;f=mac80211_hwsim/tests)
diff --git a/Documentation/networking/mac80211_hwsim/hostapd.conf b/Documentation/networking/mac80211_hwsim/hostapd.conf
new file mode 100644
index 000000000..08cde7e35
--- /dev/null
+++ b/Documentation/networking/mac80211_hwsim/hostapd.conf
@@ -0,0 +1,11 @@
+interface=wlan0
+driver=nl80211
+
+hw_mode=g
+channel=1
+ssid=mac80211 test
+
+wpa=2
+wpa_key_mgmt=WPA-PSK
+wpa_pairwise=CCMP
+wpa_passphrase=12345678
diff --git a/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf b/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf
new file mode 100644
index 000000000..299128cff
--- /dev/null
+++ b/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf
@@ -0,0 +1,10 @@
+ctrl_interface=/var/run/wpa_supplicant
+
+network={
+ ssid="mac80211 test"
+ psk="12345678"
+ key_mgmt=WPA-PSK
+ proto=WPA2
+ pairwise=CCMP
+ group=CCMP
+}
diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt
new file mode 100644
index 000000000..9ed15f86c
--- /dev/null
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -0,0 +1,29 @@
+/proc/sys/net/mpls/* Variables:
+
+platform_labels - INTEGER
+ Number of entries in the platform label table. It is not
+ possible to configure forwarding for label values equal to or
+ greater than the number of platform labels.
+
+ A dense utliziation of the entries in the platform label table
+ is possible and expected aas the platform labels are locally
+ allocated.
+
+ If the number of platform label table entries is set to 0 no
+ label will be recognized by the kernel and mpls forwarding
+ will be disabled.
+
+ Reducing this value will remove all label routing entries that
+ no longer fit in the table.
+
+ Possible values: 0 - 1048575
+ Default: 0
+
+conf/<interface>/input - BOOL
+ Control whether packets can be input on this interface.
+
+ If disabled, packets will be discarded without further
+ processing.
+
+ 0 - disabled (default)
+ not 0 - enabled
diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt
new file mode 100644
index 000000000..4caa0e314
--- /dev/null
+++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,79 @@
+
+ HOWTO for multiqueue network device support
+ ===========================================
+
+Section 1: Base driver requirements for implementing multiqueue support
+
+Intro: Kernel support for multiqueue devices
+---------------------------------------------------------
+
+Kernel support for multiqueue devices is always present.
+
+Section 1: Base driver requirements for implementing multiqueue support
+-----------------------------------------------------------------------
+
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device. The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+
+The base driver will also need to manage the queues as it does the global
+netdev->queue_lock today. Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational. netdev->queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+
+
+Section 2: Qdisc support for multiqueue devices
+
+-----------------------------------------------
+
+Currently two qdiscs are optimized for multiqueue devices. The first is the
+default pfifo_fast qdisc. This qdisc supports one qdisc per hardware queue.
+A new round-robin qdisc, sch_multiq also supports multiple hardware queues. The
+qdisc is responsible for classifying the skb's and then directing the skb's to
+bands and queues based on the value in skb->queue_mapping. Use this field in
+the base driver to determine which queue to send the skb to.
+
+sch_multiq has been added for hardware that wishes to avoid head-of-line
+blocking. It will cycle though the bands and verify that the hardware queue
+associated with the band is not stopped prior to dequeuing a packet.
+
+On qdisc load, the number of bands is based on the number of queues on the
+hardware. Once the association is made, any skb with skb->queue_mapping set,
+will be queued to the band associated with the hardware queue.
+
+
+Section 3: Brief howto using MULTIQ for multiqueue devices
+---------------------------------------------------------------
+
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs. To add the MULTIQ qdisc to your network device, assuming the device
+is called eth0, run the following command:
+
+# tc qdisc add dev eth0 root handle 1: multiq
+
+The qdisc will allocate the number of bands to equal the number of queues that
+the device reports, and bring the qdisc online. Assuming eth0 has 4 Tx
+queues, the band mapping would look like:
+
+band 0 => queue 0
+band 1 => queue 1
+band 2 => queue 2
+band 3 => queue 3
+
+Traffic will begin flowing through each queue based on either the simple_tx_hash
+function or based on netdev->select_queue() if you have it defined.
+
+The behavior of tc filters remains the same. However a new tc action,
+skbedit, has been added. Assuming you wanted to route all traffic to a
+specific host, for example 192.168.0.3, through a specific queue you could use
+this action and establish a filter such as:
+
+tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \
+ match ip dst 192.168.0.3 \
+ action skbedit queue_mapping 3
+
+Author: Alexander Duyck <alexander.h.duyck@intel.com>
+Original Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.txt
new file mode 100644
index 000000000..a5d574a9a
--- /dev/null
+++ b/Documentation/networking/netconsole.txt
@@ -0,0 +1,177 @@
+
+started by Ingo Molnar <mingo@redhat.com>, 2001.09.17
+2.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003
+IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013
+
+Please send bug reports to Matt Mackall <mpm@selenic.com>
+Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com>
+
+Introduction:
+=============
+
+This module logs kernel printk messages over UDP allowing debugging of
+problem where disk logging fails and serial consoles are impractical.
+
+It can be used either built-in or as a module. As a built-in,
+netconsole initializes immediately after NIC cards and will bring up
+the specified interface as soon as possible. While this doesn't allow
+capture of early kernel panics, it does capture most of the boot
+process.
+
+Sender and receiver configuration:
+==================================
+
+It takes a string configuration parameter "netconsole" in the
+following format:
+
+ netconsole=[src-port]@[src-ip]/[<dev>],[tgt-port]@<tgt-ip>/[tgt-macaddr]
+
+ where
+ src-port source for UDP packets (defaults to 6665)
+ src-ip source IP to use (interface address)
+ dev network interface (eth0)
+ tgt-port port for logging agent (6666)
+ tgt-ip IP address for logging agent
+ tgt-macaddr ethernet MAC address for logging agent (broadcast)
+
+Examples:
+
+ linux netconsole=4444@10.0.0.1/eth1,9353@10.0.0.2/12:34:56:78:9a:bc
+
+ or
+
+ insmod netconsole netconsole=@/,@10.0.0.2/
+
+ or using IPv6
+
+ insmod netconsole netconsole=@/,@fd00:1:2:3::1/
+
+It also supports logging to multiple remote agents by specifying
+parameters for the multiple agents separated by semicolons and the
+complete string enclosed in "quotes", thusly:
+
+ modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/"
+
+Built-in netconsole starts immediately after the TCP stack is
+initialized and attempts to bring up the supplied dev at the supplied
+address.
+
+The remote host has several options to receive the kernel messages,
+for example:
+
+1) syslogd
+
+2) netcat
+
+ On distributions using a BSD-based netcat version (e.g. Fedora,
+ openSUSE and Ubuntu) the listening port must be specified without
+ the -p switch:
+
+ 'nc -u -l -p <port>' / 'nc -u -l <port>' or
+ 'netcat -u -l -p <port>' / 'netcat -u -l <port>'
+
+3) socat
+
+ 'socat udp-recv:<port> -'
+
+Dynamic reconfiguration:
+========================
+
+Dynamic reconfigurability is a useful addition to netconsole that enables
+remote logging targets to be dynamically added, removed, or have their
+parameters reconfigured at runtime from a configfs-based userspace interface.
+[ Note that the parameters of netconsole targets that were specified/created
+from the boot/module option are not exposed via this interface, and hence
+cannot be modified dynamically. ]
+
+To include this feature, select CONFIG_NETCONSOLE_DYNAMIC when building the
+netconsole module (or kernel, if netconsole is built-in).
+
+Some examples follow (where configfs is mounted at the /sys/kernel/config
+mountpoint).
+
+To add a remote logging target (target names can be arbitrary):
+
+ cd /sys/kernel/config/netconsole/
+ mkdir target1
+
+Note that newly created targets have default parameter values (as mentioned
+above) and are disabled by default -- they must first be enabled by writing
+"1" to the "enabled" attribute (usually after setting parameters accordingly)
+as described below.
+
+To remove a target:
+
+ rmdir /sys/kernel/config/netconsole/othertarget/
+
+The interface exposes these parameters of a netconsole target to userspace:
+
+ enabled Is this target currently enabled? (read-write)
+ dev_name Local network interface name (read-write)
+ local_port Source UDP port to use (read-write)
+ remote_port Remote agent's UDP port (read-write)
+ local_ip Source IP address to use (read-write)
+ remote_ip Remote agent's IP address (read-write)
+ local_mac Local interface's MAC address (read-only)
+ remote_mac Remote agent's MAC address (read-write)
+
+The "enabled" attribute is also used to control whether the parameters of
+a target can be updated or not -- you can modify the parameters of only
+disabled targets (i.e. if "enabled" is 0).
+
+To update a target's parameters:
+
+ cat enabled # check if enabled is 1
+ echo 0 > enabled # disable the target (if required)
+ echo eth2 > dev_name # set local interface
+ echo 10.0.0.4 > remote_ip # update some parameter
+ echo cb:a9:87:65:43:21 > remote_mac # update more parameters
+ echo 1 > enabled # enable target again
+
+You can also update the local interface dynamically. This is especially
+useful if you want to use interfaces that have newly come up (and may not
+have existed when netconsole was loaded / initialized).
+
+Miscellaneous notes:
+====================
+
+WARNING: the default target ethernet setting uses the broadcast
+ethernet address to send packets, which can cause increased load on
+other systems on the same ethernet segment.
+
+TIP: some LAN switches may be configured to suppress ethernet broadcasts
+so it is advised to explicitly specify the remote agents' MAC addresses
+from the config parameters passed to netconsole.
+
+TIP: to find out the MAC address of, say, 10.0.0.2, you may try using:
+
+ ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2
+
+TIP: in case the remote logging agent is on a separate LAN subnet than
+the sender, it is suggested to try specifying the MAC address of the
+default gateway (you may use /sbin/route -n to find it out) as the
+remote MAC address instead.
+
+NOTE: the network device (eth1 in the above case) can run any kind
+of other network traffic, netconsole is not intrusive. Netconsole
+might cause slight delays in other traffic if the volume of kernel
+messages is high, but should have no other impact.
+
+NOTE: if you find that the remote logging agent is not receiving or
+printing all messages from the sender, it is likely that you have set
+the "console_loglevel" parameter (on the sender) to only send high
+priority messages to the console. You can change this at runtime using:
+
+ dmesg -n 8
+
+or by specifying "debug" on the kernel command line at boot, to send
+all kernel messages to the console. A specific value for this parameter
+can also be set using the "loglevel" kernel boot option. See the
+dmesg(8) man page and Documentation/kernel-parameters.txt for details.
+
+Netconsole was designed to be as instantaneous as possible, to
+enable the logging of even the most critical kernel bugs. It works
+from IRQ contexts as well, and does not enable interrupts while
+sending packets. Due to these unique needs, configuration cannot
+be more automatic, and some fundamental limitations will remain:
+only IP networks, UDP packets and ethernet devices are supported.
diff --git a/Documentation/networking/netdev-FAQ.txt b/Documentation/networking/netdev-FAQ.txt
new file mode 100644
index 000000000..0fe1c6e0d
--- /dev/null
+++ b/Documentation/networking/netdev-FAQ.txt
@@ -0,0 +1,224 @@
+
+Information you need to know about netdev
+-----------------------------------------
+
+Q: What is netdev?
+
+A: It is a mailing list for all network-related Linux stuff. This includes
+ anything found under net/ (i.e. core code like IPv6) and drivers/net
+ (i.e. hardware specific drivers) in the Linux source tree.
+
+ Note that some subsystems (e.g. wireless drivers) which have a high volume
+ of traffic have their own specific mailing lists.
+
+ The netdev list is managed (like many other Linux mailing lists) through
+ VGER ( http://vger.kernel.org/ ) and archives can be found below:
+
+ http://marc.info/?l=linux-netdev
+ http://www.spinics.net/lists/netdev/
+
+ Aside from subsystems like that mentioned above, all network-related Linux
+ development (i.e. RFC, review, comments, etc.) takes place on netdev.
+
+Q: How do the changes posted to netdev make their way into Linux?
+
+A: There are always two trees (git repositories) in play. Both are driven
+ by David Miller, the main network maintainer. There is the "net" tree,
+ and the "net-next" tree. As you can probably guess from the names, the
+ net tree is for fixes to existing code already in the mainline tree from
+ Linus, and net-next is where the new code goes for the future release.
+ You can find the trees here:
+
+ http://git.kernel.org/?p=linux/kernel/git/davem/net.git
+ http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git
+
+Q: How often do changes from these trees make it to the mainline Linus tree?
+
+A: To understand this, you need to know a bit of background information
+ on the cadence of Linux development. Each new release starts off with
+ a two week "merge window" where the main maintainers feed their new
+ stuff to Linus for merging into the mainline tree. After the two weeks,
+ the merge window is closed, and it is called/tagged "-rc1". No new
+ features get mainlined after this -- only fixes to the rc1 content
+ are expected. After roughly a week of collecting fixes to the rc1
+ content, rc2 is released. This repeats on a roughly weekly basis
+ until rc7 (typically; sometimes rc6 if things are quiet, or rc8 if
+ things are in a state of churn), and a week after the last vX.Y-rcN
+ was done, the official "vX.Y" is released.
+
+ Relating that to netdev: At the beginning of the 2-week merge window,
+ the net-next tree will be closed - no new changes/features. The
+ accumulated new content of the past ~10 weeks will be passed onto
+ mainline/Linus via a pull request for vX.Y -- at the same time,
+ the "net" tree will start accumulating fixes for this pulled content
+ relating to vX.Y
+
+ An announcement indicating when net-next has been closed is usually
+ sent to netdev, but knowing the above, you can predict that in advance.
+
+ IMPORTANT: Do not send new net-next content to netdev during the
+ period during which net-next tree is closed.
+
+ Shortly after the two weeks have passed (and vX.Y-rc1 is released), the
+ tree for net-next reopens to collect content for the next (vX.Y+1) release.
+
+ If you aren't subscribed to netdev and/or are simply unsure if net-next
+ has re-opened yet, simply check the net-next git repository link above for
+ any new networking-related commits.
+
+ The "net" tree continues to collect fixes for the vX.Y content, and
+ is fed back to Linus at regular (~weekly) intervals. Meaning that the
+ focus for "net" is on stabilization and bugfixes.
+
+ Finally, the vX.Y gets released, and the whole cycle starts over.
+
+Q: So where are we now in this cycle?
+
+A: Load the mainline (Linus) page here:
+
+ http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git
+
+ and note the top of the "tags" section. If it is rc1, it is early
+ in the dev cycle. If it was tagged rc7 a week ago, then a release
+ is probably imminent.
+
+Q: How do I indicate which tree (net vs. net-next) my patch should be in?
+
+A: Firstly, think whether you have a bug fix or new "next-like" content.
+ Then once decided, assuming that you use git, use the prefix flag, i.e.
+
+ git format-patch --subject-prefix='PATCH net-next' start..finish
+
+ Use "net" instead of "net-next" (always lower case) in the above for
+ bug-fix net content. If you don't use git, then note the only magic in
+ the above is just the subject text of the outgoing e-mail, and you can
+ manually change it yourself with whatever MUA you are comfortable with.
+
+Q: I sent a patch and I'm wondering what happened to it. How can I tell
+ whether it got merged?
+
+A: Start by looking at the main patchworks queue for netdev:
+
+ http://patchwork.ozlabs.org/project/netdev/list/
+
+ The "State" field will tell you exactly where things are at with
+ your patch.
+
+Q: The above only says "Under Review". How can I find out more?
+
+A: Generally speaking, the patches get triaged quickly (in less than 48h).
+ So be patient. Asking the maintainer for status updates on your
+ patch is a good way to ensure your patch is ignored or pushed to
+ the bottom of the priority list.
+
+Q: How can I tell what patches are queued up for backporting to the
+ various stable releases?
+
+A: Normally Greg Kroah-Hartman collects stable commits himself, but
+ for networking, Dave collects up patches he deems critical for the
+ networking subsystem, and then hands them off to Greg.
+
+ There is a patchworks queue that you can see here:
+ http://patchwork.ozlabs.org/bundle/davem/stable/?state=*
+
+ It contains the patches which Dave has selected, but not yet handed
+ off to Greg. If Greg already has the patch, then it will be here:
+ http://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git
+
+ A quick way to find whether the patch is in this stable-queue is
+ to simply clone the repo, and then git grep the mainline commit ID, e.g.
+
+ stable-queue$ git grep -l 284041ef21fdf2e
+ releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
+ releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
+ releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch
+ stable/stable-queue$
+
+Q: I see a network patch and I think it should be backported to stable.
+ Should I request it via "stable@vger.kernel.org" like the references in
+ the kernel's Documentation/stable_kernel_rules.txt file say?
+
+A: No, not for networking. Check the stable queues as per above 1st to see
+ if it is already queued. If not, then send a mail to netdev, listing
+ the upstream commit ID and why you think it should be a stable candidate.
+
+ Before you jump to go do the above, do note that the normal stable rules
+ in Documentation/stable_kernel_rules.txt still apply. So you need to
+ explicitly indicate why it is a critical fix and exactly what users are
+ impacted. In addition, you need to convince yourself that you _really_
+ think it has been overlooked, vs. having been considered and rejected.
+
+ Generally speaking, the longer it has had a chance to "soak" in mainline,
+ the better the odds that it is an OK candidate for stable. So scrambling
+ to request a commit be added the day after it appears should be avoided.
+
+Q: I have created a network patch and I think it should be backported to
+ stable. Should I add a "Cc: stable@vger.kernel.org" like the references
+ in the kernel's Documentation/ directory say?
+
+A: No. See above answer. In short, if you think it really belongs in
+ stable, then ensure you write a decent commit log that describes who
+ gets impacted by the bugfix and how it manifests itself, and when the
+ bug was introduced. If you do that properly, then the commit will
+ get handled appropriately and most likely get put in the patchworks
+ stable queue if it really warrants it.
+
+ If you think there is some valid information relating to it being in
+ stable that does _not_ belong in the commit log, then use the three
+ dash marker line as described in Documentation/SubmittingPatches to
+ temporarily embed that information into the patch that you send.
+
+Q: Someone said that the comment style and coding convention is different
+ for the networking content. Is this true?
+
+A: Yes, in a largely trivial way. Instead of this:
+
+ /*
+ * foobar blah blah blah
+ * another line of text
+ */
+
+ it is requested that you make it look like this:
+
+ /* foobar blah blah blah
+ * another line of text
+ */
+
+Q: I am working in existing code that has the former comment style and not the
+ latter. Should I submit new code in the former style or the latter?
+
+A: Make it the latter style, so that eventually all code in the domain of
+ netdev is of this format.
+
+Q: I found a bug that might have possible security implications or similar.
+ Should I mail the main netdev maintainer off-list?
+
+A: No. The current netdev maintainer has consistently requested that people
+ use the mailing lists and not reach out directly. If you aren't OK with
+ that, then perhaps consider mailing "security@kernel.org" or reading about
+ http://oss-security.openwall.org/wiki/mailing-lists/distros
+ as possible alternative mechanisms.
+
+Q: What level of testing is expected before I submit my change?
+
+A: If your changes are against net-next, the expectation is that you
+ have tested by layering your changes on top of net-next. Ideally you
+ will have done run-time testing specific to your change, but at a
+ minimum, your changes should survive an "allyesconfig" and an
+ "allmodconfig" build without new warnings or failures.
+
+Q: Any other tips to help ensure my net/net-next patch gets OK'd?
+
+A: Attention to detail. Re-read your own work as if you were the
+ reviewer. You can start with using checkpatch.pl, perhaps even
+ with the "--strict" flag. But do not be mindlessly robotic in
+ doing so. If your change is a bug-fix, make sure your commit log
+ indicates the end-user visible symptom, the underlying reason as
+ to why it happens, and then if necessary, explain why the fix proposed
+ is the best way to get things done. Don't mangle whitespace, and as
+ is common, don't mis-indent function arguments that span multiple lines.
+ If it is your first patch, mail it to yourself so you can test apply
+ it to an unpatched tree to confirm infrastructure didn't mangle it.
+
+ Finally, go back and read Documentation/SubmittingPatches to be
+ sure you are not repeating some common mistake documented there.
diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt
new file mode 100644
index 000000000..f310edec8
--- /dev/null
+++ b/Documentation/networking/netdev-features.txt
@@ -0,0 +1,167 @@
+Netdev features mess and how to get out from it alive
+=====================================================
+
+Author:
+ Michał Mirosław <mirq-linux@rere.qmqm.pl>
+
+
+
+ Part I: Feature sets
+======================
+
+Long gone are the days when a network card would just take and give packets
+verbatim. Today's devices add multiple features and bugs (read: offloads)
+that relieve an OS of various tasks like generating and checking checksums,
+splitting packets, classifying them. Those capabilities and their state
+are commonly referred to as netdev features in Linux kernel world.
+
+There are currently three sets of features relevant to the driver, and
+one used internally by network core:
+
+ 1. netdev->hw_features set contains features whose state may possibly
+ be changed (enabled or disabled) for a particular device by user's
+ request. This set should be initialized in ndo_init callback and not
+ changed later.
+
+ 2. netdev->features set contains features which are currently enabled
+ for a device. This should be changed only by network core or in
+ error paths of ndo_set_features callback.
+
+ 3. netdev->vlan_features set contains features whose state is inherited
+ by child VLAN devices (limits netdev->features set). This is currently
+ used for all VLAN devices whether tags are stripped or inserted in
+ hardware or software.
+
+ 4. netdev->wanted_features set contains feature set requested by user.
+ This set is filtered by ndo_fix_features callback whenever it or
+ some device-specific conditions change. This set is internal to
+ networking core and should not be referenced in drivers.
+
+
+
+ Part II: Controlling enabled features
+=======================================
+
+When current feature set (netdev->features) is to be changed, new set
+is calculated and filtered by calling ndo_fix_features callback
+and netdev_fix_features(). If the resulting set differs from current
+set, it is passed to ndo_set_features callback and (if the callback
+returns success) replaces value stored in netdev->features.
+NETDEV_FEAT_CHANGE notification is issued after that whenever current
+set might have changed.
+
+The following events trigger recalculation:
+ 1. device's registration, after ndo_init returned success
+ 2. user requested changes in features state
+ 3. netdev_update_features() is called
+
+ndo_*_features callbacks are called with rtnl_lock held. Missing callbacks
+are treated as always returning success.
+
+A driver that wants to trigger recalculation must do so by calling
+netdev_update_features() while holding rtnl_lock. This should not be done
+from ndo_*_features callbacks. netdev->features should not be modified by
+driver except by means of ndo_fix_features callback.
+
+
+
+ Part III: Implementation hints
+================================
+
+ * ndo_fix_features:
+
+All dependencies between features should be resolved here. The resulting
+set can be reduced further by networking core imposed limitations (as coded
+in netdev_fix_features()). For this reason it is safer to disable a feature
+when its dependencies are not met instead of forcing the dependency on.
+
+This callback should not modify hardware nor driver state (should be
+stateless). It can be called multiple times between successive
+ndo_set_features calls.
+
+Callback must not alter features contained in NETIF_F_SOFT_FEATURES or
+NETIF_F_NEVER_CHANGE sets. The exception is NETIF_F_VLAN_CHALLENGED but
+care must be taken as the change won't affect already configured VLANs.
+
+ * ndo_set_features:
+
+Hardware should be reconfigured to match passed feature set. The set
+should not be altered unless some error condition happens that can't
+be reliably detected in ndo_fix_features. In this case, the callback
+should update netdev->features to match resulting hardware state.
+Errors returned are not (and cannot be) propagated anywhere except dmesg.
+(Note: successful return is zero, >0 means silent error.)
+
+
+
+ Part IV: Features
+===================
+
+For current list of features, see include/linux/netdev_features.h.
+This section describes semantics of some of them.
+
+ * Transmit checksumming
+
+For complete description, see comments near the top of include/linux/skbuff.h.
+
+Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM.
+It means that device can fill TCP/UDP-like checksum anywhere in the packets
+whatever headers there might be.
+
+ * Transmit TCP segmentation offload
+
+NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit
+set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6).
+
+ * Transmit DMA from high memory
+
+On platforms where this is relevant, NETIF_F_HIGHDMA signals that
+ndo_start_xmit can handle skbs with frags in high memory.
+
+ * Transmit scatter-gather
+
+Those features say that ndo_start_xmit can handle fragmented skbs:
+NETIF_F_SG --- paged skbs (skb_shinfo()->frags), NETIF_F_FRAGLIST ---
+chained skbs (skb->next/prev list).
+
+ * Software features
+
+Features contained in NETIF_F_SOFT_FEATURES are features of networking
+stack. Driver should not change behaviour based on them.
+
+ * LLTX driver (deprecated for hardware drivers)
+
+NETIF_F_LLTX should be set in drivers that implement their own locking in
+transmit path or don't need locking at all (e.g. software tunnels).
+In ndo_start_xmit, it is recommended to use a try_lock and return
+NETDEV_TX_LOCKED when the spin lock fails. The locking should also properly
+protect against other callbacks (the rules you need to find out).
+
+Don't use it for new drivers.
+
+ * netns-local device
+
+NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between
+network namespaces (e.g. loopback).
+
+Don't use it in drivers.
+
+ * VLAN challenged
+
+NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN
+headers. Some drivers set this because the cards can't handle the bigger MTU.
+[FIXME: Those cases could be fixed in VLAN code by allowing only reduced-MTU
+VLANs. This may be not useful, though.]
+
+* rx-fcs
+
+This requests that the NIC append the Ethernet Frame Checksum (FCS)
+to the end of the skb data. This allows sniffers and other tools to
+read the CRC recorded by the NIC on receipt of the packet.
+
+* rx-all
+
+This requests that the NIC receive all possible frames, including errored
+frames (such as bad FCS, etc). This can be helpful when sniffing a link with
+bad packets on it. Some NICs may receive more packets if also put into normal
+PROMISC mode.
diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt
new file mode 100644
index 000000000..0b1cf6b2a
--- /dev/null
+++ b/Documentation/networking/netdevices.txt
@@ -0,0 +1,107 @@
+
+Network Devices, the Kernel, and You!
+
+
+Introduction
+============
+The following is a random collection of documentation regarding
+network devices.
+
+struct net_device allocation rules
+==================================
+Network device structures need to persist even after module is unloaded and
+must be allocated with alloc_netdev_mqs() and friends.
+If device has registered successfully, it will be freed on last use
+by free_netdev(). This is required to handle the pathologic case cleanly
+(example: rmmod mydriver </sys/class/net/myeth/mtu )
+
+alloc_netdev_mqs()/alloc_netdev() reserve extra space for driver
+private data which gets freed when the network device is freed. If
+separately allocated data is attached to the network device
+(netdev_priv(dev)) then it is up to the module exit handler to free that.
+
+MTU
+===
+Each network device has a Maximum Transfer Unit. The MTU does not
+include any link layer protocol overhead. Upper layer protocols must
+not pass a socket buffer (skb) to a device to transmit with more data
+than the mtu. The MTU does not include link layer header overhead, so
+for example on Ethernet if the standard MTU is 1500 bytes used, the
+actual skb will contain up to 1514 bytes because of the Ethernet
+header. Devices should allow for the 4 byte VLAN header as well.
+
+Segmentation Offload (GSO, TSO) is an exception to this rule. The
+upper layer protocol may pass a large socket buffer to the device
+transmit routine, and the device will break that up into separate
+packets based on the current MTU.
+
+MTU is symmetrical and applies both to receive and transmit. A device
+must be able to receive at least the maximum size packet allowed by
+the MTU. A network device may use the MTU as mechanism to size receive
+buffers, but the device should allow packets with VLAN header. With
+standard Ethernet mtu of 1500 bytes, the device should allow up to
+1518 byte packets (1500 + 14 header + 4 tag). The device may either:
+drop, truncate, or pass up oversize packets, but dropping oversize
+packets is preferred.
+
+
+struct net_device synchronization rules
+=======================================
+ndo_open:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ndo_stop:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+ Note: netif_running() is guaranteed false
+
+ndo_do_ioctl:
+ Synchronization: rtnl_lock() semaphore.
+ Context: process
+
+ndo_get_stats:
+ Synchronization: dev_base_lock rwlock.
+ Context: nominally process, but don't sleep inside an rwlock
+
+ndo_start_xmit:
+ Synchronization: __netif_tx_lock spinlock.
+
+ When the driver sets NETIF_F_LLTX in dev->features this will be
+ called without holding netif_tx_lock. In this case the driver
+ has to lock by itself when needed. It is recommended to use a try lock
+ for this and return NETDEV_TX_LOCKED when the spin lock fails.
+ The locking there should also properly protect against
+ set_rx_mode. Note that the use of NETIF_F_LLTX is deprecated.
+ Don't use it for new drivers.
+
+ Context: Process with BHs disabled or BH (timer),
+ will be called with interrupts disabled by netconsole.
+
+ Return codes:
+ o NETDEV_TX_OK everything ok.
+ o NETDEV_TX_BUSY Cannot transmit packet, try later
+ Usually a bug, means queue start/stop flow control is broken in
+ the driver. Note: the driver must NOT put the skb in its DMA ring.
+ o NETDEV_TX_LOCKED Locking failed, please retry quickly.
+ Only valid when NETIF_F_LLTX is set.
+
+ndo_tx_timeout:
+ Synchronization: netif_tx_lock spinlock; all TX queues frozen.
+ Context: BHs disabled
+ Notes: netif_queue_stopped() is guaranteed true
+
+ndo_set_rx_mode:
+ Synchronization: netif_addr_lock spinlock.
+ Context: BHs disabled
+
+struct napi_struct synchronization rules
+========================================
+napi->poll:
+ Synchronization: NAPI_STATE_SCHED bit in napi->state. Device
+ driver's ndo_stop method will invoke napi_disable() on
+ all NAPI instances which will do a sleeping poll on the
+ NAPI_STATE_SCHED napi->state bit, waiting for all pending
+ NAPI activity to cease.
+ Context: softirq
+ will be called with interrupts disabled by netconsole.
diff --git a/Documentation/networking/netif-msg.txt b/Documentation/networking/netif-msg.txt
new file mode 100644
index 000000000..c967ddb90
--- /dev/null
+++ b/Documentation/networking/netif-msg.txt
@@ -0,0 +1,79 @@
+
+________________
+NETIF Msg Level
+
+The design of the network interface message level setting.
+
+History
+
+ The design of the debugging message interface was guided and
+ constrained by backwards compatibility previous practice. It is useful
+ to understand the history and evolution in order to understand current
+ practice and relate it to older driver source code.
+
+ From the beginning of Linux, each network device driver has had a local
+ integer variable that controls the debug message level. The message
+ level ranged from 0 to 7, and monotonically increased in verbosity.
+
+ The message level was not precisely defined past level 3, but were
+ always implemented within +-1 of the specified level. Drivers tended
+ to shed the more verbose level messages as they matured.
+ 0 Minimal messages, only essential information on fatal errors.
+ 1 Standard messages, initialization status. No run-time messages
+ 2 Special media selection messages, generally timer-driver.
+ 3 Interface starts and stops, including normal status messages
+ 4 Tx and Rx frame error messages, and abnormal driver operation
+ 5 Tx packet queue information, interrupt events.
+ 6 Status on each completed Tx packet and received Rx packets
+ 7 Initial contents of Tx and Rx packets
+
+ Initially this message level variable was uniquely named in each driver
+ e.g. "lance_debug", so that a kernel symbolic debugger could locate and
+ modify the setting. When kernel modules became common, the variables
+ were consistently renamed to "debug" and allowed to be set as a module
+ parameter.
+
+ This approach worked well. However there is always a demand for
+ additional features. Over the years the following emerged as
+ reasonable and easily implemented enhancements
+ Using an ioctl() call to modify the level.
+ Per-interface rather than per-driver message level setting.
+ More selective control over the type of messages emitted.
+
+ The netif_msg recommendation adds these features with only a minor
+ complexity and code size increase.
+
+ The recommendation is the following points
+ Retaining the per-driver integer variable "debug" as a module
+ parameter with a default level of '1'.
+
+ Adding a per-interface private variable named "msg_enable". The
+ variable is a bit map rather than a level, and is initialized as
+ 1 << debug
+ Or more precisely
+ debug < 0 ? 0 : 1 << min(sizeof(int)-1, debug)
+
+ Messages should changes from
+ if (debug > 1)
+ printk(MSG_DEBUG "%s: ...
+ to
+ if (np->msg_enable & NETIF_MSG_LINK)
+ printk(MSG_DEBUG "%s: ...
+
+
+The set of message levels is named
+ Old level Name Bit position
+ 0 NETIF_MSG_DRV 0x0001
+ 1 NETIF_MSG_PROBE 0x0002
+ 2 NETIF_MSG_LINK 0x0004
+ 2 NETIF_MSG_TIMER 0x0004
+ 3 NETIF_MSG_IFDOWN 0x0008
+ 3 NETIF_MSG_IFUP 0x0008
+ 4 NETIF_MSG_RX_ERR 0x0010
+ 4 NETIF_MSG_TX_ERR 0x0010
+ 5 NETIF_MSG_TX_QUEUED 0x0020
+ 5 NETIF_MSG_INTR 0x0020
+ 6 NETIF_MSG_TX_DONE 0x0040
+ 6 NETIF_MSG_RX_STATUS 0x0040
+ 7 NETIF_MSG_PKTDATA 0x0080
+
diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt
new file mode 100644
index 000000000..54f10478e
--- /dev/null
+++ b/Documentation/networking/netlink_mmap.txt
@@ -0,0 +1,332 @@
+This file documents how to use memory mapped I/O with netlink.
+
+Author: Patrick McHardy <kaber@trash.net>
+
+Overview
+--------
+
+Memory mapped netlink I/O can be used to increase throughput and decrease
+overhead of unicast receive and transmit operations. Some netlink subsystems
+require high throughput, these are mainly the netfilter subsystems
+nfnetlink_queue and nfnetlink_log, but it can also help speed up large
+dump operations of f.i. the routing database.
+
+Memory mapped netlink I/O used two circular ring buffers for RX and TX which
+are mapped into the processes address space.
+
+The RX ring is used by the kernel to directly construct netlink messages into
+user-space memory without copying them as done with regular socket I/O,
+additionally as long as the ring contains messages no recvmsg() or poll()
+syscalls have to be issued by user-space to get more message.
+
+The TX ring is used to process messages directly from user-space memory, the
+kernel processes all messages contained in the ring using a single sendmsg()
+call.
+
+Usage overview
+--------------
+
+In order to use memory mapped netlink I/O, user-space needs three main changes:
+
+- ring setup
+- conversion of the RX path to get messages from the ring instead of recvmsg()
+- conversion of the TX path to construct messages into the ring
+
+Ring setup is done using setsockopt() to provide the ring parameters to the
+kernel, then a call to mmap() to map the ring into the processes address space:
+
+- setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &params, sizeof(params));
+- setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &params, sizeof(params));
+- ring = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)
+
+Usage of either ring is optional, but even if only the RX ring is used the
+mapping still needs to be writable in order to update the frame status after
+processing.
+
+Conversion of the reception path involves calling poll() on the file
+descriptor, once the socket is readable the frames from the ring are
+processed in order until no more messages are available, as indicated by
+a status word in the frame header.
+
+On kernel side, in order to make use of memory mapped I/O on receive, the
+originating netlink subsystem needs to support memory mapped I/O, otherwise
+it will use an allocated socket buffer as usual and the contents will be
+ copied to the ring on transmission, nullifying most of the performance gains.
+Dumps of kernel databases automatically support memory mapped I/O.
+
+Conversion of the transmit path involves changing message construction to
+use memory from the TX ring instead of (usually) a buffer declared on the
+stack and setting up the frame header appropriately. Optionally poll() can
+be used to wait for free frames in the TX ring.
+
+Structured and definitions for using memory mapped I/O are contained in
+<linux/netlink.h>.
+
+RX and TX rings
+----------------
+
+Each ring contains a number of continuous memory blocks, containing frames of
+fixed size dependent on the parameters used for ring setup.
+
+Ring: [ block 0 ]
+ [ frame 0 ]
+ [ frame 1 ]
+ [ block 1 ]
+ [ frame 2 ]
+ [ frame 3 ]
+ ...
+ [ block n ]
+ [ frame 2 * n ]
+ [ frame 2 * n + 1 ]
+
+The blocks are only visible to the kernel, from the point of view of user-space
+the ring just contains the frames in a continuous memory zone.
+
+The ring parameters used for setting up the ring are defined as follows:
+
+struct nl_mmap_req {
+ unsigned int nm_block_size;
+ unsigned int nm_block_nr;
+ unsigned int nm_frame_size;
+ unsigned int nm_frame_nr;
+};
+
+Frames are grouped into blocks, where each block is a continuous region of memory
+and holds nm_block_size / nm_frame_size frames. The total number of frames in
+the ring is nm_frame_nr. The following invariants hold:
+
+- frames_per_block = nm_block_size / nm_frame_size
+
+- nm_frame_nr = frames_per_block * nm_block_nr
+
+Some parameters are constrained, specifically:
+
+- nm_block_size must be a multiple of the architectures memory page size.
+ The getpagesize() function can be used to get the page size.
+
+- nm_frame_size must be equal or larger to NL_MMAP_HDRLEN, IOW a frame must be
+ able to hold at least the frame header
+
+- nm_frame_size must be smaller or equal to nm_block_size
+
+- nm_frame_size must be a multiple of NL_MMAP_MSG_ALIGNMENT
+
+- nm_frame_nr must equal the actual number of frames as specified above.
+
+When the kernel can't allocate physically continuous memory for a ring block,
+it will fall back to use physically discontinuous memory. This might affect
+performance negatively, in order to avoid this the nm_frame_size parameter
+should be chosen to be as small as possible for the required frame size and
+the number of blocks should be increased instead.
+
+Ring frames
+------------
+
+Each frames contain a frame header, consisting of a synchronization word and some
+meta-data, and the message itself.
+
+Frame: [ header message ]
+
+The frame header is defined as follows:
+
+struct nl_mmap_hdr {
+ unsigned int nm_status;
+ unsigned int nm_len;
+ __u32 nm_group;
+ /* credentials */
+ __u32 nm_pid;
+ __u32 nm_uid;
+ __u32 nm_gid;
+};
+
+- nm_status is used for synchronizing processing between the kernel and user-
+ space and specifies ownership of the frame as well as the operation to perform
+
+- nm_len contains the length of the message contained in the data area
+
+- nm_group specified the destination multicast group of message
+
+- nm_pid, nm_uid and nm_gid contain the netlink pid, UID and GID of the sending
+ process. These values correspond to the data available using SOCK_PASSCRED in
+ the SCM_CREDENTIALS cmsg.
+
+The possible values in the status word are:
+
+- NL_MMAP_STATUS_UNUSED:
+ RX ring: frame belongs to the kernel and contains no message
+ for user-space. Approriate action is to invoke poll()
+ to wait for new messages.
+
+ TX ring: frame belongs to user-space and can be used for
+ message construction.
+
+- NL_MMAP_STATUS_RESERVED:
+ RX ring only: frame is currently used by the kernel for message
+ construction and contains no valid message yet.
+ Appropriate action is to invoke poll() to wait for
+ new messages.
+
+- NL_MMAP_STATUS_VALID:
+ RX ring: frame contains a valid message. Approriate action is
+ to process the message and release the frame back to
+ the kernel by setting the status to
+ NL_MMAP_STATUS_UNUSED or queue the frame by setting the
+ status to NL_MMAP_STATUS_SKIP.
+
+ TX ring: the frame contains a valid message from user-space to
+ be processed by the kernel. After completing processing
+ the kernel will release the frame back to user-space by
+ setting the status to NL_MMAP_STATUS_UNUSED.
+
+- NL_MMAP_STATUS_COPY:
+ RX ring only: a message is ready to be processed but could not be
+ stored in the ring, either because it exceeded the
+ frame size or because the originating subsystem does
+ not support memory mapped I/O. Appropriate action is
+ to invoke recvmsg() to receive the message and release
+ the frame back to the kernel by setting the status to
+ NL_MMAP_STATUS_UNUSED.
+
+- NL_MMAP_STATUS_SKIP:
+ RX ring only: user-space queued the message for later processing, but
+ processed some messages following it in the ring. The
+ kernel should skip this frame when looking for unused
+ frames.
+
+The data area of a frame begins at a offset of NL_MMAP_HDRLEN relative to the
+frame header.
+
+TX limitations
+--------------
+
+As of Jan 2015 the message is always copied from the ring frame to an
+allocated buffer due to unresolved security concerns.
+See commit 4682a0358639b29cf ("netlink: Always copy on mmap TX.").
+
+Example
+-------
+
+Ring setup:
+
+ unsigned int block_size = 16 * getpagesize();
+ struct nl_mmap_req req = {
+ .nm_block_size = block_size,
+ .nm_block_nr = 64,
+ .nm_frame_size = 16384,
+ .nm_frame_nr = 64 * block_size / 16384,
+ };
+ unsigned int ring_size;
+ void *rx_ring, *tx_ring;
+
+ /* Configure ring parameters */
+ if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
+ exit(1);
+ if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
+ exit(1)
+
+ /* Calculate size of each individual ring */
+ ring_size = req.nm_block_nr * req.nm_block_size;
+
+ /* Map RX/TX rings. The TX ring is located after the RX ring */
+ rx_ring = mmap(NULL, 2 * ring_size, PROT_READ | PROT_WRITE,
+ MAP_SHARED, fd, 0);
+ if ((long)rx_ring == -1L)
+ exit(1);
+ tx_ring = rx_ring + ring_size:
+
+Message reception:
+
+This example assumes some ring parameters of the ring setup are available.
+
+ unsigned int frame_offset = 0;
+ struct nl_mmap_hdr *hdr;
+ struct nlmsghdr *nlh;
+ unsigned char buf[16384];
+ ssize_t len;
+
+ while (1) {
+ struct pollfd pfds[1];
+
+ pfds[0].fd = fd;
+ pfds[0].events = POLLIN | POLLERR;
+ pfds[0].revents = 0;
+
+ if (poll(pfds, 1, -1) < 0 && errno != -EINTR)
+ exit(1);
+
+ /* Check for errors. Error handling omitted */
+ if (pfds[0].revents & POLLERR)
+ <handle error>
+
+ /* If no new messages, poll again */
+ if (!(pfds[0].revents & POLLIN))
+ continue;
+
+ /* Process all frames */
+ while (1) {
+ /* Get next frame header */
+ hdr = rx_ring + frame_offset;
+
+ if (hdr->nm_status == NL_MMAP_STATUS_VALID) {
+ /* Regular memory mapped frame */
+ nlh = (void *)hdr + NL_MMAP_HDRLEN;
+ len = hdr->nm_len;
+
+ /* Release empty message immediately. May happen
+ * on error during message construction.
+ */
+ if (len == 0)
+ goto release;
+ } else if (hdr->nm_status == NL_MMAP_STATUS_COPY) {
+ /* Frame queued to socket receive queue */
+ len = recv(fd, buf, sizeof(buf), MSG_DONTWAIT);
+ if (len <= 0)
+ break;
+ nlh = buf;
+ } else
+ /* No more messages to process, continue polling */
+ break;
+
+ process_msg(nlh);
+release:
+ /* Release frame back to the kernel */
+ hdr->nm_status = NL_MMAP_STATUS_UNUSED;
+
+ /* Advance frame offset to next frame */
+ frame_offset = (frame_offset + frame_size) % ring_size;
+ }
+ }
+
+Message transmission:
+
+This example assumes some ring parameters of the ring setup are available.
+A single message is constructed and transmitted, to send multiple messages
+at once they would be constructed in consecutive frames before a final call
+to sendto().
+
+ unsigned int frame_offset = 0;
+ struct nl_mmap_hdr *hdr;
+ struct nlmsghdr *nlh;
+ struct sockaddr_nl addr = {
+ .nl_family = AF_NETLINK,
+ };
+
+ hdr = tx_ring + frame_offset;
+ if (hdr->nm_status != NL_MMAP_STATUS_UNUSED)
+ /* No frame available. Use poll() to avoid. */
+ exit(1);
+
+ nlh = (void *)hdr + NL_MMAP_HDRLEN;
+
+ /* Build message */
+ build_message(nlh);
+
+ /* Fill frame header: length and status need to be set */
+ hdr->nm_len = nlh->nlmsg_len;
+ hdr->nm_status = NL_MMAP_STATUS_VALID;
+
+ if (sendto(fd, NULL, 0, 0, &addr, sizeof(addr)) < 0)
+ exit(1);
+
+ /* Advance frame offset to next frame */
+ frame_offset = (frame_offset + frame_size) % ring_size;
diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt
new file mode 100644
index 000000000..f55599c62
--- /dev/null
+++ b/Documentation/networking/nf_conntrack-sysctl.txt
@@ -0,0 +1,177 @@
+/proc/sys/net/netfilter/nf_conntrack_* Variables:
+
+nf_conntrack_acct - BOOLEAN
+ 0 - disabled (default)
+ not 0 - enabled
+
+ Enable connection tracking flow accounting. 64-bit byte and packet
+ counters per flow are added.
+
+nf_conntrack_buckets - INTEGER (read-only)
+ Size of hash table. If not specified as parameter during module
+ loading, the default size is calculated by dividing total memory
+ by 16384 to determine the number of buckets but the hash table will
+ never have fewer than 32 and limited to 16384 buckets. For systems
+ with more than 4GB of memory it will be 65536 buckets.
+
+nf_conntrack_checksum - BOOLEAN
+ 0 - disabled
+ not 0 - enabled (default)
+
+ Verify checksum of incoming packets. Packets with bad checksums are
+ in INVALID state. If this is enabled, such packets will not be
+ considered for connection tracking.
+
+nf_conntrack_count - INTEGER (read-only)
+ Number of currently allocated flow entries.
+
+nf_conntrack_events - BOOLEAN
+ 0 - disabled
+ not 0 - enabled (default)
+
+ If this option is enabled, the connection tracking code will
+ provide userspace with connection tracking events via ctnetlink.
+
+nf_conntrack_events_retry_timeout - INTEGER (seconds)
+ default 15
+
+ This option is only relevant when "reliable connection tracking
+ events" are used. Normally, ctnetlink is "lossy", that is,
+ events are normally dropped when userspace listeners can't keep up.
+
+ Userspace can request "reliable event mode". When this mode is
+ active, the conntrack will only be destroyed after the event was
+ delivered. If event delivery fails, the kernel periodically
+ re-tries to send the event to userspace.
+
+ This is the maximum interval the kernel should use when re-trying
+ to deliver the destroy event.
+
+ A higher number means there will be fewer delivery retries and it
+ will take longer for a backlog to be processed.
+
+nf_conntrack_expect_max - INTEGER
+ Maximum size of expectation table. Default value is
+ nf_conntrack_buckets / 256. Minimum is 1.
+
+nf_conntrack_frag6_high_thresh - INTEGER
+ default 262144
+
+ Maximum memory used to reassemble IPv6 fragments. When
+ nf_conntrack_frag6_high_thresh bytes of memory is allocated for this
+ purpose, the fragment handler will toss packets until
+ nf_conntrack_frag6_low_thresh is reached.
+
+nf_conntrack_frag6_low_thresh - INTEGER
+ default 196608
+
+ See nf_conntrack_frag6_low_thresh
+
+nf_conntrack_frag6_timeout - INTEGER (seconds)
+ default 60
+
+ Time to keep an IPv6 fragment in memory.
+
+nf_conntrack_generic_timeout - INTEGER (seconds)
+ default 600
+
+ Default for generic timeout. This refers to layer 4 unknown/unsupported
+ protocols.
+
+nf_conntrack_helper - BOOLEAN
+ 0 - disabled
+ not 0 - enabled (default)
+
+ Enable automatic conntrack helper assignment.
+
+nf_conntrack_icmp_timeout - INTEGER (seconds)
+ default 30
+
+ Default for ICMP timeout.
+
+nf_conntrack_icmpv6_timeout - INTEGER (seconds)
+ default 30
+
+ Default for ICMP6 timeout.
+
+nf_conntrack_log_invalid - INTEGER
+ 0 - disable (default)
+ 1 - log ICMP packets
+ 6 - log TCP packets
+ 17 - log UDP packets
+ 33 - log DCCP packets
+ 41 - log ICMPv6 packets
+ 136 - log UDPLITE packets
+ 255 - log packets of any protocol
+
+ Log invalid packets of a type specified by value.
+
+nf_conntrack_max - INTEGER
+ Size of connection tracking table. Default value is
+ nf_conntrack_buckets value * 4.
+
+nf_conntrack_tcp_be_liberal - BOOLEAN
+ 0 - disabled (default)
+ not 0 - enabled
+
+ Be conservative in what you do, be liberal in what you accept from others.
+ If it's non-zero, we mark only out of window RST segments as INVALID.
+
+nf_conntrack_tcp_loose - BOOLEAN
+ 0 - disabled
+ not 0 - enabled (default)
+
+ If it is set to zero, we disable picking up already established
+ connections.
+
+nf_conntrack_tcp_max_retrans - INTEGER
+ default 3
+
+ Maximum number of packets that can be retransmitted without
+ received an (acceptable) ACK from the destination. If this number
+ is reached, a shorter timer will be started.
+
+nf_conntrack_tcp_timeout_close - INTEGER (seconds)
+ default 10
+
+nf_conntrack_tcp_timeout_close_wait - INTEGER (seconds)
+ default 60
+
+nf_conntrack_tcp_timeout_established - INTEGER (seconds)
+ default 432000 (5 days)
+
+nf_conntrack_tcp_timeout_fin_wait - INTEGER (seconds)
+ default 120
+
+nf_conntrack_tcp_timeout_last_ack - INTEGER (seconds)
+ default 30
+
+nf_conntrack_tcp_timeout_max_retrans - INTEGER (seconds)
+ default 300
+
+nf_conntrack_tcp_timeout_syn_recv - INTEGER (seconds)
+ default 60
+
+nf_conntrack_tcp_timeout_syn_sent - INTEGER (seconds)
+ default 120
+
+nf_conntrack_tcp_timeout_time_wait - INTEGER (seconds)
+ default 120
+
+nf_conntrack_tcp_timeout_unacknowledged - INTEGER (seconds)
+ default 300
+
+nf_conntrack_timestamp - BOOLEAN
+ 0 - disabled (default)
+ not 0 - enabled
+
+ Enable connection tracking flow timestamping.
+
+nf_conntrack_udp_timeout - INTEGER (seconds)
+ default 30
+
+nf_conntrack_udp_timeout_stream2 - INTEGER (seconds)
+ default 180
+
+ This extended timeout will be used in case there is an UDP stream
+ detected.
diff --git a/Documentation/networking/nfc.txt b/Documentation/networking/nfc.txt
new file mode 100644
index 000000000..b24c29bda
--- /dev/null
+++ b/Documentation/networking/nfc.txt
@@ -0,0 +1,128 @@
+Linux NFC subsystem
+===================
+
+The Near Field Communication (NFC) subsystem is required to standardize the
+NFC device drivers development and to create an unified userspace interface.
+
+This document covers the architecture overview, the device driver interface
+description and the userspace interface description.
+
+Architecture overview
+---------------------
+
+The NFC subsystem is responsible for:
+ - NFC adapters management;
+ - Polling for targets;
+ - Low-level data exchange;
+
+The subsystem is divided in some parts. The 'core' is responsible for
+providing the device driver interface. On the other side, it is also
+responsible for providing an interface to control operations and low-level
+data exchange.
+
+The control operations are available to userspace via generic netlink.
+
+The low-level data exchange interface is provided by the new socket family
+PF_NFC. The NFC_SOCKPROTO_RAW performs raw communication with NFC targets.
+
+
+ +--------------------------------------+
+ | USER SPACE |
+ +--------------------------------------+
+ ^ ^
+ | low-level | control
+ | data exchange | operations
+ | |
+ | v
+ | +-----------+
+ | AF_NFC | netlink |
+ | socket +-----------+
+ | raw ^
+ | |
+ v v
+ +---------+ +-----------+
+ | rawsock | <--------> | core |
+ +---------+ +-----------+
+ ^
+ |
+ v
+ +-----------+
+ | driver |
+ +-----------+
+
+Device Driver Interface
+-----------------------
+
+When registering on the NFC subsystem, the device driver must inform the core
+of the set of supported NFC protocols and the set of ops callbacks. The ops
+callbacks that must be implemented are the following:
+
+* start_poll - setup the device to poll for targets
+* stop_poll - stop on progress polling operation
+* activate_target - select and initialize one of the targets found
+* deactivate_target - deselect and deinitialize the selected target
+* data_exchange - send data and receive the response (transceive operation)
+
+Userspace interface
+--------------------
+
+The userspace interface is divided in control operations and low-level data
+exchange operation.
+
+CONTROL OPERATIONS:
+
+Generic netlink is used to implement the interface to the control operations.
+The operations are composed by commands and events, all listed below:
+
+* NFC_CMD_GET_DEVICE - get specific device info or dump the device list
+* NFC_CMD_START_POLL - setup a specific device to polling for targets
+* NFC_CMD_STOP_POLL - stop the polling operation in a specific device
+* NFC_CMD_GET_TARGET - dump the list of targets found by a specific device
+
+* NFC_EVENT_DEVICE_ADDED - reports an NFC device addition
+* NFC_EVENT_DEVICE_REMOVED - reports an NFC device removal
+* NFC_EVENT_TARGETS_FOUND - reports START_POLL results when 1 or more targets
+are found
+
+The user must call START_POLL to poll for NFC targets, passing the desired NFC
+protocols through NFC_ATTR_PROTOCOLS attribute. The device remains in polling
+state until it finds any target. However, the user can stop the polling
+operation by calling STOP_POLL command. In this case, it will be checked if
+the requester of STOP_POLL is the same of START_POLL.
+
+If the polling operation finds one or more targets, the event TARGETS_FOUND is
+sent (including the device id). The user must call GET_TARGET to get the list of
+all targets found by such device. Each reply message has target attributes with
+relevant information such as the supported NFC protocols.
+
+All polling operations requested through one netlink socket are stopped when
+it's closed.
+
+LOW-LEVEL DATA EXCHANGE:
+
+The userspace must use PF_NFC sockets to perform any data communication with
+targets. All NFC sockets use AF_NFC:
+
+struct sockaddr_nfc {
+ sa_family_t sa_family;
+ __u32 dev_idx;
+ __u32 target_idx;
+ __u32 nfc_protocol;
+};
+
+To establish a connection with one target, the user must create an
+NFC_SOCKPROTO_RAW socket and call the 'connect' syscall with the sockaddr_nfc
+struct correctly filled. All information comes from NFC_EVENT_TARGETS_FOUND
+netlink event. As a target can support more than one NFC protocol, the user
+must inform which protocol it wants to use.
+
+Internally, 'connect' will result in an activate_target call to the driver.
+When the socket is closed, the target is deactivated.
+
+The data format exchanged through the sockets is NFC protocol dependent. For
+instance, when communicating with MIFARE tags, the data exchanged are MIFARE
+commands and their responses.
+
+The first received package is the response to the first sent package and so
+on. In order to allow valid "empty" responses, every data received has a NULL
+header of 1 byte.
diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.txt
new file mode 100644
index 000000000..b3b9ac61d
--- /dev/null
+++ b/Documentation/networking/openvswitch.txt
@@ -0,0 +1,248 @@
+Open vSwitch datapath developer documentation
+=============================================
+
+The Open vSwitch kernel module allows flexible userspace control over
+flow-level packet processing on selected network devices. It can be
+used to implement a plain Ethernet switch, network device bonding,
+VLAN processing, network access control, flow-based network control,
+and so on.
+
+The kernel module implements multiple "datapaths" (analogous to
+bridges), each of which can have multiple "vports" (analogous to ports
+within a bridge). Each datapath also has associated with it a "flow
+table" that userspace populates with "flows" that map from keys based
+on packet headers and metadata to sets of actions. The most common
+action forwards the packet to another vport; other actions are also
+implemented.
+
+When a packet arrives on a vport, the kernel module processes it by
+extracting its flow key and looking it up in the flow table. If there
+is a matching flow, it executes the associated actions. If there is
+no match, it queues the packet to userspace for processing (as part of
+its processing, userspace will likely set up a flow to handle further
+packets of the same type entirely in-kernel).
+
+
+Flow key compatibility
+----------------------
+
+Network protocols evolve over time. New protocols become important
+and existing protocols lose their prominence. For the Open vSwitch
+kernel module to remain relevant, it must be possible for newer
+versions to parse additional protocols as part of the flow key. It
+might even be desirable, someday, to drop support for parsing
+protocols that have become obsolete. Therefore, the Netlink interface
+to Open vSwitch is designed to allow carefully written userspace
+applications to work with any version of the flow key, past or future.
+
+To support this forward and backward compatibility, whenever the
+kernel module passes a packet to userspace, it also passes along the
+flow key that it parsed from the packet. Userspace then extracts its
+own notion of a flow key from the packet and compares it against the
+kernel-provided version:
+
+ - If userspace's notion of the flow key for the packet matches the
+ kernel's, then nothing special is necessary.
+
+ - If the kernel's flow key includes more fields than the userspace
+ version of the flow key, for example if the kernel decoded IPv6
+ headers but userspace stopped at the Ethernet type (because it
+ does not understand IPv6), then again nothing special is
+ necessary. Userspace can still set up a flow in the usual way,
+ as long as it uses the kernel-provided flow key to do it.
+
+ - If the userspace flow key includes more fields than the
+ kernel's, for example if userspace decoded an IPv6 header but
+ the kernel stopped at the Ethernet type, then userspace can
+ forward the packet manually, without setting up a flow in the
+ kernel. This case is bad for performance because every packet
+ that the kernel considers part of the flow must go to userspace,
+ but the forwarding behavior is correct. (If userspace can
+ determine that the values of the extra fields would not affect
+ forwarding behavior, then it could set up a flow anyway.)
+
+How flow keys evolve over time is important to making this work, so
+the following sections go into detail.
+
+
+Flow key format
+---------------
+
+A flow key is passed over a Netlink socket as a sequence of Netlink
+attributes. Some attributes represent packet metadata, defined as any
+information about a packet that cannot be extracted from the packet
+itself, e.g. the vport on which the packet was received. Most
+attributes, however, are extracted from headers within the packet,
+e.g. source and destination addresses from Ethernet, IP, or TCP
+headers.
+
+The <linux/openvswitch.h> header file defines the exact format of the
+flow key attributes. For informal explanatory purposes here, we write
+them as comma-separated strings, with parentheses indicating arguments
+and nesting. For example, the following could represent a flow key
+corresponding to a TCP packet that arrived on vport 1:
+
+ in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4),
+ eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0,
+ frag=no), tcp(src=49163, dst=80)
+
+Often we ellipsize arguments not important to the discussion, e.g.:
+
+ in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...)
+
+
+Wildcarded flow key format
+--------------------------
+
+A wildcarded flow is described with two sequences of Netlink attributes
+passed over the Netlink socket. A flow key, exactly as described above, and an
+optional corresponding flow mask.
+
+A wildcarded flow can represent a group of exact match flows. Each '1' bit
+in the mask specifies a exact match with the corresponding bit in the flow key.
+A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit
+of a incoming packet. Using wildcarded flow can improve the flow set up rate
+by reduce the number of new flows need to be processed by the user space program.
+
+Support for the mask Netlink attribute is optional for both the kernel and user
+space program. The kernel can ignore the mask attribute, installing an exact
+match flow, or reduce the number of don't care bits in the kernel to less than
+what was specified by the user space program. In this case, variations in bits
+that the kernel does not implement will simply result in additional flow setups.
+The kernel module will also work with user space programs that neither support
+nor supply flow mask attributes.
+
+Since the kernel may ignore or modify wildcard bits, it can be difficult for
+the userspace program to know exactly what matches are installed. There are
+two possible approaches: reactively install flows as they miss the kernel
+flow table (and therefore not attempt to determine wildcard changes at all)
+or use the kernel's response messages to determine the installed wildcards.
+
+When interacting with userspace, the kernel should maintain the match portion
+of the key exactly as originally installed. This will provides a handle to
+identify the flow for all future operations. However, when reporting the
+mask of an installed flow, the mask should include any restrictions imposed
+by the kernel.
+
+The behavior when using overlapping wildcarded flows is undefined. It is the
+responsibility of the user space program to ensure that any incoming packet
+can match at most one flow, wildcarded or not. The current implementation
+performs best-effort detection of overlapping wildcarded flows and may reject
+some but not all of them. However, this behavior may change in future versions.
+
+
+Unique flow identifiers
+-----------------------
+
+An alternative to using the original match portion of a key as the handle for
+flow identification is a unique flow identifier, or "UFID". UFIDs are optional
+for both the kernel and user space program.
+
+User space programs that support UFID are expected to provide it during flow
+setup in addition to the flow, then refer to the flow using the UFID for all
+future operations. The kernel is not required to index flows by the original
+flow key if a UFID is specified.
+
+
+Basic rule for evolving flow keys
+---------------------------------
+
+Some care is needed to really maintain forward and backward
+compatibility for applications that follow the rules listed under
+"Flow key compatibility" above.
+
+The basic rule is obvious:
+
+ ------------------------------------------------------------------
+ New network protocol support must only supplement existing flow
+ key attributes. It must not change the meaning of already defined
+ flow key attributes.
+ ------------------------------------------------------------------
+
+This rule does have less-obvious consequences so it is worth working
+through a few examples. Suppose, for example, that the kernel module
+did not already implement VLAN parsing. Instead, it just interpreted
+the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the
+packet. The flow key for any packet with an 802.1Q header would look
+essentially like this, ignoring metadata:
+
+ eth(...), eth_type(0x8100)
+
+Naively, to add VLAN support, it makes sense to add a new "vlan" flow
+key attribute to contain the VLAN tag, then continue to decode the
+encapsulated headers beyond the VLAN tag using the existing field
+definitions. With this change, a TCP packet in VLAN 10 would have a
+flow key much like this:
+
+ eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...)
+
+But this change would negatively affect a userspace application that
+has not been updated to understand the new "vlan" flow key attribute.
+The application could, following the flow compatibility rules above,
+ignore the "vlan" attribute that it does not understand and therefore
+assume that the flow contained IP packets. This is a bad assumption
+(the flow only contains IP packets if one parses and skips over the
+802.1Q header) and it could cause the application's behavior to change
+across kernel versions even though it follows the compatibility rules.
+
+The solution is to use a set of nested attributes. This is, for
+example, why 802.1Q support uses nested attributes. A TCP packet in
+VLAN 10 is actually expressed as:
+
+ eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800),
+ ip(proto=6, ...), tcp(...)))
+
+Notice how the "eth_type", "ip", and "tcp" flow key attributes are
+nested inside the "encap" attribute. Thus, an application that does
+not understand the "vlan" key will not see either of those attributes
+and therefore will not misinterpret them. (Also, the outer eth_type
+is still 0x8100, not changed to 0x0800.)
+
+Handling malformed packets
+--------------------------
+
+Don't drop packets in the kernel for malformed protocol headers, bad
+checksums, etc. This would prevent userspace from implementing a
+simple Ethernet switch that forwards every packet.
+
+Instead, in such a case, include an attribute with "empty" content.
+It doesn't matter if the empty content could be valid protocol values,
+as long as those values are rarely seen in practice, because userspace
+can always forward all packets with those values to userspace and
+handle them individually.
+
+For example, consider a packet that contains an IP header that
+indicates protocol 6 for TCP, but which is truncated just after the IP
+header, so that the TCP header is missing. The flow key for this
+packet would include a tcp attribute with all-zero src and dst, like
+this:
+
+ eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
+
+As another example, consider a packet with an Ethernet type of 0x8100,
+indicating that a VLAN TCI should follow, but which is truncated just
+after the Ethernet type. The flow key for this packet would include
+an all-zero-bits vlan and an empty encap attribute, like this:
+
+ eth(...), eth_type(0x8100), vlan(0), encap()
+
+Unlike a TCP packet with source and destination ports 0, an
+all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
+VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
+attribute expressly to allow this situation to be distinguished.
+Thus, the flow key in this second example unambiguously indicates a
+missing or malformed VLAN TCI.
+
+Other rules
+-----------
+
+The other rules for flow keys are much less subtle:
+
+ - Duplicate attributes are not allowed at a given nesting level.
+
+ - Ordering of attributes is not significant.
+
+ - When the kernel sends a given flow key to userspace, it always
+ composes it the same way. This allows userspace to hash and
+ compare entire flow keys that it may not be able to fully
+ interpret.
diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.txt
new file mode 100644
index 000000000..355c6d8ef
--- /dev/null
+++ b/Documentation/networking/operstates.txt
@@ -0,0 +1,162 @@
+
+1. Introduction
+
+Linux distinguishes between administrative and operational state of an
+interface. Administrative state is the result of "ip link set dev
+<dev> up or down" and reflects whether the administrator wants to use
+the device for traffic.
+
+However, an interface is not usable just because the admin enabled it
+- ethernet requires to be plugged into the switch and, depending on
+a site's networking policy and configuration, an 802.1X authentication
+to be performed before user data can be transferred. Operational state
+shows the ability of an interface to transmit this user data.
+
+Thanks to 802.1X, userspace must be granted the possibility to
+influence operational state. To accommodate this, operational state is
+split into two parts: Two flags that can be set by the driver only, and
+a RFC2863 compatible state that is derived from these flags, a policy,
+and changeable from userspace under certain rules.
+
+
+2. Querying from userspace
+
+Both admin and operational state can be queried via the netlink
+operation RTM_GETLINK. It is also possible to subscribe to RTMGRP_LINK
+to be notified of updates. This is important for setting from userspace.
+
+These values contain interface state:
+
+ifinfomsg::if_flags & IFF_UP:
+ Interface is admin up
+ifinfomsg::if_flags & IFF_RUNNING:
+ Interface is in RFC2863 operational state UP or UNKNOWN. This is for
+ backward compatibility, routing daemons, dhcp clients can use this
+ flag to determine whether they should use the interface.
+ifinfomsg::if_flags & IFF_LOWER_UP:
+ Driver has signaled netif_carrier_on()
+ifinfomsg::if_flags & IFF_DORMANT:
+ Driver has signaled netif_dormant_on()
+
+TLV IFLA_OPERSTATE
+
+contains RFC2863 state of the interface in numeric representation:
+
+IF_OPER_UNKNOWN (0):
+ Interface is in unknown state, neither driver nor userspace has set
+ operational state. Interface must be considered for user data as
+ setting operational state has not been implemented in every driver.
+IF_OPER_NOTPRESENT (1):
+ Unused in current kernel (notpresent interfaces normally disappear),
+ just a numerical placeholder.
+IF_OPER_DOWN (2):
+ Interface is unable to transfer data on L1, f.e. ethernet is not
+ plugged or interface is ADMIN down.
+IF_OPER_LOWERLAYERDOWN (3):
+ Interfaces stacked on an interface that is IF_OPER_DOWN show this
+ state (f.e. VLAN).
+IF_OPER_TESTING (4):
+ Unused in current kernel.
+IF_OPER_DORMANT (5):
+ Interface is L1 up, but waiting for an external event, f.e. for a
+ protocol to establish. (802.1X)
+IF_OPER_UP (6):
+ Interface is operational up and can be used.
+
+This TLV can also be queried via sysfs.
+
+TLV IFLA_LINKMODE
+
+contains link policy. This is needed for userspace interaction
+described below.
+
+This TLV can also be queried via sysfs.
+
+
+3. Kernel driver API
+
+Kernel drivers have access to two flags that map to IFF_LOWER_UP and
+IFF_DORMANT. These flags can be set from everywhere, even from
+interrupts. It is guaranteed that only the driver has write access,
+however, if different layers of the driver manipulate the same flag,
+the driver has to provide the synchronisation needed.
+
+__LINK_STATE_NOCARRIER, maps to !IFF_LOWER_UP:
+
+The driver uses netif_carrier_on() to clear and netif_carrier_off() to
+set this flag. On netif_carrier_off(), the scheduler stops sending
+packets. The name 'carrier' and the inversion are historical, think of
+it as lower layer.
+
+Note that for certain kind of soft-devices, which are not managing any
+real hardware, it is possible to set this bit from userspace. One
+should use TVL IFLA_CARRIER to do so.
+
+netif_carrier_ok() can be used to query that bit.
+
+__LINK_STATE_DORMANT, maps to IFF_DORMANT:
+
+Set by the driver to express that the device cannot yet be used
+because some driver controlled protocol establishment has to
+complete. Corresponding functions are netif_dormant_on() to set the
+flag, netif_dormant_off() to clear it and netif_dormant() to query.
+
+On device allocation, networking core sets the flags equivalent to
+netif_carrier_ok() and !netif_dormant().
+
+
+Whenever the driver CHANGES one of these flags, a workqueue event is
+scheduled to translate the flag combination to IFLA_OPERSTATE as
+follows:
+
+!netif_carrier_ok():
+ IF_OPER_LOWERLAYERDOWN if the interface is stacked, IF_OPER_DOWN
+ otherwise. Kernel can recognise stacked interfaces because their
+ ifindex != iflink.
+
+netif_carrier_ok() && netif_dormant():
+ IF_OPER_DORMANT
+
+netif_carrier_ok() && !netif_dormant():
+ IF_OPER_UP if userspace interaction is disabled. Otherwise
+ IF_OPER_DORMANT with the possibility for userspace to initiate the
+ IF_OPER_UP transition afterwards.
+
+
+4. Setting from userspace
+
+Applications have to use the netlink interface to influence the
+RFC2863 operational state of an interface. Setting IFLA_LINKMODE to 1
+via RTM_SETLINK instructs the kernel that an interface should go to
+IF_OPER_DORMANT instead of IF_OPER_UP when the combination
+netif_carrier_ok() && !netif_dormant() is set by the
+driver. Afterwards, the userspace application can set IFLA_OPERSTATE
+to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set
+netif_carrier_off() or netif_dormant_on(). Changes made by userspace
+are multicasted on the netlink group RTMGRP_LINK.
+
+So basically a 802.1X supplicant interacts with the kernel like this:
+
+-subscribe to RTMGRP_LINK
+-set IFLA_LINKMODE to 1 via RTM_SETLINK
+-query RTM_GETLINK once to get initial state
+-if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until
+ netlink multicast signals this state
+-do 802.1X, eventually abort if flags go down again
+-send RTM_SETLINK to set operstate to IF_OPER_UP if authentication
+ succeeds, IF_OPER_DORMANT otherwise
+-see how operstate and IFF_RUNNING is echoed via netlink multicast
+-set interface back to IF_OPER_DORMANT if 802.1X reauthentication
+ fails
+-restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag
+
+if supplicant goes down, bring back IFLA_LINKMODE to 0 and
+IFLA_OPERSTATE to a sane value.
+
+A routing daemon or dhcp client just needs to care for IFF_RUNNING or
+waiting for operstate to go IF_OPER_UP/IF_OPER_UNKNOWN before
+considering the interface / querying a DHCP address.
+
+
+For technical questions and/or comments please e-mail to Stefan Rompf
+(stefan at loplof.de).
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
new file mode 100644
index 000000000..daa015af1
--- /dev/null
+++ b/Documentation/networking/packet_mmap.txt
@@ -0,0 +1,1068 @@
+--------------------------------------------------------------------------------
++ ABSTRACT
+--------------------------------------------------------------------------------
+
+This file documents the mmap() facility available with the PACKET
+socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for
+i) capture network traffic with utilities like tcpdump, ii) transmit network
+traffic, or any other that needs raw access to network interface.
+
+You can find the latest version of this document at:
+ http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
+
+Howto can be found at:
+ http://wiki.gnu-log.net (packet_mmap)
+
+Please send your comments to
+ Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es>
+ Johann Baudy <johann.baudy@gnu-log.net>
+
+-------------------------------------------------------------------------------
++ Why use PACKET_MMAP
+--------------------------------------------------------------------------------
+
+In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very
+inefficient. It uses very limited buffers and requires one system call to
+capture each packet, it requires two if you want to get packet's timestamp
+(like libpcap always does).
+
+In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size
+configurable circular buffer mapped in user space that can be used to either
+send or receive packets. This way reading packets just needs to wait for them,
+most of the time there is no need to issue a single system call. Concerning
+transmission, multiple packets can be sent through one system call to get the
+highest bandwidth. By using a shared buffer between the kernel and the user
+also has the benefit of minimizing packet copies.
+
+It's fine to use PACKET_MMAP to improve the performance of the capture and
+transmission process, but it isn't everything. At least, if you are capturing
+at high speeds (this is relative to the cpu speed), you should check if the
+device driver of your network interface card supports some sort of interrupt
+load mitigation or (even better) if it supports NAPI, also make sure it is
+enabled. For transmission, check the MTU (Maximum Transmission Unit) used and
+supported by devices of your network. CPU IRQ pinning of your network interface
+card can also be an advantage.
+
+--------------------------------------------------------------------------------
++ How to use mmap() to improve capture process
+--------------------------------------------------------------------------------
+
+From the user standpoint, you should use the higher level libpcap library, which
+is a de facto standard, portable across nearly all operating systems
+including Win32.
+
+Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include
+support for PACKET_MMAP, and also probably the libpcap included in your distribution.
+
+I'm aware of two implementations of PACKET_MMAP in libpcap:
+
+ http://wiki.ipxwarzone.com/ (by Simon Patarin, based on libpcap 0.6.2)
+ http://public.lanl.gov/cpw/ (by Phil Wood, based on lastest libpcap)
+
+The rest of this document is intended for people who want to understand
+the low level details or want to improve libpcap by including PACKET_MMAP
+support.
+
+--------------------------------------------------------------------------------
++ How to use mmap() directly to improve capture process
+--------------------------------------------------------------------------------
+
+From the system calls stand point, the use of PACKET_MMAP involves
+the following process:
+
+
+[setup] socket() -------> creation of the capture socket
+ setsockopt() ---> allocation of the circular buffer (ring)
+ option: PACKET_RX_RING
+ mmap() ---------> mapping of the allocated buffer to the
+ user process
+
+[capture] poll() ---------> to wait for incoming packets
+
+[shutdown] close() --------> destruction of the capture socket and
+ deallocation of all associated
+ resources.
+
+
+socket creation and destruction is straight forward, and is done
+the same way with or without PACKET_MMAP:
+
+ int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL));
+
+where mode is SOCK_RAW for the raw interface were link level
+information can be captured or SOCK_DGRAM for the cooked
+interface where link level information capture is not
+supported and a link level pseudo-header is provided
+by the kernel.
+
+The destruction of the socket and all associated resources
+is done by a simple call to close(fd).
+
+Similarly as without PACKET_MMAP, it is possible to use one socket
+for capture and transmission. This can be done by mapping the
+allocated RX and TX buffer ring with a single mmap() call.
+See "Mapping and use of the circular buffer (ring)".
+
+Next I will describe PACKET_MMAP settings and its constraints,
+also the mapping of the circular buffer in the user process and
+the use of this buffer.
+
+--------------------------------------------------------------------------------
++ How to use mmap() directly to improve transmission process
+--------------------------------------------------------------------------------
+Transmission process is similar to capture as shown below.
+
+[setup] socket() -------> creation of the transmission socket
+ setsockopt() ---> allocation of the circular buffer (ring)
+ option: PACKET_TX_RING
+ bind() ---------> bind transmission socket with a network interface
+ mmap() ---------> mapping of the allocated buffer to the
+ user process
+
+[transmission] poll() ---------> wait for free packets (optional)
+ send() ---------> send all packets that are set as ready in
+ the ring
+ The flag MSG_DONTWAIT can be used to return
+ before end of transfer.
+
+[shutdown] close() --------> destruction of the transmission socket and
+ deallocation of all associated resources.
+
+Socket creation and destruction is also straight forward, and is done
+the same way as in capturing described in the previous paragraph:
+
+ int fd = socket(PF_PACKET, mode, 0);
+
+The protocol can optionally be 0 in case we only want to transmit
+via this socket, which avoids an expensive call to packet_rcv().
+In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
+set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
+
+Binding the socket to your network interface is mandatory (with zero copy) to
+know the header size of frames used in the circular buffer.
+
+As capture, each frame contains two parts:
+
+ --------------------
+| struct tpacket_hdr | Header. It contains the status of
+| | of this frame
+|--------------------|
+| data buffer |
+. . Data that will be sent over the network interface.
+. .
+ --------------------
+
+ bind() associates the socket to your network interface thanks to
+ sll_ifindex parameter of struct sockaddr_ll.
+
+ Initialization example:
+
+ struct sockaddr_ll my_addr;
+ struct ifreq s_ifr;
+ ...
+
+ strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name));
+
+ /* get interface index of eth0 */
+ ioctl(this->socket, SIOCGIFINDEX, &s_ifr);
+
+ /* fill sockaddr_ll struct to prepare binding */
+ my_addr.sll_family = AF_PACKET;
+ my_addr.sll_protocol = htons(ETH_P_ALL);
+ my_addr.sll_ifindex = s_ifr.ifr_ifindex;
+
+ /* bind socket to eth0 */
+ bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll));
+
+ A complete tutorial is available at: http://wiki.gnu-log.net/
+
+By default, the user should put data at :
+ frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll)
+
+So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW),
+the beginning of the user data will be at :
+ frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
+
+If you wish to put user data at a custom offset from the beginning of
+the frame (for payload alignment with SOCK_RAW mode for instance) you
+can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order
+to make this work it must be enabled previously with setsockopt()
+and the PACKET_TX_HAS_OFF option.
+
+--------------------------------------------------------------------------------
++ PACKET_MMAP settings
+--------------------------------------------------------------------------------
+
+To setup PACKET_MMAP from user level code is done with a call like
+
+ - Capture process
+ setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req))
+ - Transmission process
+ setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req))
+
+The most significant argument in the previous call is the req parameter,
+this parameter must to have the following structure:
+
+ struct tpacket_req
+ {
+ unsigned int tp_block_size; /* Minimal size of contiguous block */
+ unsigned int tp_block_nr; /* Number of blocks */
+ unsigned int tp_frame_size; /* Size of frame */
+ unsigned int tp_frame_nr; /* Total number of frames */
+ };
+
+This structure is defined in /usr/include/linux/if_packet.h and establishes a
+circular buffer (ring) of unswappable memory.
+Being mapped in the capture process allows reading the captured frames and
+related meta-information like timestamps without requiring a system call.
+
+Frames are grouped in blocks. Each block is a physically contiguous
+region of memory and holds tp_block_size/tp_frame_size frames. The total number
+of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because
+
+ frames_per_block = tp_block_size/tp_frame_size
+
+indeed, packet_set_ring checks that the following condition is true
+
+ frames_per_block * tp_block_nr == tp_frame_nr
+
+Lets see an example, with the following values:
+
+ tp_block_size= 4096
+ tp_frame_size= 2048
+ tp_block_nr = 4
+ tp_frame_nr = 8
+
+we will get the following buffer structure:
+
+ block #1 block #2
++---------+---------+ +---------+---------+
+| frame 1 | frame 2 | | frame 3 | frame 4 |
++---------+---------+ +---------+---------+
+
+ block #3 block #4
++---------+---------+ +---------+---------+
+| frame 5 | frame 6 | | frame 7 | frame 8 |
++---------+---------+ +---------+---------+
+
+A frame can be of any size with the only condition it can fit in a block. A block
+can only hold an integer number of frames, or in other words, a frame cannot
+be spawned across two blocks, so there are some details you have to take into
+account when choosing the frame_size. See "Mapping and use of the circular
+buffer (ring)".
+
+--------------------------------------------------------------------------------
++ PACKET_MMAP setting constraints
+--------------------------------------------------------------------------------
+
+In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch),
+the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or
+16384 in a 64 bit architecture. For information on these kernel versions
+see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt
+
+ Block size limit
+------------------
+
+As stated earlier, each block is a contiguous physical region of memory. These
+memory regions are allocated with calls to the __get_free_pages() function. As
+the name indicates, this function allocates pages of memory, and the second
+argument is "order" or a power of two number of pages, that is
+(for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,
+order=2 ==> 16384 bytes, etc. The maximum size of a
+region allocated by __get_free_pages is determined by the MAX_ORDER macro. More
+precisely the limit can be calculated as:
+
+ PAGE_SIZE << MAX_ORDER
+
+ In a i386 architecture PAGE_SIZE is 4096 bytes
+ In a 2.4/i386 kernel MAX_ORDER is 10
+ In a 2.6/i386 kernel MAX_ORDER is 11
+
+So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel
+respectively, with an i386 architecture.
+
+User space programs can include /usr/include/sys/user.h and
+/usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations.
+
+The pagesize can also be determined dynamically with the getpagesize (2)
+system call.
+
+ Block number limit
+--------------------
+
+To understand the constraints of PACKET_MMAP, we have to see the structure
+used to hold the pointers to each block.
+
+Currently, this structure is a dynamically allocated vector with kmalloc
+called pg_vec, its size limits the number of blocks that can be allocated.
+
+ +---+---+---+---+
+ | x | x | x | x |
+ +---+---+---+---+
+ | | | |
+ | | | v
+ | | v block #4
+ | v block #3
+ v block #2
+ block #1
+
+kmalloc allocates any number of bytes of physically contiguous memory from
+a pool of pre-determined sizes. This pool of memory is maintained by the slab
+allocator which is at the end the responsible for doing the allocation and
+hence which imposes the maximum memory that kmalloc can allocate.
+
+In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The
+predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"
+entries of /proc/slabinfo
+
+In a 32 bit architecture, pointers are 4 bytes long, so the total number of
+pointers to blocks is
+
+ 131072/4 = 32768 blocks
+
+ PACKET_MMAP buffer size calculator
+------------------------------------
+
+Definitions:
+
+<size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo)
+<pointer size>: depends on the architecture -- sizeof(void *)
+<page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2)
+<max-order> : is the value defined with MAX_ORDER
+<frame size> : it's an upper bound of frame's capture size (more on this later)
+
+from these definitions we will derive
+
+ <block number> = <size-max>/<pointer size>
+ <block size> = <pagesize> << <max-order>
+
+so, the max buffer size is
+
+ <block number> * <block size>
+
+and, the number of frames be
+
+ <block number> * <block size> / <frame size>
+
+Suppose the following parameters, which apply for 2.6 kernel and an
+i386 architecture:
+
+ <size-max> = 131072 bytes
+ <pointer size> = 4 bytes
+ <pagesize> = 4096 bytes
+ <max-order> = 11
+
+and a value for <frame size> of 2048 bytes. These parameters will yield
+
+ <block number> = 131072/4 = 32768 blocks
+ <block size> = 4096 << 11 = 8 MiB.
+
+and hence the buffer will have a 262144 MiB size. So it can hold
+262144 MiB / 2048 bytes = 134217728 frames
+
+Actually, this buffer size is not possible with an i386 architecture.
+Remember that the memory is allocated in kernel space, in the case of
+an i386 kernel's memory size is limited to 1GiB.
+
+All memory allocations are not freed until the socket is closed. The memory
+allocations are done with GFP_KERNEL priority, this basically means that
+the allocation can wait and swap other process' memory in order to allocate
+the necessary memory, so normally limits can be reached.
+
+ Other constraints
+-------------------
+
+If you check the source code you will see that what I draw here as a frame
+is not only the link level frame. At the beginning of each frame there is a
+header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame
+meta information like timestamp. So what we draw here a frame it's really
+the following (from include/linux/if_packet.h):
+
+/*
+ Frame structure:
+
+ - Start. Frame must be aligned to TPACKET_ALIGNMENT=16
+ - struct tpacket_hdr
+ - pad to TPACKET_ALIGNMENT=16
+ - struct sockaddr_ll
+ - Gap, chosen so that packet data (Start+tp_net) aligns to
+ TPACKET_ALIGNMENT=16
+ - Start+tp_mac: [ Optional MAC header ]
+ - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16.
+ - Pad to align to TPACKET_ALIGNMENT=16
+ */
+
+ The following are conditions that are checked in packet_set_ring
+
+ tp_block_size must be a multiple of PAGE_SIZE (1)
+ tp_frame_size must be greater than TPACKET_HDRLEN (obvious)
+ tp_frame_size must be a multiple of TPACKET_ALIGNMENT
+ tp_frame_nr must be exactly frames_per_block*tp_block_nr
+
+Note that tp_block_size should be chosen to be a power of two or there will
+be a waste of memory.
+
+--------------------------------------------------------------------------------
++ Mapping and use of the circular buffer (ring)
+--------------------------------------------------------------------------------
+
+The mapping of the buffer in the user process is done with the conventional
+mmap function. Even the circular buffer is compound of several physically
+discontiguous blocks of memory, they are contiguous to the user space, hence
+just one call to mmap is needed:
+
+ mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+
+If tp_frame_size is a divisor of tp_block_size frames will be
+contiguously spaced by tp_frame_size bytes. If not, each
+tp_block_size/tp_frame_size frames there will be a gap between
+the frames. This is because a frame cannot be spawn across two
+blocks.
+
+To use one socket for capture and transmission, the mapping of both the
+RX and TX buffer ring has to be done with one call to mmap:
+
+ ...
+ setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
+ setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
+ ...
+ rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ tx_ring = rx_ring + size;
+
+RX must be the first as the kernel maps the TX ring memory right
+after the RX one.
+
+At the beginning of each frame there is an status field (see
+struct tpacket_hdr). If this field is 0 means that the frame is ready
+to be used for the kernel, If not, there is a frame the user can read
+and the following flags apply:
+
++++ Capture process:
+ from include/linux/if_packet.h
+
+ #define TP_STATUS_COPY (1 << 1)
+ #define TP_STATUS_LOSING (1 << 2)
+ #define TP_STATUS_CSUMNOTREADY (1 << 3)
+ #define TP_STATUS_CSUM_VALID (1 << 7)
+
+TP_STATUS_COPY : This flag indicates that the frame (and associated
+ meta information) has been truncated because it's
+ larger than tp_frame_size. This packet can be
+ read entirely with recvfrom().
+
+ In order to make this work it must to be
+ enabled previously with setsockopt() and
+ the PACKET_COPY_THRESH option.
+
+ The number of frames that can be buffered to
+ be read with recvfrom is limited like a normal socket.
+ See the SO_RCVBUF option in the socket (7) man page.
+
+TP_STATUS_LOSING : indicates there were packet drops from last time
+ statistics where checked with getsockopt() and
+ the PACKET_STATISTICS option.
+
+TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which
+ its checksum will be done in hardware. So while
+ reading the packet we should not try to check the
+ checksum.
+
+TP_STATUS_CSUM_VALID : This flag indicates that at least the transport
+ header checksum of the packet has been already
+ validated on the kernel side. If the flag is not set
+ then we are free to check the checksum by ourselves
+ provided that TP_STATUS_CSUMNOTREADY is also not set.
+
+for convenience there are also the following defines:
+
+ #define TP_STATUS_KERNEL 0
+ #define TP_STATUS_USER 1
+
+The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel
+receives a packet it puts in the buffer and updates the status with
+at least the TP_STATUS_USER flag. Then the user can read the packet,
+once the packet is read the user must zero the status field, so the kernel
+can use again that frame buffer.
+
+The user can use poll (any other variant should apply too) to check if new
+packets are in the ring:
+
+ struct pollfd pfd;
+
+ pfd.fd = fd;
+ pfd.revents = 0;
+ pfd.events = POLLIN|POLLRDNORM|POLLERR;
+
+ if (status == TP_STATUS_KERNEL)
+ retval = poll(&pfd, 1, timeout);
+
+It doesn't incur in a race condition to first check the status value and
+then poll for frames.
+
+++ Transmission process
+Those defines are also used for transmission:
+
+ #define TP_STATUS_AVAILABLE 0 // Frame is available
+ #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send()
+ #define TP_STATUS_SENDING 2 // Frame is currently in transmission
+ #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct
+
+First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a
+packet, the user fills a data buffer of an available frame, sets tp_len to
+current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST.
+This can be done on multiple frames. Once the user is ready to transmit, it
+calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are
+forwarded to the network device. The kernel updates each status of sent
+frames with TP_STATUS_SENDING until the end of transfer.
+At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE.
+
+ header->tp_len = in_i_size;
+ header->tp_status = TP_STATUS_SEND_REQUEST;
+ retval = send(this->socket, NULL, 0, 0);
+
+The user can also use poll() to check if a buffer is available:
+(status == TP_STATUS_SENDING)
+
+ struct pollfd pfd;
+ pfd.fd = fd;
+ pfd.revents = 0;
+ pfd.events = POLLOUT;
+ retval = poll(&pfd, 1, timeout);
+
+-------------------------------------------------------------------------------
++ What TPACKET versions are available and when to use them?
+-------------------------------------------------------------------------------
+
+ int val = tpacket_version;
+ setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+ getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val));
+
+where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
+
+TPACKET_V1:
+ - Default if not otherwise specified by setsockopt(2)
+ - RX_RING, TX_RING available
+
+TPACKET_V1 --> TPACKET_V2:
+ - Made 64 bit clean due to unsigned long usage in TPACKET_V1
+ structures, thus this also works on 64 bit kernel with 32 bit
+ userspace and the like
+ - Timestamp resolution in nanoseconds instead of microseconds
+ - RX_RING, TX_RING available
+ - VLAN metadata information available for packets
+ (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
+ in the tpacket2_hdr structure:
+ - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
+ that the tp_vlan_tci field has valid VLAN TCI value
+ - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
+ indicates that the tp_vlan_tpid field has valid VLAN TPID value
+ - How to switch to TPACKET_V2:
+ 1. Replace struct tpacket_hdr by struct tpacket2_hdr
+ 2. Query header len and save
+ 3. Set protocol version to 2, set up ring as usual
+ 4. For getting the sockaddr_ll,
+ use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of
+ (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr))
+
+TPACKET_V2 --> TPACKET_V3:
+ - Flexible buffer implementation:
+ 1. Blocks can be configured with non-static frame-size
+ 2. Read/poll is at a block-level (as opposed to packet-level)
+ 3. Added poll timeout to avoid indefinite user-space wait
+ on idle links
+ 4. Added user-configurable knobs:
+ 4.1 block::timeout
+ 4.2 tpkt_hdr::sk_rxhash
+ - RX Hash data available in user space
+ - Currently only RX_RING available
+
+-------------------------------------------------------------------------------
++ AF_PACKET fanout mode
+-------------------------------------------------------------------------------
+
+In the AF_PACKET fanout mode, packet reception can be load balanced among
+processes. This also works in combination with mmap(2) on packet sockets.
+
+Currently implemented fanout policies are:
+
+ - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
+ - PACKET_FANOUT_LB: schedule to socket by round-robin
+ - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
+ - PACKET_FANOUT_RND: schedule to socket by random selection
+ - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
+ - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
+
+Minimal example code by David S. Miller (try things like "./test eth0 hash",
+"./test eth0 lb", etc.):
+
+#include <stddef.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#include <unistd.h>
+
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+
+#include <net/if.h>
+
+static const char *device_name;
+static int fanout_type;
+static int fanout_id;
+
+#ifndef PACKET_FANOUT
+# define PACKET_FANOUT 18
+# define PACKET_FANOUT_HASH 0
+# define PACKET_FANOUT_LB 1
+#endif
+
+static int setup_socket(void)
+{
+ int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP));
+ struct sockaddr_ll ll;
+ struct ifreq ifr;
+ int fanout_arg;
+
+ if (fd < 0) {
+ perror("socket");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ifr, 0, sizeof(ifr));
+ strcpy(ifr.ifr_name, device_name);
+ err = ioctl(fd, SIOCGIFINDEX, &ifr);
+ if (err < 0) {
+ perror("SIOCGIFINDEX");
+ return EXIT_FAILURE;
+ }
+
+ memset(&ll, 0, sizeof(ll));
+ ll.sll_family = AF_PACKET;
+ ll.sll_ifindex = ifr.ifr_ifindex;
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+ if (err < 0) {
+ perror("bind");
+ return EXIT_FAILURE;
+ }
+
+ fanout_arg = (fanout_id | (fanout_type << 16));
+ err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT,
+ &fanout_arg, sizeof(fanout_arg));
+ if (err) {
+ perror("setsockopt");
+ return EXIT_FAILURE;
+ }
+
+ return fd;
+}
+
+static void fanout_thread(void)
+{
+ int fd = setup_socket();
+ int limit = 10000;
+
+ if (fd < 0)
+ exit(fd);
+
+ while (limit-- > 0) {
+ char buf[1600];
+ int err;
+
+ err = read(fd, buf, sizeof(buf));
+ if (err < 0) {
+ perror("read");
+ exit(EXIT_FAILURE);
+ }
+ if ((limit % 10) == 0)
+ fprintf(stdout, "(%d) \n", getpid());
+ }
+
+ fprintf(stdout, "%d: Received 10000 packets\n", getpid());
+
+ close(fd);
+ exit(0);
+}
+
+int main(int argc, char **argp)
+{
+ int fd, err;
+ int i;
+
+ if (argc != 3) {
+ fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]);
+ return EXIT_FAILURE;
+ }
+
+ if (!strcmp(argp[2], "hash"))
+ fanout_type = PACKET_FANOUT_HASH;
+ else if (!strcmp(argp[2], "lb"))
+ fanout_type = PACKET_FANOUT_LB;
+ else {
+ fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]);
+ exit(EXIT_FAILURE);
+ }
+
+ device_name = argp[1];
+ fanout_id = getpid() & 0xffff;
+
+ for (i = 0; i < 4; i++) {
+ pid_t pid = fork();
+
+ switch (pid) {
+ case 0:
+ fanout_thread();
+
+ case -1:
+ perror("fork");
+ exit(EXIT_FAILURE);
+ }
+ }
+
+ for (i = 0; i < 4; i++) {
+ int status;
+
+ wait(&status);
+ }
+
+ return 0;
+}
+
+-------------------------------------------------------------------------------
++ AF_PACKET TPACKET_V3 example
+-------------------------------------------------------------------------------
+
+AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
+sizes by doing it's own memory management. It is based on blocks where polling
+works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
+
+It is said that TPACKET_V3 brings the following benefits:
+ *) ~15 - 20% reduction in CPU-usage
+ *) ~20% increase in packet capture rate
+ *) ~2x increase in packet density
+ *) Port aggregation analysis
+ *) Non static frame size to capture entire packet payload
+
+So it seems to be a good candidate to be used with packet fanout.
+
+Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
+it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
+
+/* Written from scratch, but kernel-to-user space API usage
+ * dissected from lolpcap:
+ * Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
+ * License: GPL, version 2.0
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <assert.h>
+#include <net/if.h>
+#include <arpa/inet.h>
+#include <netdb.h>
+#include <poll.h>
+#include <unistd.h>
+#include <signal.h>
+#include <inttypes.h>
+#include <sys/socket.h>
+#include <sys/mman.h>
+#include <linux/if_packet.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+
+#ifndef likely
+# define likely(x) __builtin_expect(!!(x), 1)
+#endif
+#ifndef unlikely
+# define unlikely(x) __builtin_expect(!!(x), 0)
+#endif
+
+struct block_desc {
+ uint32_t version;
+ uint32_t offset_to_priv;
+ struct tpacket_hdr_v1 h1;
+};
+
+struct ring {
+ struct iovec *rd;
+ uint8_t *map;
+ struct tpacket_req3 req;
+};
+
+static unsigned long packets_total = 0, bytes_total = 0;
+static sig_atomic_t sigint = 0;
+
+static void sighandler(int num)
+{
+ sigint = 1;
+}
+
+static int setup_socket(struct ring *ring, char *netdev)
+{
+ int err, i, fd, v = TPACKET_V3;
+ struct sockaddr_ll ll;
+ unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
+ unsigned int blocknum = 64;
+
+ fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+ if (fd < 0) {
+ perror("socket");
+ exit(1);
+ }
+
+ err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
+ if (err < 0) {
+ perror("setsockopt");
+ exit(1);
+ }
+
+ memset(&ring->req, 0, sizeof(ring->req));
+ ring->req.tp_block_size = blocksiz;
+ ring->req.tp_frame_size = framesiz;
+ ring->req.tp_block_nr = blocknum;
+ ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
+ ring->req.tp_retire_blk_tov = 60;
+ ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
+
+ err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
+ sizeof(ring->req));
+ if (err < 0) {
+ perror("setsockopt");
+ exit(1);
+ }
+
+ ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
+ PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
+ if (ring->map == MAP_FAILED) {
+ perror("mmap");
+ exit(1);
+ }
+
+ ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
+ assert(ring->rd);
+ for (i = 0; i < ring->req.tp_block_nr; ++i) {
+ ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
+ ring->rd[i].iov_len = ring->req.tp_block_size;
+ }
+
+ memset(&ll, 0, sizeof(ll));
+ ll.sll_family = PF_PACKET;
+ ll.sll_protocol = htons(ETH_P_ALL);
+ ll.sll_ifindex = if_nametoindex(netdev);
+ ll.sll_hatype = 0;
+ ll.sll_pkttype = 0;
+ ll.sll_halen = 0;
+
+ err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
+ if (err < 0) {
+ perror("bind");
+ exit(1);
+ }
+
+ return fd;
+}
+
+static void display(struct tpacket3_hdr *ppd)
+{
+ struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
+ struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
+
+ if (eth->h_proto == htons(ETH_P_IP)) {
+ struct sockaddr_in ss, sd;
+ char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
+
+ memset(&ss, 0, sizeof(ss));
+ ss.sin_family = PF_INET;
+ ss.sin_addr.s_addr = ip->saddr;
+ getnameinfo((struct sockaddr *) &ss, sizeof(ss),
+ sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
+
+ memset(&sd, 0, sizeof(sd));
+ sd.sin_family = PF_INET;
+ sd.sin_addr.s_addr = ip->daddr;
+ getnameinfo((struct sockaddr *) &sd, sizeof(sd),
+ dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
+
+ printf("%s -> %s, ", sbuff, dbuff);
+ }
+
+ printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
+}
+
+static void walk_block(struct block_desc *pbd, const int block_num)
+{
+ int num_pkts = pbd->h1.num_pkts, i;
+ unsigned long bytes = 0;
+ struct tpacket3_hdr *ppd;
+
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
+ pbd->h1.offset_to_first_pkt);
+ for (i = 0; i < num_pkts; ++i) {
+ bytes += ppd->tp_snaplen;
+ display(ppd);
+
+ ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
+ ppd->tp_next_offset);
+ }
+
+ packets_total += num_pkts;
+ bytes_total += bytes;
+}
+
+static void flush_block(struct block_desc *pbd)
+{
+ pbd->h1.block_status = TP_STATUS_KERNEL;
+}
+
+static void teardown_socket(struct ring *ring, int fd)
+{
+ munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
+ free(ring->rd);
+ close(fd);
+}
+
+int main(int argc, char **argp)
+{
+ int fd, err;
+ socklen_t len;
+ struct ring ring;
+ struct pollfd pfd;
+ unsigned int block_num = 0, blocks = 64;
+ struct block_desc *pbd;
+ struct tpacket_stats_v3 stats;
+
+ if (argc != 2) {
+ fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
+ return EXIT_FAILURE;
+ }
+
+ signal(SIGINT, sighandler);
+
+ memset(&ring, 0, sizeof(ring));
+ fd = setup_socket(&ring, argp[argc - 1]);
+ assert(fd > 0);
+
+ memset(&pfd, 0, sizeof(pfd));
+ pfd.fd = fd;
+ pfd.events = POLLIN | POLLERR;
+ pfd.revents = 0;
+
+ while (likely(!sigint)) {
+ pbd = (struct block_desc *) ring.rd[block_num].iov_base;
+
+ if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
+ poll(&pfd, 1, -1);
+ continue;
+ }
+
+ walk_block(pbd, block_num);
+ flush_block(pbd);
+ block_num = (block_num + 1) % blocks;
+ }
+
+ len = sizeof(stats);
+ err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
+ if (err < 0) {
+ perror("getsockopt");
+ exit(1);
+ }
+
+ fflush(stdout);
+ printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
+ stats.tp_packets, bytes_total, stats.tp_drops,
+ stats.tp_freeze_q_cnt);
+
+ teardown_socket(&ring, fd);
+ return 0;
+}
+
+-------------------------------------------------------------------------------
++ PACKET_QDISC_BYPASS
+-------------------------------------------------------------------------------
+
+If there is a requirement to load the network with many packets in a similar
+fashion as pktgen does, you might set the following option after socket
+creation:
+
+ int one = 1;
+ setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
+
+This has the side-effect, that packets sent through PF_PACKET will bypass the
+kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
+packet are not buffered, tc disciplines are ignored, increased loss can occur
+and such packets are also not visible to other PF_PACKET sockets anymore. So,
+you have been warned; generally, this can be useful for stress testing various
+components of a system.
+
+On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
+on PF_PACKET sockets.
+
+-------------------------------------------------------------------------------
++ PACKET_TIMESTAMP
+-------------------------------------------------------------------------------
+
+The PACKET_TIMESTAMP setting determines the source of the timestamp in
+the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your
+NIC is capable of timestamping packets in hardware, you can request those
+hardware timestamps to be used. Note: you may need to enable the generation
+of hardware timestamps with SIOCSHWTSTAMP (see related information from
+Documentation/networking/timestamping.txt).
+
+PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING:
+
+ int req = SOF_TIMESTAMPING_RAW_HARDWARE;
+ setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
+
+For the mmap(2)ed ring buffers, such timestamps are stored in the
+tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine
+what kind of timestamp has been reported, the tp_status field is binary |'ed
+with the following possible bits ...
+
+ TP_STATUS_TS_RAW_HARDWARE
+ TP_STATUS_TS_SOFTWARE
+
+... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the
+RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a
+software fallback was invoked *within* PF_PACKET's processing code (less
+precise).
+
+Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
+ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
+frames to be updated resp. the frame handed over to the application, iv) walk
+through the frames to pick up the individual hw/sw timestamps.
+
+Only (!) if transmit timestamping is enabled, then these bits are combined
+with binary | with TP_STATUS_AVAILABLE, so you must check for that in your
+application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
+in a first step to see if the frame belongs to the application, and then
+one can extract the type of timestamp in a second step from tp_status)!
+
+If you don't care about them, thus having it disabled, checking for
+TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
+TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
+members do not contain a valid value. For TX_RINGs, by default no timestamp
+is generated!
+
+See include/linux/net_tstamp.h and Documentation/networking/timestamping
+for more information on hardware timestamps.
+
+-------------------------------------------------------------------------------
++ Miscellaneous bits
+-------------------------------------------------------------------------------
+
+- Packet sockets work well together with Linux socket filters, thus you also
+ might want to have a look at Documentation/networking/filter.txt
+
+--------------------------------------------------------------------------------
++ THANKS
+--------------------------------------------------------------------------------
+
+ Jesse Brandeburg, for fixing my grammathical/spelling errors
+
diff --git a/Documentation/networking/phonet.txt b/Documentation/networking/phonet.txt
new file mode 100644
index 000000000..81003581f
--- /dev/null
+++ b/Documentation/networking/phonet.txt
@@ -0,0 +1,214 @@
+Linux Phonet protocol family
+============================
+
+Introduction
+------------
+
+Phonet is a packet protocol used by Nokia cellular modems for both IPC
+and RPC. With the Linux Phonet socket family, Linux host processes can
+receive and send messages from/to the modem, or any other external
+device attached to the modem. The modem takes care of routing.
+
+Phonet packets can be exchanged through various hardware connections
+depending on the device, such as:
+ - USB with the CDC Phonet interface,
+ - infrared,
+ - Bluetooth,
+ - an RS232 serial port (with a dedicated "FBUS" line discipline),
+ - the SSI bus with some TI OMAP processors.
+
+
+Packets format
+--------------
+
+Phonet packets have a common header as follows:
+
+ struct phonethdr {
+ uint8_t pn_media; /* Media type (link-layer identifier) */
+ uint8_t pn_rdev; /* Receiver device ID */
+ uint8_t pn_sdev; /* Sender device ID */
+ uint8_t pn_res; /* Resource ID or function */
+ uint16_t pn_length; /* Big-endian message byte length (minus 6) */
+ uint8_t pn_robj; /* Receiver object ID */
+ uint8_t pn_sobj; /* Sender object ID */
+ };
+
+On Linux, the link-layer header includes the pn_media byte (see below).
+The next 7 bytes are part of the network-layer header.
+
+The device ID is split: the 6 higher-order bits constitute the device
+address, while the 2 lower-order bits are used for multiplexing, as are
+the 8-bit object identifiers. As such, Phonet can be considered as a
+network layer with 6 bits of address space and 10 bits for transport
+protocol (much like port numbers in IP world).
+
+The modem always has address number zero. All other device have a their
+own 6-bit address.
+
+
+Link layer
+----------
+
+Phonet links are always point-to-point links. The link layer header
+consists of a single Phonet media type byte. It uniquely identifies the
+link through which the packet is transmitted, from the modem's
+perspective. Each Phonet network device shall prepend and set the media
+type byte as appropriate. For convenience, a common phonet_header_ops
+link-layer header operations structure is provided. It sets the
+media type according to the network device hardware address.
+
+Linux Phonet network interfaces support a dedicated link layer packets
+type (ETH_P_PHONET) which is out of the Ethernet type range. They can
+only send and receive Phonet packets.
+
+The virtual TUN tunnel device driver can also be used for Phonet. This
+requires IFF_TUN mode, _without_ the IFF_NO_PI flag. In this case,
+there is no link-layer header, so there is no Phonet media type byte.
+
+Note that Phonet interfaces are not allowed to re-order packets, so
+only the (default) Linux FIFO qdisc should be used with them.
+
+
+Network layer
+-------------
+
+The Phonet socket address family maps the Phonet packet header:
+
+ struct sockaddr_pn {
+ sa_family_t spn_family; /* AF_PHONET */
+ uint8_t spn_obj; /* Object ID */
+ uint8_t spn_dev; /* Device ID */
+ uint8_t spn_resource; /* Resource or function */
+ uint8_t spn_zero[...]; /* Padding */
+ };
+
+The resource field is only used when sending and receiving;
+It is ignored by bind() and getsockname().
+
+
+Low-level datagram protocol
+---------------------------
+
+Applications can send Phonet messages using the Phonet datagram socket
+protocol from the PF_PHONET family. Each socket is bound to one of the
+2^10 object IDs available, and can send and receive packets with any
+other peer.
+
+ struct sockaddr_pn addr = { .spn_family = AF_PHONET, };
+ ssize_t len;
+ socklen_t addrlen = sizeof(addr);
+ int fd;
+
+ fd = socket(PF_PHONET, SOCK_DGRAM, 0);
+ bind(fd, (struct sockaddr *)&addr, sizeof(addr));
+ /* ... */
+
+ sendto(fd, msg, msglen, 0, (struct sockaddr *)&addr, sizeof(addr));
+ len = recvfrom(fd, buf, sizeof(buf), 0,
+ (struct sockaddr *)&addr, &addrlen);
+
+This protocol follows the SOCK_DGRAM connection-less semantics.
+However, connect() and getpeername() are not supported, as they did
+not seem useful with Phonet usages (could be added easily).
+
+
+Resource subscription
+---------------------
+
+A Phonet datagram socket can be subscribed to any number of 8-bits
+Phonet resources, as follow:
+
+ uint32_t res = 0xXX;
+ ioctl(fd, SIOCPNADDRESOURCE, &res);
+
+Subscription is similarly cancelled using the SIOCPNDELRESOURCE I/O
+control request, or when the socket is closed.
+
+Note that no more than one socket can be subcribed to any given
+resource at a time. If not, ioctl() will return EBUSY.
+
+
+Phonet Pipe protocol
+--------------------
+
+The Phonet Pipe protocol is a simple sequenced packets protocol
+with end-to-end congestion control. It uses the passive listening
+socket paradigm. The listening socket is bound to an unique free object
+ID. Each listening socket can handle up to 255 simultaneous
+connections, one per accept()'d socket.
+
+ int lfd, cfd;
+
+ lfd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE);
+ listen (lfd, INT_MAX);
+
+ /* ... */
+ cfd = accept(lfd, NULL, NULL);
+ for (;;)
+ {
+ char buf[...];
+ ssize_t len = read(cfd, buf, sizeof(buf));
+
+ /* ... */
+
+ write(cfd, msg, msglen);
+ }
+
+Connections are traditionally established between two endpoints by a
+"third party" application. This means that both endpoints are passive.
+
+
+As of Linux kernel version 2.6.39, it is also possible to connect
+two endpoints directly, using connect() on the active side. This is
+intended to support the newer Nokia Wireless Modem API, as found in
+e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform:
+
+ struct sockaddr_spn spn;
+ int fd;
+
+ fd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE);
+ memset(&spn, 0, sizeof(spn));
+ spn.spn_family = AF_PHONET;
+ spn.spn_obj = ...;
+ spn.spn_dev = ...;
+ spn.spn_resource = 0xD9;
+ connect(fd, (struct sockaddr *)&spn, sizeof(spn));
+ /* normal I/O here ... */
+ close(fd);
+
+
+WARNING:
+When polling a connected pipe socket for writability, there is an
+intrinsic race condition whereby writability might be lost between the
+polling and the writing system calls. In this case, the socket will
+block until write becomes possible again, unless non-blocking mode
+is enabled.
+
+
+The pipe protocol provides two socket options at the SOL_PNPIPE level:
+
+ PNPIPE_ENCAP accepts one integer value (int) of:
+
+ PNPIPE_ENCAP_NONE: The socket operates normally (default).
+
+ PNPIPE_ENCAP_IP: The socket is used as a backend for a virtual IP
+ interface. This requires CAP_NET_ADMIN capability. GPRS data
+ support on Nokia modems can use this. Note that the socket cannot
+ be reliably poll()'d or read() from while in this mode.
+
+ PNPIPE_IFINDEX is a read-only integer value. It contains the
+ interface index of the network interface created by PNPIPE_ENCAP,
+ or zero if encapsulation is off.
+
+ PNPIPE_HANDLE is a read-only integer value. It contains the underlying
+ identifier ("pipe handle") of the pipe. This is only defined for
+ socket descriptors that are already connected or being connected.
+
+
+Authors
+-------
+
+Linux Phonet was initially written by Sakari Ailus.
+Other contributors include Mikä Liljeberg, Andras Domokos,
+Carlos Chinea and Rémi Denis-Courmont.
+Copyright (C) 2008 Nokia Corporation.
diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
new file mode 100644
index 000000000..e839e7efc
--- /dev/null
+++ b/Documentation/networking/phy.txt
@@ -0,0 +1,339 @@
+
+-------
+PHY Abstraction Layer
+(Updated 2008-04-08)
+
+Purpose
+
+ Most network devices consist of set of registers which provide an interface
+ to a MAC layer, which communicates with the physical connection through a
+ PHY. The PHY concerns itself with negotiating link parameters with the link
+ partner on the other side of the network connection (typically, an ethernet
+ cable), and provides a register interface to allow drivers to determine what
+ settings were chosen, and to configure what settings are allowed.
+
+ While these devices are distinct from the network devices, and conform to a
+ standard layout for the registers, it has been common practice to integrate
+ the PHY management code with the network driver. This has resulted in large
+ amounts of redundant code. Also, on embedded systems with multiple (and
+ sometimes quite different) ethernet controllers connected to the same
+ management bus, it is difficult to ensure safe use of the bus.
+
+ Since the PHYs are devices, and the management busses through which they are
+ accessed are, in fact, busses, the PHY Abstraction Layer treats them as such.
+ In doing so, it has these goals:
+
+ 1) Increase code-reuse
+ 2) Increase overall code-maintainability
+ 3) Speed development time for new network drivers, and for new systems
+
+ Basically, this layer is meant to provide an interface to PHY devices which
+ allows network driver writers to write as little code as possible, while
+ still providing a full feature set.
+
+The MDIO bus
+
+ Most network devices are connected to a PHY by means of a management bus.
+ Different devices use different busses (though some share common interfaces).
+ In order to take advantage of the PAL, each bus interface needs to be
+ registered as a distinct device.
+
+ 1) read and write functions must be implemented. Their prototypes are:
+
+ int write(struct mii_bus *bus, int mii_id, int regnum, u16 value);
+ int read(struct mii_bus *bus, int mii_id, int regnum);
+
+ mii_id is the address on the bus for the PHY, and regnum is the register
+ number. These functions are guaranteed not to be called from interrupt
+ time, so it is safe for them to block, waiting for an interrupt to signal
+ the operation is complete
+
+ 2) A reset function is optional. This is used to return the bus to an
+ initialized state.
+
+ 3) A probe function is needed. This function should set up anything the bus
+ driver needs, setup the mii_bus structure, and register with the PAL using
+ mdiobus_register. Similarly, there's a remove function to undo all of
+ that (use mdiobus_unregister).
+
+ 4) Like any driver, the device_driver structure must be configured, and init
+ exit functions are used to register the driver.
+
+ 5) The bus must also be declared somewhere as a device, and registered.
+
+ As an example for how one driver implemented an mdio bus driver, see
+ drivers/net/ethernet/freescale/fsl_pq_mdio.c and an associated DTS file
+ for one of the users. (e.g. "git grep fsl,.*-mdio arch/powerpc/boot/dts/")
+
+Connecting to a PHY
+
+ Sometime during startup, the network driver needs to establish a connection
+ between the PHY device, and the network device. At this time, the PHY's bus
+ and drivers need to all have been loaded, so it is ready for the connection.
+ At this point, there are several ways to connect to the PHY:
+
+ 1) The PAL handles everything, and only calls the network driver when
+ the link state changes, so it can react.
+
+ 2) The PAL handles everything except interrupts (usually because the
+ controller has the interrupt registers).
+
+ 3) The PAL handles everything, but checks in with the driver every second,
+ allowing the network driver to react first to any changes before the PAL
+ does.
+
+ 4) The PAL serves only as a library of functions, with the network device
+ manually calling functions to update status, and configure the PHY
+
+
+Letting the PHY Abstraction Layer do Everything
+
+ If you choose option 1 (The hope is that every driver can, but to still be
+ useful to drivers that can't), connecting to the PHY is simple:
+
+ First, you need a function to react to changes in the link state. This
+ function follows this protocol:
+
+ static void adjust_link(struct net_device *dev);
+
+ Next, you need to know the device name of the PHY connected to this device.
+ The name will look something like, "0:00", where the first number is the
+ bus id, and the second is the PHY's address on that bus. Typically,
+ the bus is responsible for making its ID unique.
+
+ Now, to connect, just call this function:
+
+ phydev = phy_connect(dev, phy_name, &adjust_link, interface);
+
+ phydev is a pointer to the phy_device structure which represents the PHY. If
+ phy_connect is successful, it will return the pointer. dev, here, is the
+ pointer to your net_device. Once done, this function will have started the
+ PHY's software state machine, and registered for the PHY's interrupt, if it
+ has one. The phydev structure will be populated with information about the
+ current state, though the PHY will not yet be truly operational at this
+ point.
+
+ PHY-specific flags should be set in phydev->dev_flags prior to the call
+ to phy_connect() such that the underlying PHY driver can check for flags
+ and perform specific operations based on them.
+ This is useful if the system has put hardware restrictions on
+ the PHY/controller, of which the PHY needs to be aware.
+
+ interface is a u32 which specifies the connection type used
+ between the controller and the PHY. Examples are GMII, MII,
+ RGMII, and SGMII. For a full list, see include/linux/phy.h
+
+ Now just make sure that phydev->supported and phydev->advertising have any
+ values pruned from them which don't make sense for your controller (a 10/100
+ controller may be connected to a gigabit capable PHY, so you would need to
+ mask off SUPPORTED_1000baseT*). See include/linux/ethtool.h for definitions
+ for these bitfields. Note that you should not SET any bits, or the PHY may
+ get put into an unsupported state.
+
+ Lastly, once the controller is ready to handle network traffic, you call
+ phy_start(phydev). This tells the PAL that you are ready, and configures the
+ PHY to connect to the network. If you want to handle your own interrupts,
+ just set phydev->irq to PHY_IGNORE_INTERRUPT before you call phy_start.
+ Similarly, if you don't want to use interrupts, set phydev->irq to PHY_POLL.
+
+ When you want to disconnect from the network (even if just briefly), you call
+ phy_stop(phydev).
+
+Keeping Close Tabs on the PAL
+
+ It is possible that the PAL's built-in state machine needs a little help to
+ keep your network device and the PHY properly in sync. If so, you can
+ register a helper function when connecting to the PHY, which will be called
+ every second before the state machine reacts to any changes. To do this, you
+ need to manually call phy_attach() and phy_prepare_link(), and then call
+ phy_start_machine() with the second argument set to point to your special
+ handler.
+
+ Currently there are no examples of how to use this functionality, and testing
+ on it has been limited because the author does not have any drivers which use
+ it (they all use option 1). So Caveat Emptor.
+
+Doing it all yourself
+
+ There's a remote chance that the PAL's built-in state machine cannot track
+ the complex interactions between the PHY and your network device. If this is
+ so, you can simply call phy_attach(), and not call phy_start_machine or
+ phy_prepare_link(). This will mean that phydev->state is entirely yours to
+ handle (phy_start and phy_stop toggle between some of the states, so you
+ might need to avoid them).
+
+ An effort has been made to make sure that useful functionality can be
+ accessed without the state-machine running, and most of these functions are
+ descended from functions which did not interact with a complex state-machine.
+ However, again, no effort has been made so far to test running without the
+ state machine, so tryer beware.
+
+ Here is a brief rundown of the functions:
+
+ int phy_read(struct phy_device *phydev, u16 regnum);
+ int phy_write(struct phy_device *phydev, u16 regnum, u16 val);
+
+ Simple read/write primitives. They invoke the bus's read/write function
+ pointers.
+
+ void phy_print_status(struct phy_device *phydev);
+
+ A convenience function to print out the PHY status neatly.
+
+ int phy_start_interrupts(struct phy_device *phydev);
+ int phy_stop_interrupts(struct phy_device *phydev);
+
+ Requests the IRQ for the PHY interrupts, then enables them for
+ start, or disables then frees them for stop.
+
+ struct phy_device * phy_attach(struct net_device *dev, const char *phy_id,
+ phy_interface_t interface);
+
+ Attaches a network device to a particular PHY, binding the PHY to a generic
+ driver if none was found during bus initialization.
+
+ int phy_start_aneg(struct phy_device *phydev);
+
+ Using variables inside the phydev structure, either configures advertising
+ and resets autonegotiation, or disables autonegotiation, and configures
+ forced settings.
+
+ static inline int phy_read_status(struct phy_device *phydev);
+
+ Fills the phydev structure with up-to-date information about the current
+ settings in the PHY.
+
+ int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd);
+ int phy_ethtool_gset(struct phy_device *phydev, struct ethtool_cmd *cmd);
+
+ Ethtool convenience functions.
+
+ int phy_mii_ioctl(struct phy_device *phydev,
+ struct mii_ioctl_data *mii_data, int cmd);
+
+ The MII ioctl. Note that this function will completely screw up the state
+ machine if you write registers like BMCR, BMSR, ADVERTISE, etc. Best to
+ use this only to write registers which are not standard, and don't set off
+ a renegotiation.
+
+
+PHY Device Drivers
+
+ With the PHY Abstraction Layer, adding support for new PHYs is
+ quite easy. In some cases, no work is required at all! However,
+ many PHYs require a little hand-holding to get up-and-running.
+
+Generic PHY driver
+
+ If the desired PHY doesn't have any errata, quirks, or special
+ features you want to support, then it may be best to not add
+ support, and let the PHY Abstraction Layer's Generic PHY Driver
+ do all of the work.
+
+Writing a PHY driver
+
+ If you do need to write a PHY driver, the first thing to do is
+ make sure it can be matched with an appropriate PHY device.
+ This is done during bus initialization by reading the device's
+ UID (stored in registers 2 and 3), then comparing it to each
+ driver's phy_id field by ANDing it with each driver's
+ phy_id_mask field. Also, it needs a name. Here's an example:
+
+ static struct phy_driver dm9161_driver = {
+ .phy_id = 0x0181b880,
+ .name = "Davicom DM9161E",
+ .phy_id_mask = 0x0ffffff0,
+ ...
+ }
+
+ Next, you need to specify what features (speed, duplex, autoneg,
+ etc) your PHY device and driver support. Most PHYs support
+ PHY_BASIC_FEATURES, but you can look in include/mii.h for other
+ features.
+
+ Each driver consists of a number of function pointers:
+
+ soft_reset: perform a PHY software reset
+ config_init: configures PHY into a sane state after a reset.
+ For instance, a Davicom PHY requires descrambling disabled.
+ probe: Allocate phy->priv, optionally refuse to bind.
+ PHY may not have been reset or had fixups run yet.
+ suspend/resume: power management
+ config_aneg: Changes the speed/duplex/negotiation settings
+ aneg_done: Determines the auto-negotiation result
+ read_status: Reads the current speed/duplex/negotiation settings
+ ack_interrupt: Clear a pending interrupt
+ did_interrupt: Checks if the PHY generated an interrupt
+ config_intr: Enable or disable interrupts
+ remove: Does any driver take-down
+ ts_info: Queries about the HW timestamping status
+ hwtstamp: Set the PHY HW timestamping configuration
+ rxtstamp: Requests a receive timestamp at the PHY level for a 'skb'
+ txtsamp: Requests a transmit timestamp at the PHY level for a 'skb'
+ set_wol: Enable Wake-on-LAN at the PHY level
+ get_wol: Get the Wake-on-LAN status at the PHY level
+ read_mmd_indirect: Read PHY MMD indirect register
+ write_mmd_indirect: Write PHY MMD indirect register
+
+ Of these, only config_aneg and read_status are required to be
+ assigned by the driver code. The rest are optional. Also, it is
+ preferred to use the generic phy driver's versions of these two
+ functions if at all possible: genphy_read_status and
+ genphy_config_aneg. If this is not possible, it is likely that
+ you only need to perform some actions before and after invoking
+ these functions, and so your functions will wrap the generic
+ ones.
+
+ Feel free to look at the Marvell, Cicada, and Davicom drivers in
+ drivers/net/phy/ for examples (the lxt and qsemi drivers have
+ not been tested as of this writing).
+
+ The PHY's MMD register accesses are handled by the PAL framework
+ by default, but can be overridden by a specific PHY driver if
+ required. This could be the case if a PHY was released for
+ manufacturing before the MMD PHY register definitions were
+ standardized by the IEEE. Most modern PHYs will be able to use
+ the generic PAL framework for accessing the PHY's MMD registers.
+ An example of such usage is for Energy Efficient Ethernet support,
+ implemented in the PAL. This support uses the PAL to access MMD
+ registers for EEE query and configuration if the PHY supports
+ the IEEE standard access mechanisms, or can use the PHY's specific
+ access interfaces if overridden by the specific PHY driver. See
+ the Micrel driver in drivers/net/phy/ for an example of how this
+ can be implemented.
+
+Board Fixups
+
+ Sometimes the specific interaction between the platform and the PHY requires
+ special handling. For instance, to change where the PHY's clock input is,
+ or to add a delay to account for latency issues in the data path. In order
+ to support such contingencies, the PHY Layer allows platform code to register
+ fixups to be run when the PHY is brought up (or subsequently reset).
+
+ When the PHY Layer brings up a PHY it checks to see if there are any fixups
+ registered for it, matching based on UID (contained in the PHY device's phy_id
+ field) and the bus identifier (contained in phydev->dev.bus_id). Both must
+ match, however two constants, PHY_ANY_ID and PHY_ANY_UID, are provided as
+ wildcards for the bus ID and UID, respectively.
+
+ When a match is found, the PHY layer will invoke the run function associated
+ with the fixup. This function is passed a pointer to the phy_device of
+ interest. It should therefore only operate on that PHY.
+
+ The platform code can either register the fixup using phy_register_fixup():
+
+ int phy_register_fixup(const char *phy_id,
+ u32 phy_uid, u32 phy_uid_mask,
+ int (*run)(struct phy_device *));
+
+ Or using one of the two stubs, phy_register_fixup_for_uid() and
+ phy_register_fixup_for_id():
+
+ int phy_register_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask,
+ int (*run)(struct phy_device *));
+ int phy_register_fixup_for_id(const char *phy_id,
+ int (*run)(struct phy_device *));
+
+ The stubs set one of the two matching criteria, and set the other one to
+ match anything.
+
diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt
new file mode 100644
index 000000000..0344f1d45
--- /dev/null
+++ b/Documentation/networking/pktgen.txt
@@ -0,0 +1,323 @@
+
+
+ HOWTO for the linux packet generator
+ ------------------------------------
+
+Enable CONFIG_NET_PKTGEN to compile and build pktgen either in-kernel
+or as a module. A module is preferred; modprobe pktgen if needed. Once
+running, pktgen creates a thread for each CPU with affinity to that CPU.
+Monitoring and controlling is done via /proc. It is easiest to select a
+suitable sample script and configure that.
+
+On a dual CPU:
+
+ps aux | grep pkt
+root 129 0.3 0.0 0 0 ? SW 2003 523:20 [pktgen/0]
+root 130 0.3 0.0 0 0 ? SW 2003 509:50 [pktgen/1]
+
+
+For monitoring and control pktgen creates:
+ /proc/net/pktgen/pgctrl
+ /proc/net/pktgen/kpktgend_X
+ /proc/net/pktgen/ethX
+
+
+Tuning NIC for max performance
+==============================
+
+The default NIC settings are (likely) not tuned for pktgen's artificial
+overload type of benchmarking, as this could hurt the normal use-case.
+
+Specifically increasing the TX ring buffer in the NIC:
+ # ethtool -G ethX tx 1024
+
+A larger TX ring can improve pktgen's performance, while it can hurt
+in the general case, 1) because the TX ring buffer might get larger
+than the CPU's L1/L2 cache, 2) because it allows more queueing in the
+NIC HW layer (which is bad for bufferbloat).
+
+One should hesitate to conclude that packets/descriptors in the HW
+TX ring cause delay. Drivers usually delay cleaning up the
+ring-buffers for various performance reasons, and packets stalling
+the TX ring might just be waiting for cleanup.
+
+This cleanup issue is specifically the case for the driver ixgbe
+(Intel 82599 chip). This driver (ixgbe) combines TX+RX ring cleanups,
+and the cleanup interval is affected by the ethtool --coalesce setting
+of parameter "rx-usecs".
+
+For ixgbe use e.g. "30" resulting in approx 33K interrupts/sec (1/30*10^6):
+ # ethtool -C ethX rx-usecs 30
+
+
+Viewing threads
+===============
+/proc/net/pktgen/kpktgend_0
+Name: kpktgend_0 max_before_softirq: 10000
+Running:
+Stopped: eth1
+Result: OK: max_before_softirq=10000
+
+Most important are the devices assigned to the thread. Note that a
+device can only belong to one thread.
+
+
+Viewing devices
+===============
+
+The Params section holds configured information. The Current section
+holds running statistics. The Result is printed after a run or after
+interruption. Example:
+
+/proc/net/pktgen/eth1
+
+Params: count 10000000 min_pkt_size: 60 max_pkt_size: 60
+ frags: 0 delay: 0 clone_skb: 1000000 ifname: eth1
+ flows: 0 flowlen: 0
+ dst_min: 10.10.11.2 dst_max:
+ src_min: src_max:
+ src_mac: 00:00:00:00:00:00 dst_mac: 00:04:23:AC:FD:82
+ udp_src_min: 9 udp_src_max: 9 udp_dst_min: 9 udp_dst_max: 9
+ src_mac_count: 0 dst_mac_count: 0
+ Flags:
+Current:
+ pkts-sofar: 10000000 errors: 39664
+ started: 1103053986245187us stopped: 1103053999346329us idle: 880401us
+ seq_num: 10000011 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
+ cur_saddr: 0x10a0a0a cur_daddr: 0x20b0a0a
+ cur_udp_dst: 9 cur_udp_src: 9
+ flows: 0
+Result: OK: 13101142(c12220741+d880401) usec, 10000000 (60byte,0frags)
+ 763292pps 390Mb/sec (390805504bps) errors: 39664
+
+Configuring threads and devices
+================================
+This is done via the /proc interface, and most easily done via pgset
+as defined in the sample scripts.
+
+Examples:
+
+ pgset "clone_skb 1" sets the number of copies of the same packet
+ pgset "clone_skb 0" use single SKB for all transmits
+ pgset "burst 8" uses xmit_more API to queue 8 copies of the same
+ packet and update HW tx queue tail pointer once.
+ "burst 1" is the default
+ pgset "pkt_size 9014" sets packet size to 9014
+ pgset "frags 5" packet will consist of 5 fragments
+ pgset "count 200000" sets number of packets to send, set to zero
+ for continuous sends until explicitly stopped.
+
+ pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds
+
+ pgset "dst 10.0.0.1" sets IP destination address
+ (BEWARE! This generator is very aggressive!)
+
+ pgset "dst_min 10.0.0.1" Same as dst
+ pgset "dst_max 10.0.0.254" Set the maximum destination IP.
+ pgset "src_min 10.0.0.1" Set the minimum (or only) source IP.
+ pgset "src_max 10.0.0.254" Set the maximum source IP.
+ pgset "dst6 fec0::1" IPV6 destination address
+ pgset "src6 fec0::2" IPV6 source address
+ pgset "dstmac 00:00:00:00:00:00" sets MAC destination address
+ pgset "srcmac 00:00:00:00:00:00" sets MAC source address
+
+ pgset "queue_map_min 0" Sets the min value of tx queue interval
+ pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices
+ To select queue 1 of a given device,
+ use queue_map_min=1 and queue_map_max=1
+
+ pgset "src_mac_count 1" Sets the number of MACs we'll range through.
+ The 'minimum' MAC is what you set with srcmac.
+
+ pgset "dst_mac_count 1" Sets the number of MACs we'll range through.
+ The 'minimum' MAC is what you set with dstmac.
+
+ pgset "flag [name]" Set a flag to determine behaviour. Current flags
+ are: IPSRC_RND # IP source is random (between min/max)
+ IPDST_RND # IP destination is random
+ UDPSRC_RND, UDPDST_RND,
+ MACSRC_RND, MACDST_RND
+ TXSIZE_RND, IPV6,
+ MPLS_RND, VID_RND, SVID_RND
+ FLOW_SEQ,
+ QUEUE_MAP_RND # queue map random
+ QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
+ UDPCSUM,
+ IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
+ NODE_ALLOC # node specific memory allocation
+
+ pgset spi SPI_VALUE Set specific SA used to transform packet.
+
+ pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then
+ cycle through the port range.
+
+ pgset "udp_src_max 9" set UDP source port max.
+ pgset "udp_dst_min 9" set UDP destination port min, If < udp_dst_max, then
+ cycle through the port range.
+ pgset "udp_dst_max 9" set UDP destination port max.
+
+ pgset "mpls 0001000a,0002000a,0000000a" set MPLS labels (in this example
+ outer label=16,middle label=32,
+ inner label=0 (IPv4 NULL)) Note that
+ there must be no spaces between the
+ arguments. Leading zeros are required.
+ Do not set the bottom of stack bit,
+ that's done automatically. If you do
+ set the bottom of stack bit, that
+ indicates that you want to randomly
+ generate that address and the flag
+ MPLS_RND will be turned on. You
+ can have any mix of random and fixed
+ labels in the label stack.
+
+ pgset "mpls 0" turn off mpls (or any invalid argument works too!)
+
+ pgset "vlan_id 77" set VLAN ID 0-4095
+ pgset "vlan_p 3" set priority bit 0-7 (default 0)
+ pgset "vlan_cfi 0" set canonical format identifier 0-1 (default 0)
+
+ pgset "svlan_id 22" set SVLAN ID 0-4095
+ pgset "svlan_p 3" set priority bit 0-7 (default 0)
+ pgset "svlan_cfi 0" set canonical format identifier 0-1 (default 0)
+
+ pgset "vlan_id 9999" > 4095 remove vlan and svlan tags
+ pgset "svlan 9999" > 4095 remove svlan tag
+
+
+ pgset "tos XX" set former IPv4 TOS field (e.g. "tos 28" for AF11 no ECN, default 00)
+ pgset "traffic_class XX" set former IPv6 TRAFFIC CLASS (e.g. "traffic_class B8" for EF no ECN, default 00)
+
+ pgset stop aborts injection. Also, ^C aborts generator.
+
+ pgset "rate 300M" set rate to 300 Mb/s
+ pgset "ratep 1000000" set rate to 1Mpps
+
+Sample scripts
+==============
+
+A collection of small tutorial scripts for pktgen is in the
+samples/pktgen directory:
+
+pktgen.conf-1-1 # 1 CPU 1 dev
+pktgen.conf-1-2 # 1 CPU 2 dev
+pktgen.conf-2-1 # 2 CPU's 1 dev
+pktgen.conf-2-2 # 2 CPU's 2 dev
+pktgen.conf-1-1-rdos # 1 CPU 1 dev w. route DoS
+pktgen.conf-1-1-ip6 # 1 CPU 1 dev ipv6
+pktgen.conf-1-1-ip6-rdos # 1 CPU 1 dev ipv6 w. route DoS
+pktgen.conf-1-1-flows # 1 CPU 1 dev multiple flows.
+
+Run in shell: ./pktgen.conf-X-Y
+This does all the setup including sending.
+
+
+Interrupt affinity
+===================
+Note that when adding devices to a specific CPU it is a good idea to
+also assign /proc/irq/XX/smp_affinity so that the TX interrupts are bound
+to the same CPU. This reduces cache bouncing when freeing skbs.
+
+Enable IPsec
+============
+Default IPsec transformation with ESP encapsulation plus transport mode
+can be enabled by simply setting:
+
+pgset "flag IPSEC"
+pgset "flows 1"
+
+To avoid breaking existing testbed scripts for using AH type and tunnel mode,
+you can use "pgset spi SPI_VALUE" to specify which transformation mode
+to employ.
+
+
+Current commands and configuration options
+==========================================
+
+** Pgcontrol commands:
+
+start
+stop
+
+** Thread commands:
+
+add_device
+rem_device_all
+max_before_softirq
+
+
+** Device commands:
+
+count
+clone_skb
+debug
+
+frags
+delay
+
+src_mac_count
+dst_mac_count
+
+pkt_size
+min_pkt_size
+max_pkt_size
+
+mpls
+
+udp_src_min
+udp_src_max
+
+udp_dst_min
+udp_dst_max
+
+flag
+ IPSRC_RND
+ IPDST_RND
+ UDPSRC_RND
+ UDPDST_RND
+ MACSRC_RND
+ MACDST_RND
+ TXSIZE_RND
+ IPV6
+ MPLS_RND
+ VID_RND
+ SVID_RND
+ FLOW_SEQ
+ QUEUE_MAP_RND
+ QUEUE_MAP_CPU
+ UDPCSUM
+ IPSEC
+ NODE_ALLOC
+
+dst_min
+dst_max
+
+src_min
+src_max
+
+dst_mac
+src_mac
+
+clear_counters
+
+dst6
+src6
+
+flows
+flowlen
+
+rate
+ratep
+
+References:
+ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/
+ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/
+
+Paper from Linux-Kongress in Erlangen 2004.
+ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/pktgen_paper.pdf
+
+Thanks to:
+Grant Grundler for testing on IA-64 and parisc, Harald Welte, Lennert Buytenhek
+Stephen Hemminger, Andi Kleen, Dave Miller and many others.
+
+
+Good luck with the linux net-development.
diff --git a/Documentation/networking/policy-routing.txt b/Documentation/networking/policy-routing.txt
new file mode 100644
index 000000000..36f6936d7
--- /dev/null
+++ b/Documentation/networking/policy-routing.txt
@@ -0,0 +1,150 @@
+Classes
+-------
+
+ "Class" is a complete routing table in common sense.
+ I.e. it is tree of nodes (destination prefix, tos, metric)
+ with attached information: gateway, device etc.
+ This tree is looked up as specified in RFC1812 5.2.4.3
+ 1. Basic match
+ 2. Longest match
+ 3. Weak TOS.
+ 4. Metric. (should not be in kernel space, but they are)
+ 5. Additional pruning rules. (not in kernel space).
+
+ We have two special type of nodes:
+ REJECT - abort route lookup and return an error value.
+ THROW - abort route lookup in this class.
+
+
+ Currently the number of classes is limited to 255
+ (0 is reserved for "not specified class")
+
+ Three classes are builtin:
+
+ RT_CLASS_LOCAL=255 - local interface addresses,
+ broadcasts, nat addresses.
+
+ RT_CLASS_MAIN=254 - all normal routes are put there
+ by default.
+
+ RT_CLASS_DEFAULT=253 - if ip_fib_model==1, then
+ normal default routes are put there, if ip_fib_model==2
+ all gateway routes are put there.
+
+
+Rules
+-----
+ Rule is a record of (src prefix, src interface, tos, dst prefix)
+ with attached information.
+
+ Rule types:
+ RTP_ROUTE - lookup in attached class
+ RTP_NAT - lookup in attached class and if a match is found,
+ translate packet source address.
+ RTP_MASQUERADE - lookup in attached class and if a match is found,
+ masquerade packet as sourced by us.
+ RTP_DROP - silently drop the packet.
+ RTP_REJECT - drop the packet and send ICMP NET UNREACHABLE.
+ RTP_PROHIBIT - drop the packet and send ICMP COMM. ADM. PROHIBITED.
+
+ Rule flags:
+ RTRF_LOG - log route creations.
+ RTRF_VALVE - One way route (used with masquerading)
+
+Default setup:
+
+root@amber:/pub/ip-routing # iproute -r
+Kernel routing policy rules
+Pref Source Destination TOS Iface Cl
+ 0 default default 00 * 255
+ 254 default default 00 * 254
+ 255 default default 00 * 253
+
+
+Lookup algorithm
+----------------
+
+ We scan rules list, and if a rule is matched, apply it.
+ If a route is found, return it.
+ If it is not found or a THROW node was matched, continue
+ to scan rules.
+
+Applications
+------------
+
+1. Just ignore classes. All the routes are put into MAIN class
+ (and/or into DEFAULT class).
+
+ HOWTO: iproute add PREFIX [ tos TOS ] [ gw GW ] [ dev DEV ]
+ [ metric METRIC ] [ reject ] ... (look at iproute utility)
+
+ or use route utility from current net-tools.
+
+2. Opposite case. Just forget all that you know about routing
+ tables. Every rule is supplied with its own gateway, device
+ info. record. This approach is not appropriate for automated
+ route maintenance, but it is ideal for manual configuration.
+
+ HOWTO: iproute addrule [ from PREFIX ] [ to PREFIX ] [ tos TOS ]
+ [ dev INPUTDEV] [ pref PREFERENCE ] route [ gw GATEWAY ]
+ [ dev OUTDEV ] .....
+
+ Warning: As of now the size of the routing table in this
+ approach is limited to 256. If someone likes this model, I'll
+ relax this limitation.
+
+3. OSPF classes (see RFC1583, RFC1812 E.3.3)
+ Very clean, stable and robust algorithm for OSPF routing
+ domains. Unfortunately, it is not widely used in the Internet.
+
+ Proposed setup:
+ 255 local addresses
+ 254 interface routes
+ 253 ASE routes with external metric
+ 252 ASE routes with internal metric
+ 251 inter-area routes
+ 250 intra-area routes for 1st area
+ 249 intra-area routes for 2nd area
+ etc.
+
+ Rules:
+ iproute addrule class 253
+ iproute addrule class 252
+ iproute addrule class 251
+ iproute addrule to a-prefix-for-1st-area class 250
+ iproute addrule to another-prefix-for-1st-area class 250
+ ...
+ iproute addrule to a-prefix-for-2nd-area class 249
+ ...
+
+ Area classes must be terminated with reject record.
+ iproute add default reject class 250
+ iproute add default reject class 249
+ ...
+
+4. The Variant Router Requirements Algorithm (RFC1812 E.3.2)
+ Create 16 classes for different TOS values.
+ It is a funny, but pretty useless algorithm.
+ I listed it just to show the power of new routing code.
+
+5. All the variety of combinations......
+
+
+GATED
+-----
+
+ Gated does not understand classes, but it will work
+ happily in MAIN+DEFAULT. All policy routes can be set
+ and maintained manually.
+
+IMPORTANT NOTE
+--------------
+ route.c has a compilation time switch CONFIG_IP_LOCAL_RT_POLICY.
+ If it is set, locally originated packets are routed
+ using all the policy list. This is not very convenient and
+ pretty ambiguous when used with NAT and masquerading.
+ I set it to FALSE by default.
+
+
+Alexey Kuznetov
+kuznet@ms2.inr.ac.ru
diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.txt
new file mode 100644
index 000000000..091d20273
--- /dev/null
+++ b/Documentation/networking/ppp_generic.txt
@@ -0,0 +1,432 @@
+ PPP Generic Driver and Channel Interface
+ ----------------------------------------
+
+ Paul Mackerras
+ paulus@samba.org
+ 7 Feb 2002
+
+The generic PPP driver in linux-2.4 provides an implementation of the
+functionality which is of use in any PPP implementation, including:
+
+* the network interface unit (ppp0 etc.)
+* the interface to the networking code
+* PPP multilink: splitting datagrams between multiple links, and
+ ordering and combining received fragments
+* the interface to pppd, via a /dev/ppp character device
+* packet compression and decompression
+* TCP/IP header compression and decompression
+* detecting network traffic for demand dialling and for idle timeouts
+* simple packet filtering
+
+For sending and receiving PPP frames, the generic PPP driver calls on
+the services of PPP `channels'. A PPP channel encapsulates a
+mechanism for transporting PPP frames from one machine to another. A
+PPP channel implementation can be arbitrarily complex internally but
+has a very simple interface with the generic PPP code: it merely has
+to be able to send PPP frames, receive PPP frames, and optionally
+handle ioctl requests. Currently there are PPP channel
+implementations for asynchronous serial ports, synchronous serial
+ports, and for PPP over ethernet.
+
+This architecture makes it possible to implement PPP multilink in a
+natural and straightforward way, by allowing more than one channel to
+be linked to each ppp network interface unit. The generic layer is
+responsible for splitting datagrams on transmit and recombining them
+on receive.
+
+
+PPP channel API
+---------------
+
+See include/linux/ppp_channel.h for the declaration of the types and
+functions used to communicate between the generic PPP layer and PPP
+channels.
+
+Each channel has to provide two functions to the generic PPP layer,
+via the ppp_channel.ops pointer:
+
+* start_xmit() is called by the generic layer when it has a frame to
+ send. The channel has the option of rejecting the frame for
+ flow-control reasons. In this case, start_xmit() should return 0
+ and the channel should call the ppp_output_wakeup() function at a
+ later time when it can accept frames again, and the generic layer
+ will then attempt to retransmit the rejected frame(s). If the frame
+ is accepted, the start_xmit() function should return 1.
+
+* ioctl() provides an interface which can be used by a user-space
+ program to control aspects of the channel's behaviour. This
+ procedure will be called when a user-space program does an ioctl
+ system call on an instance of /dev/ppp which is bound to the
+ channel. (Usually it would only be pppd which would do this.)
+
+The generic PPP layer provides seven functions to channels:
+
+* ppp_register_channel() is called when a channel has been created, to
+ notify the PPP generic layer of its presence. For example, setting
+ a serial port to the PPPDISC line discipline causes the ppp_async
+ channel code to call this function.
+
+* ppp_unregister_channel() is called when a channel is to be
+ destroyed. For example, the ppp_async channel code calls this when
+ a hangup is detected on the serial port.
+
+* ppp_output_wakeup() is called by a channel when it has previously
+ rejected a call to its start_xmit function, and can now accept more
+ packets.
+
+* ppp_input() is called by a channel when it has received a complete
+ PPP frame.
+
+* ppp_input_error() is called by a channel when it has detected that a
+ frame has been lost or dropped (for example, because of a FCS (frame
+ check sequence) error).
+
+* ppp_channel_index() returns the channel index assigned by the PPP
+ generic layer to this channel. The channel should provide some way
+ (e.g. an ioctl) to transmit this back to user-space, as user-space
+ will need it to attach an instance of /dev/ppp to this channel.
+
+* ppp_unit_number() returns the unit number of the ppp network
+ interface to which this channel is connected, or -1 if the channel
+ is not connected.
+
+Connecting a channel to the ppp generic layer is initiated from the
+channel code, rather than from the generic layer. The channel is
+expected to have some way for a user-level process to control it
+independently of the ppp generic layer. For example, with the
+ppp_async channel, this is provided by the file descriptor to the
+serial port.
+
+Generally a user-level process will initialize the underlying
+communications medium and prepare it to do PPP. For example, with an
+async tty, this can involve setting the tty speed and modes, issuing
+modem commands, and then going through some sort of dialog with the
+remote system to invoke PPP service there. We refer to this process
+as `discovery'. Then the user-level process tells the medium to
+become a PPP channel and register itself with the generic PPP layer.
+The channel then has to report the channel number assigned to it back
+to the user-level process. From that point, the PPP negotiation code
+in the PPP daemon (pppd) can take over and perform the PPP
+negotiation, accessing the channel through the /dev/ppp interface.
+
+At the interface to the PPP generic layer, PPP frames are stored in
+skbuff structures and start with the two-byte PPP protocol number.
+The frame does *not* include the 0xff `address' byte or the 0x03
+`control' byte that are optionally used in async PPP. Nor is there
+any escaping of control characters, nor are there any FCS or framing
+characters included. That is all the responsibility of the channel
+code, if it is needed for the particular medium. That is, the skbuffs
+presented to the start_xmit() function contain only the 2-byte
+protocol number and the data, and the skbuffs presented to ppp_input()
+must be in the same format.
+
+The channel must provide an instance of a ppp_channel struct to
+represent the channel. The channel is free to use the `private' field
+however it wishes. The channel should initialize the `mtu' and
+`hdrlen' fields before calling ppp_register_channel() and not change
+them until after ppp_unregister_channel() returns. The `mtu' field
+represents the maximum size of the data part of the PPP frames, that
+is, it does not include the 2-byte protocol number.
+
+If the channel needs some headroom in the skbuffs presented to it for
+transmission (i.e., some space free in the skbuff data area before the
+start of the PPP frame), it should set the `hdrlen' field of the
+ppp_channel struct to the amount of headroom required. The generic
+PPP layer will attempt to provide that much headroom but the channel
+should still check if there is sufficient headroom and copy the skbuff
+if there isn't.
+
+On the input side, channels should ideally provide at least 2 bytes of
+headroom in the skbuffs presented to ppp_input(). The generic PPP
+code does not require this but will be more efficient if this is done.
+
+
+Buffering and flow control
+--------------------------
+
+The generic PPP layer has been designed to minimize the amount of data
+that it buffers in the transmit direction. It maintains a queue of
+transmit packets for the PPP unit (network interface device) plus a
+queue of transmit packets for each attached channel. Normally the
+transmit queue for the unit will contain at most one packet; the
+exceptions are when pppd sends packets by writing to /dev/ppp, and
+when the core networking code calls the generic layer's start_xmit()
+function with the queue stopped, i.e. when the generic layer has
+called netif_stop_queue(), which only happens on a transmit timeout.
+The start_xmit function always accepts and queues the packet which it
+is asked to transmit.
+
+Transmit packets are dequeued from the PPP unit transmit queue and
+then subjected to TCP/IP header compression and packet compression
+(Deflate or BSD-Compress compression), as appropriate. After this
+point the packets can no longer be reordered, as the decompression
+algorithms rely on receiving compressed packets in the same order that
+they were generated.
+
+If multilink is not in use, this packet is then passed to the attached
+channel's start_xmit() function. If the channel refuses to take
+the packet, the generic layer saves it for later transmission. The
+generic layer will call the channel's start_xmit() function again
+when the channel calls ppp_output_wakeup() or when the core
+networking code calls the generic layer's start_xmit() function
+again. The generic layer contains no timeout and retransmission
+logic; it relies on the core networking code for that.
+
+If multilink is in use, the generic layer divides the packet into one
+or more fragments and puts a multilink header on each fragment. It
+decides how many fragments to use based on the length of the packet
+and the number of channels which are potentially able to accept a
+fragment at the moment. A channel is potentially able to accept a
+fragment if it doesn't have any fragments currently queued up for it
+to transmit. The channel may still refuse a fragment; in this case
+the fragment is queued up for the channel to transmit later. This
+scheme has the effect that more fragments are given to higher-
+bandwidth channels. It also means that under light load, the generic
+layer will tend to fragment large packets across all the channels,
+thus reducing latency, while under heavy load, packets will tend to be
+transmitted as single fragments, thus reducing the overhead of
+fragmentation.
+
+
+SMP safety
+----------
+
+The PPP generic layer has been designed to be SMP-safe. Locks are
+used around accesses to the internal data structures where necessary
+to ensure their integrity. As part of this, the generic layer
+requires that the channels adhere to certain requirements and in turn
+provides certain guarantees to the channels. Essentially the channels
+are required to provide the appropriate locking on the ppp_channel
+structures that form the basis of the communication between the
+channel and the generic layer. This is because the channel provides
+the storage for the ppp_channel structure, and so the channel is
+required to provide the guarantee that this storage exists and is
+valid at the appropriate times.
+
+The generic layer requires these guarantees from the channel:
+
+* The ppp_channel object must exist from the time that
+ ppp_register_channel() is called until after the call to
+ ppp_unregister_channel() returns.
+
+* No thread may be in a call to any of ppp_input(), ppp_input_error(),
+ ppp_output_wakeup(), ppp_channel_index() or ppp_unit_number() for a
+ channel at the time that ppp_unregister_channel() is called for that
+ channel.
+
+* ppp_register_channel() and ppp_unregister_channel() must be called
+ from process context, not interrupt or softirq/BH context.
+
+* The remaining generic layer functions may be called at softirq/BH
+ level but must not be called from a hardware interrupt handler.
+
+* The generic layer may call the channel start_xmit() function at
+ softirq/BH level but will not call it at interrupt level. Thus the
+ start_xmit() function may not block.
+
+* The generic layer will only call the channel ioctl() function in
+ process context.
+
+The generic layer provides these guarantees to the channels:
+
+* The generic layer will not call the start_xmit() function for a
+ channel while any thread is already executing in that function for
+ that channel.
+
+* The generic layer will not call the ioctl() function for a channel
+ while any thread is already executing in that function for that
+ channel.
+
+* By the time a call to ppp_unregister_channel() returns, no thread
+ will be executing in a call from the generic layer to that channel's
+ start_xmit() or ioctl() function, and the generic layer will not
+ call either of those functions subsequently.
+
+
+Interface to pppd
+-----------------
+
+The PPP generic layer exports a character device interface called
+/dev/ppp. This is used by pppd to control PPP interface units and
+channels. Although there is only one /dev/ppp, each open instance of
+/dev/ppp acts independently and can be attached either to a PPP unit
+or a PPP channel. This is achieved using the file->private_data field
+to point to a separate object for each open instance of /dev/ppp. In
+this way an effect similar to Solaris' clone open is obtained,
+allowing us to control an arbitrary number of PPP interfaces and
+channels without having to fill up /dev with hundreds of device names.
+
+When /dev/ppp is opened, a new instance is created which is initially
+unattached. Using an ioctl call, it can then be attached to an
+existing unit, attached to a newly-created unit, or attached to an
+existing channel. An instance attached to a unit can be used to send
+and receive PPP control frames, using the read() and write() system
+calls, along with poll() if necessary. Similarly, an instance
+attached to a channel can be used to send and receive PPP frames on
+that channel.
+
+In multilink terms, the unit represents the bundle, while the channels
+represent the individual physical links. Thus, a PPP frame sent by a
+write to the unit (i.e., to an instance of /dev/ppp attached to the
+unit) will be subject to bundle-level compression and to fragmentation
+across the individual links (if multilink is in use). In contrast, a
+PPP frame sent by a write to the channel will be sent as-is on that
+channel, without any multilink header.
+
+A channel is not initially attached to any unit. In this state it can
+be used for PPP negotiation but not for the transfer of data packets.
+It can then be connected to a PPP unit with an ioctl call, which
+makes it available to send and receive data packets for that unit.
+
+The ioctl calls which are available on an instance of /dev/ppp depend
+on whether it is unattached, attached to a PPP interface, or attached
+to a PPP channel. The ioctl calls which are available on an
+unattached instance are:
+
+* PPPIOCNEWUNIT creates a new PPP interface and makes this /dev/ppp
+ instance the "owner" of the interface. The argument should point to
+ an int which is the desired unit number if >= 0, or -1 to assign the
+ lowest unused unit number. Being the owner of the interface means
+ that the interface will be shut down if this instance of /dev/ppp is
+ closed.
+
+* PPPIOCATTACH attaches this instance to an existing PPP interface.
+ The argument should point to an int containing the unit number.
+ This does not make this instance the owner of the PPP interface.
+
+* PPPIOCATTCHAN attaches this instance to an existing PPP channel.
+ The argument should point to an int containing the channel number.
+
+The ioctl calls available on an instance of /dev/ppp attached to a
+channel are:
+
+* PPPIOCDETACH detaches the instance from the channel. This ioctl is
+ deprecated since the same effect can be achieved by closing the
+ instance. In order to prevent possible races this ioctl will fail
+ with an EINVAL error if more than one file descriptor refers to this
+ instance (i.e. as a result of dup(), dup2() or fork()).
+
+* PPPIOCCONNECT connects this channel to a PPP interface. The
+ argument should point to an int containing the interface unit
+ number. It will return an EINVAL error if the channel is already
+ connected to an interface, or ENXIO if the requested interface does
+ not exist.
+
+* PPPIOCDISCONN disconnects this channel from the PPP interface that
+ it is connected to. It will return an EINVAL error if the channel
+ is not connected to an interface.
+
+* All other ioctl commands are passed to the channel ioctl() function.
+
+The ioctl calls that are available on an instance that is attached to
+an interface unit are:
+
+* PPPIOCSMRU sets the MRU (maximum receive unit) for the interface.
+ The argument should point to an int containing the new MRU value.
+
+* PPPIOCSFLAGS sets flags which control the operation of the
+ interface. The argument should be a pointer to an int containing
+ the new flags value. The bits in the flags value that can be set
+ are:
+ SC_COMP_TCP enable transmit TCP header compression
+ SC_NO_TCP_CCID disable connection-id compression for
+ TCP header compression
+ SC_REJ_COMP_TCP disable receive TCP header decompression
+ SC_CCP_OPEN Compression Control Protocol (CCP) is
+ open, so inspect CCP packets
+ SC_CCP_UP CCP is up, may (de)compress packets
+ SC_LOOP_TRAFFIC send IP traffic to pppd
+ SC_MULTILINK enable PPP multilink fragmentation on
+ transmitted packets
+ SC_MP_SHORTSEQ expect short multilink sequence
+ numbers on received multilink fragments
+ SC_MP_XSHORTSEQ transmit short multilink sequence nos.
+
+ The values of these flags are defined in <linux/ppp-ioctl.h>. Note
+ that the values of the SC_MULTILINK, SC_MP_SHORTSEQ and
+ SC_MP_XSHORTSEQ bits are ignored if the CONFIG_PPP_MULTILINK option
+ is not selected.
+
+* PPPIOCGFLAGS returns the value of the status/control flags for the
+ interface unit. The argument should point to an int where the ioctl
+ will store the flags value. As well as the values listed above for
+ PPPIOCSFLAGS, the following bits may be set in the returned value:
+ SC_COMP_RUN CCP compressor is running
+ SC_DECOMP_RUN CCP decompressor is running
+ SC_DC_ERROR CCP decompressor detected non-fatal error
+ SC_DC_FERROR CCP decompressor detected fatal error
+
+* PPPIOCSCOMPRESS sets the parameters for packet compression or
+ decompression. The argument should point to a ppp_option_data
+ structure (defined in <linux/ppp-ioctl.h>), which contains a
+ pointer/length pair which should describe a block of memory
+ containing a CCP option specifying a compression method and its
+ parameters. The ppp_option_data struct also contains a `transmit'
+ field. If this is 0, the ioctl will affect the receive path,
+ otherwise the transmit path.
+
+* PPPIOCGUNIT returns, in the int pointed to by the argument, the unit
+ number of this interface unit.
+
+* PPPIOCSDEBUG sets the debug flags for the interface to the value in
+ the int pointed to by the argument. Only the least significant bit
+ is used; if this is 1 the generic layer will print some debug
+ messages during its operation. This is only intended for debugging
+ the generic PPP layer code; it is generally not helpful for working
+ out why a PPP connection is failing.
+
+* PPPIOCGDEBUG returns the debug flags for the interface in the int
+ pointed to by the argument.
+
+* PPPIOCGIDLE returns the time, in seconds, since the last data
+ packets were sent and received. The argument should point to a
+ ppp_idle structure (defined in <linux/ppp_defs.h>). If the
+ CONFIG_PPP_FILTER option is enabled, the set of packets which reset
+ the transmit and receive idle timers is restricted to those which
+ pass the `active' packet filter.
+
+* PPPIOCSMAXCID sets the maximum connection-ID parameter (and thus the
+ number of connection slots) for the TCP header compressor and
+ decompressor. The lower 16 bits of the int pointed to by the
+ argument specify the maximum connection-ID for the compressor. If
+ the upper 16 bits of that int are non-zero, they specify the maximum
+ connection-ID for the decompressor, otherwise the decompressor's
+ maximum connection-ID is set to 15.
+
+* PPPIOCSNPMODE sets the network-protocol mode for a given network
+ protocol. The argument should point to an npioctl struct (defined
+ in <linux/ppp-ioctl.h>). The `protocol' field gives the PPP protocol
+ number for the protocol to be affected, and the `mode' field
+ specifies what to do with packets for that protocol:
+
+ NPMODE_PASS normal operation, transmit and receive packets
+ NPMODE_DROP silently drop packets for this protocol
+ NPMODE_ERROR drop packets and return an error on transmit
+ NPMODE_QUEUE queue up packets for transmit, drop received
+ packets
+
+ At present NPMODE_ERROR and NPMODE_QUEUE have the same effect as
+ NPMODE_DROP.
+
+* PPPIOCGNPMODE returns the network-protocol mode for a given
+ protocol. The argument should point to an npioctl struct with the
+ `protocol' field set to the PPP protocol number for the protocol of
+ interest. On return the `mode' field will be set to the network-
+ protocol mode for that protocol.
+
+* PPPIOCSPASS and PPPIOCSACTIVE set the `pass' and `active' packet
+ filters. These ioctls are only available if the CONFIG_PPP_FILTER
+ option is selected. The argument should point to a sock_fprog
+ structure (defined in <linux/filter.h>) containing the compiled BPF
+ instructions for the filter. Packets are dropped if they fail the
+ `pass' filter; otherwise, if they fail the `active' filter they are
+ passed but they do not reset the transmit or receive idle timer.
+
+* PPPIOCSMRRU enables or disables multilink processing for received
+ packets and sets the multilink MRRU (maximum reconstructed receive
+ unit). The argument should point to an int containing the new MRRU
+ value. If the MRRU value is 0, processing of received multilink
+ fragments is disabled. This ioctl is only available if the
+ CONFIG_PPP_MULTILINK option is selected.
+
+Last modified: 7-feb-2002
diff --git a/Documentation/networking/proc_net_tcp.txt b/Documentation/networking/proc_net_tcp.txt
new file mode 100644
index 000000000..4a79209e7
--- /dev/null
+++ b/Documentation/networking/proc_net_tcp.txt
@@ -0,0 +1,48 @@
+This document describes the interfaces /proc/net/tcp and /proc/net/tcp6.
+Note that these interfaces are deprecated in favor of tcp_diag.
+
+These /proc interfaces provide information about currently active TCP
+connections, and are implemented by tcp4_seq_show() in net/ipv4/tcp_ipv4.c
+and tcp6_seq_show() in net/ipv6/tcp_ipv6.c, respectively.
+
+It will first list all listening TCP sockets, and next list all established
+TCP connections. A typical entry of /proc/net/tcp would look like this (split
+up into 3 parts because of the length of the line):
+
+ 46: 010310AC:9C4C 030310AC:1770 01
+ | | | | | |--> connection state
+ | | | | |------> remote TCP port number
+ | | | |-------------> remote IPv4 address
+ | | |--------------------> local TCP port number
+ | |---------------------------> local IPv4 address
+ |----------------------------------> number of entry
+
+ 00000150:00000000 01:00000019 00000000
+ | | | | |--> number of unrecovered RTO timeouts
+ | | | |----------> number of jiffies until timer expires
+ | | |----------------> timer_active (see below)
+ | |----------------------> receive-queue
+ |-------------------------------> transmit-queue
+
+ 1000 0 54165785 4 cd1e6040 25 4 27 3 -1
+ | | | | | | | | | |--> slow start size threshold,
+ | | | | | | | | | or -1 if the threshold
+ | | | | | | | | | is >= 0xFFFF
+ | | | | | | | | |----> sending congestion window
+ | | | | | | | |-------> (ack.quick<<1)|ack.pingpong
+ | | | | | | |---------> Predicted tick of soft clock
+ | | | | | | (delayed ACK control data)
+ | | | | | |------------> retransmit timeout
+ | | | | |------------------> location of socket in memory
+ | | | |-----------------------> socket reference count
+ | | |-----------------------------> inode
+ | |----------------------------------> unanswered 0-window probes
+ |---------------------------------------------> uid
+
+timer_active:
+ 0 no timer is pending
+ 1 retransmit-timer is pending
+ 2 another timer (e.g. delayed ack or keepalive) is pending
+ 3 this is a socket in TIME_WAIT state. Not all fields will contain
+ data (or even exist)
+ 4 zero window probe timer is pending
diff --git a/Documentation/networking/radiotap-headers.txt b/Documentation/networking/radiotap-headers.txt
new file mode 100644
index 000000000..953331c79
--- /dev/null
+++ b/Documentation/networking/radiotap-headers.txt
@@ -0,0 +1,152 @@
+How to use radiotap headers
+===========================
+
+Pointer to the radiotap include file
+------------------------------------
+
+Radiotap headers are variable-length and extensible, you can get most of the
+information you need to know on them from:
+
+./include/net/ieee80211_radiotap.h
+
+This document gives an overview and warns on some corner cases.
+
+
+Structure of the header
+-----------------------
+
+There is a fixed portion at the start which contains a u32 bitmap that defines
+if the possible argument associated with that bit is present or not. So if b0
+of the it_present member of ieee80211_radiotap_header is set, it means that
+the header for argument index 0 (IEEE80211_RADIOTAP_TSFT) is present in the
+argument area.
+
+ < 8-byte ieee80211_radiotap_header >
+ [ <possible argument bitmap extensions ... > ]
+ [ <argument> ... ]
+
+At the moment there are only 13 possible argument indexes defined, but in case
+we run out of space in the u32 it_present member, it is defined that b31 set
+indicates that there is another u32 bitmap following (shown as "possible
+argument bitmap extensions..." above), and the start of the arguments is moved
+forward 4 bytes each time.
+
+Note also that the it_len member __le16 is set to the total number of bytes
+covered by the ieee80211_radiotap_header and any arguments following.
+
+
+Requirements for arguments
+--------------------------
+
+After the fixed part of the header, the arguments follow for each argument
+index whose matching bit is set in the it_present member of
+ieee80211_radiotap_header.
+
+ - the arguments are all stored little-endian!
+
+ - the argument payload for a given argument index has a fixed size. So
+ IEEE80211_RADIOTAP_TSFT being present always indicates an 8-byte argument is
+ present. See the comments in ./include/net/ieee80211_radiotap.h for a nice
+ breakdown of all the argument sizes
+
+ - the arguments must be aligned to a boundary of the argument size using
+ padding. So a u16 argument must start on the next u16 boundary if it isn't
+ already on one, a u32 must start on the next u32 boundary and so on.
+
+ - "alignment" is relative to the start of the ieee80211_radiotap_header, ie,
+ the first byte of the radiotap header. The absolute alignment of that first
+ byte isn't defined. So even if the whole radiotap header is starting at, eg,
+ address 0x00000003, still the first byte of the radiotap header is treated as
+ 0 for alignment purposes.
+
+ - the above point that there may be no absolute alignment for multibyte
+ entities in the fixed radiotap header or the argument region means that you
+ have to take special evasive action when trying to access these multibyte
+ entities. Some arches like Blackfin cannot deal with an attempt to
+ dereference, eg, a u16 pointer that is pointing to an odd address. Instead
+ you have to use a kernel API get_unaligned() to dereference the pointer,
+ which will do it bytewise on the arches that require that.
+
+ - The arguments for a given argument index can be a compound of multiple types
+ together. For example IEEE80211_RADIOTAP_CHANNEL has an argument payload
+ consisting of two u16s of total length 4. When this happens, the padding
+ rule is applied dealing with a u16, NOT dealing with a 4-byte single entity.
+
+
+Example valid radiotap header
+-----------------------------
+
+ 0x00, 0x00, // <-- radiotap version + pad byte
+ 0x0b, 0x00, // <- radiotap header length
+ 0x04, 0x0c, 0x00, 0x00, // <-- bitmap
+ 0x6c, // <-- rate (in 500kHz units)
+ 0x0c, //<-- tx power
+ 0x01 //<-- antenna
+
+
+Using the Radiotap Parser
+-------------------------
+
+If you are having to parse a radiotap struct, you can radically simplify the
+job by using the radiotap parser that lives in net/wireless/radiotap.c and has
+its prototypes available in include/net/cfg80211.h. You use it like this:
+
+#include <net/cfg80211.h>
+
+/* buf points to the start of the radiotap header part */
+
+int MyFunction(u8 * buf, int buflen)
+{
+ int pkt_rate_100kHz = 0, antenna = 0, pwr = 0;
+ struct ieee80211_radiotap_iterator iterator;
+ int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen);
+
+ while (!ret) {
+
+ ret = ieee80211_radiotap_iterator_next(&iterator);
+
+ if (ret)
+ continue;
+
+ /* see if this argument is something we can use */
+
+ switch (iterator.this_arg_index) {
+ /*
+ * You must take care when dereferencing iterator.this_arg
+ * for multibyte types... the pointer is not aligned. Use
+ * get_unaligned((type *)iterator.this_arg) to dereference
+ * iterator.this_arg for type "type" safely on all arches.
+ */
+ case IEEE80211_RADIOTAP_RATE:
+ /* radiotap "rate" u8 is in
+ * 500kbps units, eg, 0x02=1Mbps
+ */
+ pkt_rate_100kHz = (*iterator.this_arg) * 5;
+ break;
+
+ case IEEE80211_RADIOTAP_ANTENNA:
+ /* radiotap uses 0 for 1st ant */
+ antenna = *iterator.this_arg);
+ break;
+
+ case IEEE80211_RADIOTAP_DBM_TX_POWER:
+ pwr = *iterator.this_arg;
+ break;
+
+ default:
+ break;
+ }
+ } /* while more rt headers */
+
+ if (ret != -ENOENT)
+ return TXRX_DROP;
+
+ /* discard the radiotap header part */
+ buf += iterator.max_length;
+ buflen -= iterator.max_length;
+
+ ...
+
+}
+
+Andy Green <andy@warmcat.com>
diff --git a/Documentation/networking/ray_cs.txt b/Documentation/networking/ray_cs.txt
new file mode 100644
index 000000000..c0c12307e
--- /dev/null
+++ b/Documentation/networking/ray_cs.txt
@@ -0,0 +1,150 @@
+September 21, 1999
+
+Copyright (c) 1998 Corey Thomas (corey@world.std.com)
+
+This file is the documentation for the Raylink Wireless LAN card driver for
+Linux. The Raylink wireless LAN card is a PCMCIA card which provides IEEE
+802.11 compatible wireless network connectivity at 1 and 2 megabits/second.
+See http://www.raytheon.com/micro/raylink/ for more information on the Raylink
+card. This driver is in early development and does have bugs. See the known
+bugs and limitations at the end of this document for more information.
+This driver also works with WebGear's Aviator 2.4 and Aviator Pro
+wireless LAN cards.
+
+As of kernel 2.3.18, the ray_cs driver is part of the Linux kernel
+source. My web page for the development of ray_cs is at
+http://web.ralinktech.com/ralink/Home/Support/Linux.html
+and I can be emailed at corey@world.std.com
+
+The kernel driver is based on ray_cs-1.62.tgz
+
+The driver at my web page is intended to be used as an add on to
+David Hinds pcmcia package. All the command line parameters are
+available when compiled as a module. When built into the kernel, only
+the essid= string parameter is available via the kernel command line.
+This will change after the method of sorting out parameters for all
+the PCMCIA drivers is agreed upon. If you must have a built in driver
+with nondefault parameters, they can be edited in
+/usr/src/linux/drivers/net/pcmcia/ray_cs.c. Searching for module_param
+will find them all.
+
+Information on card services is available at:
+ http://pcmcia-cs.sourceforge.net/
+
+
+Card services user programs are still required for PCMCIA devices.
+pcmcia-cs-3.1.1 or greater is required for the kernel version of
+the driver.
+
+Currently, ray_cs is not part of David Hinds card services package,
+so the following magic is required.
+
+At the end of the /etc/pcmcia/config.opts file, add the line:
+source ./ray_cs.opts
+This will make card services read the ray_cs.opts file
+when starting. Create the file /etc/pcmcia/ray_cs.opts containing the
+following:
+
+#### start of /etc/pcmcia/ray_cs.opts ###################
+# Configuration options for Raylink Wireless LAN PCMCIA card
+device "ray_cs"
+ class "network" module "misc/ray_cs"
+
+card "RayLink PC Card WLAN Adapter"
+ manfid 0x01a6, 0x0000
+ bind "ray_cs"
+
+module "misc/ray_cs" opts ""
+#### end of /etc/pcmcia/ray_cs.opts #####################
+
+
+To join an existing network with
+different parameters, contact the network administrator for the
+configuration information, and edit /etc/pcmcia/ray_cs.opts.
+Add the parameters below between the empty quotes.
+
+Parameters for ray_cs driver which may be specified in ray_cs.opts:
+
+bc integer 0 = normal mode (802.11 timing)
+ 1 = slow down inter frame timing to allow
+ operation with older breezecom access
+ points.
+
+beacon_period integer beacon period in Kilo-microseconds
+ legal values = must be integer multiple
+ of hop dwell
+ default = 256
+
+country integer 1 = USA (default)
+ 2 = Europe
+ 3 = Japan
+ 4 = Korea
+ 5 = Spain
+ 6 = France
+ 7 = Israel
+ 8 = Australia
+
+essid string ESS ID - network name to join
+ string with maximum length of 32 chars
+ default value = "ADHOC_ESSID"
+
+hop_dwell integer hop dwell time in Kilo-microseconds
+ legal values = 16,32,64,128(default),256
+
+irq_mask integer linux standard 16 bit value 1bit/IRQ
+ lsb is IRQ 0, bit 1 is IRQ 1 etc.
+ Used to restrict choice of IRQ's to use.
+ Recommended method for controlling
+ interrupts is in /etc/pcmcia/config.opts
+
+net_type integer 0 (default) = adhoc network,
+ 1 = infrastructure
+
+phy_addr string string containing new MAC address in
+ hex, must start with x eg
+ x00008f123456
+
+psm integer 0 = continuously active
+ 1 = power save mode (not useful yet)
+
+pc_debug integer (0-5) larger values for more verbose
+ logging. Replaces ray_debug.
+
+ray_debug integer Replaced with pc_debug
+
+ray_mem_speed integer defaults to 500
+
+sniffer integer 0 = not sniffer (default)
+ 1 = sniffer which can be used to record all
+ network traffic using tcpdump or similar,
+ but no normal network use is allowed.
+
+translate integer 0 = no translation (encapsulate frames)
+ 1 = translation (RFC1042/802.1)
+
+
+More on sniffer mode:
+
+tcpdump does not understand 802.11 headers, so it can't
+interpret the contents, but it can record to a file. This is only
+useful for debugging 802.11 lowlevel protocols that are not visible to
+linux. If you want to watch ftp xfers, or do similar things, you
+don't need to use sniffer mode. Also, some packet types are never
+sent up by the card, so you will never see them (ack, rts, cts, probe
+etc.) There is a simple program (showcap) included in the ray_cs
+package which parses the 802.11 headers.
+
+Known Problems and missing features
+
+ Does not work with non x86
+
+ Does not work with SMP
+
+ Support for defragmenting frames is not yet debugged, and in
+ fact is known to not work. I have never encountered a net set
+ up to fragment, but still, it should be fixed.
+
+ The ioctl support is incomplete. The hardware address cannot be set
+ using ifconfig yet. If a different hardware address is needed, it may
+ be set using the phy_addr parameter in ray_cs.opts. This requires
+ a card insertion to take effect.
diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
new file mode 100644
index 000000000..e1a3d59bb
--- /dev/null
+++ b/Documentation/networking/rds.txt
@@ -0,0 +1,355 @@
+
+Overview
+========
+
+This readme tries to provide some background on the hows and whys of RDS,
+and will hopefully help you find your way around the code.
+
+In addition, please see this email about RDS origins:
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
+
+RDS Architecture
+================
+
+RDS provides reliable, ordered datagram delivery by using a single
+reliable connection between any two nodes in the cluster. This allows
+applications to use a single socket to talk to any other process in the
+cluster - so in a cluster with N processes you need N sockets, in contrast
+to N*N if you use a connection-oriented socket transport like TCP.
+
+RDS is not Infiniband-specific; it was designed to support different
+transports. The current implementation used to support RDS over TCP as well
+as IB. Work is in progress to support RDS over iWARP, and using DCE to
+guarantee no dropped packets on Ethernet, it may be possible to use RDS over
+UDP in the future.
+
+The high-level semantics of RDS from the application's point of view are
+
+ * Addressing
+ RDS uses IPv4 addresses and 16bit port numbers to identify
+ the end point of a connection. All socket operations that involve
+ passing addresses between kernel and user space generally
+ use a struct sockaddr_in.
+
+ The fact that IPv4 addresses are used does not mean the underlying
+ transport has to be IP-based. In fact, RDS over IB uses a
+ reliable IB connection; the IP address is used exclusively to
+ locate the remote node's GID (by ARPing for the given IP).
+
+ The port space is entirely independent of UDP, TCP or any other
+ protocol.
+
+ * Socket interface
+ RDS sockets work *mostly* as you would expect from a BSD
+ socket. The next section will cover the details. At any rate,
+ all I/O is performed through the standard BSD socket API.
+ Some additions like zerocopy support are implemented through
+ control messages, while other extensions use the getsockopt/
+ setsockopt calls.
+
+ Sockets must be bound before you can send or receive data.
+ This is needed because binding also selects a transport and
+ attaches it to the socket. Once bound, the transport assignment
+ does not change. RDS will tolerate IPs moving around (eg in
+ a active-active HA scenario), but only as long as the address
+ doesn't move to a different transport.
+
+ * sysctls
+ RDS supports a number of sysctls in /proc/sys/net/rds
+
+
+Socket Interface
+================
+
+ AF_RDS, PF_RDS, SOL_RDS
+ AF_RDS and PF_RDS are the domain type to be used with socket(2)
+ to create RDS sockets. SOL_RDS is the socket-level to be used
+ with setsockopt(2) and getsockopt(2) for RDS specific socket
+ options.
+
+ fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
+ This creates a new, unbound RDS socket.
+
+ setsockopt(SOL_SOCKET): send and receive buffer size
+ RDS honors the send and receive buffer size socket options.
+ You are not allowed to queue more than SO_SNDSIZE bytes to
+ a socket. A message is queued when sendmsg is called, and
+ it leaves the queue when the remote system acknowledges
+ its arrival.
+
+ The SO_RCVSIZE option controls the maximum receive queue length.
+ This is a soft limit rather than a hard limit - RDS will
+ continue to accept and queue incoming messages, even if that
+ takes the queue length over the limit. However, it will also
+ mark the port as "congested" and send a congestion update to
+ the source node. The source node is supposed to throttle any
+ processes sending to this congested port.
+
+ bind(fd, &sockaddr_in, ...)
+ This binds the socket to a local IP address and port, and a
+ transport.
+
+ sendmsg(fd, ...)
+ Sends a message to the indicated recipient. The kernel will
+ transparently establish the underlying reliable connection
+ if it isn't up yet.
+
+ An attempt to send a message that exceeds SO_SNDSIZE will
+ return with -EMSGSIZE
+
+ An attempt to send a message that would take the total number
+ of queued bytes over the SO_SNDSIZE threshold will return
+ EAGAIN.
+
+ An attempt to send a message to a destination that is marked
+ as "congested" will return ENOBUFS.
+
+ recvmsg(fd, ...)
+ Receives a message that was queued to this socket. The sockets
+ recv queue accounting is adjusted, and if the queue length
+ drops below SO_SNDSIZE, the port is marked uncongested, and
+ a congestion update is sent to all peers.
+
+ Applications can ask the RDS kernel module to receive
+ notifications via control messages (for instance, there is a
+ notification when a congestion update arrived, or when a RDMA
+ operation completes). These notifications are received through
+ the msg.msg_control buffer of struct msghdr. The format of the
+ messages is described in manpages.
+
+ poll(fd)
+ RDS supports the poll interface to allow the application
+ to implement async I/O.
+
+ POLLIN handling is pretty straightforward. When there's an
+ incoming message queued to the socket, or a pending notification,
+ we signal POLLIN.
+
+ POLLOUT is a little harder. Since you can essentially send
+ to any destination, RDS will always signal POLLOUT as long as
+ there's room on the send queue (ie the number of bytes queued
+ is less than the sendbuf size).
+
+ However, the kernel will refuse to accept messages to
+ a destination marked congested - in this case you will loop
+ forever if you rely on poll to tell you what to do.
+ This isn't a trivial problem, but applications can deal with
+ this - by using congestion notifications, and by checking for
+ ENOBUFS errors returned by sendmsg.
+
+ setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
+ This allows the application to discard all messages queued to a
+ specific destination on this particular socket.
+
+ This allows the application to cancel outstanding messages if
+ it detects a timeout. For instance, if it tried to send a message,
+ and the remote host is unreachable, RDS will keep trying forever.
+ The application may decide it's not worth it, and cancel the
+ operation. In this case, it would use RDS_CANCEL_SENT_TO to
+ nuke any pending messages.
+
+
+RDMA for RDS
+============
+
+ see rds-rdma(7) manpage (available in rds-tools)
+
+
+Congestion Notifications
+========================
+
+ see rds(7) manpage
+
+
+RDS Protocol
+============
+
+ Message header
+
+ The message header is a 'struct rds_header' (see rds.h):
+ Fields:
+ h_sequence:
+ per-packet sequence number
+ h_ack:
+ piggybacked acknowledgment of last packet received
+ h_len:
+ length of data, not including header
+ h_sport:
+ source port
+ h_dport:
+ destination port
+ h_flags:
+ CONG_BITMAP - this is a congestion update bitmap
+ ACK_REQUIRED - receiver must ack this packet
+ RETRANSMITTED - packet has previously been sent
+ h_credit:
+ indicate to other end of connection that
+ it has more credits available (i.e. there is
+ more send room)
+ h_padding[4]:
+ unused, for future use
+ h_csum:
+ header checksum
+ h_exthdr:
+ optional data can be passed here. This is currently used for
+ passing RDMA-related information.
+
+ ACK and retransmit handling
+
+ One might think that with reliable IB connections you wouldn't need
+ to ack messages that have been received. The problem is that IB
+ hardware generates an ack message before it has DMAed the message
+ into memory. This creates a potential message loss if the HCA is
+ disabled for any reason between when it sends the ack and before
+ the message is DMAed and processed. This is only a potential issue
+ if another HCA is available for fail-over.
+
+ Sending an ack immediately would allow the sender to free the sent
+ message from their send queue quickly, but could cause excessive
+ traffic to be used for acks. RDS piggybacks acks on sent data
+ packets. Ack-only packets are reduced by only allowing one to be
+ in flight at a time, and by the sender only asking for acks when
+ its send buffers start to fill up. All retransmissions are also
+ acked.
+
+ Flow Control
+
+ RDS's IB transport uses a credit-based mechanism to verify that
+ there is space in the peer's receive buffers for more data. This
+ eliminates the need for hardware retries on the connection.
+
+ Congestion
+
+ Messages waiting in the receive queue on the receiving socket
+ are accounted against the sockets SO_RCVBUF option value. Only
+ the payload bytes in the message are accounted for. If the
+ number of bytes queued equals or exceeds rcvbuf then the socket
+ is congested. All sends attempted to this socket's address
+ should return block or return -EWOULDBLOCK.
+
+ Applications are expected to be reasonably tuned such that this
+ situation very rarely occurs. An application encountering this
+ "back-pressure" is considered a bug.
+
+ This is implemented by having each node maintain bitmaps which
+ indicate which ports on bound addresses are congested. As the
+ bitmap changes it is sent through all the connections which
+ terminate in the local address of the bitmap which changed.
+
+ The bitmaps are allocated as connections are brought up. This
+ avoids allocation in the interrupt handling path which queues
+ sages on sockets. The dense bitmaps let transports send the
+ entire bitmap on any bitmap change reasonably efficiently. This
+ is much easier to implement than some finer-grained
+ communication of per-port congestion. The sender does a very
+ inexpensive bit test to test if the port it's about to send to
+ is congested or not.
+
+
+RDS Transport Layer
+==================
+
+ As mentioned above, RDS is not IB-specific. Its code is divided
+ into a general RDS layer and a transport layer.
+
+ The general layer handles the socket API, congestion handling,
+ loopback, stats, usermem pinning, and the connection state machine.
+
+ The transport layer handles the details of the transport. The IB
+ transport, for example, handles all the queue pairs, work requests,
+ CM event handlers, and other Infiniband details.
+
+
+RDS Kernel Structures
+=====================
+
+ struct rds_message
+ aka possibly "rds_outgoing", the generic RDS layer copies data to
+ be sent and sets header fields as needed, based on the socket API.
+ This is then queued for the individual connection and sent by the
+ connection's transport.
+ struct rds_incoming
+ a generic struct referring to incoming data that can be handed from
+ the transport to the general code and queued by the general code
+ while the socket is awoken. It is then passed back to the transport
+ code to handle the actual copy-to-user.
+ struct rds_socket
+ per-socket information
+ struct rds_connection
+ per-connection information
+ struct rds_transport
+ pointers to transport-specific functions
+ struct rds_statistics
+ non-transport-specific statistics
+ struct rds_cong_map
+ wraps the raw congestion bitmap, contains rbnode, waitq, etc.
+
+Connection management
+=====================
+
+ Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
+ ERROR states.
+
+ The first time an attempt is made by an RDS socket to send data to
+ a node, a connection is allocated and connected. That connection is
+ then maintained forever -- if there are transport errors, the
+ connection will be dropped and re-established.
+
+ Dropping a connection while packets are queued will cause queued or
+ partially-sent datagrams to be retransmitted when the connection is
+ re-established.
+
+
+The send path
+=============
+
+ rds_sendmsg()
+ struct rds_message built from incoming data
+ CMSGs parsed (e.g. RDMA ops)
+ transport connection alloced and connected if not already
+ rds_message placed on send queue
+ send worker awoken
+ rds_send_worker()
+ calls rds_send_xmit() until queue is empty
+ rds_send_xmit()
+ transmits congestion map if one is pending
+ may set ACK_REQUIRED
+ calls transport to send either non-RDMA or RDMA message
+ (RDMA ops never retransmitted)
+ rds_ib_xmit()
+ allocs work requests from send ring
+ adds any new send credits available to peer (h_credits)
+ maps the rds_message's sg list
+ piggybacks ack
+ populates work requests
+ post send to connection's queue pair
+
+The recv path
+=============
+
+ rds_ib_recv_cq_comp_handler()
+ looks at write completions
+ unmaps recv buffer from device
+ no errors, call rds_ib_process_recv()
+ refill recv ring
+ rds_ib_process_recv()
+ validate header checksum
+ copy header to rds_ib_incoming struct if start of a new datagram
+ add to ibinc's fraglist
+ if competed datagram:
+ update cong map if datagram was cong update
+ call rds_recv_incoming() otherwise
+ note if ack is required
+ rds_recv_incoming()
+ drop duplicate packets
+ respond to pings
+ find the sock associated with this datagram
+ add to sock queue
+ wake up sock
+ do some congestion calculations
+ rds_recvmsg
+ copy data into user iovec
+ handle CMSGs
+ return to application
+
+
diff --git a/Documentation/networking/regulatory.txt b/Documentation/networking/regulatory.txt
new file mode 100644
index 000000000..356f791af
--- /dev/null
+++ b/Documentation/networking/regulatory.txt
@@ -0,0 +1,214 @@
+Linux wireless regulatory documentation
+---------------------------------------
+
+This document gives a brief review over how the Linux wireless
+regulatory infrastructure works.
+
+More up to date information can be obtained at the project's web page:
+
+http://wireless.kernel.org/en/developers/Regulatory
+
+Keeping regulatory domains in userspace
+---------------------------------------
+
+Due to the dynamic nature of regulatory domains we keep them
+in userspace and provide a framework for userspace to upload
+to the kernel one regulatory domain to be used as the central
+core regulatory domain all wireless devices should adhere to.
+
+How to get regulatory domains to the kernel
+-------------------------------------------
+
+Userspace gets a regulatory domain in the kernel by having
+a userspace agent build it and send it via nl80211. Only
+expected regulatory domains will be respected by the kernel.
+
+A currently available userspace agent which can accomplish this
+is CRDA - central regulatory domain agent. Its documented here:
+
+http://wireless.kernel.org/en/developers/Regulatory/CRDA
+
+Essentially the kernel will send a udev event when it knows
+it needs a new regulatory domain. A udev rule can be put in place
+to trigger crda to send the respective regulatory domain for a
+specific ISO/IEC 3166 alpha2.
+
+Below is an example udev rule which can be used:
+
+# Example file, should be put in /etc/udev/rules.d/regulatory.rules
+KERNEL=="regulatory*", ACTION=="change", SUBSYSTEM=="platform", RUN+="/sbin/crda"
+
+The alpha2 is passed as an environment variable under the variable COUNTRY.
+
+Who asks for regulatory domains?
+--------------------------------
+
+* Users
+
+Users can use iw:
+
+http://wireless.kernel.org/en/users/Documentation/iw
+
+An example:
+
+ # set regulatory domain to "Costa Rica"
+ iw reg set CR
+
+This will request the kernel to set the regulatory domain to
+the specificied alpha2. The kernel in turn will then ask userspace
+to provide a regulatory domain for the alpha2 specified by the user
+by sending a uevent.
+
+* Wireless subsystems for Country Information elements
+
+The kernel will send a uevent to inform userspace a new
+regulatory domain is required. More on this to be added
+as its integration is added.
+
+* Drivers
+
+If drivers determine they need a specific regulatory domain
+set they can inform the wireless core using regulatory_hint().
+They have two options -- they either provide an alpha2 so that
+crda can provide back a regulatory domain for that country or
+they can build their own regulatory domain based on internal
+custom knowledge so the wireless core can respect it.
+
+*Most* drivers will rely on the first mechanism of providing a
+regulatory hint with an alpha2. For these drivers there is an additional
+check that can be used to ensure compliance based on custom EEPROM
+regulatory data. This additional check can be used by drivers by
+registering on its struct wiphy a reg_notifier() callback. This notifier
+is called when the core's regulatory domain has been changed. The driver
+can use this to review the changes made and also review who made them
+(driver, user, country IE) and determine what to allow based on its
+internal EEPROM data. Devices drivers wishing to be capable of world
+roaming should use this callback. More on world roaming will be
+added to this document when its support is enabled.
+
+Device drivers who provide their own built regulatory domain
+do not need a callback as the channels registered by them are
+the only ones that will be allowed and therefore *additional*
+channels cannot be enabled.
+
+Example code - drivers hinting an alpha2:
+------------------------------------------
+
+This example comes from the zd1211rw device driver. You can start
+by having a mapping of your device's EEPROM country/regulatory
+domain value to a specific alpha2 as follows:
+
+static struct zd_reg_alpha2_map reg_alpha2_map[] = {
+ { ZD_REGDOMAIN_FCC, "US" },
+ { ZD_REGDOMAIN_IC, "CA" },
+ { ZD_REGDOMAIN_ETSI, "DE" }, /* Generic ETSI, use most restrictive */
+ { ZD_REGDOMAIN_JAPAN, "JP" },
+ { ZD_REGDOMAIN_JAPAN_ADD, "JP" },
+ { ZD_REGDOMAIN_SPAIN, "ES" },
+ { ZD_REGDOMAIN_FRANCE, "FR" },
+
+Then you can define a routine to map your read EEPROM value to an alpha2,
+as follows:
+
+static int zd_reg2alpha2(u8 regdomain, char *alpha2)
+{
+ unsigned int i;
+ struct zd_reg_alpha2_map *reg_map;
+ for (i = 0; i < ARRAY_SIZE(reg_alpha2_map); i++) {
+ reg_map = &reg_alpha2_map[i];
+ if (regdomain == reg_map->reg) {
+ alpha2[0] = reg_map->alpha2[0];
+ alpha2[1] = reg_map->alpha2[1];
+ return 0;
+ }
+ }
+ return 1;
+}
+
+Lastly, you can then hint to the core of your discovered alpha2, if a match
+was found. You need to do this after you have registered your wiphy. You
+are expected to do this during initialization.
+
+ r = zd_reg2alpha2(mac->regdomain, alpha2);
+ if (!r)
+ regulatory_hint(hw->wiphy, alpha2);
+
+Example code - drivers providing a built in regulatory domain:
+--------------------------------------------------------------
+
+[NOTE: This API is not currently available, it can be added when required]
+
+If you have regulatory information you can obtain from your
+driver and you *need* to use this we let you build a regulatory domain
+structure and pass it to the wireless core. To do this you should
+kmalloc() a structure big enough to hold your regulatory domain
+structure and you should then fill it with your data. Finally you simply
+call regulatory_hint() with the regulatory domain structure in it.
+
+Bellow is a simple example, with a regulatory domain cached using the stack.
+Your implementation may vary (read EEPROM cache instead, for example).
+
+Example cache of some regulatory domain
+
+struct ieee80211_regdomain mydriver_jp_regdom = {
+ .n_reg_rules = 3,
+ .alpha2 = "JP",
+ //.alpha2 = "99", /* If I have no alpha2 to map it to */
+ .reg_rules = {
+ /* IEEE 802.11b/g, channels 1..14 */
+ REG_RULE(2412-20, 2484+20, 40, 6, 20, 0),
+ /* IEEE 802.11a, channels 34..48 */
+ REG_RULE(5170-20, 5240+20, 40, 6, 20,
+ NL80211_RRF_NO_IR),
+ /* IEEE 802.11a, channels 52..64 */
+ REG_RULE(5260-20, 5320+20, 40, 6, 20,
+ NL80211_RRF_NO_IR|
+ NL80211_RRF_DFS),
+ }
+};
+
+Then in some part of your code after your wiphy has been registered:
+
+ struct ieee80211_regdomain *rd;
+ int size_of_regd;
+ int num_rules = mydriver_jp_regdom.n_reg_rules;
+ unsigned int i;
+
+ size_of_regd = sizeof(struct ieee80211_regdomain) +
+ (num_rules * sizeof(struct ieee80211_reg_rule));
+
+ rd = kzalloc(size_of_regd, GFP_KERNEL);
+ if (!rd)
+ return -ENOMEM;
+
+ memcpy(rd, &mydriver_jp_regdom, sizeof(struct ieee80211_regdomain));
+
+ for (i=0; i < num_rules; i++)
+ memcpy(&rd->reg_rules[i],
+ &mydriver_jp_regdom.reg_rules[i],
+ sizeof(struct ieee80211_reg_rule));
+ regulatory_struct_hint(rd);
+
+Statically compiled regulatory database
+---------------------------------------
+
+In most situations the userland solution using CRDA as described
+above is the preferred solution. However in some cases a set of
+rules built into the kernel itself may be desirable. To account
+for this situation, a configuration option has been provided
+(i.e. CONFIG_CFG80211_INTERNAL_REGDB). With this option enabled,
+the wireless database information contained in net/wireless/db.txt is
+used to generate a data structure encoded in net/wireless/regdb.c.
+That option also enables code in net/wireless/reg.c which queries
+the data in regdb.c as an alternative to using CRDA.
+
+The file net/wireless/db.txt should be kept up-to-date with the db.txt
+file available in the git repository here:
+
+ git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-regdb.git
+
+Again, most users in most situations should be using the CRDA package
+provided with their distribution, and in most other situations users
+should be building and using CRDA on their own rather than using
+this option. If you are not absolutely sure that you should be using
+CONFIG_CFG80211_INTERNAL_REGDB then _DO_NOT_USE_IT_.
diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt
new file mode 100644
index 000000000..16a924c48
--- /dev/null
+++ b/Documentation/networking/rxrpc.txt
@@ -0,0 +1,947 @@
+ ======================
+ RxRPC NETWORK PROTOCOL
+ ======================
+
+The RxRPC protocol driver provides a reliable two-phase transport on top of UDP
+that can be used to perform RxRPC remote operations. This is done over sockets
+of AF_RXRPC family, using sendmsg() and recvmsg() with control data to send and
+receive data, aborts and errors.
+
+Contents of this document:
+
+ (*) Overview.
+
+ (*) RxRPC protocol summary.
+
+ (*) AF_RXRPC driver model.
+
+ (*) Control messages.
+
+ (*) Socket options.
+
+ (*) Security.
+
+ (*) Example client usage.
+
+ (*) Example server usage.
+
+ (*) AF_RXRPC kernel interface.
+
+ (*) Configurable parameters.
+
+
+========
+OVERVIEW
+========
+
+RxRPC is a two-layer protocol. There is a session layer which provides
+reliable virtual connections using UDP over IPv4 (or IPv6) as the transport
+layer, but implements a real network protocol; and there's the presentation
+layer which renders structured data to binary blobs and back again using XDR
+(as does SunRPC):
+
+ +-------------+
+ | Application |
+ +-------------+
+ | XDR | Presentation
+ +-------------+
+ | RxRPC | Session
+ +-------------+
+ | UDP | Transport
+ +-------------+
+
+
+AF_RXRPC provides:
+
+ (1) Part of an RxRPC facility for both kernel and userspace applications by
+ making the session part of it a Linux network protocol (AF_RXRPC).
+
+ (2) A two-phase protocol. The client transmits a blob (the request) and then
+ receives a blob (the reply), and the server receives the request and then
+ transmits the reply.
+
+ (3) Retention of the reusable bits of the transport system set up for one call
+ to speed up subsequent calls.
+
+ (4) A secure protocol, using the Linux kernel's key retention facility to
+ manage security on the client end. The server end must of necessity be
+ more active in security negotiations.
+
+AF_RXRPC does not provide XDR marshalling/presentation facilities. That is
+left to the application. AF_RXRPC only deals in blobs. Even the operation ID
+is just the first four bytes of the request blob, and as such is beyond the
+kernel's interest.
+
+
+Sockets of AF_RXRPC family are:
+
+ (1) created as type SOCK_DGRAM;
+
+ (2) provided with a protocol of the type of underlying transport they're going
+ to use - currently only PF_INET is supported.
+
+
+The Andrew File System (AFS) is an example of an application that uses this and
+that has both kernel (filesystem) and userspace (utility) components.
+
+
+======================
+RXRPC PROTOCOL SUMMARY
+======================
+
+An overview of the RxRPC protocol:
+
+ (*) RxRPC sits on top of another networking protocol (UDP is the only option
+ currently), and uses this to provide network transport. UDP ports, for
+ example, provide transport endpoints.
+
+ (*) RxRPC supports multiple virtual "connections" from any given transport
+ endpoint, thus allowing the endpoints to be shared, even to the same
+ remote endpoint.
+
+ (*) Each connection goes to a particular "service". A connection may not go
+ to multiple services. A service may be considered the RxRPC equivalent of
+ a port number. AF_RXRPC permits multiple services to share an endpoint.
+
+ (*) Client-originating packets are marked, thus a transport endpoint can be
+ shared between client and server connections (connections have a
+ direction).
+
+ (*) Up to a billion connections may be supported concurrently between one
+ local transport endpoint and one service on one remote endpoint. An RxRPC
+ connection is described by seven numbers:
+
+ Local address }
+ Local port } Transport (UDP) address
+ Remote address }
+ Remote port }
+ Direction
+ Connection ID
+ Service ID
+
+ (*) Each RxRPC operation is a "call". A connection may make up to four
+ billion calls, but only up to four calls may be in progress on a
+ connection at any one time.
+
+ (*) Calls are two-phase and asymmetric: the client sends its request data,
+ which the service receives; then the service transmits the reply data
+ which the client receives.
+
+ (*) The data blobs are of indefinite size, the end of a phase is marked with a
+ flag in the packet. The number of packets of data making up one blob may
+ not exceed 4 billion, however, as this would cause the sequence number to
+ wrap.
+
+ (*) The first four bytes of the request data are the service operation ID.
+
+ (*) Security is negotiated on a per-connection basis. The connection is
+ initiated by the first data packet on it arriving. If security is
+ requested, the server then issues a "challenge" and then the client
+ replies with a "response". If the response is successful, the security is
+ set for the lifetime of that connection, and all subsequent calls made
+ upon it use that same security. In the event that the server lets a
+ connection lapse before the client, the security will be renegotiated if
+ the client uses the connection again.
+
+ (*) Calls use ACK packets to handle reliability. Data packets are also
+ explicitly sequenced per call.
+
+ (*) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs.
+ A hard-ACK indicates to the far side that all the data received to a point
+ has been received and processed; a soft-ACK indicates that the data has
+ been received but may yet be discarded and re-requested. The sender may
+ not discard any transmittable packets until they've been hard-ACK'd.
+
+ (*) Reception of a reply data packet implicitly hard-ACK's all the data
+ packets that make up the request.
+
+ (*) An call is complete when the request has been sent, the reply has been
+ received and the final hard-ACK on the last packet of the reply has
+ reached the server.
+
+ (*) An call may be aborted by either end at any time up to its completion.
+
+
+=====================
+AF_RXRPC DRIVER MODEL
+=====================
+
+About the AF_RXRPC driver:
+
+ (*) The AF_RXRPC protocol transparently uses internal sockets of the transport
+ protocol to represent transport endpoints.
+
+ (*) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC
+ connections are handled transparently. One client socket may be used to
+ make multiple simultaneous calls to the same service. One server socket
+ may handle calls from many clients.
+
+ (*) Additional parallel client connections will be initiated to support extra
+ concurrent calls, up to a tunable limit.
+
+ (*) Each connection is retained for a certain amount of time [tunable] after
+ the last call currently using it has completed in case a new call is made
+ that could reuse it.
+
+ (*) Each internal UDP socket is retained [tunable] for a certain amount of
+ time [tunable] after the last connection using it discarded, in case a new
+ connection is made that could use it.
+
+ (*) A client-side connection is only shared between calls if they have have
+ the same key struct describing their security (and assuming the calls
+ would otherwise share the connection). Non-secured calls would also be
+ able to share connections with each other.
+
+ (*) A server-side connection is shared if the client says it is.
+
+ (*) ACK'ing is handled by the protocol driver automatically, including ping
+ replying.
+
+ (*) SO_KEEPALIVE automatically pings the other side to keep the connection
+ alive [TODO].
+
+ (*) If an ICMP error is received, all calls affected by that error will be
+ aborted with an appropriate network error passed through recvmsg().
+
+
+Interaction with the user of the RxRPC socket:
+
+ (*) A socket is made into a server socket by binding an address with a
+ non-zero service ID.
+
+ (*) In the client, sending a request is achieved with one or more sendmsgs,
+ followed by the reply being received with one or more recvmsgs.
+
+ (*) The first sendmsg for a request to be sent from a client contains a tag to
+ be used in all other sendmsgs or recvmsgs associated with that call. The
+ tag is carried in the control data.
+
+ (*) connect() is used to supply a default destination address for a client
+ socket. This may be overridden by supplying an alternate address to the
+ first sendmsg() of a call (struct msghdr::msg_name).
+
+ (*) If connect() is called on an unbound client, a random local port will
+ bound before the operation takes place.
+
+ (*) A server socket may also be used to make client calls. To do this, the
+ first sendmsg() of the call must specify the target address. The server's
+ transport endpoint is used to send the packets.
+
+ (*) Once the application has received the last message associated with a call,
+ the tag is guaranteed not to be seen again, and so it can be used to pin
+ client resources. A new call can then be initiated with the same tag
+ without fear of interference.
+
+ (*) In the server, a request is received with one or more recvmsgs, then the
+ the reply is transmitted with one or more sendmsgs, and then the final ACK
+ is received with a last recvmsg.
+
+ (*) When sending data for a call, sendmsg is given MSG_MORE if there's more
+ data to come on that call.
+
+ (*) When receiving data for a call, recvmsg flags MSG_MORE if there's more
+ data to come for that call.
+
+ (*) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg
+ to indicate the terminal message for that call.
+
+ (*) A call may be aborted by adding an abort control message to the control
+ data. Issuing an abort terminates the kernel's use of that call's tag.
+ Any messages waiting in the receive queue for that call will be discarded.
+
+ (*) Aborts, busy notifications and challenge packets are delivered by recvmsg,
+ and control data messages will be set to indicate the context. Receiving
+ an abort or a busy message terminates the kernel's use of that call's tag.
+
+ (*) The control data part of the msghdr struct is used for a number of things:
+
+ (*) The tag of the intended or affected call.
+
+ (*) Sending or receiving errors, aborts and busy notifications.
+
+ (*) Notifications of incoming calls.
+
+ (*) Sending debug requests and receiving debug replies [TODO].
+
+ (*) When the kernel has received and set up an incoming call, it sends a
+ message to server application to let it know there's a new call awaiting
+ its acceptance [recvmsg reports a special control message]. The server
+ application then uses sendmsg to assign a tag to the new call. Once that
+ is done, the first part of the request data will be delivered by recvmsg.
+
+ (*) The server application has to provide the server socket with a keyring of
+ secret keys corresponding to the security types it permits. When a secure
+ connection is being set up, the kernel looks up the appropriate secret key
+ in the keyring and then sends a challenge packet to the client and
+ receives a response packet. The kernel then checks the authorisation of
+ the packet and either aborts the connection or sets up the security.
+
+ (*) The name of the key a client will use to secure its communications is
+ nominated by a socket option.
+
+
+Notes on recvmsg:
+
+ (*) If there's a sequence of data messages belonging to a particular call on
+ the receive queue, then recvmsg will keep working through them until:
+
+ (a) it meets the end of that call's received data,
+
+ (b) it meets a non-data message,
+
+ (c) it meets a message belonging to a different call, or
+
+ (d) it fills the user buffer.
+
+ If recvmsg is called in blocking mode, it will keep sleeping, awaiting the
+ reception of further data, until one of the above four conditions is met.
+
+ (2) MSG_PEEK operates similarly, but will return immediately if it has put any
+ data in the buffer rather than sleeping until it can fill the buffer.
+
+ (3) If a data message is only partially consumed in filling a user buffer,
+ then the remainder of that message will be left on the front of the queue
+ for the next taker. MSG_TRUNC will never be flagged.
+
+ (4) If there is more data to be had on a call (it hasn't copied the last byte
+ of the last data message in that phase yet), then MSG_MORE will be
+ flagged.
+
+
+================
+CONTROL MESSAGES
+================
+
+AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex
+calls, to invoke certain actions and to report certain conditions. These are:
+
+ MESSAGE ID SRT DATA MEANING
+ ======================= === =========== ===============================
+ RXRPC_USER_CALL_ID sr- User ID App's call specifier
+ RXRPC_ABORT srt Abort code Abort code to issue/received
+ RXRPC_ACK -rt n/a Final ACK received
+ RXRPC_NET_ERROR -rt error num Network error on call
+ RXRPC_BUSY -rt n/a Call rejected (server busy)
+ RXRPC_LOCAL_ERROR -rt error num Local error encountered
+ RXRPC_NEW_CALL -r- n/a New call received
+ RXRPC_ACCEPT s-- n/a Accept new call
+
+ (SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message)
+
+ (*) RXRPC_USER_CALL_ID
+
+ This is used to indicate the application's call ID. It's an unsigned long
+ that the app specifies in the client by attaching it to the first data
+ message or in the server by passing it in association with an RXRPC_ACCEPT
+ message. recvmsg() passes it in conjunction with all messages except
+ those of the RXRPC_NEW_CALL message.
+
+ (*) RXRPC_ABORT
+
+ This is can be used by an application to abort a call by passing it to
+ sendmsg, or it can be delivered by recvmsg to indicate a remote abort was
+ received. Either way, it must be associated with an RXRPC_USER_CALL_ID to
+ specify the call affected. If an abort is being sent, then error EBADSLT
+ will be returned if there is no call with that user ID.
+
+ (*) RXRPC_ACK
+
+ This is delivered to a server application to indicate that the final ACK
+ of a call was received from the client. It will be associated with an
+ RXRPC_USER_CALL_ID to indicate the call that's now complete.
+
+ (*) RXRPC_NET_ERROR
+
+ This is delivered to an application to indicate that an ICMP error message
+ was encountered in the process of trying to talk to the peer. An
+ errno-class integer value will be included in the control message data
+ indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call
+ affected.
+
+ (*) RXRPC_BUSY
+
+ This is delivered to a client application to indicate that a call was
+ rejected by the server due to the server being busy. It will be
+ associated with an RXRPC_USER_CALL_ID to indicate the rejected call.
+
+ (*) RXRPC_LOCAL_ERROR
+
+ This is delivered to an application to indicate that a local error was
+ encountered and that a call has been aborted because of it. An
+ errno-class integer value will be included in the control message data
+ indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call
+ affected.
+
+ (*) RXRPC_NEW_CALL
+
+ This is delivered to indicate to a server application that a new call has
+ arrived and is awaiting acceptance. No user ID is associated with this,
+ as a user ID must subsequently be assigned by doing an RXRPC_ACCEPT.
+
+ (*) RXRPC_ACCEPT
+
+ This is used by a server application to attempt to accept a call and
+ assign it a user ID. It should be associated with an RXRPC_USER_CALL_ID
+ to indicate the user ID to be assigned. If there is no call to be
+ accepted (it may have timed out, been aborted, etc.), then sendmsg will
+ return error ENODATA. If the user ID is already in use by another call,
+ then error EBADSLT will be returned.
+
+
+==============
+SOCKET OPTIONS
+==============
+
+AF_RXRPC sockets support a few socket options at the SOL_RXRPC level:
+
+ (*) RXRPC_SECURITY_KEY
+
+ This is used to specify the description of the key to be used. The key is
+ extracted from the calling process's keyrings with request_key() and
+ should be of "rxrpc" type.
+
+ The optval pointer points to the description string, and optlen indicates
+ how long the string is, without the NUL terminator.
+
+ (*) RXRPC_SECURITY_KEYRING
+
+ Similar to above but specifies a keyring of server secret keys to use (key
+ type "keyring"). See the "Security" section.
+
+ (*) RXRPC_EXCLUSIVE_CONNECTION
+
+ This is used to request that new connections should be used for each call
+ made subsequently on this socket. optval should be NULL and optlen 0.
+
+ (*) RXRPC_MIN_SECURITY_LEVEL
+
+ This is used to specify the minimum security level required for calls on
+ this socket. optval must point to an int containing one of the following
+ values:
+
+ (a) RXRPC_SECURITY_PLAIN
+
+ Encrypted checksum only.
+
+ (b) RXRPC_SECURITY_AUTH
+
+ Encrypted checksum plus packet padded and first eight bytes of packet
+ encrypted - which includes the actual packet length.
+
+ (c) RXRPC_SECURITY_ENCRYPTED
+
+ Encrypted checksum plus entire packet padded and encrypted, including
+ actual packet length.
+
+
+========
+SECURITY
+========
+
+Currently, only the kerberos 4 equivalent protocol has been implemented
+(security index 2 - rxkad). This requires the rxkad module to be loaded and,
+on the client, tickets of the appropriate type to be obtained from the AFS
+kaserver or the kerberos server and installed as "rxrpc" type keys. This is
+normally done using the klog program. An example simple klog program can be
+found at:
+
+ http://people.redhat.com/~dhowells/rxrpc/klog.c
+
+The payload provided to add_key() on the client should be of the following
+form:
+
+ struct rxrpc_key_sec2_v1 {
+ uint16_t security_index; /* 2 */
+ uint16_t ticket_length; /* length of ticket[] */
+ uint32_t expiry; /* time at which expires */
+ uint8_t kvno; /* key version number */
+ uint8_t __pad[3];
+ uint8_t session_key[8]; /* DES session key */
+ uint8_t ticket[0]; /* the encrypted ticket */
+ };
+
+Where the ticket blob is just appended to the above structure.
+
+
+For the server, keys of type "rxrpc_s" must be made available to the server.
+They have a description of "<serviceID>:<securityIndex>" (eg: "52:2" for an
+rxkad key for the AFS VL service). When such a key is created, it should be
+given the server's secret key as the instantiation data (see the example
+below).
+
+ add_key("rxrpc_s", "52:2", secret_key, 8, keyring);
+
+A keyring is passed to the server socket by naming it in a sockopt. The server
+socket then looks the server secret keys up in this keyring when secure
+incoming connections are made. This can be seen in an example program that can
+be found at:
+
+ http://people.redhat.com/~dhowells/rxrpc/listen.c
+
+
+====================
+EXAMPLE CLIENT USAGE
+====================
+
+A client would issue an operation by:
+
+ (1) An RxRPC socket is set up by:
+
+ client = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
+
+ Where the third parameter indicates the protocol family of the transport
+ socket used - usually IPv4 but it can also be IPv6 [TODO].
+
+ (2) A local address can optionally be bound:
+
+ struct sockaddr_rxrpc srx = {
+ .srx_family = AF_RXRPC,
+ .srx_service = 0, /* we're a client */
+ .transport_type = SOCK_DGRAM, /* type of transport socket */
+ .transport.sin_family = AF_INET,
+ .transport.sin_port = htons(7000), /* AFS callback */
+ .transport.sin_address = 0, /* all local interfaces */
+ };
+ bind(client, &srx, sizeof(srx));
+
+ This specifies the local UDP port to be used. If not given, a random
+ non-privileged port will be used. A UDP port may be shared between
+ several unrelated RxRPC sockets. Security is handled on a basis of
+ per-RxRPC virtual connection.
+
+ (3) The security is set:
+
+ const char *key = "AFS:cambridge.redhat.com";
+ setsockopt(client, SOL_RXRPC, RXRPC_SECURITY_KEY, key, strlen(key));
+
+ This issues a request_key() to get the key representing the security
+ context. The minimum security level can be set:
+
+ unsigned int sec = RXRPC_SECURITY_ENCRYPTED;
+ setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL,
+ &sec, sizeof(sec));
+
+ (4) The server to be contacted can then be specified (alternatively this can
+ be done through sendmsg):
+
+ struct sockaddr_rxrpc srx = {
+ .srx_family = AF_RXRPC,
+ .srx_service = VL_SERVICE_ID,
+ .transport_type = SOCK_DGRAM, /* type of transport socket */
+ .transport.sin_family = AF_INET,
+ .transport.sin_port = htons(7005), /* AFS volume manager */
+ .transport.sin_address = ...,
+ };
+ connect(client, &srx, sizeof(srx));
+
+ (5) The request data should then be posted to the server socket using a series
+ of sendmsg() calls, each with the following control message attached:
+
+ RXRPC_USER_CALL_ID - specifies the user ID for this call
+
+ MSG_MORE should be set in msghdr::msg_flags on all but the last part of
+ the request. Multiple requests may be made simultaneously.
+
+ If a call is intended to go to a destination other than the default
+ specified through connect(), then msghdr::msg_name should be set on the
+ first request message of that call.
+
+ (6) The reply data will then be posted to the server socket for recvmsg() to
+ pick up. MSG_MORE will be flagged by recvmsg() if there's more reply data
+ for a particular call to be read. MSG_EOR will be set on the terminal
+ read for a call.
+
+ All data will be delivered with the following control message attached:
+
+ RXRPC_USER_CALL_ID - specifies the user ID for this call
+
+ If an abort or error occurred, this will be returned in the control data
+ buffer instead, and MSG_EOR will be flagged to indicate the end of that
+ call.
+
+
+====================
+EXAMPLE SERVER USAGE
+====================
+
+A server would be set up to accept operations in the following manner:
+
+ (1) An RxRPC socket is created by:
+
+ server = socket(AF_RXRPC, SOCK_DGRAM, PF_INET);
+
+ Where the third parameter indicates the address type of the transport
+ socket used - usually IPv4.
+
+ (2) Security is set up if desired by giving the socket a keyring with server
+ secret keys in it:
+
+ keyring = add_key("keyring", "AFSkeys", NULL, 0,
+ KEY_SPEC_PROCESS_KEYRING);
+
+ const char secret_key[8] = {
+ 0xa7, 0x83, 0x8a, 0xcb, 0xc7, 0x83, 0xec, 0x94 };
+ add_key("rxrpc_s", "52:2", secret_key, 8, keyring);
+
+ setsockopt(server, SOL_RXRPC, RXRPC_SECURITY_KEYRING, "AFSkeys", 7);
+
+ The keyring can be manipulated after it has been given to the socket. This
+ permits the server to add more keys, replace keys, etc. whilst it is live.
+
+ (2) A local address must then be bound:
+
+ struct sockaddr_rxrpc srx = {
+ .srx_family = AF_RXRPC,
+ .srx_service = VL_SERVICE_ID, /* RxRPC service ID */
+ .transport_type = SOCK_DGRAM, /* type of transport socket */
+ .transport.sin_family = AF_INET,
+ .transport.sin_port = htons(7000), /* AFS callback */
+ .transport.sin_address = 0, /* all local interfaces */
+ };
+ bind(server, &srx, sizeof(srx));
+
+ (3) The server is then set to listen out for incoming calls:
+
+ listen(server, 100);
+
+ (4) The kernel notifies the server of pending incoming connections by sending
+ it a message for each. This is received with recvmsg() on the server
+ socket. It has no data, and has a single dataless control message
+ attached:
+
+ RXRPC_NEW_CALL
+
+ The address that can be passed back by recvmsg() at this point should be
+ ignored since the call for which the message was posted may have gone by
+ the time it is accepted - in which case the first call still on the queue
+ will be accepted.
+
+ (5) The server then accepts the new call by issuing a sendmsg() with two
+ pieces of control data and no actual data:
+
+ RXRPC_ACCEPT - indicate connection acceptance
+ RXRPC_USER_CALL_ID - specify user ID for this call
+
+ (6) The first request data packet will then be posted to the server socket for
+ recvmsg() to pick up. At that point, the RxRPC address for the call can
+ be read from the address fields in the msghdr struct.
+
+ Subsequent request data will be posted to the server socket for recvmsg()
+ to collect as it arrives. All but the last piece of the request data will
+ be delivered with MSG_MORE flagged.
+
+ All data will be delivered with the following control message attached:
+
+ RXRPC_USER_CALL_ID - specifies the user ID for this call
+
+ (8) The reply data should then be posted to the server socket using a series
+ of sendmsg() calls, each with the following control messages attached:
+
+ RXRPC_USER_CALL_ID - specifies the user ID for this call
+
+ MSG_MORE should be set in msghdr::msg_flags on all but the last message
+ for a particular call.
+
+ (9) The final ACK from the client will be posted for retrieval by recvmsg()
+ when it is received. It will take the form of a dataless message with two
+ control messages attached:
+
+ RXRPC_USER_CALL_ID - specifies the user ID for this call
+ RXRPC_ACK - indicates final ACK (no data)
+
+ MSG_EOR will be flagged to indicate that this is the final message for
+ this call.
+
+(10) Up to the point the final packet of reply data is sent, the call can be
+ aborted by calling sendmsg() with a dataless message with the following
+ control messages attached:
+
+ RXRPC_USER_CALL_ID - specifies the user ID for this call
+ RXRPC_ABORT - indicates abort code (4 byte data)
+
+ Any packets waiting in the socket's receive queue will be discarded if
+ this is issued.
+
+Note that all the communications for a particular service take place through
+the one server socket, using control messages on sendmsg() and recvmsg() to
+determine the call affected.
+
+
+=========================
+AF_RXRPC KERNEL INTERFACE
+=========================
+
+The AF_RXRPC module also provides an interface for use by in-kernel utilities
+such as the AFS filesystem. This permits such a utility to:
+
+ (1) Use different keys directly on individual client calls on one socket
+ rather than having to open a whole slew of sockets, one for each key it
+ might want to use.
+
+ (2) Avoid having RxRPC call request_key() at the point of issue of a call or
+ opening of a socket. Instead the utility is responsible for requesting a
+ key at the appropriate point. AFS, for instance, would do this during VFS
+ operations such as open() or unlink(). The key is then handed through
+ when the call is initiated.
+
+ (3) Request the use of something other than GFP_KERNEL to allocate memory.
+
+ (4) Avoid the overhead of using the recvmsg() call. RxRPC messages can be
+ intercepted before they get put into the socket Rx queue and the socket
+ buffers manipulated directly.
+
+To use the RxRPC facility, a kernel utility must still open an AF_RXRPC socket,
+bind an address as appropriate and listen if it's to be a server socket, but
+then it passes this to the kernel interface functions.
+
+The kernel interface functions are as follows:
+
+ (*) Begin a new client call.
+
+ struct rxrpc_call *
+ rxrpc_kernel_begin_call(struct socket *sock,
+ struct sockaddr_rxrpc *srx,
+ struct key *key,
+ unsigned long user_call_ID,
+ gfp_t gfp);
+
+ This allocates the infrastructure to make a new RxRPC call and assigns
+ call and connection numbers. The call will be made on the UDP port that
+ the socket is bound to. The call will go to the destination address of a
+ connected client socket unless an alternative is supplied (srx is
+ non-NULL).
+
+ If a key is supplied then this will be used to secure the call instead of
+ the key bound to the socket with the RXRPC_SECURITY_KEY sockopt. Calls
+ secured in this way will still share connections if at all possible.
+
+ The user_call_ID is equivalent to that supplied to sendmsg() in the
+ control data buffer. It is entirely feasible to use this to point to a
+ kernel data structure.
+
+ If this function is successful, an opaque reference to the RxRPC call is
+ returned. The caller now holds a reference on this and it must be
+ properly ended.
+
+ (*) End a client call.
+
+ void rxrpc_kernel_end_call(struct rxrpc_call *call);
+
+ This is used to end a previously begun call. The user_call_ID is expunged
+ from AF_RXRPC's knowledge and will not be seen again in association with
+ the specified call.
+
+ (*) Send data through a call.
+
+ int rxrpc_kernel_send_data(struct rxrpc_call *call, struct msghdr *msg,
+ size_t len);
+
+ This is used to supply either the request part of a client call or the
+ reply part of a server call. msg.msg_iovlen and msg.msg_iov specify the
+ data buffers to be used. msg_iov may not be NULL and must point
+ exclusively to in-kernel virtual addresses. msg.msg_flags may be given
+ MSG_MORE if there will be subsequent data sends for this call.
+
+ The msg must not specify a destination address, control data or any flags
+ other than MSG_MORE. len is the total amount of data to transmit.
+
+ (*) Abort a call.
+
+ void rxrpc_kernel_abort_call(struct rxrpc_call *call, u32 abort_code);
+
+ This is used to abort a call if it's still in an abortable state. The
+ abort code specified will be placed in the ABORT message sent.
+
+ (*) Intercept received RxRPC messages.
+
+ typedef void (*rxrpc_interceptor_t)(struct sock *sk,
+ unsigned long user_call_ID,
+ struct sk_buff *skb);
+
+ void
+ rxrpc_kernel_intercept_rx_messages(struct socket *sock,
+ rxrpc_interceptor_t interceptor);
+
+ This installs an interceptor function on the specified AF_RXRPC socket.
+ All messages that would otherwise wind up in the socket's Rx queue are
+ then diverted to this function. Note that care must be taken to process
+ the messages in the right order to maintain DATA message sequentiality.
+
+ The interceptor function itself is provided with the address of the socket
+ and handling the incoming message, the ID assigned by the kernel utility
+ to the call and the socket buffer containing the message.
+
+ The skb->mark field indicates the type of message:
+
+ MARK MEANING
+ =============================== =======================================
+ RXRPC_SKB_MARK_DATA Data message
+ RXRPC_SKB_MARK_FINAL_ACK Final ACK received for an incoming call
+ RXRPC_SKB_MARK_BUSY Client call rejected as server busy
+ RXRPC_SKB_MARK_REMOTE_ABORT Call aborted by peer
+ RXRPC_SKB_MARK_NET_ERROR Network error detected
+ RXRPC_SKB_MARK_LOCAL_ERROR Local error encountered
+ RXRPC_SKB_MARK_NEW_CALL New incoming call awaiting acceptance
+
+ The remote abort message can be probed with rxrpc_kernel_get_abort_code().
+ The two error messages can be probed with rxrpc_kernel_get_error_number().
+ A new call can be accepted with rxrpc_kernel_accept_call().
+
+ Data messages can have their contents extracted with the usual bunch of
+ socket buffer manipulation functions. A data message can be determined to
+ be the last one in a sequence with rxrpc_kernel_is_data_last(). When a
+ data message has been used up, rxrpc_kernel_data_delivered() should be
+ called on it..
+
+ Non-data messages should be handled to rxrpc_kernel_free_skb() to dispose
+ of. It is possible to get extra refs on all types of message for later
+ freeing, but this may pin the state of a call until the message is finally
+ freed.
+
+ (*) Accept an incoming call.
+
+ struct rxrpc_call *
+ rxrpc_kernel_accept_call(struct socket *sock,
+ unsigned long user_call_ID);
+
+ This is used to accept an incoming call and to assign it a call ID. This
+ function is similar to rxrpc_kernel_begin_call() and calls accepted must
+ be ended in the same way.
+
+ If this function is successful, an opaque reference to the RxRPC call is
+ returned. The caller now holds a reference on this and it must be
+ properly ended.
+
+ (*) Reject an incoming call.
+
+ int rxrpc_kernel_reject_call(struct socket *sock);
+
+ This is used to reject the first incoming call on the socket's queue with
+ a BUSY message. -ENODATA is returned if there were no incoming calls.
+ Other errors may be returned if the call had been aborted (-ECONNABORTED)
+ or had timed out (-ETIME).
+
+ (*) Record the delivery of a data message and free it.
+
+ void rxrpc_kernel_data_delivered(struct sk_buff *skb);
+
+ This is used to record a data message as having been delivered and to
+ update the ACK state for the call. The socket buffer will be freed.
+
+ (*) Free a message.
+
+ void rxrpc_kernel_free_skb(struct sk_buff *skb);
+
+ This is used to free a non-DATA socket buffer intercepted from an AF_RXRPC
+ socket.
+
+ (*) Determine if a data message is the last one on a call.
+
+ bool rxrpc_kernel_is_data_last(struct sk_buff *skb);
+
+ This is used to determine if a socket buffer holds the last data message
+ to be received for a call (true will be returned if it does, false
+ if not).
+
+ The data message will be part of the reply on a client call and the
+ request on an incoming call. In the latter case there will be more
+ messages, but in the former case there will not.
+
+ (*) Get the abort code from an abort message.
+
+ u32 rxrpc_kernel_get_abort_code(struct sk_buff *skb);
+
+ This is used to extract the abort code from a remote abort message.
+
+ (*) Get the error number from a local or network error message.
+
+ int rxrpc_kernel_get_error_number(struct sk_buff *skb);
+
+ This is used to extract the error number from a message indicating either
+ a local error occurred or a network error occurred.
+
+ (*) Allocate a null key for doing anonymous security.
+
+ struct key *rxrpc_get_null_key(const char *keyname);
+
+ This is used to allocate a null RxRPC key that can be used to indicate
+ anonymous security for a particular domain.
+
+
+=======================
+CONFIGURABLE PARAMETERS
+=======================
+
+The RxRPC protocol driver has a number of configurable parameters that can be
+adjusted through sysctls in /proc/net/rxrpc/:
+
+ (*) req_ack_delay
+
+ The amount of time in milliseconds after receiving a packet with the
+ request-ack flag set before we honour the flag and actually send the
+ requested ack.
+
+ Usually the other side won't stop sending packets until the advertised
+ reception window is full (to a maximum of 255 packets), so delaying the
+ ACK permits several packets to be ACK'd in one go.
+
+ (*) soft_ack_delay
+
+ The amount of time in milliseconds after receiving a new packet before we
+ generate a soft-ACK to tell the sender that it doesn't need to resend.
+
+ (*) idle_ack_delay
+
+ The amount of time in milliseconds after all the packets currently in the
+ received queue have been consumed before we generate a hard-ACK to tell
+ the sender it can free its buffers, assuming no other reason occurs that
+ we would send an ACK.
+
+ (*) resend_timeout
+
+ The amount of time in milliseconds after transmitting a packet before we
+ transmit it again, assuming no ACK is received from the receiver telling
+ us they got it.
+
+ (*) max_call_lifetime
+
+ The maximum amount of time in seconds that a call may be in progress
+ before we preemptively kill it.
+
+ (*) dead_call_expiry
+
+ The amount of time in seconds before we remove a dead call from the call
+ list. Dead calls are kept around for a little while for the purpose of
+ repeating ACK and ABORT packets.
+
+ (*) connection_expiry
+
+ The amount of time in seconds after a connection was last used before we
+ remove it from the connection list. Whilst a connection is in existence,
+ it serves as a placeholder for negotiated security; when it is deleted,
+ the security must be renegotiated.
+
+ (*) transport_expiry
+
+ The amount of time in seconds after a transport was last used before we
+ remove it from the transport list. Whilst a transport is in existence, it
+ serves to anchor the peer data and keeps the connection ID counter.
+
+ (*) rxrpc_rx_window_size
+
+ The size of the receive window in packets. This is the maximum number of
+ unconsumed received packets we're willing to hold in memory for any
+ particular call.
+
+ (*) rxrpc_rx_mtu
+
+ The maximum packet MTU size that we're willing to receive in bytes. This
+ indicates to the peer whether we're willing to accept jumbo packets.
+
+ (*) rxrpc_rx_jumbo_max
+
+ The maximum number of packets that we're willing to accept in a jumbo
+ packet. Non-terminal packets in a jumbo packet must contain a four byte
+ header plus exactly 1412 bytes of data. The terminal packet must contain
+ a four byte header plus any amount of data. In any event, a jumbo packet
+ may not exceed rxrpc_rx_mtu in size.
diff --git a/Documentation/networking/s2io.txt b/Documentation/networking/s2io.txt
new file mode 100644
index 000000000..0362a42f7
--- /dev/null
+++ b/Documentation/networking/s2io.txt
@@ -0,0 +1,141 @@
+Release notes for Neterion's (Formerly S2io) Xframe I/II PCI-X 10GbE driver.
+
+Contents
+=======
+- 1. Introduction
+- 2. Identifying the adapter/interface
+- 3. Features supported
+- 4. Command line parameters
+- 5. Performance suggestions
+- 6. Available Downloads
+
+
+1. Introduction:
+This Linux driver supports Neterion's Xframe I PCI-X 1.0 and
+Xframe II PCI-X 2.0 adapters. It supports several features
+such as jumbo frames, MSI/MSI-X, checksum offloads, TSO, UFO and so on.
+See below for complete list of features.
+All features are supported for both IPv4 and IPv6.
+
+2. Identifying the adapter/interface:
+a. Insert the adapter(s) in your system.
+b. Build and load driver
+# insmod s2io.ko
+c. View log messages
+# dmesg | tail -40
+You will see messages similar to:
+eth3: Neterion Xframe I 10GbE adapter (rev 3), Version 2.0.9.1, Intr type INTA
+eth4: Neterion Xframe II 10GbE adapter (rev 2), Version 2.0.9.1, Intr type INTA
+eth4: Device is on 64 bit 133MHz PCIX(M1) bus
+
+The above messages identify the adapter type(Xframe I/II), adapter revision,
+driver version, interface name(eth3, eth4), Interrupt type(INTA, MSI, MSI-X).
+In case of Xframe II, the PCI/PCI-X bus width and frequency are displayed
+as well.
+
+To associate an interface with a physical adapter use "ethtool -p <ethX>".
+The corresponding adapter's LED will blink multiple times.
+
+3. Features supported:
+a. Jumbo frames. Xframe I/II supports MTU up to 9600 bytes,
+modifiable using ip command.
+
+b. Offloads. Supports checksum offload(TCP/UDP/IP) on transmit
+and receive, TSO.
+
+c. Multi-buffer receive mode. Scattering of packet across multiple
+buffers. Currently driver supports 2-buffer mode which yields
+significant performance improvement on certain platforms(SGI Altix,
+IBM xSeries).
+
+d. MSI/MSI-X. Can be enabled on platforms which support this feature
+(IA64, Xeon) resulting in noticeable performance improvement(up to 7%
+on certain platforms).
+
+e. Statistics. Comprehensive MAC-level and software statistics displayed
+using "ethtool -S" option.
+
+f. Multi-FIFO/Ring. Supports up to 8 transmit queues and receive rings,
+with multiple steering options.
+
+4. Command line parameters
+a. tx_fifo_num
+Number of transmit queues
+Valid range: 1-8
+Default: 1
+
+b. rx_ring_num
+Number of receive rings
+Valid range: 1-8
+Default: 1
+
+c. tx_fifo_len
+Size of each transmit queue
+Valid range: Total length of all queues should not exceed 8192
+Default: 4096
+
+d. rx_ring_sz
+Size of each receive ring(in 4K blocks)
+Valid range: Limited by memory on system
+Default: 30
+
+e. intr_type
+Specifies interrupt type. Possible values 0(INTA), 2(MSI-X)
+Valid values: 0, 2
+Default: 2
+
+5. Performance suggestions
+General:
+a. Set MTU to maximum(9000 for switch setup, 9600 in back-to-back configuration)
+b. Set TCP windows size to optimal value.
+For instance, for MTU=1500 a value of 210K has been observed to result in
+good performance.
+# sysctl -w net.ipv4.tcp_rmem="210000 210000 210000"
+# sysctl -w net.ipv4.tcp_wmem="210000 210000 210000"
+For MTU=9000, TCP window size of 10 MB is recommended.
+# sysctl -w net.ipv4.tcp_rmem="10000000 10000000 10000000"
+# sysctl -w net.ipv4.tcp_wmem="10000000 10000000 10000000"
+
+Transmit performance:
+a. By default, the driver respects BIOS settings for PCI bus parameters.
+However, you may want to experiment with PCI bus parameters
+max-split-transactions(MOST) and MMRBC (use setpci command).
+A MOST value of 2 has been found optimal for Opterons and 3 for Itanium.
+It could be different for your hardware.
+Set MMRBC to 4K**.
+
+For example you can set
+For opteron
+#setpci -d 17d5:* 62=1d
+For Itanium
+#setpci -d 17d5:* 62=3d
+
+For detailed description of the PCI registers, please see Xframe User Guide.
+
+b. Ensure Transmit Checksum offload is enabled. Use ethtool to set/verify this
+parameter.
+c. Turn on TSO(using "ethtool -K")
+# ethtool -K <ethX> tso on
+
+Receive performance:
+a. By default, the driver respects BIOS settings for PCI bus parameters.
+However, you may want to set PCI latency timer to 248.
+#setpci -d 17d5:* LATENCY_TIMER=f8
+For detailed description of the PCI registers, please see Xframe User Guide.
+b. Use 2-buffer mode. This results in large performance boost on
+certain platforms(eg. SGI Altix, IBM xSeries).
+c. Ensure Receive Checksum offload is enabled. Use "ethtool -K ethX" command to
+set/verify this option.
+d. Enable NAPI feature(in kernel configuration Device Drivers ---> Network
+device support ---> Ethernet (10000 Mbit) ---> S2IO 10Gbe Xframe NIC) to
+bring down CPU utilization.
+
+** For AMD opteron platforms with 8131 chipset, MMRBC=1 and MOST=1 are
+recommended as safe parameters.
+For more information, please review the AMD8131 errata at
+http://vip.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/
+26310_AMD-8131_HyperTransport_PCI-X_Tunnel_Revision_Guide_rev_3_18.pdf
+
+6. Support
+For further support please contact either your 10GbE Xframe NIC vendor (IBM,
+HP, SGI etc.)
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
new file mode 100644
index 000000000..59f4db2a0
--- /dev/null
+++ b/Documentation/networking/scaling.txt
@@ -0,0 +1,445 @@
+Scaling in the Linux Networking Stack
+
+
+Introduction
+============
+
+This document describes a set of complementary techniques in the Linux
+networking stack to increase parallelism and improve performance for
+multi-processor systems.
+
+The following technologies are described:
+
+ RSS: Receive Side Scaling
+ RPS: Receive Packet Steering
+ RFS: Receive Flow Steering
+ Accelerated Receive Flow Steering
+ XPS: Transmit Packet Steering
+
+
+RSS: Receive Side Scaling
+=========================
+
+Contemporary NICs support multiple receive and transmit descriptor queues
+(multi-queue). On reception, a NIC can send different packets to different
+queues to distribute processing among CPUs. The NIC distributes packets by
+applying a filter to each packet that assigns it to one of a small number
+of logical flows. Packets for each flow are steered to a separate receive
+queue, which in turn can be processed by separate CPUs. This mechanism is
+generally known as “Receive-side Scaling” (RSS). The goal of RSS and
+the other scaling techniques is to increase performance uniformly.
+Multi-queue distribution can also be used for traffic prioritization, but
+that is not the focus of these techniques.
+
+The filter used in RSS is typically a hash function over the network
+and/or transport layer headers-- for example, a 4-tuple hash over
+IP addresses and TCP ports of a packet. The most common hardware
+implementation of RSS uses a 128-entry indirection table where each entry
+stores a queue number. The receive queue for a packet is determined
+by masking out the low order seven bits of the computed hash for the
+packet (usually a Toeplitz hash), taking this number as a key into the
+indirection table and reading the corresponding value.
+
+Some advanced NICs allow steering packets to queues based on
+programmable filters. For example, webserver bound TCP port 80 packets
+can be directed to their own receive queue. Such “n-tuple” filters can
+be configured from ethtool (--config-ntuple).
+
+==== RSS Configuration
+
+The driver for a multi-queue capable NIC typically provides a kernel
+module parameter for specifying the number of hardware queues to
+configure. In the bnx2x driver, for instance, this parameter is called
+num_queues. A typical RSS configuration would be to have one receive queue
+for each CPU if the device supports enough queues, or otherwise at least
+one for each memory domain, where a memory domain is a set of CPUs that
+share a particular memory level (L1, L2, NUMA node, etc.).
+
+The indirection table of an RSS device, which resolves a queue by masked
+hash, is usually programmed by the driver at initialization. The
+default mapping is to distribute the queues evenly in the table, but the
+indirection table can be retrieved and modified at runtime using ethtool
+commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
+indirection table could be done to give different queues different
+relative weights.
+
+== RSS IRQ Configuration
+
+Each receive queue has a separate IRQ associated with it. The NIC triggers
+this to notify a CPU when new packets arrive on the given queue. The
+signaling path for PCIe devices uses message signaled interrupts (MSI-X),
+that can route each interrupt to a particular CPU. The active mapping
+of queues to IRQs can be determined from /proc/interrupts. By default,
+an IRQ may be handled on any CPU. Because a non-negligible part of packet
+processing takes place in receive interrupt handling, it is advantageous
+to spread receive interrupts between CPUs. To manually adjust the IRQ
+affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems
+will be running irqbalance, a daemon that dynamically optimizes IRQ
+assignments and as a result may override any manual settings.
+
+== Suggested Configuration
+
+RSS should be enabled when latency is a concern or whenever receive
+interrupt processing forms a bottleneck. Spreading load between CPUs
+decreases queue length. For low latency networking, the optimal setting
+is to allocate as many queues as there are CPUs in the system (or the
+NIC maximum, if lower). The most efficient high-rate configuration
+is likely the one with the smallest number of receive queues where no
+receive queue overflows due to a saturated CPU, because in default
+mode with interrupt coalescing enabled, the aggregate number of
+interrupts (and thus work) grows with each additional queue.
+
+Per-cpu load can be observed using the mpstat utility, but note that on
+processors with hyperthreading (HT), each hyperthread is represented as
+a separate CPU. For interrupt handling, HT has shown no benefit in
+initial tests, so limit the number of queues to the number of CPU cores
+in the system.
+
+
+RPS: Receive Packet Steering
+============================
+
+Receive Packet Steering (RPS) is logically a software implementation of
+RSS. Being in software, it is necessarily called later in the datapath.
+Whereas RSS selects the queue and hence CPU that will run the hardware
+interrupt handler, RPS selects the CPU to perform protocol processing
+above the interrupt handler. This is accomplished by placing the packet
+on the desired CPU’s backlog queue and waking up the CPU for processing.
+RPS has some advantages over RSS: 1) it can be used with any NIC,
+2) software filters can easily be added to hash over new protocols,
+3) it does not increase hardware device interrupt rate (although it does
+introduce inter-processor interrupts (IPIs)).
+
+RPS is called during bottom half of the receive interrupt handler, when
+a driver sends a packet up the network stack with netif_rx() or
+netif_receive_skb(). These call the get_rps_cpu() function, which
+selects the queue that should process a packet.
+
+The first step in determining the target CPU for RPS is to calculate a
+flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
+depending on the protocol). This serves as a consistent hash of the
+associated flow of the packet. The hash is either provided by hardware
+or will be computed in the stack. Capable hardware can pass the hash in
+the receive descriptor for the packet; this would usually be the same
+hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
+skb->rx_hash and can be used elsewhere in the stack as a hash of the
+packet’s flow.
+
+Each receive hardware queue has an associated list of CPUs to which
+RPS may enqueue packets for processing. For each received packet,
+an index into the list is computed from the flow hash modulo the size
+of the list. The indexed CPU is the target for processing the packet,
+and the packet is queued to the tail of that CPU’s backlog queue. At
+the end of the bottom half routine, IPIs are sent to any CPUs for which
+packets have been queued to their backlog queue. The IPI wakes backlog
+processing on the remote CPU, and any queued packets are then processed
+up the networking stack.
+
+==== RPS Configuration
+
+RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
+by default for SMP). Even when compiled in, RPS remains disabled until
+explicitly configured. The list of CPUs to which RPS may forward traffic
+can be configured for each receive queue using a sysfs file entry:
+
+ /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
+
+This file implements a bitmap of CPUs. RPS is disabled when it is zero
+(the default), in which case packets are processed on the interrupting
+CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to
+the bitmap.
+
+== Suggested Configuration
+
+For a single queue device, a typical RPS configuration would be to set
+the rps_cpus to the CPUs in the same memory domain of the interrupting
+CPU. If NUMA locality is not an issue, this could also be all CPUs in
+the system. At high interrupt rate, it might be wise to exclude the
+interrupting CPU from the map since that already performs much work.
+
+For a multi-queue system, if RSS is configured so that a hardware
+receive queue is mapped to each CPU, then RPS is probably redundant
+and unnecessary. If there are fewer hardware queues than CPUs, then
+RPS might be beneficial if the rps_cpus for each queue are the ones that
+share the same memory domain as the interrupting CPU for that queue.
+
+==== RPS Flow Limit
+
+RPS scales kernel receive processing across CPUs without introducing
+reordering. The trade-off to sending all packets from the same flow
+to the same CPU is CPU load imbalance if flows vary in packet rate.
+In the extreme case a single flow dominates traffic. Especially on
+common server workloads with many concurrent connections, such
+behavior indicates a problem such as a misconfiguration or spoofed
+source Denial of Service attack.
+
+Flow Limit is an optional RPS feature that prioritizes small flows
+during CPU contention by dropping packets from large flows slightly
+ahead of those from small flows. It is active only when an RPS or RFS
+destination CPU approaches saturation. Once a CPU's input packet
+queue exceeds half the maximum queue length (as set by sysctl
+net.core.netdev_max_backlog), the kernel starts a per-flow packet
+count over the last 256 packets. If a flow exceeds a set ratio (by
+default, half) of these packets when a new packet arrives, then the
+new packet is dropped. Packets from other flows are still only
+dropped once the input packet queue reaches netdev_max_backlog.
+No packets are dropped when the input packet queue length is below
+the threshold, so flow limit does not sever connections outright:
+even large flows maintain connectivity.
+
+== Interface
+
+Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
+turned on. It is implemented for each CPU independently (to avoid lock
+and cache contention) and toggled per CPU by setting the relevant bit
+in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
+bitmap interface as rps_cpus (see above) when called from procfs:
+
+ /proc/sys/net/core/flow_limit_cpu_bitmap
+
+Per-flow rate is calculated by hashing each packet into a hashtable
+bucket and incrementing a per-bucket counter. The hash function is
+the same that selects a CPU in RPS, but as the number of buckets can
+be much larger than the number of CPUs, flow limit has finer-grained
+identification of large flows and fewer false positives. The default
+table has 4096 buckets. This value can be modified through sysctl
+
+ net.core.flow_limit_table_len
+
+The value is only consulted when a new table is allocated. Modifying
+it does not update active tables.
+
+== Suggested Configuration
+
+Flow limit is useful on systems with many concurrent connections,
+where a single connection taking up 50% of a CPU indicates a problem.
+In such environments, enable the feature on all CPUs that handle
+network rx interrupts (as set in /proc/irq/N/smp_affinity).
+
+The feature depends on the input packet queue length to exceed
+the flow limit threshold (50%) + the flow history length (256).
+Setting net.core.netdev_max_backlog to either 1000 or 10000
+performed well in experiments.
+
+
+RFS: Receive Flow Steering
+==========================
+
+While RPS steers packets solely based on hash, and thus generally
+provides good load distribution, it does not take into account
+application locality. This is accomplished by Receive Flow Steering
+(RFS). The goal of RFS is to increase datacache hitrate by steering
+kernel processing of packets to the CPU where the application thread
+consuming the packet is running. RFS relies on the same RPS mechanisms
+to enqueue packets onto the backlog of another CPU and to wake up that
+CPU.
+
+In RFS, packets are not forwarded directly by the value of their hash,
+but the hash is used as index into a flow lookup table. This table maps
+flows to the CPUs where those flows are being processed. The flow hash
+(see RPS section above) is used to calculate the index into this table.
+The CPU recorded in each entry is the one which last processed the flow.
+If an entry does not hold a valid CPU, then packets mapped to that entry
+are steered using plain RPS. Multiple table entries may point to the
+same CPU. Indeed, with many flows and few CPUs, it is very likely that
+a single application thread handles flows with many different flow hashes.
+
+rps_sock_flow_table is a global flow table that contains the *desired* CPU
+for flows: the CPU that is currently processing the flow in userspace.
+Each table value is a CPU index that is updated during calls to recvmsg
+and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
+and tcp_splice_read()).
+
+When the scheduler moves a thread to a new CPU while it has outstanding
+receive packets on the old CPU, packets may arrive out of order. To
+avoid this, RFS uses a second flow table to track outstanding packets
+for each flow: rps_dev_flow_table is a table specific to each hardware
+receive queue of each device. Each table value stores a CPU index and a
+counter. The CPU index represents the *current* CPU onto which packets
+for this flow are enqueued for further kernel processing. Ideally, kernel
+and userspace processing occur on the same CPU, and hence the CPU index
+in both tables is identical. This is likely false if the scheduler has
+recently migrated a userspace thread while the kernel still has packets
+enqueued for kernel processing on the old CPU.
+
+The counter in rps_dev_flow_table values records the length of the current
+CPU's backlog when a packet in this flow was last enqueued. Each backlog
+queue has a head counter that is incremented on dequeue. A tail counter
+is computed as head counter + queue length. In other words, the counter
+in rps_dev_flow[i] records the last element in flow i that has
+been enqueued onto the currently designated CPU for flow i (of course,
+entry i is actually selected by hash and multiple flows may hash to the
+same entry i).
+
+And now the trick for avoiding out of order packets: when selecting the
+CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
+and the rps_dev_flow table of the queue that the packet was received on
+are compared. If the desired CPU for the flow (found in the
+rps_sock_flow table) matches the current CPU (found in the rps_dev_flow
+table), the packet is enqueued onto that CPU’s backlog. If they differ,
+the current CPU is updated to match the desired CPU if one of the
+following is true:
+
+- The current CPU's queue head counter >= the recorded tail counter
+ value in rps_dev_flow[i]
+- The current CPU is unset (>= nr_cpu_ids)
+- The current CPU is offline
+
+After this check, the packet is sent to the (possibly updated) current
+CPU. These rules aim to ensure that a flow only moves to a new CPU when
+there are no packets outstanding on the old CPU, as the outstanding
+packets could arrive later than those about to be processed on the new
+CPU.
+
+==== RFS Configuration
+
+RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on
+by default for SMP). The functionality remains disabled until explicitly
+configured. The number of entries in the global flow table is set through:
+
+ /proc/sys/net/core/rps_sock_flow_entries
+
+The number of entries in the per-queue flow table are set through:
+
+ /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
+
+== Suggested Configuration
+
+Both of these need to be set before RFS is enabled for a receive queue.
+Values for both are rounded up to the nearest power of two. The
+suggested flow count depends on the expected number of active connections
+at any given time, which may be significantly less than the number of open
+connections. We have found that a value of 32768 for rps_sock_flow_entries
+works fairly well on a moderately loaded server.
+
+For a single queue device, the rps_flow_cnt value for the single queue
+would normally be configured to the same value as rps_sock_flow_entries.
+For a multi-queue device, the rps_flow_cnt for each queue might be
+configured as rps_sock_flow_entries / N, where N is the number of
+queues. So for instance, if rps_sock_flow_entries is set to 32768 and there
+are 16 configured receive queues, rps_flow_cnt for each queue might be
+configured as 2048.
+
+
+Accelerated RFS
+===============
+
+Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
+balancing mechanism that uses soft state to steer flows based on where
+the application thread consuming the packets of each flow is running.
+Accelerated RFS should perform better than RFS since packets are sent
+directly to a CPU local to the thread consuming the data. The target CPU
+will either be the same CPU where the application runs, or at least a CPU
+which is local to the application thread’s CPU in the cache hierarchy.
+
+To enable accelerated RFS, the networking stack calls the
+ndo_rx_flow_steer driver function to communicate the desired hardware
+queue for packets matching a particular flow. The network stack
+automatically calls this function every time a flow entry in
+rps_dev_flow_table is updated. The driver in turn uses a device specific
+method to program the NIC to steer the packets.
+
+The hardware queue for a flow is derived from the CPU recorded in
+rps_dev_flow_table. The stack consults a CPU to hardware queue map which
+is maintained by the NIC driver. This is an auto-generated reverse map of
+the IRQ affinity table shown by /proc/interrupts. Drivers can use
+functions in the cpu_rmap (“CPU affinity reverse map”) kernel library
+to populate the map. For each CPU, the corresponding queue in the map is
+set to be one whose processing CPU is closest in cache locality.
+
+==== Accelerated RFS Configuration
+
+Accelerated RFS is only available if the kernel is compiled with
+CONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
+It also requires that ntuple filtering is enabled via ethtool. The map
+of CPU to queues is automatically deduced from the IRQ affinities
+configured for each receive queue by the driver, so no additional
+configuration should be necessary.
+
+== Suggested Configuration
+
+This technique should be enabled whenever one wants to use RFS and the
+NIC supports hardware acceleration.
+
+XPS: Transmit Packet Steering
+=============================
+
+Transmit Packet Steering is a mechanism for intelligently selecting
+which transmit queue to use when transmitting a packet on a multi-queue
+device. To accomplish this, a mapping from CPU to hardware queue(s) is
+recorded. The goal of this mapping is usually to assign queues
+exclusively to a subset of CPUs, where the transmit completions for
+these queues are processed on a CPU within this set. This choice
+provides two benefits. First, contention on the device queue lock is
+significantly reduced since fewer CPUs contend for the same queue
+(contention can be eliminated completely if each CPU has its own
+transmit queue). Secondly, cache miss rate on transmit completion is
+reduced, in particular for data cache lines that hold the sk_buff
+structures.
+
+XPS is configured per transmit queue by setting a bitmap of CPUs that
+may use that queue to transmit. The reverse mapping, from CPUs to
+transmit queues, is computed and maintained for each network device.
+When transmitting the first packet in a flow, the function
+get_xps_queue() is called to select a queue. This function uses the ID
+of the running CPU as a key into the CPU-to-queue lookup table. If the
+ID matches a single queue, that is used for transmission. If multiple
+queues match, one is selected by using the flow hash to compute an index
+into the set.
+
+The queue chosen for transmitting a particular flow is saved in the
+corresponding socket structure for the flow (e.g. a TCP connection).
+This transmit queue is used for subsequent packets sent on the flow to
+prevent out of order (ooo) packets. The choice also amortizes the cost
+of calling get_xps_queues() over all packets in the flow. To avoid
+ooo packets, the queue for a flow can subsequently only be changed if
+skb->ooo_okay is set for a packet in the flow. This flag indicates that
+there are no outstanding packets in the flow, so the transmit queue can
+change without the risk of generating out of order packets. The
+transport layer is responsible for setting ooo_okay appropriately. TCP,
+for instance, sets the flag when all data for a connection has been
+acknowledged.
+
+==== XPS Configuration
+
+XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
+default for SMP). The functionality remains disabled until explicitly
+configured. To enable XPS, the bitmap of CPUs that may use a transmit
+queue is configured using the sysfs file entry:
+
+/sys/class/net/<dev>/queues/tx-<n>/xps_cpus
+
+== Suggested Configuration
+
+For a network device with a single transmission queue, XPS configuration
+has no effect, since there is no choice in this case. In a multi-queue
+system, XPS is preferably configured so that each CPU maps onto one queue.
+If there are as many queues as there are CPUs in the system, then each
+queue can also map onto one CPU, resulting in exclusive pairings that
+experience no contention. If there are fewer queues than CPUs, then the
+best CPUs to share a given queue are probably those that share the cache
+with the CPU that processes transmit completions for that queue
+(transmit interrupts).
+
+Per TX Queue rate limitation:
+=============================
+
+These are rate-limitation mechanisms implemented by HW, where currently
+a max-rate attribute is supported, by setting a Mbps value to
+
+/sys/class/net/<dev>/queues/tx-<n>/tx_maxrate
+
+A value of zero means disabled, and this is the default.
+
+Further Information
+===================
+RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
+2.6.38. Original patches were submitted by Tom Herbert
+(therbert@google.com)
+
+Accelerated RFS was introduced in 2.6.35. Original patches were
+submitted by Ben Hutchings (bwh@kernel.org)
+
+Authors:
+Tom Herbert (therbert@google.com)
+Willem de Bruijn (willemb@google.com)
diff --git a/Documentation/networking/sctp.txt b/Documentation/networking/sctp.txt
new file mode 100644
index 000000000..97b810ca9
--- /dev/null
+++ b/Documentation/networking/sctp.txt
@@ -0,0 +1,35 @@
+Linux Kernel SCTP
+
+This is the current BETA release of the Linux Kernel SCTP reference
+implementation.
+
+SCTP (Stream Control Transmission Protocol) is a IP based, message oriented,
+reliable transport protocol, with congestion control, support for
+transparent multi-homing, and multiple ordered streams of messages.
+RFC2960 defines the core protocol. The IETF SIGTRAN working group originally
+developed the SCTP protocol and later handed the protocol over to the
+Transport Area (TSVWG) working group for the continued evolvement of SCTP as a
+general purpose transport.
+
+See the IETF website (http://www.ietf.org) for further documents on SCTP.
+See http://www.ietf.org/rfc/rfc2960.txt
+
+The initial project goal is to create an Linux kernel reference implementation
+of SCTP that is RFC 2960 compliant and provides an programming interface
+referred to as the UDP-style API of the Sockets Extensions for SCTP, as
+proposed in IETF Internet-Drafts.
+
+Caveats:
+
+-lksctp can be built as statically or as a module. However, be aware that
+module removal of lksctp is not yet a safe activity.
+
+-There is tentative support for IPv6, but most work has gone towards
+implementation and testing lksctp on IPv4.
+
+
+For more information, please visit the lksctp project website:
+ http://www.sf.net/projects/lksctp
+
+Or contact the lksctp developers through the mailing list:
+ <linux-sctp@vger.kernel.org>
diff --git a/Documentation/networking/secid.txt b/Documentation/networking/secid.txt
new file mode 100644
index 000000000..95ea06784
--- /dev/null
+++ b/Documentation/networking/secid.txt
@@ -0,0 +1,14 @@
+flowi structure:
+
+The secid member in the flow structure is used in LSMs (e.g. SELinux) to indicate
+the label of the flow. This label of the flow is currently used in selecting
+matching labeled xfrm(s).
+
+If this is an outbound flow, the label is derived from the socket, if any, or
+the incoming packet this flow is being generated as a response to (e.g. tcp
+resets, timewait ack, etc.). It is also conceivable that the label could be
+derived from other sources such as process context, device, etc., in special
+cases, as may be appropriate.
+
+If this is an inbound flow, the label is derived from the IPSec security
+associations, if any, used by the packet.
diff --git a/Documentation/networking/skfp.txt b/Documentation/networking/skfp.txt
new file mode 100644
index 000000000..203ec66c9
--- /dev/null
+++ b/Documentation/networking/skfp.txt
@@ -0,0 +1,220 @@
+(C)Copyright 1998-2000 SysKonnect,
+===========================================================================
+
+skfp.txt created 11-May-2000
+
+Readme File for skfp.o v2.06
+
+
+This file contains
+(1) OVERVIEW
+(2) SUPPORTED ADAPTERS
+(3) GENERAL INFORMATION
+(4) INSTALLATION
+(5) INCLUSION OF THE ADAPTER IN SYSTEM START
+(6) TROUBLESHOOTING
+(7) FUNCTION OF THE ADAPTER LEDS
+(8) HISTORY
+
+===========================================================================
+
+
+
+(1) OVERVIEW
+============
+
+This README explains how to use the driver 'skfp' for Linux with your
+network adapter.
+
+Chapter 2: Contains a list of all network adapters that are supported by
+ this driver.
+
+Chapter 3: Gives some general information.
+
+Chapter 4: Describes common problems and solutions.
+
+Chapter 5: Shows the changed functionality of the adapter LEDs.
+
+Chapter 6: History of development.
+
+***
+
+
+(2) SUPPORTED ADAPTERS
+======================
+
+The network driver 'skfp' supports the following network adapters:
+SysKonnect adapters:
+ - SK-5521 (SK-NET FDDI-UP)
+ - SK-5522 (SK-NET FDDI-UP DAS)
+ - SK-5541 (SK-NET FDDI-FP)
+ - SK-5543 (SK-NET FDDI-LP)
+ - SK-5544 (SK-NET FDDI-LP DAS)
+ - SK-5821 (SK-NET FDDI-UP64)
+ - SK-5822 (SK-NET FDDI-UP64 DAS)
+ - SK-5841 (SK-NET FDDI-FP64)
+ - SK-5843 (SK-NET FDDI-LP64)
+ - SK-5844 (SK-NET FDDI-LP64 DAS)
+Compaq adapters (not tested):
+ - Netelligent 100 FDDI DAS Fibre SC
+ - Netelligent 100 FDDI SAS Fibre SC
+ - Netelligent 100 FDDI DAS UTP
+ - Netelligent 100 FDDI SAS UTP
+ - Netelligent 100 FDDI SAS Fibre MIC
+***
+
+
+(3) GENERAL INFORMATION
+=======================
+
+From v2.01 on, the driver is integrated in the linux kernel sources.
+Therefore, the installation is the same as for any other adapter
+supported by the kernel.
+Refer to the manual of your distribution about the installation
+of network adapters.
+Makes my life much easier :-)
+***
+
+
+(4) TROUBLESHOOTING
+===================
+
+If you run into problems during installation, check those items:
+
+Problem: The FDDI adapter cannot be found by the driver.
+Reason: Look in /proc/pci for the following entry:
+ 'FDDI network controller: SysKonnect SK-FDDI-PCI ...'
+ If this entry exists, then the FDDI adapter has been
+ found by the system and should be able to be used.
+ If this entry does not exist or if the file '/proc/pci'
+ is not there, then you may have a hardware problem or PCI
+ support may not be enabled in your kernel.
+ The adapter can be checked using the diagnostic program
+ which is available from the SysKonnect web site:
+ www.syskonnect.de
+ Some COMPAQ machines have a problem with PCI under
+ Linux. This is described in the 'PCI howto' document
+ (included in some distributions or available from the
+ www, e.g. at 'www.linux.org') and no workaround is available.
+
+Problem: You want to use your computer as a router between
+ multiple IP subnetworks (using multiple adapters), but
+ you cannot reach computers in other subnetworks.
+Reason: Either the router's kernel is not configured for IP
+ forwarding or there is a problem with the routing table
+ and gateway configuration in at least one of the
+ computers.
+
+If your problem is not listed here, please contact our
+technical support for help.
+You can send email to:
+ linux@syskonnect.de
+When contacting our technical support,
+please ensure that the following information is available:
+- System Manufacturer and Model
+- Boards in your system
+- Distribution
+- Kernel version
+
+***
+
+
+(5) FUNCTION OF THE ADAPTER LEDS
+================================
+
+ The functionality of the LED's on the FDDI network adapters was
+ changed in SMT version v2.82. With this new SMT version, the yellow
+ LED works as a ring operational indicator. An active yellow LED
+ indicates that the ring is down. The green LED on the adapter now
+ works as a link indicator where an active GREEN LED indicates that
+ the respective port has a physical connection.
+
+ With versions of SMT prior to v2.82 a ring up was indicated if the
+ yellow LED was off while the green LED(s) showed the connection
+ status of the adapter. During a ring down the green LED was off and
+ the yellow LED was on.
+
+ All implementations indicate that a driver is not loaded if
+ all LEDs are off.
+
+***
+
+
+(6) HISTORY
+===========
+
+v2.06 (20000511) (In-Kernel version)
+ New features:
+ - 64 bit support
+ - new pci dma interface
+ - in kernel 2.3.99
+
+v2.05 (20000217) (In-Kernel version)
+ New features:
+ - Changes for 2.3.45 kernel
+
+v2.04 (20000207) (Standalone version)
+ New features:
+ - Added rx/tx byte counter
+
+v2.03 (20000111) (Standalone version)
+ Problems fixed:
+ - Fixed printk statements from v2.02
+
+v2.02 (991215) (Standalone version)
+ Problems fixed:
+ - Removed unnecessary output
+ - Fixed path for "printver.sh" in makefile
+
+v2.01 (991122) (In-Kernel version)
+ New features:
+ - Integration in Linux kernel sources
+ - Support for memory mapped I/O.
+
+v2.00 (991112)
+ New features:
+ - Full source released under GPL
+
+v1.05 (991023)
+ Problems fixed:
+ - Compilation with kernel version 2.2.13 failed
+
+v1.04 (990427)
+ Changes:
+ - New SMT module included, changing LED functionality
+ Problems fixed:
+ - Synchronization on SMP machines was buggy
+
+v1.03 (990325)
+ Problems fixed:
+ - Interrupt routing on SMP machines could be incorrect
+
+v1.02 (990310)
+ New features:
+ - Support for kernel versions 2.2.x added
+ - Kernel patch instead of private duplicate of kernel functions
+
+v1.01 (980812)
+ Problems fixed:
+ Connection hangup with telnet
+ Slow telnet connection
+
+v1.00 beta 01 (980507)
+ New features:
+ None.
+ Problems fixed:
+ None.
+ Known limitations:
+ - tar archive instead of standard package format (rpm).
+ - FDDI statistic is empty.
+ - not tested with 2.1.xx kernels
+ - integration in kernel not tested
+ - not tested simultaneously with FDDI adapters from other vendors.
+ - only X86 processors supported.
+ - SBA (Synchronous Bandwidth Allocator) parameters can
+ not be configured.
+ - does not work on some COMPAQ machines. See the PCI howto
+ document for details about this problem.
+ - data corruption with kernel versions below 2.0.33.
+
+*** End of information file ***
diff --git a/Documentation/networking/smc9.txt b/Documentation/networking/smc9.txt
new file mode 100644
index 000000000..d1e15074e
--- /dev/null
+++ b/Documentation/networking/smc9.txt
@@ -0,0 +1,42 @@
+
+SMC 9xxxx Driver
+Revision 0.12
+3/5/96
+Copyright 1996 Erik Stahlman
+Released under terms of the GNU General Public License.
+
+This file contains the instructions and caveats for my SMC9xxx driver. You
+should not be using the driver without reading this file.
+
+Things to note about installation:
+
+ 1. The driver should work on all kernels from 1.2.13 until 1.3.71.
+ (A kernel patch is supplied for 1.3.71 )
+
+ 2. If you include this into the kernel, you might need to change some
+ options, such as for forcing IRQ.
+
+
+ 3. To compile as a module, run 'make' .
+ Make will give you the appropriate options for various kernel support.
+
+ 4. Loading the driver as a module :
+
+ use: insmod smc9194.o
+ optional parameters:
+ io=xxxx : your base address
+ irq=xx : your irq
+ ifport=x : 0 for whatever is default
+ 1 for twisted pair
+ 2 for AUI ( or BNC on some cards )
+
+How to obtain the latest version?
+
+FTP:
+ ftp://fenris.campus.vt.edu/smc9/smc9-12.tar.gz
+ ftp://sfbox.vt.edu/filebox/F/fenris/smc9/smc9-12.tar.gz
+
+
+Contacting me:
+ erik@mail.vt.edu
+
diff --git a/Documentation/networking/spider_net.txt b/Documentation/networking/spider_net.txt
new file mode 100644
index 000000000..b0b75f846
--- /dev/null
+++ b/Documentation/networking/spider_net.txt
@@ -0,0 +1,204 @@
+
+ The Spidernet Device Driver
+ ===========================
+
+Written by Linas Vepstas <linas@austin.ibm.com>
+
+Version of 7 June 2007
+
+Abstract
+========
+This document sketches the structure of portions of the spidernet
+device driver in the Linux kernel tree. The spidernet is a gigabit
+ethernet device built into the Toshiba southbridge commonly used
+in the SONY Playstation 3 and the IBM QS20 Cell blade.
+
+The Structure of the RX Ring.
+=============================
+The receive (RX) ring is a circular linked list of RX descriptors,
+together with three pointers into the ring that are used to manage its
+contents.
+
+The elements of the ring are called "descriptors" or "descrs"; they
+describe the received data. This includes a pointer to a buffer
+containing the received data, the buffer size, and various status bits.
+
+There are three primary states that a descriptor can be in: "empty",
+"full" and "not-in-use". An "empty" or "ready" descriptor is ready
+to receive data from the hardware. A "full" descriptor has data in it,
+and is waiting to be emptied and processed by the OS. A "not-in-use"
+descriptor is neither empty or full; it is simply not ready. It may
+not even have a data buffer in it, or is otherwise unusable.
+
+During normal operation, on device startup, the OS (specifically, the
+spidernet device driver) allocates a set of RX descriptors and RX
+buffers. These are all marked "empty", ready to receive data. This
+ring is handed off to the hardware, which sequentially fills in the
+buffers, and marks them "full". The OS follows up, taking the full
+buffers, processing them, and re-marking them empty.
+
+This filling and emptying is managed by three pointers, the "head"
+and "tail" pointers, managed by the OS, and a hardware current
+descriptor pointer (GDACTDPA). The GDACTDPA points at the descr
+currently being filled. When this descr is filled, the hardware
+marks it full, and advances the GDACTDPA by one. Thus, when there is
+flowing RX traffic, every descr behind it should be marked "full",
+and everything in front of it should be "empty". If the hardware
+discovers that the current descr is not empty, it will signal an
+interrupt, and halt processing.
+
+The tail pointer tails or trails the hardware pointer. When the
+hardware is ahead, the tail pointer will be pointing at a "full"
+descr. The OS will process this descr, and then mark it "not-in-use",
+and advance the tail pointer. Thus, when there is flowing RX traffic,
+all of the descrs in front of the tail pointer should be "full", and
+all of those behind it should be "not-in-use". When RX traffic is not
+flowing, then the tail pointer can catch up to the hardware pointer.
+The OS will then note that the current tail is "empty", and halt
+processing.
+
+The head pointer (somewhat mis-named) follows after the tail pointer.
+When traffic is flowing, then the head pointer will be pointing at
+a "not-in-use" descr. The OS will perform various housekeeping duties
+on this descr. This includes allocating a new data buffer and
+dma-mapping it so as to make it visible to the hardware. The OS will
+then mark the descr as "empty", ready to receive data. Thus, when there
+is flowing RX traffic, everything in front of the head pointer should
+be "not-in-use", and everything behind it should be "empty". If no
+RX traffic is flowing, then the head pointer can catch up to the tail
+pointer, at which point the OS will notice that the head descr is
+"empty", and it will halt processing.
+
+Thus, in an idle system, the GDACTDPA, tail and head pointers will
+all be pointing at the same descr, which should be "empty". All of the
+other descrs in the ring should be "empty" as well.
+
+The show_rx_chain() routine will print out the locations of the
+GDACTDPA, tail and head pointers. It will also summarize the contents
+of the ring, starting at the tail pointer, and listing the status
+of the descrs that follow.
+
+A typical example of the output, for a nearly idle system, might be
+
+net eth1: Total number of descrs=256
+net eth1: Chain tail located at descr=20
+net eth1: Chain head is at 20
+net eth1: HW curr desc (GDACTDPA) is at 21
+net eth1: Have 1 descrs with stat=x40800101
+net eth1: HW next desc (GDACNEXTDA) is at 22
+net eth1: Last 255 descrs with stat=xa0800000
+
+In the above, the hardware has filled in one descr, number 20. Both
+head and tail are pointing at 20, because it has not yet been emptied.
+Meanwhile, hw is pointing at 21, which is free.
+
+The "Have nnn decrs" refers to the descr starting at the tail: in this
+case, nnn=1 descr, starting at descr 20. The "Last nnn descrs" refers
+to all of the rest of the descrs, from the last status change. The "nnn"
+is a count of how many descrs have exactly the same status.
+
+The status x4... corresponds to "full" and status xa... corresponds
+to "empty". The actual value printed is RXCOMST_A.
+
+In the device driver source code, a different set of names are
+used for these same concepts, so that
+
+"empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa
+"full" == SPIDER_NET_DESCR_FRAME_END == 0x4
+"not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf
+
+
+The RX RAM full bug/feature
+===========================
+
+As long as the OS can empty out the RX buffers at a rate faster than
+the hardware can fill them, there is no problem. If, for some reason,
+the OS fails to empty the RX ring fast enough, the hardware GDACTDPA
+pointer will catch up to the head, notice the not-empty condition,
+ad stop. However, RX packets may still continue arriving on the wire.
+The spidernet chip can save some limited number of these in local RAM.
+When this local ram fills up, the spider chip will issue an interrupt
+indicating this (GHIINT0STS will show ERRINT, and the GRMFLLINT bit
+will be set in GHIINT1STS). When the RX ram full condition occurs,
+a certain bug/feature is triggered that has to be specially handled.
+This section describes the special handling for this condition.
+
+When the OS finally has a chance to run, it will empty out the RX ring.
+In particular, it will clear the descriptor on which the hardware had
+stopped. However, once the hardware has decided that a certain
+descriptor is invalid, it will not restart at that descriptor; instead
+it will restart at the next descr. This potentially will lead to a
+deadlock condition, as the tail pointer will be pointing at this descr,
+which, from the OS point of view, is empty; the OS will be waiting for
+this descr to be filled. However, the hardware has skipped this descr,
+and is filling the next descrs. Since the OS doesn't see this, there
+is a potential deadlock, with the OS waiting for one descr to fill,
+while the hardware is waiting for a different set of descrs to become
+empty.
+
+A call to show_rx_chain() at this point indicates the nature of the
+problem. A typical print when the network is hung shows the following:
+
+net eth1: Spider RX RAM full, incoming packets might be discarded!
+net eth1: Total number of descrs=256
+net eth1: Chain tail located at descr=255
+net eth1: Chain head is at 255
+net eth1: HW curr desc (GDACTDPA) is at 0
+net eth1: Have 1 descrs with stat=xa0800000
+net eth1: HW next desc (GDACNEXTDA) is at 1
+net eth1: Have 127 descrs with stat=x40800101
+net eth1: Have 1 descrs with stat=x40800001
+net eth1: Have 126 descrs with stat=x40800101
+net eth1: Last 1 descrs with stat=xa0800000
+
+Both the tail and head pointers are pointing at descr 255, which is
+marked xa... which is "empty". Thus, from the OS point of view, there
+is nothing to be done. In particular, there is the implicit assumption
+that everything in front of the "empty" descr must surely also be empty,
+as explained in the last section. The OS is waiting for descr 255 to
+become non-empty, which, in this case, will never happen.
+
+The HW pointer is at descr 0. This descr is marked 0x4.. or "full".
+Since its already full, the hardware can do nothing more, and thus has
+halted processing. Notice that descrs 0 through 254 are all marked
+"full", while descr 254 and 255 are empty. (The "Last 1 descrs" is
+descr 254, since tail was at 255.) Thus, the system is deadlocked,
+and there can be no forward progress; the OS thinks there's nothing
+to do, and the hardware has nowhere to put incoming data.
+
+This bug/feature is worked around with the spider_net_resync_head_ptr()
+routine. When the driver receives RX interrupts, but an examination
+of the RX chain seems to show it is empty, then it is probable that
+the hardware has skipped a descr or two (sometimes dozens under heavy
+network conditions). The spider_net_resync_head_ptr() subroutine will
+search the ring for the next full descr, and the driver will resume
+operations there. Since this will leave "holes" in the ring, there
+is also a spider_net_resync_tail_ptr() that will skip over such holes.
+
+As of this writing, the spider_net_resync() strategy seems to work very
+well, even under heavy network loads.
+
+
+The TX ring
+===========
+The TX ring uses a low-watermark interrupt scheme to make sure that
+the TX queue is appropriately serviced for large packet sizes.
+
+For packet sizes greater than about 1KBytes, the kernel can fill
+the TX ring quicker than the device can drain it. Once the ring
+is full, the netdev is stopped. When there is room in the ring,
+the netdev needs to be reawakened, so that more TX packets are placed
+in the ring. The hardware can empty the ring about four times per jiffy,
+so its not appropriate to wait for the poll routine to refill, since
+the poll routine runs only once per jiffy. The low-watermark mechanism
+marks a descr about 1/4th of the way from the bottom of the queue, so
+that an interrupt is generated when the descr is processed. This
+interrupt wakes up the netdev, which can then refill the queue.
+For large packets, this mechanism generates a relatively small number
+of interrupts, about 1K/sec. For smaller packets, this will drop to zero
+interrupts, as the hardware can empty the queue faster than the kernel
+can fill it.
+
+
+ ======= END OF DOCUMENT ========
+
diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt
new file mode 100644
index 000000000..e655e2453
--- /dev/null
+++ b/Documentation/networking/stmmac.txt
@@ -0,0 +1,369 @@
+ STMicroelectronics 10/100/1000 Synopsys Ethernet driver
+
+Copyright (C) 2007-2014 STMicroelectronics Ltd
+Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
+
+This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers
+(Synopsys IP blocks).
+
+Currently this network device driver is for all STi embedded MAC/GMAC
+(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000
+FF1152AMT0221 D1215994A VIRTEX FPGA board.
+
+DWC Ether MAC 10/100/1000 Universal version 3.70a (and older) and DWC Ether
+MAC 10/100 Universal version 4.0 have been used for developing this driver.
+
+This driver supports both the platform bus and PCI.
+
+Please, for more information also visit: www.stlinux.com
+
+1) Kernel Configuration
+The kernel configuration option is STMMAC_ETH:
+ Device Drivers ---> Network device support ---> Ethernet (1000 Mbit) --->
+ STMicroelectronics 10/100/1000 Ethernet driver (STMMAC_ETH)
+
+CONFIG_STMMAC_PLATFORM: is to enable the platform driver.
+CONFIG_STMMAC_PCI: is to enable the pci driver.
+
+2) Driver parameters list:
+ debug: message level (0: no output, 16: all);
+ phyaddr: to manually provide the physical address to the PHY device;
+ dma_rxsize: DMA rx ring size;
+ dma_txsize: DMA tx ring size;
+ buf_sz: DMA buffer size;
+ tc: control the HW FIFO threshold;
+ watchdog: transmit timeout (in milliseconds);
+ flow_ctrl: Flow control ability [on/off];
+ pause: Flow Control Pause Time;
+ eee_timer: tx EEE timer;
+ chain_mode: select chain mode instead of ring.
+
+3) Command line options
+Driver parameters can be also passed in command line by using:
+ stmmaceth=dma_rxsize:128,dma_txsize:512
+
+4) Driver information and notes
+
+4.1) Transmit process
+The xmit method is invoked when the kernel needs to transmit a packet; it sets
+the descriptors in the ring and informs the DMA engine that there is a packet
+ready to be transmitted.
+By default, the driver sets the NETIF_F_SG bit in the features field of the
+net_device structure enabling the scatter-gather feature. This is true on
+chips and configurations where the checksum can be done in hardware.
+Once the controller has finished transmitting the packet, napi will be
+scheduled to release the transmit resources.
+
+4.2) Receive process
+When one or more packets are received, an interrupt happens. The interrupts
+are not queued so the driver has to scan all the descriptors in the ring during
+the receive process.
+This is based on NAPI so the interrupt handler signals only if there is work
+to be done, and it exits.
+Then the poll method will be scheduled at some future point.
+The incoming packets are stored, by the DMA, in a list of pre-allocated socket
+buffers in order to avoid the memcpy (zero-copy).
+
+4.3) Interrupt Mitigation
+The driver is able to mitigate the number of its DMA interrupts
+using NAPI for the reception on chips older than the 3.50.
+New chips have an HW RX-Watchdog used for this mitigation.
+Mitigation parameters can be tuned by ethtool.
+
+4.4) WOL
+Wake up on Lan feature through Magic and Unicast frames are supported for the
+GMAC core.
+
+4.5) DMA descriptors
+Driver handles both normal and alternate descriptors. The latter has been only
+tested on DWC Ether MAC 10/100/1000 Universal version 3.41a and later.
+
+STMMAC supports DMA descriptor to operate both in dual buffer (RING)
+and linked-list(CHAINED) mode. In RING each descriptor points to two
+data buffer pointers whereas in CHAINED mode they point to only one data
+buffer pointer. RING mode is the default.
+
+In CHAINED mode each descriptor will have pointer to next descriptor in
+the list, hence creating the explicit chaining in the descriptor itself,
+whereas such explicit chaining is not possible in RING mode.
+
+4.5.1) Extended descriptors
+ The extended descriptors give us information about the Ethernet payload
+ when it is carrying PTP packets or TCP/UDP/ICMP over IP.
+ These are not available on GMAC Synopsys chips older than the 3.50.
+ At probe time the driver will decide if these can be actually used.
+ This support also is mandatory for PTPv2 because the extra descriptors
+ are used for saving the hardware timestamps and Extended Status.
+
+4.6) Ethtool support
+Ethtool is supported.
+
+For example, driver statistics (including RMON), internal errors can be taken
+using:
+ # ethtool -S ethX command
+
+4.7) Jumbo and Segmentation Offloading
+Jumbo frames are supported and tested for the GMAC.
+The GSO has been also added but it's performed in software.
+LRO is not supported.
+
+4.8) Physical
+The driver is compatible with Physical Abstraction Layer to be connected with
+PHY and GPHY devices.
+
+4.9) Platform information
+Several information can be passed through the platform and device-tree.
+
+struct plat_stmmacenet_data {
+ char *phy_bus_name;
+ int bus_id;
+ int phy_addr;
+ int interface;
+ struct stmmac_mdio_bus_data *mdio_bus_data;
+ struct stmmac_dma_cfg *dma_cfg;
+ int clk_csr;
+ int has_gmac;
+ int enh_desc;
+ int tx_coe;
+ int rx_coe;
+ int bugged_jumbo;
+ int pmt;
+ int force_sf_dma_mode;
+ int force_thresh_dma_mode;
+ int riwt_off;
+ int max_speed;
+ int maxmtu;
+ void (*fix_mac_speed)(void *priv, unsigned int speed);
+ void (*bus_setup)(void __iomem *ioaddr);
+ void *(*setup)(struct platform_device *pdev);
+ void (*free)(struct platform_device *pdev, void *priv);
+ int (*init)(struct platform_device *pdev, void *priv);
+ void (*exit)(struct platform_device *pdev, void *priv);
+ void *custom_cfg;
+ void *custom_data;
+ void *bsp_priv;
+};
+
+Where:
+ o phy_bus_name: phy bus name to attach to the stmmac.
+ o bus_id: bus identifier.
+ o phy_addr: the physical address can be passed from the platform.
+ If it is set to -1 the driver will automatically
+ detect it at run-time by probing all the 32 addresses.
+ o interface: PHY device's interface.
+ o mdio_bus_data: specific platform fields for the MDIO bus.
+ o dma_cfg: internal DMA parameters
+ o pbl: the Programmable Burst Length is maximum number of beats to
+ be transferred in one DMA transaction.
+ GMAC also enables the 4xPBL by default.
+ o fixed_burst/mixed_burst/burst_len
+ o clk_csr: fixed CSR Clock range selection.
+ o has_gmac: uses the GMAC core.
+ o enh_desc: if sets the MAC will use the enhanced descriptor structure.
+ o tx_coe: core is able to perform the tx csum in HW.
+ o rx_coe: the supports three check sum offloading engine types:
+ type_1, type_2 (full csum) and no RX coe.
+ o bugged_jumbo: some HWs are not able to perform the csum in HW for
+ over-sized frames due to limited buffer sizes.
+ Setting this flag the csum will be done in SW on
+ JUMBO frames.
+ o pmt: core has the embedded power module (optional).
+ o force_sf_dma_mode: force DMA to use the Store and Forward mode
+ instead of the Threshold.
+ o force_thresh_dma_mode: force DMA to use the Threshold mode other than
+ the Store and Forward mode.
+ o riwt_off: force to disable the RX watchdog feature and switch to NAPI mode.
+ o fix_mac_speed: this callback is used for modifying some syscfg registers
+ (on ST SoCs) according to the link speed negotiated by the
+ physical layer .
+ o bus_setup: perform HW setup of the bus. For example, on some ST platforms
+ this field is used to configure the AMBA bridge to generate more
+ efficient STBus traffic.
+ o setup/init/exit: callbacks used for calling a custom initialization;
+ this is sometime necessary on some platforms (e.g. ST boxes)
+ where the HW needs to have set some PIO lines or system cfg
+ registers. setup should return a pointer to private data,
+ which will be stored in bsp_priv, and then passed to init and
+ exit callbacks. init/exit callbacks should not use or modify
+ platform data.
+ o custom_cfg/custom_data: this is a custom configuration that can be passed
+ while initializing the resources.
+ o bsp_priv: another private pointer.
+
+For MDIO bus The we have:
+
+ struct stmmac_mdio_bus_data {
+ int (*phy_reset)(void *priv);
+ unsigned int phy_mask;
+ int *irqs;
+ int probed_phy_irq;
+ };
+
+Where:
+ o phy_reset: hook to reset the phy device attached to the bus.
+ o phy_mask: phy mask passed when register the MDIO bus within the driver.
+ o irqs: list of IRQs, one per PHY.
+ o probed_phy_irq: if irqs is NULL, use this for probed PHY.
+
+For DMA engine we have the following internal fields that should be
+tuned according to the HW capabilities.
+
+struct stmmac_dma_cfg {
+ int pbl;
+ int fixed_burst;
+ int burst_len_supported;
+};
+
+Where:
+ o pbl: Programmable Burst Length
+ o fixed_burst: program the DMA to use the fixed burst mode
+ o burst_len: this is the value we put in the register
+ supported values are provided as macros in
+ linux/stmmac.h header file.
+
+---
+
+Below an example how the structures above are using on ST platforms.
+
+ static struct plat_stmmacenet_data stxYYY_ethernet_platform_data = {
+ .has_gmac = 0,
+ .enh_desc = 0,
+ .fix_mac_speed = stxYYY_ethernet_fix_mac_speed,
+ |
+ |-> to write an internal syscfg
+ | on this platform when the
+ | link speed changes from 10 to
+ | 100 and viceversa
+ .init = &stmmac_claim_resource,
+ |
+ |-> On ST SoC this calls own "PAD"
+ | manager framework to claim
+ | all the resources necessary
+ | (GPIO ...). The .custom_cfg field
+ | is used to pass a custom config.
+};
+
+Below the usage of the stmmac_mdio_bus_data: on this SoC, in fact,
+there are two MAC cores: one MAC is for MDIO Bus/PHY emulation
+with fixed_link support.
+
+static struct stmmac_mdio_bus_data stmmac1_mdio_bus = {
+ .phy_reset = phy_reset;
+ |
+ |-> function to provide the phy_reset on this board
+ .phy_mask = 0,
+};
+
+static struct fixed_phy_status stmmac0_fixed_phy_status = {
+ .link = 1,
+ .speed = 100,
+ .duplex = 1,
+};
+
+During the board's device_init we can configure the first
+MAC for fixed_link by calling:
+ fixed_phy_add(PHY_POLL, 1, &stmmac0_fixed_phy_status));)
+and the second one, with a real PHY device attached to the bus,
+by using the stmmac_mdio_bus_data structure (to provide the id, the
+reset procedure etc).
+
+Note that, starting from new chips, where it is available the HW capability
+register, many configurations are discovered at run-time for example to
+understand if EEE, HW csum, PTP, enhanced descriptor etc are actually
+available. As strategy adopted in this driver, the information from the HW
+capability register can replace what has been passed from the platform.
+
+4.10) Device-tree support.
+
+Please see the following document:
+ Documentation/devicetree/bindings/net/stmmac.txt
+
+and the stmmac_of_data structure inside the include/linux/stmmac.h header file.
+
+4.11) This is a summary of the content of some relevant files:
+ o stmmac_main.c: to implement the main network device driver;
+ o stmmac_mdio.c: to provide mdio functions;
+ o stmmac_pci: this the PCI driver;
+ o stmmac_platform.c: this the platform driver (OF supported)
+ o stmmac_ethtool.c: to implement the ethtool support;
+ o stmmac.h: private driver structure;
+ o common.h: common definitions and VFTs;
+ o descs.h: descriptor structure definitions;
+ o dwmac1000_core.c: dwmac GiGa core functions;
+ o dwmac1000_dma.c: dma functions for the GMAC chip;
+ o dwmac1000.h: specific header file for the dwmac GiGa;
+ o dwmac100_core: dwmac 100 core code;
+ o dwmac100_dma.c: dma functions for the dwmac 100 chip;
+ o dwmac1000.h: specific header file for the MAC;
+ o dwmac_lib.c: generic DMA functions;
+ o enh_desc.c: functions for handling enhanced descriptors;
+ o norm_desc.c: functions for handling normal descriptors;
+ o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes;
+ o mmc_core.c/mmc.h: Management MAC Counters;
+ o stmmac_hwtstamp.c: HW timestamp support for PTP;
+ o stmmac_ptp.c: PTP 1588 clock;
+ o dwmac-<XXX>.c: these are for the platform glue-logic file; e.g. dwmac-sti.c
+ for STMicroelectronics SoCs.
+
+5) Debug Information
+
+The driver exports many information i.e. internal statistics,
+debug information, MAC and DMA registers etc.
+
+These can be read in several ways depending on the
+type of the information actually needed.
+
+For example a user can be use the ethtool support
+to get statistics: e.g. using: ethtool -S ethX
+(that shows the Management counters (MMC) if supported)
+or sees the MAC/DMA registers: e.g. using: ethtool -d ethX
+
+Compiling the Kernel with CONFIG_DEBUG_FS the driver will export the following
+debugfs entries:
+
+/sys/kernel/debug/stmmaceth/descriptors_status
+ To show the DMA TX/RX descriptor rings
+
+Developer can also use the "debug" module parameter to get further debug
+information (please see: NETIF Msg Level).
+
+6) Energy Efficient Ethernet
+
+Energy Efficient Ethernet(EEE) enables IEEE 802.3 MAC sublayer along
+with a family of Physical layer to operate in the Low power Idle(LPI)
+mode. The EEE mode supports the IEEE 802.3 MAC operation at 100Mbps,
+1000Mbps & 10Gbps.
+
+The LPI mode allows power saving by switching off parts of the
+communication device functionality when there is no data to be
+transmitted & received. The system on both the side of the link can
+disable some functionalities & save power during the period of low-link
+utilization. The MAC controls whether the system should enter or exit
+the LPI mode & communicate this to PHY.
+
+As soon as the interface is opened, the driver verifies if the EEE can
+be supported. This is done by looking at both the DMA HW capability
+register and the PHY devices MCD registers.
+To enter in Tx LPI mode the driver needs to have a software timer
+that enable and disable the LPI mode when there is nothing to be
+transmitted.
+
+7) Precision Time Protocol (PTP)
+The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP),
+which enables precise synchronization of clocks in measurement and
+control systems implemented with technologies such as network
+communication.
+
+In addition to the basic timestamp features mentioned in IEEE 1588-2002
+Timestamps, new GMAC cores support the advanced timestamp features.
+IEEE 1588-2008 that can be enabled when configure the Kernel.
+
+8) SGMII/RGMII supports
+New GMAC devices provide own way to manage RGMII/SGMII.
+This information is available at run-time by looking at the
+HW capability register. This means that the stmmac can manage
+auto-negotiation and link status w/o using the PHYLIB stuff
+In fact, the HW provides a subset of extended registers to
+restart the ANE, verify Full/Half duplex mode and Speed.
+Also thanks to these registers it is possible to look at the
+Auto-negotiated Link Parter Ability.
diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt
new file mode 100644
index 000000000..f981a9295
--- /dev/null
+++ b/Documentation/networking/switchdev.txt
@@ -0,0 +1,59 @@
+Switch (and switch-ish) device drivers HOWTO
+===========================
+
+Please note that the word "switch" is here used in very generic meaning.
+This include devices supporting L2/L3 but also various flow offloading chips,
+including switches embedded into SR-IOV NICs.
+
+Lets describe a topology a bit. Imagine the following example:
+
+ +----------------------------+ +---------------+
+ | SOME switch chip | | CPU |
+ +----------------------------+ +---------------+
+ port1 port2 port3 port4 MNGMNT | PCI-E |
+ | | | | | +---------------+
+ PHY PHY | | | | NIC0 NIC1
+ | | | | | |
+ | | +- PCI-E -+ | |
+ | +------- MII -------+ |
+ +------------- MII ------------+
+
+In this example, there are two independent lines between the switch silicon
+and CPU. NIC0 and NIC1 drivers are not aware of a switch presence. They are
+separate from the switch driver. SOME switch chip is by managed by a driver
+via PCI-E device MNGMNT. Note that MNGMNT device, NIC0 and NIC1 may be
+connected to some other type of bus.
+
+Now, for the previous example show the representation in kernel:
+
+ +----------------------------+ +---------------+
+ | SOME switch chip | | CPU |
+ +----------------------------+ +---------------+
+ sw0p0 sw0p1 sw0p2 sw0p3 MNGMNT | PCI-E |
+ | | | | | +---------------+
+ PHY PHY | | | | eth0 eth1
+ | | | | | |
+ | | +- PCI-E -+ | |
+ | +------- MII -------+ |
+ +------------- MII ------------+
+
+Lets call the example switch driver for SOME switch chip "SOMEswitch". This
+driver takes care of PCI-E device MNGMNT. There is a netdevice instance sw0pX
+created for each port of a switch. These netdevices are instances
+of "SOMEswitch" driver. sw0pX netdevices serve as a "representation"
+of the switch chip. eth0 and eth1 are instances of some other existing driver.
+
+The only difference of the switch-port netdevice from the ordinary netdevice
+is that is implements couple more NDOs:
+
+ ndo_switch_parent_id_get - This returns the same ID for two port netdevices
+ of the same physical switch chip. This is
+ mandatory to be implemented by all switch drivers
+ and serves the caller for recognition of a port
+ netdevice.
+ ndo_switch_parent_* - Functions that serve for a manipulation of the switch
+ chip itself (it can be though of as a "parent" of the
+ port, therefore the name). They are not port-specific.
+ Caller might use arbitrary port netdevice of the same
+ switch and it will make no difference.
+ ndo_switch_port_* - Functions that serve for a port-specific manipulation.
diff --git a/Documentation/networking/tc-actions-env-rules.txt b/Documentation/networking/tc-actions-env-rules.txt
new file mode 100644
index 000000000..70d6cf608
--- /dev/null
+++ b/Documentation/networking/tc-actions-env-rules.txt
@@ -0,0 +1,30 @@
+
+The "environmental" rules for authors of any new tc actions are:
+
+1) If you stealeth or borroweth any packet thou shalt be branching
+from the righteous path and thou shalt cloneth.
+
+For example if your action queues a packet to be processed later,
+or intentionally branches by redirecting a packet, then you need to
+clone the packet.
+
+There are certain fields in the skb tc_verd that need to be reset so we
+avoid loops, etc. A few are generic enough that skb_act_clone()
+resets them for you, so invoke skb_act_clone() rather than skb_clone().
+
+2) If you munge any packet thou shalt call pskb_expand_head in the case
+someone else is referencing the skb. After that you "own" the skb.
+You must also tell us if it is ok to munge the packet (TC_OK2MUNGE),
+this way any action downstream can stomp on the packet.
+
+3) Dropping packets you don't own is a no-no. You simply return
+TC_ACT_SHOT to the caller and they will drop it.
+
+The "environmental" rules for callers of actions (qdiscs etc) are:
+
+*) Thou art responsible for freeing anything returned as being
+TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is
+returned, then all is great and you don't need to do anything.
+
+Post on netdev if something is unclear.
+
diff --git a/Documentation/networking/tcp-thin.txt b/Documentation/networking/tcp-thin.txt
new file mode 100644
index 000000000..151e22998
--- /dev/null
+++ b/Documentation/networking/tcp-thin.txt
@@ -0,0 +1,47 @@
+Thin-streams and TCP
+====================
+A wide range of Internet-based services that use reliable transport
+protocols display what we call thin-stream properties. This means
+that the application sends data with such a low rate that the
+retransmission mechanisms of the transport protocol are not fully
+effective. In time-dependent scenarios (like online games, control
+systems, stock trading etc.) where the user experience depends
+on the data delivery latency, packet loss can be devastating for
+the service quality. Extreme latencies are caused by TCP's
+dependency on the arrival of new data from the application to trigger
+retransmissions effectively through fast retransmit instead of
+waiting for long timeouts.
+
+After analysing a large number of time-dependent interactive
+applications, we have seen that they often produce thin streams
+and also stay with this traffic pattern throughout its entire
+lifespan. The combination of time-dependency and the fact that the
+streams provoke high latencies when using TCP is unfortunate.
+
+In order to reduce application-layer latency when packets are lost,
+a set of mechanisms has been made, which address these latency issues
+for thin streams. In short, if the kernel detects a thin stream,
+the retransmission mechanisms are modified in the following manner:
+
+1) If the stream is thin, fast retransmit on the first dupACK.
+2) If the stream is thin, do not apply exponential backoff.
+
+These enhancements are applied only if the stream is detected as
+thin. This is accomplished by defining a threshold for the number
+of packets in flight. If there are less than 4 packets in flight,
+fast retransmissions can not be triggered, and the stream is prone
+to experience high retransmission latencies.
+
+Since these mechanisms are targeted at time-dependent applications,
+they must be specifically activated by the application using the
+TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK IOCTLS or the
+tcp_thin_linear_timeouts and tcp_thin_dupack sysctls. Both
+modifications are turned off by default.
+
+References
+==========
+More information on the modifications, as well as a wide range of
+experimental data can be found here:
+"Improving latency for interactive, thin-stream applications over
+reliable transport"
+http://simula.no/research/nd/publications/Simula.nd.477/simula_pdf_file
diff --git a/Documentation/networking/tcp.txt b/Documentation/networking/tcp.txt
new file mode 100644
index 000000000..bdc4c0db5
--- /dev/null
+++ b/Documentation/networking/tcp.txt
@@ -0,0 +1,106 @@
+TCP protocol
+============
+
+Last updated: 9 February 2008
+
+Contents
+========
+
+- Congestion control
+- How the new TCP output machine [nyi] works
+
+Congestion control
+==================
+
+The following variables are used in the tcp_sock for congestion control:
+snd_cwnd The size of the congestion window
+snd_ssthresh Slow start threshold. We are in slow start if
+ snd_cwnd is less than this.
+snd_cwnd_cnt A counter used to slow down the rate of increase
+ once we exceed slow start threshold.
+snd_cwnd_clamp This is the maximum size that snd_cwnd can grow to.
+snd_cwnd_stamp Timestamp for when congestion window last validated.
+snd_cwnd_used Used as a highwater mark for how much of the
+ congestion window is in use. It is used to adjust
+ snd_cwnd down when the link is limited by the
+ application rather than the network.
+
+As of 2.6.13, Linux supports pluggable congestion control algorithms.
+A congestion control mechanism can be registered through functions in
+tcp_cong.c. The functions used by the congestion control mechanism are
+registered via passing a tcp_congestion_ops struct to
+tcp_register_congestion_control. As a minimum name, ssthresh,
+cong_avoid must be valid.
+
+Private data for a congestion control mechanism is stored in tp->ca_priv.
+tcp_ca(tp) returns a pointer to this space. This is preallocated space - it
+is important to check the size of your private data will fit this space, or
+alternatively space could be allocated elsewhere and a pointer to it could
+be stored here.
+
+There are three kinds of congestion control algorithms currently: The
+simplest ones are derived from TCP reno (highspeed, scalable) and just
+provide an alternative the congestion window calculation. More complex
+ones like BIC try to look at other events to provide better
+heuristics. There are also round trip time based algorithms like
+Vegas and Westwood+.
+
+Good TCP congestion control is a complex problem because the algorithm
+needs to maintain fairness and performance. Please review current
+research and RFC's before developing new modules.
+
+The method that is used to determine which congestion control mechanism is
+determined by the setting of the sysctl net.ipv4.tcp_congestion_control.
+The default congestion control will be the last one registered (LIFO);
+so if you built everything as modules, the default will be reno. If you
+build with the defaults from Kconfig, then CUBIC will be builtin (not a
+module) and it will end up the default.
+
+If you really want a particular default value then you will need
+to set it with the sysctl. If you use a sysctl, the module will be autoloaded
+if needed and you will get the expected protocol. If you ask for an
+unknown congestion method, then the sysctl attempt will fail.
+
+If you remove a tcp congestion control module, then you will get the next
+available one. Since reno cannot be built as a module, and cannot be
+deleted, it will always be available.
+
+How the new TCP output machine [nyi] works.
+===========================================
+
+Data is kept on a single queue. The skb->users flag tells us if the frame is
+one that has been queued already. To add a frame we throw it on the end. Ack
+walks down the list from the start.
+
+We keep a set of control flags
+
+
+ sk->tcp_pend_event
+
+ TCP_PEND_ACK Ack needed
+ TCP_ACK_NOW Needed now
+ TCP_WINDOW Window update check
+ TCP_WINZERO Zero probing
+
+
+ sk->transmit_queue The transmission frame begin
+ sk->transmit_new First new frame pointer
+ sk->transmit_end Where to add frames
+
+ sk->tcp_last_tx_ack Last ack seen
+ sk->tcp_dup_ack Dup ack count for fast retransmit
+
+
+Frames are queued for output by tcp_write. We do our best to send the frames
+off immediately if possible, but otherwise queue and compute the body
+checksum in the copy.
+
+When a write is done we try to clear any pending events and piggy back them.
+If the window is full we queue full sized frames. On the first timeout in
+zero window we split this.
+
+On a timer we walk the retransmit list to send any retransmits, update the
+backoff timers etc. A change of route table stamp causes a change of header
+and recompute. We add any new tcp level headers and refinish the checksum
+before sending.
+
diff --git a/Documentation/networking/team.txt b/Documentation/networking/team.txt
new file mode 100644
index 000000000..5a013686b
--- /dev/null
+++ b/Documentation/networking/team.txt
@@ -0,0 +1,2 @@
+Team devices are driven from userspace via libteam library which is here:
+ https://github.com/jpirko/libteam
diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
new file mode 100644
index 000000000..5f0922613
--- /dev/null
+++ b/Documentation/networking/timestamping.txt
@@ -0,0 +1,457 @@
+
+1. Control Interfaces
+
+The interfaces for receiving network packages timestamps are:
+
+* SO_TIMESTAMP
+ Generates a timestamp for each incoming packet in (not necessarily
+ monotonic) system time. Reports the timestamp via recvmsg() in a
+ control message as struct timeval (usec resolution).
+
+* SO_TIMESTAMPNS
+ Same timestamping mechanism as SO_TIMESTAMP, but reports the
+ timestamp as struct timespec (nsec resolution).
+
+* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS]
+ Only for multicast:approximate transmit timestamp obtained by
+ reading the looped packet receive timestamp.
+
+* SO_TIMESTAMPING
+ Generates timestamps on reception, transmission or both. Supports
+ multiple timestamp sources, including hardware. Supports generating
+ timestamps for stream sockets.
+
+
+1.1 SO_TIMESTAMP:
+
+This socket option enables timestamping of datagrams on the reception
+path. Because the destination socket, if any, is not known early in
+the network stack, the feature has to be enabled for all packets. The
+same is true for all early receive timestamp options.
+
+For interface details, see `man 7 socket`.
+
+
+1.2 SO_TIMESTAMPNS:
+
+This option is identical to SO_TIMESTAMP except for the returned data type.
+Its struct timespec allows for higher resolution (ns) timestamps than the
+timeval of SO_TIMESTAMP (ms).
+
+
+1.3 SO_TIMESTAMPING:
+
+Supports multiple types of timestamp requests. As a result, this
+socket option takes a bitmap of flags, not a boolean. In
+
+ err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, &val);
+
+val is an integer with any of the following bits set. Setting other
+bit returns EINVAL and does not change the current state.
+
+
+1.3.1 Timestamp Generation
+
+Some bits are requests to the stack to try to generate timestamps. Any
+combination of them is valid. Changes to these bits apply to newly
+created packets, not to packets already in the stack. As a result, it
+is possible to selectively request timestamps for a subset of packets
+(e.g., for sampling) by embedding an send() call within two setsockopt
+calls, one to enable timestamp generation and one to disable it.
+Timestamps may also be generated for reasons other than being
+requested by a particular socket, such as when receive timestamping is
+enabled system wide, as explained earlier.
+
+SOF_TIMESTAMPING_RX_HARDWARE:
+ Request rx timestamps generated by the network adapter.
+
+SOF_TIMESTAMPING_RX_SOFTWARE:
+ Request rx timestamps when data enters the kernel. These timestamps
+ are generated just after a device driver hands a packet to the
+ kernel receive stack.
+
+SOF_TIMESTAMPING_TX_HARDWARE:
+ Request tx timestamps generated by the network adapter.
+
+SOF_TIMESTAMPING_TX_SOFTWARE:
+ Request tx timestamps when data leaves the kernel. These timestamps
+ are generated in the device driver as close as possible, but always
+ prior to, passing the packet to the network interface. Hence, they
+ require driver support and may not be available for all devices.
+
+SOF_TIMESTAMPING_TX_SCHED:
+ Request tx timestamps prior to entering the packet scheduler. Kernel
+ transmit latency is, if long, often dominated by queuing delay. The
+ difference between this timestamp and one taken at
+ SOF_TIMESTAMPING_TX_SOFTWARE will expose this latency independent
+ of protocol processing. The latency incurred in protocol
+ processing, if any, can be computed by subtracting a userspace
+ timestamp taken immediately before send() from this timestamp. On
+ machines with virtual devices where a transmitted packet travels
+ through multiple devices and, hence, multiple packet schedulers,
+ a timestamp is generated at each layer. This allows for fine
+ grained measurement of queuing delay.
+
+SOF_TIMESTAMPING_TX_ACK:
+ Request tx timestamps when all data in the send buffer has been
+ acknowledged. This only makes sense for reliable protocols. It is
+ currently only implemented for TCP. For that protocol, it may
+ over-report measurement, because the timestamp is generated when all
+ data up to and including the buffer at send() was acknowledged: the
+ cumulative acknowledgment. The mechanism ignores SACK and FACK.
+
+
+1.3.2 Timestamp Reporting
+
+The other three bits control which timestamps will be reported in a
+generated control message. Changes to the bits take immediate
+effect at the timestamp reporting locations in the stack. Timestamps
+are only reported for packets that also have the relevant timestamp
+generation request set.
+
+SOF_TIMESTAMPING_SOFTWARE:
+ Report any software timestamps when available.
+
+SOF_TIMESTAMPING_SYS_HARDWARE:
+ This option is deprecated and ignored.
+
+SOF_TIMESTAMPING_RAW_HARDWARE:
+ Report hardware timestamps as generated by
+ SOF_TIMESTAMPING_TX_HARDWARE when available.
+
+
+1.3.3 Timestamp Options
+
+The interface supports the options
+
+SOF_TIMESTAMPING_OPT_ID:
+
+ Generate a unique identifier along with each packet. A process can
+ have multiple concurrent timestamping requests outstanding. Packets
+ can be reordered in the transmit path, for instance in the packet
+ scheduler. In that case timestamps will be queued onto the error
+ queue out of order from the original send() calls. It is not always
+ possible to uniquely match timestamps to the original send() calls
+ based on timestamp order or payload inspection alone, then.
+
+ This option associates each packet at send() with a unique
+ identifier and returns that along with the timestamp. The identifier
+ is derived from a per-socket u32 counter (that wraps). For datagram
+ sockets, the counter increments with each sent packet. For stream
+ sockets, it increments with every byte.
+
+ The counter starts at zero. It is initialized the first time that
+ the socket option is enabled. It is reset each time the option is
+ enabled after having been disabled. Resetting the counter does not
+ change the identifiers of existing packets in the system.
+
+ This option is implemented only for transmit timestamps. There, the
+ timestamp is always looped along with a struct sock_extended_err.
+ The option modifies field ee_data to pass an id that is unique
+ among all possibly concurrently outstanding timestamp requests for
+ that socket.
+
+
+SOF_TIMESTAMPING_OPT_CMSG:
+
+ Support recv() cmsg for all timestamped packets. Control messages
+ are already supported unconditionally on all packets with receive
+ timestamps and on IPv6 packets with transmit timestamp. This option
+ extends them to IPv4 packets with transmit timestamp. One use case
+ is to correlate packets with their egress device, by enabling socket
+ option IP_PKTINFO simultaneously.
+
+
+SOF_TIMESTAMPING_OPT_TSONLY:
+
+ Applies to transmit timestamps only. Makes the kernel return the
+ timestamp as a cmsg alongside an empty packet, as opposed to
+ alongside the original packet. This reduces the amount of memory
+ charged to the socket's receive budget (SO_RCVBUF) and delivers
+ the timestamp even if sysctl net.core.tstamp_allow_data is 0.
+ This option disables SOF_TIMESTAMPING_OPT_CMSG.
+
+
+New applications are encouraged to pass SOF_TIMESTAMPING_OPT_ID to
+disambiguate timestamps and SOF_TIMESTAMPING_OPT_TSONLY to operate
+regardless of the setting of sysctl net.core.tstamp_allow_data.
+
+An exception is when a process needs additional cmsg data, for
+instance SOL_IP/IP_PKTINFO to detect the egress network interface.
+Then pass option SOF_TIMESTAMPING_OPT_CMSG. This option depends on
+having access to the contents of the original packet, so cannot be
+combined with SOF_TIMESTAMPING_OPT_TSONLY.
+
+
+1.4 Bytestream Timestamps
+
+The SO_TIMESTAMPING interface supports timestamping of bytes in a
+bytestream. Each request is interpreted as a request for when the
+entire contents of the buffer has passed a timestamping point. That
+is, for streams option SOF_TIMESTAMPING_TX_SOFTWARE will record
+when all bytes have reached the device driver, regardless of how
+many packets the data has been converted into.
+
+In general, bytestreams have no natural delimiters and therefore
+correlating a timestamp with data is non-trivial. A range of bytes
+may be split across segments, any segments may be merged (possibly
+coalescing sections of previously segmented buffers associated with
+independent send() calls). Segments can be reordered and the same
+byte range can coexist in multiple segments for protocols that
+implement retransmissions.
+
+It is essential that all timestamps implement the same semantics,
+regardless of these possible transformations, as otherwise they are
+incomparable. Handling "rare" corner cases differently from the
+simple case (a 1:1 mapping from buffer to skb) is insufficient
+because performance debugging often needs to focus on such outliers.
+
+In practice, timestamps can be correlated with segments of a
+bytestream consistently, if both semantics of the timestamp and the
+timing of measurement are chosen correctly. This challenge is no
+different from deciding on a strategy for IP fragmentation. There, the
+definition is that only the first fragment is timestamped. For
+bytestreams, we chose that a timestamp is generated only when all
+bytes have passed a point. SOF_TIMESTAMPING_TX_ACK as defined is easy to
+implement and reason about. An implementation that has to take into
+account SACK would be more complex due to possible transmission holes
+and out of order arrival.
+
+On the host, TCP can also break the simple 1:1 mapping from buffer to
+skbuff as a result of Nagle, cork, autocork, segmentation and GSO. The
+implementation ensures correctness in all cases by tracking the
+individual last byte passed to send(), even if it is no longer the
+last byte after an skbuff extend or merge operation. It stores the
+relevant sequence number in skb_shinfo(skb)->tskey. Because an skbuff
+has only one such field, only one timestamp can be generated.
+
+In rare cases, a timestamp request can be missed if two requests are
+collapsed onto the same skb. A process can detect this situation by
+enabling SOF_TIMESTAMPING_OPT_ID and comparing the byte offset at
+send time with the value returned for each timestamp. It can prevent
+the situation by always flushing the TCP stack in between requests,
+for instance by enabling TCP_NODELAY and disabling TCP_CORK and
+autocork.
+
+These precautions ensure that the timestamp is generated only when all
+bytes have passed a timestamp point, assuming that the network stack
+itself does not reorder the segments. The stack indeed tries to avoid
+reordering. The one exception is under administrator control: it is
+possible to construct a packet scheduler configuration that delays
+segments from the same stream differently. Such a setup would be
+unusual.
+
+
+2 Data Interfaces
+
+Timestamps are read using the ancillary data feature of recvmsg().
+See `man 3 cmsg` for details of this interface. The socket manual
+page (`man 7 socket`) describes how timestamps generated with
+SO_TIMESTAMP and SO_TIMESTAMPNS records can be retrieved.
+
+
+2.1 SCM_TIMESTAMPING records
+
+These timestamps are returned in a control message with cmsg_level
+SOL_SOCKET, cmsg_type SCM_TIMESTAMPING, and payload of type
+
+struct scm_timestamping {
+ struct timespec ts[3];
+};
+
+The structure can return up to three timestamps. This is a legacy
+feature. Only one field is non-zero at any time. Most timestamps
+are passed in ts[0]. Hardware timestamps are passed in ts[2].
+
+ts[1] used to hold hardware timestamps converted to system time.
+Instead, expose the hardware clock device on the NIC directly as
+a HW PTP clock source, to allow time conversion in userspace and
+optionally synchronize system time with a userspace PTP stack such
+as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt.
+
+2.1.1 Transmit timestamps with MSG_ERRQUEUE
+
+For transmit timestamps the outgoing packet is looped back to the
+socket's error queue with the send timestamp(s) attached. A process
+receives the timestamps by calling recvmsg() with flag MSG_ERRQUEUE
+set and with a msg_control buffer sufficiently large to receive the
+relevant metadata structures. The recvmsg call returns the original
+outgoing data packet with two ancillary messages attached.
+
+A message of cm_level SOL_IP(V6) and cm_type IP(V6)_RECVERR
+embeds a struct sock_extended_err. This defines the error type. For
+timestamps, the ee_errno field is ENOMSG. The other ancillary message
+will have cm_level SOL_SOCKET and cm_type SCM_TIMESTAMPING. This
+embeds the struct scm_timestamping.
+
+
+2.1.1.2 Timestamp types
+
+The semantics of the three struct timespec are defined by field
+ee_info in the extended error structure. It contains a value of
+type SCM_TSTAMP_* to define the actual timestamp passed in
+scm_timestamping.
+
+The SCM_TSTAMP_* types are 1:1 matches to the SOF_TIMESTAMPING_*
+control fields discussed previously, with one exception. For legacy
+reasons, SCM_TSTAMP_SND is equal to zero and can be set for both
+SOF_TIMESTAMPING_TX_HARDWARE and SOF_TIMESTAMPING_TX_SOFTWARE. It
+is the first if ts[2] is non-zero, the second otherwise, in which
+case the timestamp is stored in ts[0].
+
+
+2.1.1.3 Fragmentation
+
+Fragmentation of outgoing datagrams is rare, but is possible, e.g., by
+explicitly disabling PMTU discovery. If an outgoing packet is fragmented,
+then only the first fragment is timestamped and returned to the sending
+socket.
+
+
+2.1.1.4 Packet Payload
+
+The calling application is often not interested in receiving the whole
+packet payload that it passed to the stack originally: the socket
+error queue mechanism is just a method to piggyback the timestamp on.
+In this case, the application can choose to read datagrams with a
+smaller buffer, possibly even of length 0. The payload is truncated
+accordingly. Until the process calls recvmsg() on the error queue,
+however, the full packet is queued, taking up budget from SO_RCVBUF.
+
+
+2.1.1.5 Blocking Read
+
+Reading from the error queue is always a non-blocking operation. To
+block waiting on a timestamp, use poll or select. poll() will return
+POLLERR in pollfd.revents if any data is ready on the error queue.
+There is no need to pass this flag in pollfd.events. This flag is
+ignored on request. See also `man 2 poll`.
+
+
+2.1.2 Receive timestamps
+
+On reception, there is no reason to read from the socket error queue.
+The SCM_TIMESTAMPING ancillary data is sent along with the packet data
+on a normal recvmsg(). Since this is not a socket error, it is not
+accompanied by a message SOL_IP(V6)/IP(V6)_RECVERROR. In this case,
+the meaning of the three fields in struct scm_timestamping is
+implicitly defined. ts[0] holds a software timestamp if set, ts[1]
+is again deprecated and ts[2] holds a hardware timestamp if set.
+
+
+3. Hardware Timestamping configuration: SIOCSHWTSTAMP and SIOCGHWTSTAMP
+
+Hardware time stamping must also be initialized for each device driver
+that is expected to do hardware time stamping. The parameter is defined in
+/include/linux/net_tstamp.h as:
+
+struct hwtstamp_config {
+ int flags; /* no flags defined right now, must be zero */
+ int tx_type; /* HWTSTAMP_TX_* */
+ int rx_filter; /* HWTSTAMP_FILTER_* */
+};
+
+Desired behavior is passed into the kernel and to a specific device by
+calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose
+ifr_data points to a struct hwtstamp_config. The tx_type and
+rx_filter are hints to the driver what it is expected to do. If
+the requested fine-grained filtering for incoming packets is not
+supported, the driver may time stamp more than just the requested types
+of packets.
+
+A driver which supports hardware time stamping shall update the struct
+with the actual, possibly more permissive configuration. If the
+requested packets cannot be time stamped, then nothing should be
+changed and ERANGE shall be returned (in contrast to EINVAL, which
+indicates that SIOCSHWTSTAMP is not supported at all).
+
+Only a processes with admin rights may change the configuration. User
+space is responsible to ensure that multiple processes don't interfere
+with each other and that the settings are reset.
+
+Any process can read the actual configuration by passing this
+structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has
+not been implemented in all drivers.
+
+/* possible values for hwtstamp_config->tx_type */
+enum {
+ /*
+ * no outgoing packet will need hardware time stamping;
+ * should a packet arrive which asks for it, no hardware
+ * time stamping will be done
+ */
+ HWTSTAMP_TX_OFF,
+
+ /*
+ * enables hardware time stamping for outgoing packets;
+ * the sender of the packet decides which are to be
+ * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE
+ * before sending the packet
+ */
+ HWTSTAMP_TX_ON,
+};
+
+/* possible values for hwtstamp_config->rx_filter */
+enum {
+ /* time stamp no incoming packet at all */
+ HWTSTAMP_FILTER_NONE,
+
+ /* time stamp any incoming packet */
+ HWTSTAMP_FILTER_ALL,
+
+ /* return value: time stamp all packets requested plus some others */
+ HWTSTAMP_FILTER_SOME,
+
+ /* PTP v1, UDP, any kind of event packet */
+ HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
+
+ /* for the complete list of values, please check
+ * the include file /include/linux/net_tstamp.h
+ */
+};
+
+3.1 Hardware Timestamping Implementation: Device Drivers
+
+A driver which supports hardware time stamping must support the
+SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
+the actual values as described in the section on SIOCSHWTSTAMP. It
+should also support SIOCGHWTSTAMP.
+
+Time stamps for received packets must be stored in the skb. To get a pointer
+to the shared time stamp structure of the skb call skb_hwtstamps(). Then
+set the time stamps in the structure:
+
+struct skb_shared_hwtstamps {
+ /* hardware time stamp transformed into duration
+ * since arbitrary point in time
+ */
+ ktime_t hwtstamp;
+};
+
+Time stamps for outgoing packets are to be generated as follows:
+- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)
+ is set no-zero. If yes, then the driver is expected to do hardware time
+ stamping.
+- If this is possible for the skb and requested, then declare
+ that the driver is doing the time stamping by setting the flag
+ SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with
+
+ skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
+
+ You might want to keep a pointer to the associated skb for the next step
+ and not free the skb. A driver not supporting hardware time stamping doesn't
+ do that. A driver must never touch sk_buff::tstamp! It is used to store
+ software generated time stamps by the network subsystem.
+- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware
+ as possible. skb_tx_timestamp() provides a software time stamp if requested
+ and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set).
+- As soon as the driver has sent the packet and/or obtained a
+ hardware time stamp for it, it passes the time stamp back by
+ calling skb_hwtstamp_tx() with the original skb, the raw
+ hardware time stamp. skb_hwtstamp_tx() clones the original skb and
+ adds the timestamps, therefore the original skb has to be freed now.
+ If obtaining the hardware time stamp somehow fails, then the driver
+ should not fall back to software time stamping. The rationale is that
+ this would occur at a later time in the processing pipeline than other
+ software time stamping and therefore could lead to unexpected deltas
+ between time stamps.
diff --git a/Documentation/networking/timestamping/.gitignore b/Documentation/networking/timestamping/.gitignore
new file mode 100644
index 000000000..9e69e982f
--- /dev/null
+++ b/Documentation/networking/timestamping/.gitignore
@@ -0,0 +1,3 @@
+timestamping
+txtimestamp
+hwtstamp_config
diff --git a/Documentation/networking/timestamping/Makefile b/Documentation/networking/timestamping/Makefile
new file mode 100644
index 000000000..8c20dfaa4
--- /dev/null
+++ b/Documentation/networking/timestamping/Makefile
@@ -0,0 +1,14 @@
+# To compile, from the source root
+#
+# make headers_install
+# make M=documentation
+
+# List of programs to build
+hostprogs-y := hwtstamp_config timestamping txtimestamp
+
+# Tell kbuild to always build the programs
+always := $(hostprogs-y)
+
+HOSTCFLAGS_timestamping.o += -I$(objtree)/usr/include
+HOSTCFLAGS_txtimestamp.o += -I$(objtree)/usr/include
+HOSTCFLAGS_hwtstamp_config.o += -I$(objtree)/usr/include
diff --git a/Documentation/networking/timestamping/hwtstamp_config.c b/Documentation/networking/timestamping/hwtstamp_config.c
new file mode 100644
index 000000000..e8b685a7f
--- /dev/null
+++ b/Documentation/networking/timestamping/hwtstamp_config.c
@@ -0,0 +1,134 @@
+/* Test program for SIOC{G,S}HWTSTAMP
+ * Copyright 2013 Solarflare Communications
+ * Author: Ben Hutchings
+ */
+
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#include <linux/if.h>
+#include <linux/net_tstamp.h>
+#include <linux/sockios.h>
+
+static int
+lookup_value(const char **names, int size, const char *name)
+{
+ int value;
+
+ for (value = 0; value < size; value++)
+ if (names[value] && strcasecmp(names[value], name) == 0)
+ return value;
+
+ return -1;
+}
+
+static const char *
+lookup_name(const char **names, int size, int value)
+{
+ return (value >= 0 && value < size) ? names[value] : NULL;
+}
+
+static void list_names(FILE *f, const char **names, int size)
+{
+ int value;
+
+ for (value = 0; value < size; value++)
+ if (names[value])
+ fprintf(f, " %s\n", names[value]);
+}
+
+static const char *tx_types[] = {
+#define TX_TYPE(name) [HWTSTAMP_TX_ ## name] = #name
+ TX_TYPE(OFF),
+ TX_TYPE(ON),
+ TX_TYPE(ONESTEP_SYNC)
+#undef TX_TYPE
+};
+#define N_TX_TYPES ((int)(sizeof(tx_types) / sizeof(tx_types[0])))
+
+static const char *rx_filters[] = {
+#define RX_FILTER(name) [HWTSTAMP_FILTER_ ## name] = #name
+ RX_FILTER(NONE),
+ RX_FILTER(ALL),
+ RX_FILTER(SOME),
+ RX_FILTER(PTP_V1_L4_EVENT),
+ RX_FILTER(PTP_V1_L4_SYNC),
+ RX_FILTER(PTP_V1_L4_DELAY_REQ),
+ RX_FILTER(PTP_V2_L4_EVENT),
+ RX_FILTER(PTP_V2_L4_SYNC),
+ RX_FILTER(PTP_V2_L4_DELAY_REQ),
+ RX_FILTER(PTP_V2_L2_EVENT),
+ RX_FILTER(PTP_V2_L2_SYNC),
+ RX_FILTER(PTP_V2_L2_DELAY_REQ),
+ RX_FILTER(PTP_V2_EVENT),
+ RX_FILTER(PTP_V2_SYNC),
+ RX_FILTER(PTP_V2_DELAY_REQ),
+#undef RX_FILTER
+};
+#define N_RX_FILTERS ((int)(sizeof(rx_filters) / sizeof(rx_filters[0])))
+
+static void usage(void)
+{
+ fputs("Usage: hwtstamp_config if_name [tx_type rx_filter]\n"
+ "tx_type is any of (case-insensitive):\n",
+ stderr);
+ list_names(stderr, tx_types, N_TX_TYPES);
+ fputs("rx_filter is any of (case-insensitive):\n", stderr);
+ list_names(stderr, rx_filters, N_RX_FILTERS);
+}
+
+int main(int argc, char **argv)
+{
+ struct ifreq ifr;
+ struct hwtstamp_config config;
+ const char *name;
+ int sock;
+
+ if ((argc != 2 && argc != 4) || (strlen(argv[1]) >= IFNAMSIZ)) {
+ usage();
+ return 2;
+ }
+
+ if (argc == 4) {
+ config.flags = 0;
+ config.tx_type = lookup_value(tx_types, N_TX_TYPES, argv[2]);
+ config.rx_filter = lookup_value(rx_filters, N_RX_FILTERS, argv[3]);
+ if (config.tx_type < 0 || config.rx_filter < 0) {
+ usage();
+ return 2;
+ }
+ }
+
+ sock = socket(AF_INET, SOCK_DGRAM, 0);
+ if (sock < 0) {
+ perror("socket");
+ return 1;
+ }
+
+ strcpy(ifr.ifr_name, argv[1]);
+ ifr.ifr_data = (caddr_t)&config;
+
+ if (ioctl(sock, (argc == 2) ? SIOCGHWTSTAMP : SIOCSHWTSTAMP, &ifr)) {
+ perror("ioctl");
+ return 1;
+ }
+
+ printf("flags = %#x\n", config.flags);
+ name = lookup_name(tx_types, N_TX_TYPES, config.tx_type);
+ if (name)
+ printf("tx_type = %s\n", name);
+ else
+ printf("tx_type = %d\n", config.tx_type);
+ name = lookup_name(rx_filters, N_RX_FILTERS, config.rx_filter);
+ if (name)
+ printf("rx_filter = %s\n", name);
+ else
+ printf("rx_filter = %d\n", config.rx_filter);
+
+ return 0;
+}
diff --git a/Documentation/networking/timestamping/timestamping.c b/Documentation/networking/timestamping/timestamping.c
new file mode 100644
index 000000000..5cdfd7434
--- /dev/null
+++ b/Documentation/networking/timestamping/timestamping.c
@@ -0,0 +1,528 @@
+/*
+ * This program demonstrates how the various time stamping features in
+ * the Linux kernel work. It emulates the behavior of a PTP
+ * implementation in stand-alone master mode by sending PTPv1 Sync
+ * multicasts once every second. It looks for similar packets, but
+ * beyond that doesn't actually implement PTP.
+ *
+ * Outgoing packets are time stamped with SO_TIMESTAMPING with or
+ * without hardware support.
+ *
+ * Incoming packets are time stamped with SO_TIMESTAMPING with or
+ * without hardware support, SIOCGSTAMP[NS] (per-socket time stamp) and
+ * SO_TIMESTAMP[NS].
+ *
+ * Copyright (C) 2009 Intel Corporation.
+ * Author: Patrick Ohly <patrick.ohly@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <errno.h>
+#include <string.h>
+
+#include <sys/time.h>
+#include <sys/socket.h>
+#include <sys/select.h>
+#include <sys/ioctl.h>
+#include <arpa/inet.h>
+#include <net/if.h>
+
+#include <asm/types.h>
+#include <linux/net_tstamp.h>
+#include <linux/errqueue.h>
+
+#ifndef SO_TIMESTAMPING
+# define SO_TIMESTAMPING 37
+# define SCM_TIMESTAMPING SO_TIMESTAMPING
+#endif
+
+#ifndef SO_TIMESTAMPNS
+# define SO_TIMESTAMPNS 35
+#endif
+
+#ifndef SIOCGSTAMPNS
+# define SIOCGSTAMPNS 0x8907
+#endif
+
+#ifndef SIOCSHWTSTAMP
+# define SIOCSHWTSTAMP 0x89b0
+#endif
+
+static void usage(const char *error)
+{
+ if (error)
+ printf("invalid option: %s\n", error);
+ printf("timestamping interface option*\n\n"
+ "Options:\n"
+ " IP_MULTICAST_LOOP - looping outgoing multicasts\n"
+ " SO_TIMESTAMP - normal software time stamping, ms resolution\n"
+ " SO_TIMESTAMPNS - more accurate software time stamping\n"
+ " SOF_TIMESTAMPING_TX_HARDWARE - hardware time stamping of outgoing packets\n"
+ " SOF_TIMESTAMPING_TX_SOFTWARE - software fallback for outgoing packets\n"
+ " SOF_TIMESTAMPING_RX_HARDWARE - hardware time stamping of incoming packets\n"
+ " SOF_TIMESTAMPING_RX_SOFTWARE - software fallback for incoming packets\n"
+ " SOF_TIMESTAMPING_SOFTWARE - request reporting of software time stamps\n"
+ " SOF_TIMESTAMPING_RAW_HARDWARE - request reporting of raw HW time stamps\n"
+ " SIOCGSTAMP - check last socket time stamp\n"
+ " SIOCGSTAMPNS - more accurate socket time stamp\n");
+ exit(1);
+}
+
+static void bail(const char *error)
+{
+ printf("%s: %s\n", error, strerror(errno));
+ exit(1);
+}
+
+static const unsigned char sync[] = {
+ 0x00, 0x01, 0x00, 0x01,
+ 0x5f, 0x44, 0x46, 0x4c,
+ 0x54, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x01, 0x01,
+
+ /* fake uuid */
+ 0x00, 0x01,
+ 0x02, 0x03, 0x04, 0x05,
+
+ 0x00, 0x01, 0x00, 0x37,
+ 0x00, 0x00, 0x00, 0x08,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x49, 0x05, 0xcd, 0x01,
+ 0x29, 0xb1, 0x8d, 0xb0,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x01,
+
+ /* fake uuid */
+ 0x00, 0x01,
+ 0x02, 0x03, 0x04, 0x05,
+
+ 0x00, 0x00, 0x00, 0x37,
+ 0x00, 0x00, 0x00, 0x04,
+ 0x44, 0x46, 0x4c, 0x54,
+ 0x00, 0x00, 0xf0, 0x60,
+ 0x00, 0x01, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x01,
+ 0x00, 0x00, 0xf0, 0x60,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x04,
+ 0x44, 0x46, 0x4c, 0x54,
+ 0x00, 0x01,
+
+ /* fake uuid */
+ 0x00, 0x01,
+ 0x02, 0x03, 0x04, 0x05,
+
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00,
+ 0x00, 0x00, 0x00, 0x00
+};
+
+static void sendpacket(int sock, struct sockaddr *addr, socklen_t addr_len)
+{
+ struct timeval now;
+ int res;
+
+ res = sendto(sock, sync, sizeof(sync), 0,
+ addr, addr_len);
+ gettimeofday(&now, 0);
+ if (res < 0)
+ printf("%s: %s\n", "send", strerror(errno));
+ else
+ printf("%ld.%06ld: sent %d bytes\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ res);
+}
+
+static void printpacket(struct msghdr *msg, int res,
+ char *data,
+ int sock, int recvmsg_flags,
+ int siocgstamp, int siocgstampns)
+{
+ struct sockaddr_in *from_addr = (struct sockaddr_in *)msg->msg_name;
+ struct cmsghdr *cmsg;
+ struct timeval tv;
+ struct timespec ts;
+ struct timeval now;
+
+ gettimeofday(&now, 0);
+
+ printf("%ld.%06ld: received %s data, %d bytes from %s, %zu bytes control messages\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular",
+ res,
+ inet_ntoa(from_addr->sin_addr),
+ msg->msg_controllen);
+ for (cmsg = CMSG_FIRSTHDR(msg);
+ cmsg;
+ cmsg = CMSG_NXTHDR(msg, cmsg)) {
+ printf(" cmsg len %zu: ", cmsg->cmsg_len);
+ switch (cmsg->cmsg_level) {
+ case SOL_SOCKET:
+ printf("SOL_SOCKET ");
+ switch (cmsg->cmsg_type) {
+ case SO_TIMESTAMP: {
+ struct timeval *stamp =
+ (struct timeval *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMP %ld.%06ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_usec);
+ break;
+ }
+ case SO_TIMESTAMPNS: {
+ struct timespec *stamp =
+ (struct timespec *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMPNS %ld.%09ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ break;
+ }
+ case SO_TIMESTAMPING: {
+ struct timespec *stamp =
+ (struct timespec *)CMSG_DATA(cmsg);
+ printf("SO_TIMESTAMPING ");
+ printf("SW %ld.%09ld ",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ stamp++;
+ /* skip deprecated HW transformed */
+ stamp++;
+ printf("HW raw %ld.%09ld",
+ (long)stamp->tv_sec,
+ (long)stamp->tv_nsec);
+ break;
+ }
+ default:
+ printf("type %d", cmsg->cmsg_type);
+ break;
+ }
+ break;
+ case IPPROTO_IP:
+ printf("IPPROTO_IP ");
+ switch (cmsg->cmsg_type) {
+ case IP_RECVERR: {
+ struct sock_extended_err *err =
+ (struct sock_extended_err *)CMSG_DATA(cmsg);
+ printf("IP_RECVERR ee_errno '%s' ee_origin %d => %s",
+ strerror(err->ee_errno),
+ err->ee_origin,
+#ifdef SO_EE_ORIGIN_TIMESTAMPING
+ err->ee_origin == SO_EE_ORIGIN_TIMESTAMPING ?
+ "bounced packet" : "unexpected origin"
+#else
+ "probably SO_EE_ORIGIN_TIMESTAMPING"
+#endif
+ );
+ if (res < sizeof(sync))
+ printf(" => truncated data?!");
+ else if (!memcmp(sync, data + res - sizeof(sync),
+ sizeof(sync)))
+ printf(" => GOT OUR DATA BACK (HURRAY!)");
+ break;
+ }
+ case IP_PKTINFO: {
+ struct in_pktinfo *pktinfo =
+ (struct in_pktinfo *)CMSG_DATA(cmsg);
+ printf("IP_PKTINFO interface index %u",
+ pktinfo->ipi_ifindex);
+ break;
+ }
+ default:
+ printf("type %d", cmsg->cmsg_type);
+ break;
+ }
+ break;
+ default:
+ printf("level %d type %d",
+ cmsg->cmsg_level,
+ cmsg->cmsg_type);
+ break;
+ }
+ printf("\n");
+ }
+
+ if (siocgstamp) {
+ if (ioctl(sock, SIOCGSTAMP, &tv))
+ printf(" %s: %s\n", "SIOCGSTAMP", strerror(errno));
+ else
+ printf("SIOCGSTAMP %ld.%06ld\n",
+ (long)tv.tv_sec,
+ (long)tv.tv_usec);
+ }
+ if (siocgstampns) {
+ if (ioctl(sock, SIOCGSTAMPNS, &ts))
+ printf(" %s: %s\n", "SIOCGSTAMPNS", strerror(errno));
+ else
+ printf("SIOCGSTAMPNS %ld.%09ld\n",
+ (long)ts.tv_sec,
+ (long)ts.tv_nsec);
+ }
+}
+
+static void recvpacket(int sock, int recvmsg_flags,
+ int siocgstamp, int siocgstampns)
+{
+ char data[256];
+ struct msghdr msg;
+ struct iovec entry;
+ struct sockaddr_in from_addr;
+ struct {
+ struct cmsghdr cm;
+ char control[512];
+ } control;
+ int res;
+
+ memset(&msg, 0, sizeof(msg));
+ msg.msg_iov = &entry;
+ msg.msg_iovlen = 1;
+ entry.iov_base = data;
+ entry.iov_len = sizeof(data);
+ msg.msg_name = (caddr_t)&from_addr;
+ msg.msg_namelen = sizeof(from_addr);
+ msg.msg_control = &control;
+ msg.msg_controllen = sizeof(control);
+
+ res = recvmsg(sock, &msg, recvmsg_flags|MSG_DONTWAIT);
+ if (res < 0) {
+ printf("%s %s: %s\n",
+ "recvmsg",
+ (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular",
+ strerror(errno));
+ } else {
+ printpacket(&msg, res, data,
+ sock, recvmsg_flags,
+ siocgstamp, siocgstampns);
+ }
+}
+
+int main(int argc, char **argv)
+{
+ int so_timestamping_flags = 0;
+ int so_timestamp = 0;
+ int so_timestampns = 0;
+ int siocgstamp = 0;
+ int siocgstampns = 0;
+ int ip_multicast_loop = 0;
+ char *interface;
+ int i;
+ int enabled = 1;
+ int sock;
+ struct ifreq device;
+ struct ifreq hwtstamp;
+ struct hwtstamp_config hwconfig, hwconfig_requested;
+ struct sockaddr_in addr;
+ struct ip_mreq imr;
+ struct in_addr iaddr;
+ int val;
+ socklen_t len;
+ struct timeval next;
+
+ if (argc < 2)
+ usage(0);
+ interface = argv[1];
+
+ for (i = 2; i < argc; i++) {
+ if (!strcasecmp(argv[i], "SO_TIMESTAMP"))
+ so_timestamp = 1;
+ else if (!strcasecmp(argv[i], "SO_TIMESTAMPNS"))
+ so_timestampns = 1;
+ else if (!strcasecmp(argv[i], "SIOCGSTAMP"))
+ siocgstamp = 1;
+ else if (!strcasecmp(argv[i], "SIOCGSTAMPNS"))
+ siocgstampns = 1;
+ else if (!strcasecmp(argv[i], "IP_MULTICAST_LOOP"))
+ ip_multicast_loop = 1;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_HARDWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_TX_HARDWARE;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_SOFTWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_TX_SOFTWARE;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_HARDWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_RX_HARDWARE;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_SOFTWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_RX_SOFTWARE;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SOFTWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_SOFTWARE;
+ else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RAW_HARDWARE"))
+ so_timestamping_flags |= SOF_TIMESTAMPING_RAW_HARDWARE;
+ else
+ usage(argv[i]);
+ }
+
+ sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
+ if (sock < 0)
+ bail("socket");
+
+ memset(&device, 0, sizeof(device));
+ strncpy(device.ifr_name, interface, sizeof(device.ifr_name));
+ if (ioctl(sock, SIOCGIFADDR, &device) < 0)
+ bail("getting interface IP address");
+
+ memset(&hwtstamp, 0, sizeof(hwtstamp));
+ strncpy(hwtstamp.ifr_name, interface, sizeof(hwtstamp.ifr_name));
+ hwtstamp.ifr_data = (void *)&hwconfig;
+ memset(&hwconfig, 0, sizeof(hwconfig));
+ hwconfig.tx_type =
+ (so_timestamping_flags & SOF_TIMESTAMPING_TX_HARDWARE) ?
+ HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF;
+ hwconfig.rx_filter =
+ (so_timestamping_flags & SOF_TIMESTAMPING_RX_HARDWARE) ?
+ HWTSTAMP_FILTER_PTP_V1_L4_SYNC : HWTSTAMP_FILTER_NONE;
+ hwconfig_requested = hwconfig;
+ if (ioctl(sock, SIOCSHWTSTAMP, &hwtstamp) < 0) {
+ if ((errno == EINVAL || errno == ENOTSUP) &&
+ hwconfig_requested.tx_type == HWTSTAMP_TX_OFF &&
+ hwconfig_requested.rx_filter == HWTSTAMP_FILTER_NONE)
+ printf("SIOCSHWTSTAMP: disabling hardware time stamping not possible\n");
+ else
+ bail("SIOCSHWTSTAMP");
+ }
+ printf("SIOCSHWTSTAMP: tx_type %d requested, got %d; rx_filter %d requested, got %d\n",
+ hwconfig_requested.tx_type, hwconfig.tx_type,
+ hwconfig_requested.rx_filter, hwconfig.rx_filter);
+
+ /* bind to PTP port */
+ addr.sin_family = AF_INET;
+ addr.sin_addr.s_addr = htonl(INADDR_ANY);
+ addr.sin_port = htons(319 /* PTP event port */);
+ if (bind(sock,
+ (struct sockaddr *)&addr,
+ sizeof(struct sockaddr_in)) < 0)
+ bail("bind");
+
+ /* set multicast group for outgoing packets */
+ inet_aton("224.0.1.130", &iaddr); /* alternate PTP domain 1 */
+ addr.sin_addr = iaddr;
+ imr.imr_multiaddr.s_addr = iaddr.s_addr;
+ imr.imr_interface.s_addr =
+ ((struct sockaddr_in *)&device.ifr_addr)->sin_addr.s_addr;
+ if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_IF,
+ &imr.imr_interface.s_addr, sizeof(struct in_addr)) < 0)
+ bail("set multicast");
+
+ /* join multicast group, loop our own packet */
+ if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP,
+ &imr, sizeof(struct ip_mreq)) < 0)
+ bail("join multicast group");
+
+ if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_LOOP,
+ &ip_multicast_loop, sizeof(enabled)) < 0) {
+ bail("loop multicast");
+ }
+
+ /* set socket options for time stamping */
+ if (so_timestamp &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMP,
+ &enabled, sizeof(enabled)) < 0)
+ bail("setsockopt SO_TIMESTAMP");
+
+ if (so_timestampns &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS,
+ &enabled, sizeof(enabled)) < 0)
+ bail("setsockopt SO_TIMESTAMPNS");
+
+ if (so_timestamping_flags &&
+ setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING,
+ &so_timestamping_flags,
+ sizeof(so_timestamping_flags)) < 0)
+ bail("setsockopt SO_TIMESTAMPING");
+
+ /* request IP_PKTINFO for debugging purposes */
+ if (setsockopt(sock, SOL_IP, IP_PKTINFO,
+ &enabled, sizeof(enabled)) < 0)
+ printf("%s: %s\n", "setsockopt IP_PKTINFO", strerror(errno));
+
+ /* verify socket options */
+ len = sizeof(val);
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, &val, &len) < 0)
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMP", strerror(errno));
+ else
+ printf("SO_TIMESTAMP %d\n", val);
+
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS, &val, &len) < 0)
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMPNS",
+ strerror(errno));
+ else
+ printf("SO_TIMESTAMPNS %d\n", val);
+
+ if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &val, &len) < 0) {
+ printf("%s: %s\n", "getsockopt SO_TIMESTAMPING",
+ strerror(errno));
+ } else {
+ printf("SO_TIMESTAMPING %d\n", val);
+ if (val != so_timestamping_flags)
+ printf(" not the expected value %d\n",
+ so_timestamping_flags);
+ }
+
+ /* send packets forever every five seconds */
+ gettimeofday(&next, 0);
+ next.tv_sec = (next.tv_sec + 1) / 5 * 5;
+ next.tv_usec = 0;
+ while (1) {
+ struct timeval now;
+ struct timeval delta;
+ long delta_us;
+ int res;
+ fd_set readfs, errorfs;
+
+ gettimeofday(&now, 0);
+ delta_us = (long)(next.tv_sec - now.tv_sec) * 1000000 +
+ (long)(next.tv_usec - now.tv_usec);
+ if (delta_us > 0) {
+ /* continue waiting for timeout or data */
+ delta.tv_sec = delta_us / 1000000;
+ delta.tv_usec = delta_us % 1000000;
+
+ FD_ZERO(&readfs);
+ FD_ZERO(&errorfs);
+ FD_SET(sock, &readfs);
+ FD_SET(sock, &errorfs);
+ printf("%ld.%06ld: select %ldus\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ delta_us);
+ res = select(sock + 1, &readfs, 0, &errorfs, &delta);
+ gettimeofday(&now, 0);
+ printf("%ld.%06ld: select returned: %d, %s\n",
+ (long)now.tv_sec, (long)now.tv_usec,
+ res,
+ res < 0 ? strerror(errno) : "success");
+ if (res > 0) {
+ if (FD_ISSET(sock, &readfs))
+ printf("ready for reading\n");
+ if (FD_ISSET(sock, &errorfs))
+ printf("has error\n");
+ recvpacket(sock, 0,
+ siocgstamp,
+ siocgstampns);
+ recvpacket(sock, MSG_ERRQUEUE,
+ siocgstamp,
+ siocgstampns);
+ }
+ } else {
+ /* write one packet */
+ sendpacket(sock,
+ (struct sockaddr *)&addr,
+ sizeof(addr));
+ next.tv_sec += 5;
+ continue;
+ }
+ }
+
+ return 0;
+}
diff --git a/Documentation/networking/timestamping/txtimestamp.c b/Documentation/networking/timestamping/txtimestamp.c
new file mode 100644
index 000000000..8217510d3
--- /dev/null
+++ b/Documentation/networking/timestamping/txtimestamp.c
@@ -0,0 +1,549 @@
+/*
+ * Copyright 2014 Google Inc.
+ * Author: willemb@google.com (Willem de Bruijn)
+ *
+ * Test software tx timestamping, including
+ *
+ * - SCHED, SND and ACK timestamps
+ * - RAW, UDP and TCP
+ * - IPv4 and IPv6
+ * - various packet sizes (to test GSO and TSO)
+ *
+ * Consult the command line arguments for help on running
+ * the various testcases.
+ *
+ * This test requires a dummy TCP server.
+ * A simple `nc6 [-u] -l -p $DESTPORT` will do
+ *
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#define _GNU_SOURCE
+
+#include <arpa/inet.h>
+#include <asm/types.h>
+#include <error.h>
+#include <errno.h>
+#include <linux/errqueue.h>
+#include <linux/if_ether.h>
+#include <linux/net_tstamp.h>
+#include <netdb.h>
+#include <net/if.h>
+#include <netinet/in.h>
+#include <netinet/ip.h>
+#include <netinet/udp.h>
+#include <netinet/tcp.h>
+#include <netpacket/packet.h>
+#include <poll.h>
+#include <stdarg.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/select.h>
+#include <sys/socket.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <time.h>
+#include <unistd.h>
+
+/* command line parameters */
+static int cfg_proto = SOCK_STREAM;
+static int cfg_ipproto = IPPROTO_TCP;
+static int cfg_num_pkts = 4;
+static int do_ipv4 = 1;
+static int do_ipv6 = 1;
+static int cfg_payload_len = 10;
+static bool cfg_show_payload;
+static bool cfg_do_pktinfo;
+static bool cfg_loop_nodata;
+static uint16_t dest_port = 9000;
+
+static struct sockaddr_in daddr;
+static struct sockaddr_in6 daddr6;
+static struct timespec ts_prev;
+
+static void __print_timestamp(const char *name, struct timespec *cur,
+ uint32_t key, int payload_len)
+{
+ if (!(cur->tv_sec | cur->tv_nsec))
+ return;
+
+ fprintf(stderr, " %s: %lu s %lu us (seq=%u, len=%u)",
+ name, cur->tv_sec, cur->tv_nsec / 1000,
+ key, payload_len);
+
+ if ((ts_prev.tv_sec | ts_prev.tv_nsec)) {
+ int64_t cur_ms, prev_ms;
+
+ cur_ms = (long) cur->tv_sec * 1000 * 1000;
+ cur_ms += cur->tv_nsec / 1000;
+
+ prev_ms = (long) ts_prev.tv_sec * 1000 * 1000;
+ prev_ms += ts_prev.tv_nsec / 1000;
+
+ fprintf(stderr, " (%+ld us)", cur_ms - prev_ms);
+ }
+
+ ts_prev = *cur;
+ fprintf(stderr, "\n");
+}
+
+static void print_timestamp_usr(void)
+{
+ struct timespec ts;
+ struct timeval tv; /* avoid dependency on -lrt */
+
+ gettimeofday(&tv, NULL);
+ ts.tv_sec = tv.tv_sec;
+ ts.tv_nsec = tv.tv_usec * 1000;
+
+ __print_timestamp(" USR", &ts, 0, 0);
+}
+
+static void print_timestamp(struct scm_timestamping *tss, int tstype,
+ int tskey, int payload_len)
+{
+ const char *tsname;
+
+ switch (tstype) {
+ case SCM_TSTAMP_SCHED:
+ tsname = " ENQ";
+ break;
+ case SCM_TSTAMP_SND:
+ tsname = " SND";
+ break;
+ case SCM_TSTAMP_ACK:
+ tsname = " ACK";
+ break;
+ default:
+ error(1, 0, "unknown timestamp type: %u",
+ tstype);
+ }
+ __print_timestamp(tsname, &tss->ts[0], tskey, payload_len);
+}
+
+/* TODO: convert to check_and_print payload once API is stable */
+static void print_payload(char *data, int len)
+{
+ int i;
+
+ if (!len)
+ return;
+
+ if (len > 70)
+ len = 70;
+
+ fprintf(stderr, "payload: ");
+ for (i = 0; i < len; i++)
+ fprintf(stderr, "%02hhx ", data[i]);
+ fprintf(stderr, "\n");
+}
+
+static void print_pktinfo(int family, int ifindex, void *saddr, void *daddr)
+{
+ char sa[INET6_ADDRSTRLEN], da[INET6_ADDRSTRLEN];
+
+ fprintf(stderr, " pktinfo: ifindex=%u src=%s dst=%s\n",
+ ifindex,
+ saddr ? inet_ntop(family, saddr, sa, sizeof(sa)) : "unknown",
+ daddr ? inet_ntop(family, daddr, da, sizeof(da)) : "unknown");
+}
+
+static void __poll(int fd)
+{
+ struct pollfd pollfd;
+ int ret;
+
+ memset(&pollfd, 0, sizeof(pollfd));
+ pollfd.fd = fd;
+ ret = poll(&pollfd, 1, 100);
+ if (ret != 1)
+ error(1, errno, "poll");
+}
+
+static void __recv_errmsg_cmsg(struct msghdr *msg, int payload_len)
+{
+ struct sock_extended_err *serr = NULL;
+ struct scm_timestamping *tss = NULL;
+ struct cmsghdr *cm;
+ int batch = 0;
+
+ for (cm = CMSG_FIRSTHDR(msg);
+ cm && cm->cmsg_len;
+ cm = CMSG_NXTHDR(msg, cm)) {
+ if (cm->cmsg_level == SOL_SOCKET &&
+ cm->cmsg_type == SCM_TIMESTAMPING) {
+ tss = (void *) CMSG_DATA(cm);
+ } else if ((cm->cmsg_level == SOL_IP &&
+ cm->cmsg_type == IP_RECVERR) ||
+ (cm->cmsg_level == SOL_IPV6 &&
+ cm->cmsg_type == IPV6_RECVERR)) {
+ serr = (void *) CMSG_DATA(cm);
+ if (serr->ee_errno != ENOMSG ||
+ serr->ee_origin != SO_EE_ORIGIN_TIMESTAMPING) {
+ fprintf(stderr, "unknown ip error %d %d\n",
+ serr->ee_errno,
+ serr->ee_origin);
+ serr = NULL;
+ }
+ } else if (cm->cmsg_level == SOL_IP &&
+ cm->cmsg_type == IP_PKTINFO) {
+ struct in_pktinfo *info = (void *) CMSG_DATA(cm);
+ print_pktinfo(AF_INET, info->ipi_ifindex,
+ &info->ipi_spec_dst, &info->ipi_addr);
+ } else if (cm->cmsg_level == SOL_IPV6 &&
+ cm->cmsg_type == IPV6_PKTINFO) {
+ struct in6_pktinfo *info6 = (void *) CMSG_DATA(cm);
+ print_pktinfo(AF_INET6, info6->ipi6_ifindex,
+ NULL, &info6->ipi6_addr);
+ } else
+ fprintf(stderr, "unknown cmsg %d,%d\n",
+ cm->cmsg_level, cm->cmsg_type);
+
+ if (serr && tss) {
+ print_timestamp(tss, serr->ee_info, serr->ee_data,
+ payload_len);
+ serr = NULL;
+ tss = NULL;
+ batch++;
+ }
+ }
+
+ if (batch > 1)
+ fprintf(stderr, "batched %d timestamps\n", batch);
+}
+
+static int recv_errmsg(int fd)
+{
+ static char ctrl[1024 /* overprovision*/];
+ static struct msghdr msg;
+ struct iovec entry;
+ static char *data;
+ int ret = 0;
+
+ data = malloc(cfg_payload_len);
+ if (!data)
+ error(1, 0, "malloc");
+
+ memset(&msg, 0, sizeof(msg));
+ memset(&entry, 0, sizeof(entry));
+ memset(ctrl, 0, sizeof(ctrl));
+
+ entry.iov_base = data;
+ entry.iov_len = cfg_payload_len;
+ msg.msg_iov = &entry;
+ msg.msg_iovlen = 1;
+ msg.msg_name = NULL;
+ msg.msg_namelen = 0;
+ msg.msg_control = ctrl;
+ msg.msg_controllen = sizeof(ctrl);
+
+ ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (ret == -1 && errno != EAGAIN)
+ error(1, errno, "recvmsg");
+
+ if (ret >= 0) {
+ __recv_errmsg_cmsg(&msg, ret);
+ if (cfg_show_payload)
+ print_payload(data, cfg_payload_len);
+ }
+
+ free(data);
+ return ret == -1;
+}
+
+static void do_test(int family, unsigned int opt)
+{
+ char *buf;
+ int fd, i, val = 1, total_len;
+
+ if (family == AF_INET6 && cfg_proto != SOCK_STREAM) {
+ /* due to lack of checksum generation code */
+ fprintf(stderr, "test: skipping datagram over IPv6\n");
+ return;
+ }
+
+ total_len = cfg_payload_len;
+ if (cfg_proto == SOCK_RAW) {
+ total_len += sizeof(struct udphdr);
+ if (cfg_ipproto == IPPROTO_RAW)
+ total_len += sizeof(struct iphdr);
+ }
+
+ buf = malloc(total_len);
+ if (!buf)
+ error(1, 0, "malloc");
+
+ fd = socket(family, cfg_proto, cfg_ipproto);
+ if (fd < 0)
+ error(1, errno, "socket");
+
+ if (cfg_proto == SOCK_STREAM) {
+ if (setsockopt(fd, IPPROTO_TCP, TCP_NODELAY,
+ (char*) &val, sizeof(val)))
+ error(1, 0, "setsockopt no nagle");
+
+ if (family == PF_INET) {
+ if (connect(fd, (void *) &daddr, sizeof(daddr)))
+ error(1, errno, "connect ipv4");
+ } else {
+ if (connect(fd, (void *) &daddr6, sizeof(daddr6)))
+ error(1, errno, "connect ipv6");
+ }
+ }
+
+ if (cfg_do_pktinfo) {
+ if (family == AF_INET6) {
+ if (setsockopt(fd, SOL_IPV6, IPV6_RECVPKTINFO,
+ &val, sizeof(val)))
+ error(1, errno, "setsockopt pktinfo ipv6");
+ } else {
+ if (setsockopt(fd, SOL_IP, IP_PKTINFO,
+ &val, sizeof(val)))
+ error(1, errno, "setsockopt pktinfo ipv4");
+ }
+ }
+
+ opt |= SOF_TIMESTAMPING_SOFTWARE |
+ SOF_TIMESTAMPING_OPT_CMSG |
+ SOF_TIMESTAMPING_OPT_ID;
+ if (cfg_loop_nodata)
+ opt |= SOF_TIMESTAMPING_OPT_TSONLY;
+
+ if (setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING,
+ (char *) &opt, sizeof(opt)))
+ error(1, 0, "setsockopt timestamping");
+
+ for (i = 0; i < cfg_num_pkts; i++) {
+ memset(&ts_prev, 0, sizeof(ts_prev));
+ memset(buf, 'a' + i, total_len);
+
+ if (cfg_proto == SOCK_RAW) {
+ struct udphdr *udph;
+ int off = 0;
+
+ if (cfg_ipproto == IPPROTO_RAW) {
+ struct iphdr *iph = (void *) buf;
+
+ memset(iph, 0, sizeof(*iph));
+ iph->ihl = 5;
+ iph->version = 4;
+ iph->ttl = 2;
+ iph->daddr = daddr.sin_addr.s_addr;
+ iph->protocol = IPPROTO_UDP;
+ /* kernel writes saddr, csum, len */
+
+ off = sizeof(*iph);
+ }
+
+ udph = (void *) buf + off;
+ udph->source = ntohs(9000); /* random spoof */
+ udph->dest = ntohs(dest_port);
+ udph->len = ntohs(sizeof(*udph) + cfg_payload_len);
+ udph->check = 0; /* not allowed for IPv6 */
+ }
+
+ print_timestamp_usr();
+ if (cfg_proto != SOCK_STREAM) {
+ if (family == PF_INET)
+ val = sendto(fd, buf, total_len, 0, (void *) &daddr, sizeof(daddr));
+ else
+ val = sendto(fd, buf, total_len, 0, (void *) &daddr6, sizeof(daddr6));
+ } else {
+ val = send(fd, buf, cfg_payload_len, 0);
+ }
+ if (val != total_len)
+ error(1, errno, "send");
+
+ /* wait for all errors to be queued, else ACKs arrive OOO */
+ usleep(50 * 1000);
+
+ __poll(fd);
+
+ while (!recv_errmsg(fd)) {}
+ }
+
+ if (close(fd))
+ error(1, errno, "close");
+
+ free(buf);
+ usleep(400 * 1000);
+}
+
+static void __attribute__((noreturn)) usage(const char *filepath)
+{
+ fprintf(stderr, "\nUsage: %s [options] hostname\n"
+ "\nwhere options are:\n"
+ " -4: only IPv4\n"
+ " -6: only IPv6\n"
+ " -h: show this message\n"
+ " -I: request PKTINFO\n"
+ " -l N: send N bytes at a time\n"
+ " -n: set no-payload option\n"
+ " -r: use raw\n"
+ " -R: use raw (IP_HDRINCL)\n"
+ " -p N: connect to port N\n"
+ " -u: use udp\n"
+ " -x: show payload (up to 70 bytes)\n",
+ filepath);
+ exit(1);
+}
+
+static void parse_opt(int argc, char **argv)
+{
+ int proto_count = 0;
+ char c;
+
+ while ((c = getopt(argc, argv, "46hIl:np:rRux")) != -1) {
+ switch (c) {
+ case '4':
+ do_ipv6 = 0;
+ break;
+ case '6':
+ do_ipv4 = 0;
+ break;
+ case 'I':
+ cfg_do_pktinfo = true;
+ break;
+ case 'n':
+ cfg_loop_nodata = true;
+ break;
+ case 'r':
+ proto_count++;
+ cfg_proto = SOCK_RAW;
+ cfg_ipproto = IPPROTO_UDP;
+ break;
+ case 'R':
+ proto_count++;
+ cfg_proto = SOCK_RAW;
+ cfg_ipproto = IPPROTO_RAW;
+ break;
+ case 'u':
+ proto_count++;
+ cfg_proto = SOCK_DGRAM;
+ cfg_ipproto = IPPROTO_UDP;
+ break;
+ case 'l':
+ cfg_payload_len = strtoul(optarg, NULL, 10);
+ break;
+ case 'p':
+ dest_port = strtoul(optarg, NULL, 10);
+ break;
+ case 'x':
+ cfg_show_payload = true;
+ break;
+ case 'h':
+ default:
+ usage(argv[0]);
+ }
+ }
+
+ if (!cfg_payload_len)
+ error(1, 0, "payload may not be nonzero");
+ if (cfg_proto != SOCK_STREAM && cfg_payload_len > 1472)
+ error(1, 0, "udp packet might exceed expected MTU");
+ if (!do_ipv4 && !do_ipv6)
+ error(1, 0, "pass -4 or -6, not both");
+ if (proto_count > 1)
+ error(1, 0, "pass -r, -R or -u, not multiple");
+
+ if (optind != argc - 1)
+ error(1, 0, "missing required hostname argument");
+}
+
+static void resolve_hostname(const char *hostname)
+{
+ struct addrinfo *addrs, *cur;
+ int have_ipv4 = 0, have_ipv6 = 0;
+
+ if (getaddrinfo(hostname, NULL, NULL, &addrs))
+ error(1, errno, "getaddrinfo");
+
+ cur = addrs;
+ while (cur && !have_ipv4 && !have_ipv6) {
+ if (!have_ipv4 && cur->ai_family == AF_INET) {
+ memcpy(&daddr, cur->ai_addr, sizeof(daddr));
+ daddr.sin_port = htons(dest_port);
+ have_ipv4 = 1;
+ }
+ else if (!have_ipv6 && cur->ai_family == AF_INET6) {
+ memcpy(&daddr6, cur->ai_addr, sizeof(daddr6));
+ daddr6.sin6_port = htons(dest_port);
+ have_ipv6 = 1;
+ }
+ cur = cur->ai_next;
+ }
+ if (addrs)
+ freeaddrinfo(addrs);
+
+ do_ipv4 &= have_ipv4;
+ do_ipv6 &= have_ipv6;
+}
+
+static void do_main(int family)
+{
+ fprintf(stderr, "family: %s\n",
+ family == PF_INET ? "INET" : "INET6");
+
+ fprintf(stderr, "test SND\n");
+ do_test(family, SOF_TIMESTAMPING_TX_SOFTWARE);
+
+ fprintf(stderr, "test ENQ\n");
+ do_test(family, SOF_TIMESTAMPING_TX_SCHED);
+
+ fprintf(stderr, "test ENQ + SND\n");
+ do_test(family, SOF_TIMESTAMPING_TX_SCHED |
+ SOF_TIMESTAMPING_TX_SOFTWARE);
+
+ if (cfg_proto == SOCK_STREAM) {
+ fprintf(stderr, "\ntest ACK\n");
+ do_test(family, SOF_TIMESTAMPING_TX_ACK);
+
+ fprintf(stderr, "\ntest SND + ACK\n");
+ do_test(family, SOF_TIMESTAMPING_TX_SOFTWARE |
+ SOF_TIMESTAMPING_TX_ACK);
+
+ fprintf(stderr, "\ntest ENQ + SND + ACK\n");
+ do_test(family, SOF_TIMESTAMPING_TX_SCHED |
+ SOF_TIMESTAMPING_TX_SOFTWARE |
+ SOF_TIMESTAMPING_TX_ACK);
+ }
+}
+
+const char *sock_names[] = { NULL, "TCP", "UDP", "RAW" };
+
+int main(int argc, char **argv)
+{
+ if (argc == 1)
+ usage(argv[0]);
+
+ parse_opt(argc, argv);
+ resolve_hostname(argv[argc - 1]);
+
+ fprintf(stderr, "protocol: %s\n", sock_names[cfg_proto]);
+ fprintf(stderr, "payload: %u\n", cfg_payload_len);
+ fprintf(stderr, "server port: %u\n", dest_port);
+ fprintf(stderr, "\n");
+
+ if (do_ipv4)
+ do_main(PF_INET);
+ if (do_ipv6)
+ do_main(PF_INET6);
+
+ return 0;
+}
diff --git a/Documentation/networking/tlan.txt b/Documentation/networking/tlan.txt
new file mode 100644
index 000000000..34550dfce
--- /dev/null
+++ b/Documentation/networking/tlan.txt
@@ -0,0 +1,117 @@
+(C) 1997-1998 Caldera, Inc.
+(C) 1998 James Banks
+(C) 1999-2001 Torben Mathiasen <tmm@image.dk, torben.mathiasen@compaq.com>
+
+For driver information/updates visit http://www.compaq.com
+
+
+TLAN driver for Linux, version 1.14a
+README
+
+
+I. Supported Devices.
+
+ Only PCI devices will work with this driver.
+
+ Supported:
+ Vendor ID Device ID Name
+ 0e11 ae32 Compaq Netelligent 10/100 TX PCI UTP
+ 0e11 ae34 Compaq Netelligent 10 T PCI UTP
+ 0e11 ae35 Compaq Integrated NetFlex 3/P
+ 0e11 ae40 Compaq Netelligent Dual 10/100 TX PCI UTP
+ 0e11 ae43 Compaq Netelligent Integrated 10/100 TX UTP
+ 0e11 b011 Compaq Netelligent 10/100 TX Embedded UTP
+ 0e11 b012 Compaq Netelligent 10 T/2 PCI UTP/Coax
+ 0e11 b030 Compaq Netelligent 10/100 TX UTP
+ 0e11 f130 Compaq NetFlex 3/P
+ 0e11 f150 Compaq NetFlex 3/P
+ 108d 0012 Olicom OC-2325
+ 108d 0013 Olicom OC-2183
+ 108d 0014 Olicom OC-2326
+
+
+ Caveats:
+
+ I am not sure if 100BaseTX daughterboards (for those cards which
+ support such things) will work. I haven't had any solid evidence
+ either way.
+
+ However, if a card supports 100BaseTx without requiring an add
+ on daughterboard, it should work with 100BaseTx.
+
+ The "Netelligent 10 T/2 PCI UTP/Coax" (b012) device is untested,
+ but I do not expect any problems.
+
+
+II. Driver Options
+ 1. You can append debug=x to the end of the insmod line to get
+ debug messages, where x is a bit field where the bits mean
+ the following:
+
+ 0x01 Turn on general debugging messages.
+ 0x02 Turn on receive debugging messages.
+ 0x04 Turn on transmit debugging messages.
+ 0x08 Turn on list debugging messages.
+
+ 2. You can append aui=1 to the end of the insmod line to cause
+ the adapter to use the AUI interface instead of the 10 Base T
+ interface. This is also what to do if you want to use the BNC
+ connector on a TLAN based device. (Setting this option on a
+ device that does not have an AUI/BNC connector will probably
+ cause it to not function correctly.)
+
+ 3. You can set duplex=1 to force half duplex, and duplex=2 to
+ force full duplex.
+
+ 4. You can set speed=10 to force 10Mbs operation, and speed=100
+ to force 100Mbs operation. (I'm not sure what will happen
+ if a card which only supports 10Mbs is forced into 100Mbs
+ mode.)
+
+ 5. You have to use speed=X duplex=Y together now. If you just
+ do "insmod tlan.o speed=100" the driver will do Auto-Neg.
+ To force a 10Mbps Half-Duplex link do "insmod tlan.o speed=10
+ duplex=1".
+
+ 6. If the driver is built into the kernel, you can use the 3rd
+ and 4th parameters to set aui and debug respectively. For
+ example:
+
+ ether=0,0,0x1,0x7,eth0
+
+ This sets aui to 0x1 and debug to 0x7, assuming eth0 is a
+ supported TLAN device.
+
+ The bits in the third byte are assigned as follows:
+
+ 0x01 = aui
+ 0x02 = use half duplex
+ 0x04 = use full duplex
+ 0x08 = use 10BaseT
+ 0x10 = use 100BaseTx
+
+ You also need to set both speed and duplex settings when forcing
+ speeds with kernel-parameters.
+ ether=0,0,0x12,0,eth0 will force link to 100Mbps Half-Duplex.
+
+ 7. If you have more than one tlan adapter in your system, you can
+ use the above options on a per adapter basis. To force a 100Mbit/HD
+ link with your eth1 adapter use:
+
+ insmod tlan speed=0,100 duplex=0,1
+
+ Now eth0 will use auto-neg and eth1 will be forced to 100Mbit/HD.
+ Note that the tlan driver supports a maximum of 8 adapters.
+
+
+III. Things to try if you have problems.
+ 1. Make sure your card's PCI id is among those listed in
+ section I, above.
+ 2. Make sure routing is correct.
+ 3. Try forcing different speed/duplex settings
+
+
+There is also a tlan mailing list which you can join by sending "subscribe tlan"
+in the body of an email to majordomo@vuser.vu.union.edu.
+There is also a tlan website at http://www.compaq.com
+
diff --git a/Documentation/networking/tproxy.txt b/Documentation/networking/tproxy.txt
new file mode 100644
index 000000000..ec11429e1
--- /dev/null
+++ b/Documentation/networking/tproxy.txt
@@ -0,0 +1,84 @@
+Transparent proxy support
+=========================
+
+This feature adds Linux 2.2-like transparent proxy support to current kernels.
+To use it, enable the socket match and the TPROXY target in your kernel config.
+You will need policy routing too, so be sure to enable that as well.
+
+
+1. Making non-local sockets work
+================================
+
+The idea is that you identify packets with destination address matching a local
+socket on your box, set the packet mark to a certain value, and then match on that
+value using policy routing to have those packets delivered locally:
+
+# iptables -t mangle -N DIVERT
+# iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT
+# iptables -t mangle -A DIVERT -j MARK --set-mark 1
+# iptables -t mangle -A DIVERT -j ACCEPT
+
+# ip rule add fwmark 1 lookup 100
+# ip route add local 0.0.0.0/0 dev lo table 100
+
+Because of certain restrictions in the IPv4 routing output code you'll have to
+modify your application to allow it to send datagrams _from_ non-local IP
+addresses. All you have to do is enable the (SOL_IP, IP_TRANSPARENT) socket
+option before calling bind:
+
+fd = socket(AF_INET, SOCK_STREAM, 0);
+/* - 8< -*/
+int value = 1;
+setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value));
+/* - 8< -*/
+name.sin_family = AF_INET;
+name.sin_port = htons(0xCAFE);
+name.sin_addr.s_addr = htonl(0xDEADBEEF);
+bind(fd, &name, sizeof(name));
+
+A trivial patch for netcat is available here:
+http://people.netfilter.org/hidden/tproxy/netcat-ip_transparent-support.patch
+
+
+2. Redirecting traffic
+======================
+
+Transparent proxying often involves "intercepting" traffic on a router. This is
+usually done with the iptables REDIRECT target; however, there are serious
+limitations of that method. One of the major issues is that it actually
+modifies the packets to change the destination address -- which might not be
+acceptable in certain situations. (Think of proxying UDP for example: you won't
+be able to find out the original destination address. Even in case of TCP
+getting the original destination address is racy.)
+
+The 'TPROXY' target provides similar functionality without relying on NAT. Simply
+add rules like this to the iptables ruleset above:
+
+# iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \
+ --tproxy-mark 0x1/0x1 --on-port 50080
+
+Note that for this to work you'll have to modify the proxy to enable (SOL_IP,
+IP_TRANSPARENT) for the listening socket.
+
+
+3. Iptables extensions
+======================
+
+To use tproxy you'll need to have the 'socket' and 'TPROXY' modules
+compiled for iptables. A patched version of iptables is available
+here: http://git.balabit.hu/?p=bazsi/iptables-tproxy.git
+
+
+4. Application support
+======================
+
+4.1. Squid
+----------
+
+Squid 3.HEAD has support built-in. To use it, pass
+'--enable-linux-netfilter' to configure and set the 'tproxy' option on
+the HTTP listener you redirect traffic to with the TPROXY iptables
+target.
+
+For more information please consult the following page on the Squid
+wiki: http://wiki.squid-cache.org/Features/Tproxy4
diff --git a/Documentation/networking/tuntap.txt b/Documentation/networking/tuntap.txt
new file mode 100644
index 000000000..949d5dcdd
--- /dev/null
+++ b/Documentation/networking/tuntap.txt
@@ -0,0 +1,227 @@
+Universal TUN/TAP device driver.
+Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
+
+ Linux, Solaris drivers
+ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com>
+
+ FreeBSD TAP driver
+ Copyright (c) 1999-2000 Maksim Yevmenkin <m_evmenkin@yahoo.com>
+
+ Revision of this document 2002 by Florian Thiel <florian.thiel@gmx.net>
+
+1. Description
+ TUN/TAP provides packet reception and transmission for user space programs.
+ It can be seen as a simple Point-to-Point or Ethernet device, which,
+ instead of receiving packets from physical media, receives them from
+ user space program and instead of sending packets via physical media
+ writes them to the user space program.
+
+ In order to use the driver a program has to open /dev/net/tun and issue a
+ corresponding ioctl() to register a network device with the kernel. A network
+ device will appear as tunXX or tapXX, depending on the options chosen. When
+ the program closes the file descriptor, the network device and all
+ corresponding routes will disappear.
+
+ Depending on the type of device chosen the userspace program has to read/write
+ IP packets (with tun) or ethernet frames (with tap). Which one is being used
+ depends on the flags given with the ioctl().
+
+ The package from http://vtun.sourceforge.net/tun contains two simple examples
+ for how to use tun and tap devices. Both programs work like a bridge between
+ two network interfaces.
+ br_select.c - bridge based on select system call.
+ br_sigio.c - bridge based on async io and SIGIO signal.
+ However, the best example is VTun http://vtun.sourceforge.net :))
+
+2. Configuration
+ Create device node:
+ mkdir /dev/net (if it doesn't exist already)
+ mknod /dev/net/tun c 10 200
+
+ Set permissions:
+ e.g. chmod 0666 /dev/net/tun
+ There's no harm in allowing the device to be accessible by non-root users,
+ since CAP_NET_ADMIN is required for creating network devices or for
+ connecting to network devices which aren't owned by the user in question.
+ If you want to create persistent devices and give ownership of them to
+ unprivileged users, then you need the /dev/net/tun device to be usable by
+ those users.
+
+ Driver module autoloading
+
+ Make sure that "Kernel module loader" - module auto-loading
+ support is enabled in your kernel. The kernel should load it on
+ first access.
+
+ Manual loading
+ insert the module by hand:
+ modprobe tun
+
+ If you do it the latter way, you have to load the module every time you
+ need it, if you do it the other way it will be automatically loaded when
+ /dev/net/tun is being opened.
+
+3. Program interface
+ 3.1 Network device allocation:
+
+ char *dev should be the name of the device with a format string (e.g.
+ "tun%d"), but (as far as I can see) this can be any valid network device name.
+ Note that the character pointer becomes overwritten with the real device name
+ (e.g. "tun0")
+
+ #include <linux/if.h>
+ #include <linux/if_tun.h>
+
+ int tun_alloc(char *dev)
+ {
+ struct ifreq ifr;
+ int fd, err;
+
+ if( (fd = open("/dev/net/tun", O_RDWR)) < 0 )
+ return tun_alloc_old(dev);
+
+ memset(&ifr, 0, sizeof(ifr));
+
+ /* Flags: IFF_TUN - TUN device (no Ethernet headers)
+ * IFF_TAP - TAP device
+ *
+ * IFF_NO_PI - Do not provide packet information
+ */
+ ifr.ifr_flags = IFF_TUN;
+ if( *dev )
+ strncpy(ifr.ifr_name, dev, IFNAMSIZ);
+
+ if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){
+ close(fd);
+ return err;
+ }
+ strcpy(dev, ifr.ifr_name);
+ return fd;
+ }
+
+ 3.2 Frame format:
+ If flag IFF_NO_PI is not set each frame format is:
+ Flags [2 bytes]
+ Proto [2 bytes]
+ Raw protocol(IP, IPv6, etc) frame.
+
+ 3.3 Multiqueue tuntap interface:
+
+ From version 3.8, Linux supports multiqueue tuntap which can uses multiple
+ file descriptors (queues) to parallelize packets sending or receiving. The
+ device allocation is the same as before, and if user wants to create multiple
+ queues, TUNSETIFF with the same device name must be called many times with
+ IFF_MULTI_QUEUE flag.
+
+ char *dev should be the name of the device, queues is the number of queues to
+ be created, fds is used to store and return the file descriptors (queues)
+ created to the caller. Each file descriptor were served as the interface of a
+ queue which could be accessed by userspace.
+
+ #include <linux/if.h>
+ #include <linux/if_tun.h>
+
+ int tun_alloc_mq(char *dev, int queues, int *fds)
+ {
+ struct ifreq ifr;
+ int fd, err, i;
+
+ if (!dev)
+ return -1;
+
+ memset(&ifr, 0, sizeof(ifr));
+ /* Flags: IFF_TUN - TUN device (no Ethernet headers)
+ * IFF_TAP - TAP device
+ *
+ * IFF_NO_PI - Do not provide packet information
+ * IFF_MULTI_QUEUE - Create a queue of multiqueue device
+ */
+ ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_MULTI_QUEUE;
+ strcpy(ifr.ifr_name, dev);
+
+ for (i = 0; i < queues; i++) {
+ if ((fd = open("/dev/net/tun", O_RDWR)) < 0)
+ goto err;
+ err = ioctl(fd, TUNSETIFF, (void *)&ifr);
+ if (err) {
+ close(fd);
+ goto err;
+ }
+ fds[i] = fd;
+ }
+
+ return 0;
+ err:
+ for (--i; i >= 0; i--)
+ close(fds[i]);
+ return err;
+ }
+
+ A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When
+ calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when
+ calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were
+ enabled by default after it was created through TUNSETIFF.
+
+ fd is the file descriptor (queue) that we want to enable or disable, when
+ enable is true we enable it, otherwise we disable it
+
+ #include <linux/if.h>
+ #include <linux/if_tun.h>
+
+ int tun_set_queue(int fd, int enable)
+ {
+ struct ifreq ifr;
+
+ memset(&ifr, 0, sizeof(ifr));
+
+ if (enable)
+ ifr.ifr_flags = IFF_ATTACH_QUEUE;
+ else
+ ifr.ifr_flags = IFF_DETACH_QUEUE;
+
+ return ioctl(fd, TUNSETQUEUE, (void *)&ifr);
+ }
+
+Universal TUN/TAP device driver Frequently Asked Question.
+
+1. What platforms are supported by TUN/TAP driver ?
+Currently driver has been written for 3 Unices:
+ Linux kernels 2.2.x, 2.4.x
+ FreeBSD 3.x, 4.x, 5.x
+ Solaris 2.6, 7.0, 8.0
+
+2. What is TUN/TAP driver used for?
+As mentioned above, main purpose of TUN/TAP driver is tunneling.
+It is used by VTun (http://vtun.sourceforge.net).
+
+Another interesting application using TUN/TAP is pipsecd
+(http://perso.enst.fr/~beyssac/pipsec/), a userspace IPSec
+implementation that can use complete kernel routing (unlike FreeS/WAN).
+
+3. How does Virtual network device actually work ?
+Virtual network device can be viewed as a simple Point-to-Point or
+Ethernet device, which instead of receiving packets from a physical
+media, receives them from user space program and instead of sending
+packets via physical media sends them to the user space program.
+
+Let's say that you configured IPX on the tap0, then whenever
+the kernel sends an IPX packet to tap0, it is passed to the application
+(VTun for example). The application encrypts, compresses and sends it to
+the other side over TCP or UDP. The application on the other side decompresses
+and decrypts the data received and writes the packet to the TAP device,
+the kernel handles the packet like it came from real physical device.
+
+4. What is the difference between TUN driver and TAP driver?
+TUN works with IP frames. TAP works with Ethernet frames.
+
+This means that you have to read/write IP packets when you are using tun and
+ethernet frames when using tap.
+
+5. What is the difference between BPF and TUN/TAP driver?
+BPF is an advanced packet filter. It can be attached to existing
+network interface. It does not provide a virtual network interface.
+A TUN/TAP driver does provide a virtual network interface and it is possible
+to attach BPF to this interface.
+
+6. Does TAP driver support kernel Ethernet bridging?
+Yes. Linux and FreeBSD drivers support Ethernet bridging.
diff --git a/Documentation/networking/udplite.txt b/Documentation/networking/udplite.txt
new file mode 100644
index 000000000..53a726855
--- /dev/null
+++ b/Documentation/networking/udplite.txt
@@ -0,0 +1,278 @@
+ ===========================================================================
+ The UDP-Lite protocol (RFC 3828)
+ ===========================================================================
+
+
+ UDP-Lite is a Standards-Track IETF transport protocol whose characteristic
+ is a variable-length checksum. This has advantages for transport of multimedia
+ (video, VoIP) over wireless networks, as partly damaged packets can still be
+ fed into the codec instead of being discarded due to a failed checksum test.
+
+ This file briefly describes the existing kernel support and the socket API.
+ For in-depth information, you can consult:
+
+ o The UDP-Lite Homepage:
+ http://web.archive.org/web/*/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/
+ From here you can also download some example application source code.
+
+ o The UDP-Lite HOWTO on
+ http://web.archive.org/web/*/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/
+ files/UDP-Lite-HOWTO.txt
+
+ o The Wireshark UDP-Lite WiKi (with capture files):
+ https://wiki.wireshark.org/Lightweight_User_Datagram_Protocol
+
+ o The Protocol Spec, RFC 3828, http://www.ietf.org/rfc/rfc3828.txt
+
+
+ I) APPLICATIONS
+
+ Several applications have been ported successfully to UDP-Lite. Ethereal
+ (now called wireshark) has UDP-Litev4/v6 support by default.
+ Porting applications to UDP-Lite is straightforward: only socket level and
+ IPPROTO need to be changed; senders additionally set the checksum coverage
+ length (default = header length = 8). Details are in the next section.
+
+
+ II) PROGRAMMING API
+
+ UDP-Lite provides a connectionless, unreliable datagram service and hence
+ uses the same socket type as UDP. In fact, porting from UDP to UDP-Lite is
+ very easy: simply add `IPPROTO_UDPLITE' as the last argument of the socket(2)
+ call so that the statement looks like:
+
+ s = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDPLITE);
+
+ or, respectively,
+
+ s = socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDPLITE);
+
+ With just the above change you are able to run UDP-Lite services or connect
+ to UDP-Lite servers. The kernel will assume that you are not interested in
+ using partial checksum coverage and so emulate UDP mode (full coverage).
+
+ To make use of the partial checksum coverage facilities requires setting a
+ single socket option, which takes an integer specifying the coverage length:
+
+ * Sender checksum coverage: UDPLITE_SEND_CSCOV
+
+ For example,
+
+ int val = 20;
+ setsockopt(s, SOL_UDPLITE, UDPLITE_SEND_CSCOV, &val, sizeof(int));
+
+ sets the checksum coverage length to 20 bytes (12b data + 8b header).
+ Of each packet only the first 20 bytes (plus the pseudo-header) will be
+ checksummed. This is useful for RTP applications which have a 12-byte
+ base header.
+
+
+ * Receiver checksum coverage: UDPLITE_RECV_CSCOV
+
+ This option is the receiver-side analogue. It is truly optional, i.e. not
+ required to enable traffic with partial checksum coverage. Its function is
+ that of a traffic filter: when enabled, it instructs the kernel to drop
+ all packets which have a coverage _less_ than this value. For example, if
+ RTP and UDP headers are to be protected, a receiver can enforce that only
+ packets with a minimum coverage of 20 are admitted:
+
+ int min = 20;
+ setsockopt(s, SOL_UDPLITE, UDPLITE_RECV_CSCOV, &min, sizeof(int));
+
+ The calls to getsockopt(2) are analogous. Being an extension and not a stand-
+ alone protocol, all socket options known from UDP can be used in exactly the
+ same manner as before, e.g. UDP_CORK or UDP_ENCAP.
+
+ A detailed discussion of UDP-Lite checksum coverage options is in section IV.
+
+
+ III) HEADER FILES
+
+ The socket API requires support through header files in /usr/include:
+
+ * /usr/include/netinet/in.h
+ to define IPPROTO_UDPLITE
+
+ * /usr/include/netinet/udplite.h
+ for UDP-Lite header fields and protocol constants
+
+ For testing purposes, the following can serve as a `mini' header file:
+
+ #define IPPROTO_UDPLITE 136
+ #define SOL_UDPLITE 136
+ #define UDPLITE_SEND_CSCOV 10
+ #define UDPLITE_RECV_CSCOV 11
+
+ Ready-made header files for various distros are in the UDP-Lite tarball.
+
+
+ IV) KERNEL BEHAVIOUR WITH REGARD TO THE VARIOUS SOCKET OPTIONS
+
+ To enable debugging messages, the log level need to be set to 8, as most
+ messages use the KERN_DEBUG level (7).
+
+ 1) Sender Socket Options
+
+ If the sender specifies a value of 0 as coverage length, the module
+ assumes full coverage, transmits a packet with coverage length of 0
+ and according checksum. If the sender specifies a coverage < 8 and
+ different from 0, the kernel assumes 8 as default value. Finally,
+ if the specified coverage length exceeds the packet length, the packet
+ length is used instead as coverage length.
+
+ 2) Receiver Socket Options
+
+ The receiver specifies the minimum value of the coverage length it
+ is willing to accept. A value of 0 here indicates that the receiver
+ always wants the whole of the packet covered. In this case, all
+ partially covered packets are dropped and an error is logged.
+
+ It is not possible to specify illegal values (<0 and <8); in these
+ cases the default of 8 is assumed.
+
+ All packets arriving with a coverage value less than the specified
+ threshold are discarded, these events are also logged.
+
+ 3) Disabling the Checksum Computation
+
+ On both sender and receiver, checksumming will always be performed
+ and cannot be disabled using SO_NO_CHECK. Thus
+
+ setsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, ... );
+
+ will always will be ignored, while the value of
+
+ getsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, &value, ...);
+
+ is meaningless (as in TCP). Packets with a zero checksum field are
+ illegal (cf. RFC 3828, sec. 3.1) and will be silently discarded.
+
+ 4) Fragmentation
+
+ The checksum computation respects both buffersize and MTU. The size
+ of UDP-Lite packets is determined by the size of the send buffer. The
+ minimum size of the send buffer is 2048 (defined as SOCK_MIN_SNDBUF
+ in include/net/sock.h), the default value is configurable as
+ net.core.wmem_default or via setting the SO_SNDBUF socket(7)
+ option. The maximum upper bound for the send buffer is determined
+ by net.core.wmem_max.
+
+ Given a payload size larger than the send buffer size, UDP-Lite will
+ split the payload into several individual packets, filling up the
+ send buffer size in each case.
+
+ The precise value also depends on the interface MTU. The interface MTU,
+ in turn, may trigger IP fragmentation. In this case, the generated
+ UDP-Lite packet is split into several IP packets, of which only the
+ first one contains the L4 header.
+
+ The send buffer size has implications on the checksum coverage length.
+ Consider the following example:
+
+ Payload: 1536 bytes Send Buffer: 1024 bytes
+ MTU: 1500 bytes Coverage Length: 856 bytes
+
+ UDP-Lite will ship the 1536 bytes in two separate packets:
+
+ Packet 1: 1024 payload + 8 byte header + 20 byte IP header = 1052 bytes
+ Packet 2: 512 payload + 8 byte header + 20 byte IP header = 540 bytes
+
+ The coverage packet covers the UDP-Lite header and 848 bytes of the
+ payload in the first packet, the second packet is fully covered. Note
+ that for the second packet, the coverage length exceeds the packet
+ length. The kernel always re-adjusts the coverage length to the packet
+ length in such cases.
+
+ As an example of what happens when one UDP-Lite packet is split into
+ several tiny fragments, consider the following example.
+
+ Payload: 1024 bytes Send buffer size: 1024 bytes
+ MTU: 300 bytes Coverage length: 575 bytes
+
+ +-+-----------+--------------+--------------+--------------+
+ |8| 272 | 280 | 280 | 280 |
+ +-+-----------+--------------+--------------+--------------+
+ 280 560 840 1032
+ ^
+ *****checksum coverage*************
+
+ The UDP-Lite module generates one 1032 byte packet (1024 + 8 byte
+ header). According to the interface MTU, these are split into 4 IP
+ packets (280 byte IP payload + 20 byte IP header). The kernel module
+ sums the contents of the entire first two packets, plus 15 bytes of
+ the last packet before releasing the fragments to the IP module.
+
+ To see the analogous case for IPv6 fragmentation, consider a link
+ MTU of 1280 bytes and a write buffer of 3356 bytes. If the checksum
+ coverage is less than 1232 bytes (MTU minus IPv6/fragment header
+ lengths), only the first fragment needs to be considered. When using
+ larger checksum coverage lengths, each eligible fragment needs to be
+ checksummed. Suppose we have a checksum coverage of 3062. The buffer
+ of 3356 bytes will be split into the following fragments:
+
+ Fragment 1: 1280 bytes carrying 1232 bytes of UDP-Lite data
+ Fragment 2: 1280 bytes carrying 1232 bytes of UDP-Lite data
+ Fragment 3: 948 bytes carrying 900 bytes of UDP-Lite data
+
+ The first two fragments have to be checksummed in full, of the last
+ fragment only 598 (= 3062 - 2*1232) bytes are checksummed.
+
+ While it is important that such cases are dealt with correctly, they
+ are (annoyingly) rare: UDP-Lite is designed for optimising multimedia
+ performance over wireless (or generally noisy) links and thus smaller
+ coverage lengths are likely to be expected.
+
+
+ V) UDP-LITE RUNTIME STATISTICS AND THEIR MEANING
+
+ Exceptional and error conditions are logged to syslog at the KERN_DEBUG
+ level. Live statistics about UDP-Lite are available in /proc/net/snmp
+ and can (with newer versions of netstat) be viewed using
+
+ netstat -svu
+
+ This displays UDP-Lite statistics variables, whose meaning is as follows.
+
+ InDatagrams: The total number of datagrams delivered to users.
+
+ NoPorts: Number of packets received to an unknown port.
+ These cases are counted separately (not as InErrors).
+
+ InErrors: Number of erroneous UDP-Lite packets. Errors include:
+ * internal socket queue receive errors
+ * packet too short (less than 8 bytes or stated
+ coverage length exceeds received length)
+ * xfrm4_policy_check() returned with error
+ * application has specified larger min. coverage
+ length than that of incoming packet
+ * checksum coverage violated
+ * bad checksum
+
+ OutDatagrams: Total number of sent datagrams.
+
+ These statistics derive from the UDP MIB (RFC 2013).
+
+
+ VI) IPTABLES
+
+ There is packet match support for UDP-Lite as well as support for the LOG target.
+ If you copy and paste the following line into /etc/protocols,
+
+ udplite 136 UDP-Lite # UDP-Lite [RFC 3828]
+
+ then
+ iptables -A INPUT -p udplite -j LOG
+
+ will produce logging output to syslog. Dropping and rejecting packets also works.
+
+
+ VII) MAINTAINER ADDRESS
+
+ The UDP-Lite patch was developed at
+ University of Aberdeen
+ Electronics Research Group
+ Department of Engineering
+ Fraser Noble Building
+ Aberdeen AB24 3UE; UK
+ The current maintainer is Gerrit Renker, <gerrit@erg.abdn.ac.uk>. Initial
+ code was developed by William Stanislaus, <william@erg.abdn.ac.uk>.
diff --git a/Documentation/networking/vortex.txt b/Documentation/networking/vortex.txt
new file mode 100644
index 000000000..97282da82
--- /dev/null
+++ b/Documentation/networking/vortex.txt
@@ -0,0 +1,448 @@
+Documentation/networking/vortex.txt
+Andrew Morton
+30 April 2000
+
+
+This document describes the usage and errata of the 3Com "Vortex" device
+driver for Linux, 3c59x.c.
+
+The driver was written by Donald Becker <becker@scyld.com>
+
+Don is no longer the prime maintainer of this version of the driver.
+Please report problems to one or more of:
+
+ Andrew Morton
+ Netdev mailing list <netdev@vger.kernel.org>
+ Linux kernel mailing list <linux-kernel@vger.kernel.org>
+
+Please note the 'Reporting and Diagnosing Problems' section at the end
+of this file.
+
+
+Since kernel 2.3.99-pre6, this driver incorporates the support for the
+3c575-series Cardbus cards which used to be handled by 3c575_cb.c.
+
+This driver supports the following hardware:
+
+ 3c590 Vortex 10Mbps
+ 3c592 EISA 10Mbps Demon/Vortex
+ 3c597 EISA Fast Demon/Vortex
+ 3c595 Vortex 100baseTx
+ 3c595 Vortex 100baseT4
+ 3c595 Vortex 100base-MII
+ 3c900 Boomerang 10baseT
+ 3c900 Boomerang 10Mbps Combo
+ 3c900 Cyclone 10Mbps TPO
+ 3c900 Cyclone 10Mbps Combo
+ 3c900 Cyclone 10Mbps TPC
+ 3c900B-FL Cyclone 10base-FL
+ 3c905 Boomerang 100baseTx
+ 3c905 Boomerang 100baseT4
+ 3c905B Cyclone 100baseTx
+ 3c905B Cyclone 10/100/BNC
+ 3c905B-FX Cyclone 100baseFx
+ 3c905C Tornado
+ 3c920B-EMB-WNM (ATI Radeon 9100 IGP)
+ 3c980 Cyclone
+ 3c980C Python-T
+ 3cSOHO100-TX Hurricane
+ 3c555 Laptop Hurricane
+ 3c556 Laptop Tornado
+ 3c556B Laptop Hurricane
+ 3c575 [Megahertz] 10/100 LAN CardBus
+ 3c575 Boomerang CardBus
+ 3CCFE575BT Cyclone CardBus
+ 3CCFE575CT Tornado CardBus
+ 3CCFE656 Cyclone CardBus
+ 3CCFEM656B Cyclone+Winmodem CardBus
+ 3CXFEM656C Tornado+Winmodem CardBus
+ 3c450 HomePNA Tornado
+ 3c920 Tornado
+ 3c982 Hydra Dual Port A
+ 3c982 Hydra Dual Port B
+ 3c905B-T4
+ 3c920B-EMB-WNM Tornado
+
+Module parameters
+=================
+
+There are several parameters which may be provided to the driver when
+its module is loaded. These are usually placed in /etc/modprobe.d/*.conf
+configuration files. Example:
+
+options 3c59x debug=3 rx_copybreak=300
+
+If you are using the PCMCIA tools (cardmgr) then the options may be
+placed in /etc/pcmcia/config.opts:
+
+module "3c59x" opts "debug=3 rx_copybreak=300"
+
+
+The supported parameters are:
+
+debug=N
+
+ Where N is a number from 0 to 7. Anything above 3 produces a lot
+ of output in your system logs. debug=1 is default.
+
+options=N1,N2,N3,...
+
+ Each number in the list provides an option to the corresponding
+ network card. So if you have two 3c905's and you wish to provide
+ them with option 0x204 you would use:
+
+ options=0x204,0x204
+
+ The individual options are composed of a number of bitfields which
+ have the following meanings:
+
+ Possible media type settings
+ 0 10baseT
+ 1 10Mbs AUI
+ 2 undefined
+ 3 10base2 (BNC)
+ 4 100base-TX
+ 5 100base-FX
+ 6 MII (Media Independent Interface)
+ 7 Use default setting from EEPROM
+ 8 Autonegotiate
+ 9 External MII
+ 10 Use default setting from EEPROM
+
+ When generating a value for the 'options' setting, the above media
+ selection values may be OR'ed (or added to) the following:
+
+ 0x8000 Set driver debugging level to 7
+ 0x4000 Set driver debugging level to 2
+ 0x0400 Enable Wake-on-LAN
+ 0x0200 Force full duplex mode.
+ 0x0010 Bus-master enable bit (Old Vortex cards only)
+
+ For example:
+
+ insmod 3c59x options=0x204
+
+ will force full-duplex 100base-TX, rather than allowing the usual
+ autonegotiation.
+
+global_options=N
+
+ Sets the `options' parameter for all 3c59x NICs in the machine.
+ Entries in the `options' array above will override any setting of
+ this.
+
+full_duplex=N1,N2,N3...
+
+ Similar to bit 9 of 'options'. Forces the corresponding card into
+ full-duplex mode. Please use this in preference to the `options'
+ parameter.
+
+ In fact, please don't use this at all! You're better off getting
+ autonegotiation working properly.
+
+global_full_duplex=N1
+
+ Sets full duplex mode for all 3c59x NICs in the machine. Entries
+ in the `full_duplex' array above will override any setting of this.
+
+flow_ctrl=N1,N2,N3...
+
+ Use 802.3x MAC-layer flow control. The 3com cards only support the
+ PAUSE command, which means that they will stop sending packets for a
+ short period if they receive a PAUSE frame from the link partner.
+
+ The driver only allows flow control on a link which is operating in
+ full duplex mode.
+
+ This feature does not appear to work on the 3c905 - only 3c905B and
+ 3c905C have been tested.
+
+ The 3com cards appear to only respond to PAUSE frames which are
+ sent to the reserved destination address of 01:80:c2:00:00:01. They
+ do not honour PAUSE frames which are sent to the station MAC address.
+
+rx_copybreak=M
+
+ The driver preallocates 32 full-sized (1536 byte) network buffers
+ for receiving. When a packet arrives, the driver has to decide
+ whether to leave the packet in its full-sized buffer, or to allocate
+ a smaller buffer and copy the packet across into it.
+
+ This is a speed/space tradeoff.
+
+ The value of rx_copybreak is used to decide when to make the copy.
+ If the packet size is less than rx_copybreak, the packet is copied.
+ The default value for rx_copybreak is 200 bytes.
+
+max_interrupt_work=N
+
+ The driver's interrupt service routine can handle many receive and
+ transmit packets in a single invocation. It does this in a loop.
+ The value of max_interrupt_work governs how many times the interrupt
+ service routine will loop. The default value is 32 loops. If this
+ is exceeded the interrupt service routine gives up and generates a
+ warning message "eth0: Too much work in interrupt".
+
+hw_checksums=N1,N2,N3,...
+
+ Recent 3com NICs are able to generate IPv4, TCP and UDP checksums
+ in hardware. Linux has used the Rx checksumming for a long time.
+ The "zero copy" patch which is planned for the 2.4 kernel series
+ allows you to make use of the NIC's DMA scatter/gather and transmit
+ checksumming as well.
+
+ The driver is set up so that, when the zerocopy patch is applied,
+ all Tornado and Cyclone devices will use S/G and Tx checksums.
+
+ This module parameter has been provided so you can override this
+ decision. If you think that Tx checksums are causing a problem, you
+ may disable the feature with `hw_checksums=0'.
+
+ If you think your NIC should be performing Tx checksumming and the
+ driver isn't enabling it, you can force the use of hardware Tx
+ checksumming with `hw_checksums=1'.
+
+ The driver drops a message in the logfiles to indicate whether or
+ not it is using hardware scatter/gather and hardware Tx checksums.
+
+ Scatter/gather and hardware checksums provide considerable
+ performance improvement for the sendfile() system call, but a small
+ decrease in throughput for send(). There is no effect upon receive
+ efficiency.
+
+compaq_ioaddr=N
+compaq_irq=N
+compaq_device_id=N
+
+ "Variables to work-around the Compaq PCI BIOS32 problem"....
+
+watchdog=N
+
+ Sets the time duration (in milliseconds) after which the kernel
+ decides that the transmitter has become stuck and needs to be reset.
+ This is mainly for debugging purposes, although it may be advantageous
+ to increase this value on LANs which have very high collision rates.
+ The default value is 5000 (5.0 seconds).
+
+enable_wol=N1,N2,N3,...
+
+ Enable Wake-on-LAN support for the relevant interface. Donald
+ Becker's `ether-wake' application may be used to wake suspended
+ machines.
+
+ Also enables the NIC's power management support.
+
+global_enable_wol=N
+
+ Sets enable_wol mode for all 3c59x NICs in the machine. Entries in
+ the `enable_wol' array above will override any setting of this.
+
+Media selection
+---------------
+
+A number of the older NICs such as the 3c590 and 3c900 series have
+10base2 and AUI interfaces.
+
+Prior to January, 2001 this driver would autoeselect the 10base2 or AUI
+port if it didn't detect activity on the 10baseT port. It would then
+get stuck on the 10base2 port and a driver reload was necessary to
+switch back to 10baseT. This behaviour could not be prevented with a
+module option override.
+
+Later (current) versions of the driver _do_ support locking of the
+media type. So if you load the driver module with
+
+ modprobe 3c59x options=0
+
+it will permanently select the 10baseT port. Automatic selection of
+other media types does not occur.
+
+
+Transmit error, Tx status register 82
+-------------------------------------
+
+This is a common error which is almost always caused by another host on
+the same network being in full-duplex mode, while this host is in
+half-duplex mode. You need to find that other host and make it run in
+half-duplex mode or fix this host to run in full-duplex mode.
+
+As a last resort, you can force the 3c59x driver into full-duplex mode
+with
+
+ options 3c59x full_duplex=1
+
+but this has to be viewed as a workaround for broken network gear and
+should only really be used for equipment which cannot autonegotiate.
+
+
+Additional resources
+--------------------
+
+Details of the device driver implementation are at the top of the source file.
+
+Additional documentation is available at Don Becker's Linux Drivers site:
+
+ http://www.scyld.com/vortex.html
+
+Donald Becker's driver development site:
+
+ http://www.scyld.com/network.html
+
+Donald's vortex-diag program is useful for inspecting the NIC's state:
+
+ http://www.scyld.com/ethercard_diag.html
+
+Donald's mii-diag program may be used for inspecting and manipulating
+the NIC's Media Independent Interface subsystem:
+
+ http://www.scyld.com/ethercard_diag.html#mii-diag
+
+Donald's wake-on-LAN page:
+
+ http://www.scyld.com/wakeonlan.html
+
+3Com's DOS-based application for setting up the NICs EEPROMs:
+
+ ftp://ftp.3com.com/pub/nic/3c90x/3c90xx2.exe
+
+
+Autonegotiation notes
+---------------------
+
+ The driver uses a one-minute heartbeat for adapting to changes in
+ the external LAN environment if link is up and 5 seconds if link is down.
+ This means that when, for example, a machine is unplugged from a hubbed
+ 10baseT LAN plugged into a switched 100baseT LAN, the throughput
+ will be quite dreadful for up to sixty seconds. Be patient.
+
+ Cisco interoperability note from Walter Wong <wcw+@CMU.EDU>:
+
+ On a side note, adding HAS_NWAY seems to share a problem with the
+ Cisco 6509 switch. Specifically, you need to change the spanning
+ tree parameter for the port the machine is plugged into to 'portfast'
+ mode. Otherwise, the negotiation fails. This has been an issue
+ we've noticed for a while but haven't had the time to track down.
+
+ Cisco switches (Jeff Busch <jbusch@deja.com>)
+
+ My "standard config" for ports to which PC's/servers connect directly:
+
+ interface FastEthernet0/N
+ description machinename
+ load-interval 30
+ spanning-tree portfast
+
+ If autonegotiation is a problem, you may need to specify "speed
+ 100" and "duplex full" as well (or "speed 10" and "duplex half").
+
+ WARNING: DO NOT hook up hubs/switches/bridges to these
+ specially-configured ports! The switch will become very confused.
+
+
+Reporting and diagnosing problems
+---------------------------------
+
+Maintainers find that accurate and complete problem reports are
+invaluable in resolving driver problems. We are frequently not able to
+reproduce problems and must rely on your patience and efforts to get to
+the bottom of the problem.
+
+If you believe you have a driver problem here are some of the
+steps you should take:
+
+- Is it really a driver problem?
+
+ Eliminate some variables: try different cards, different
+ computers, different cables, different ports on the switch/hub,
+ different versions of the kernel or of the driver, etc.
+
+- OK, it's a driver problem.
+
+ You need to generate a report. Typically this is an email to the
+ maintainer and/or netdev@vger.kernel.org. The maintainer's
+ email address will be in the driver source or in the MAINTAINERS file.
+
+- The contents of your report will vary a lot depending upon the
+ problem. If it's a kernel crash then you should refer to the
+ REPORTING-BUGS file.
+
+ But for most problems it is useful to provide the following:
+
+ o Kernel version, driver version
+
+ o A copy of the banner message which the driver generates when
+ it is initialised. For example:
+
+ eth0: 3Com PCI 3c905C Tornado at 0xa400, 00:50:da:6a:88:f0, IRQ 19
+ 8K byte-wide RAM 5:3 Rx:Tx split, autoselect/Autonegotiate interface.
+ MII transceiver found at address 24, status 782d.
+ Enabling bus-master transmits and whole-frame receives.
+
+ NOTE: You must provide the `debug=2' modprobe option to generate
+ a full detection message. Please do this:
+
+ modprobe 3c59x debug=2
+
+ o If it is a PCI device, the relevant output from 'lspci -vx', eg:
+
+ 00:09.0 Ethernet controller: 3Com Corporation 3c905C-TX [Fast Etherlink] (rev 74)
+ Subsystem: 3Com Corporation: Unknown device 9200
+ Flags: bus master, medium devsel, latency 32, IRQ 19
+ I/O ports at a400 [size=128]
+ Memory at db000000 (32-bit, non-prefetchable) [size=128]
+ Expansion ROM at <unassigned> [disabled] [size=128K]
+ Capabilities: [dc] Power Management version 2
+ 00: b7 10 00 92 07 00 10 02 74 00 00 02 08 20 00 00
+ 10: 01 a4 00 00 00 00 00 db 00 00 00 00 00 00 00 00
+ 20: 00 00 00 00 00 00 00 00 00 00 00 00 b7 10 00 10
+ 30: 00 00 00 00 dc 00 00 00 00 00 00 00 05 01 0a 0a
+
+ o A description of the environment: 10baseT? 100baseT?
+ full/half duplex? switched or hubbed?
+
+ o Any additional module parameters which you may be providing to the driver.
+
+ o Any kernel logs which are produced. The more the merrier.
+ If this is a large file and you are sending your report to a
+ mailing list, mention that you have the logfile, but don't send
+ it. If you're reporting direct to the maintainer then just send
+ it.
+
+ To ensure that all kernel logs are available, add the
+ following line to /etc/syslog.conf:
+
+ kern.* /var/log/messages
+
+ Then restart syslogd with:
+
+ /etc/rc.d/init.d/syslog restart
+
+ (The above may vary, depending upon which Linux distribution you use).
+
+ o If your problem is reproducible then that's great. Try the
+ following:
+
+ 1) Increase the debug level. Usually this is done via:
+
+ a) modprobe driver debug=7
+ b) In /etc/modprobe.d/driver.conf:
+ options driver debug=7
+
+ 2) Recreate the problem with the higher debug level,
+ send all logs to the maintainer.
+
+ 3) Download you card's diagnostic tool from Donald
+ Becker's website <http://www.scyld.com/ethercard_diag.html>.
+ Download mii-diag.c as well. Build these.
+
+ a) Run 'vortex-diag -aaee' and 'mii-diag -v' when the card is
+ working correctly. Save the output.
+
+ b) Run the above commands when the card is malfunctioning. Send
+ both sets of output.
+
+Finally, please be patient and be prepared to do some work. You may
+end up working on this problem for a week or more as the maintainer
+asks more questions, asks for more tests, asks for patches to be
+applied, etc. At the end of it all, the problem may even remain
+unresolved.
diff --git a/Documentation/networking/vxge.txt b/Documentation/networking/vxge.txt
new file mode 100644
index 000000000..abfec245f
--- /dev/null
+++ b/Documentation/networking/vxge.txt
@@ -0,0 +1,93 @@
+Neterion's (Formerly S2io) X3100 Series 10GbE PCIe Server Adapter Linux driver
+==============================================================================
+
+Contents
+--------
+
+1) Introduction
+2) Features supported
+3) Configurable driver parameters
+4) Troubleshooting
+
+1) Introduction:
+----------------
+This Linux driver supports all Neterion's X3100 series 10 GbE PCIe I/O
+Virtualized Server adapters.
+The X3100 series supports four modes of operation, configurable via
+firmware -
+ Single function mode
+ Multi function mode
+ SRIOV mode
+ MRIOV mode
+The functions share a 10GbE link and the pci-e bus, but hardly anything else
+inside the ASIC. Features like independent hw reset, statistics, bandwidth/
+priority allocation and guarantees, GRO, TSO, interrupt moderation etc are
+supported independently on each function.
+
+(See below for a complete list of features supported for both IPv4 and IPv6)
+
+2) Features supported:
+----------------------
+
+i) Single function mode (up to 17 queues)
+
+ii) Multi function mode (up to 17 functions)
+
+iii) PCI-SIG's I/O Virtualization
+ - Single Root mode: v1.0 (up to 17 functions)
+ - Multi-Root mode: v1.0 (up to 17 functions)
+
+iv) Jumbo frames
+ X3100 Series supports MTU up to 9600 bytes, modifiable using
+ ip command.
+
+v) Offloads supported: (Enabled by default)
+ Checksum offload (TCP/UDP/IP) on transmit and receive paths
+ TCP Segmentation Offload (TSO) on transmit path
+ Generic Receive Offload (GRO) on receive path
+
+vi) MSI-X: (Enabled by default)
+ Resulting in noticeable performance improvement (up to 7% on certain
+ platforms).
+
+vii) NAPI: (Enabled by default)
+ For better Rx interrupt moderation.
+
+viii)RTH (Receive Traffic Hash): (Enabled by default)
+ Receive side steering for better scaling.
+
+ix) Statistics
+ Comprehensive MAC-level and software statistics displayed using
+ "ethtool -S" option.
+
+x) Multiple hardware queues: (Enabled by default)
+ Up to 17 hardware based transmit and receive data channels, with
+ multiple steering options (transmit multiqueue enabled by default).
+
+3) Configurable driver parameters:
+----------------------------------
+
+i) max_config_dev
+ Specifies maximum device functions to be enabled.
+ Valid range: 1-8
+
+ii) max_config_port
+ Specifies number of ports to be enabled.
+ Valid range: 1,2
+ Default: 1
+
+iii)max_config_vpath
+ Specifies maximum VPATH(s) configured for each device function.
+ Valid range: 1-17
+
+iv) vlan_tag_strip
+ Enables/disables vlan tag stripping from all received tagged frames that
+ are not replicated at the internal L2 switch.
+ Valid range: 0,1 (disabled, enabled respectively)
+ Default: 1
+
+v) addr_learn_en
+ Enable learning the mac address of the guest OS interface in
+ virtualization environment.
+ Valid range: 0,1 (disabled, enabled respectively)
+ Default: 0
diff --git a/Documentation/networking/vxlan.txt b/Documentation/networking/vxlan.txt
new file mode 100644
index 000000000..6d993510f
--- /dev/null
+++ b/Documentation/networking/vxlan.txt
@@ -0,0 +1,47 @@
+Virtual eXtensible Local Area Networking documentation
+======================================================
+
+The VXLAN protocol is a tunnelling protocol that is designed to
+solve the problem of limited number of available VLAN's (4096).
+With VXLAN identifier is expanded to 24 bits.
+
+It is a draft RFC standard, that is implemented by Cisco Nexus,
+Vmware and Brocade. The protocol runs over UDP using a single
+destination port (still not standardized by IANA).
+This document describes the Linux kernel tunnel device,
+there is also an implantation of VXLAN for Openvswitch.
+
+Unlike most tunnels, a VXLAN is a 1 to N network, not just point
+to point. A VXLAN device can either dynamically learn the IP address
+of the other end, in a manner similar to a learning bridge, or the
+forwarding entries can be configured statically.
+
+The management of vxlan is done in a similar fashion to it's
+too closest neighbors GRE and VLAN. Configuring VXLAN requires
+the version of iproute2 that matches the kernel release
+where VXLAN was first merged upstream.
+
+1. Create vxlan device
+ # ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1
+
+This creates a new device (vxlan0). The device uses the
+the multicast group 239.1.1.1 over eth1 to handle packets where
+no entry is in the forwarding table.
+
+2. Delete vxlan device
+ # ip link delete vxlan0
+
+3. Show vxlan info
+ # ip -d link show vxlan0
+
+It is possible to create, destroy and display the vxlan
+forwarding table using the new bridge command.
+
+1. Create forwarding table entry
+ # bridge fdb add to 00:17:42:8a:b4:05 dst 192.19.0.2 dev vxlan0
+
+2. Delete forwarding table entry
+ # bridge fdb delete 00:17:42:8a:b4:05 dev vxlan0
+
+3. Show forwarding table
+ # bridge fdb show dev vxlan0
diff --git a/Documentation/networking/x25-iface.txt b/Documentation/networking/x25-iface.txt
new file mode 100644
index 000000000..7f213b556
--- /dev/null
+++ b/Documentation/networking/x25-iface.txt
@@ -0,0 +1,123 @@
+ X.25 Device Driver Interface 1.1
+
+ Jonathan Naylor 26.12.96
+
+This is a description of the messages to be passed between the X.25 Packet
+Layer and the X.25 device driver. They are designed to allow for the easy
+setting of the LAPB mode from within the Packet Layer.
+
+The X.25 device driver will be coded normally as per the Linux device driver
+standards. Most X.25 device drivers will be moderately similar to the
+already existing Ethernet device drivers. However unlike those drivers, the
+X.25 device driver has a state associated with it, and this information
+needs to be passed to and from the Packet Layer for proper operation.
+
+All messages are held in sk_buff's just like real data to be transmitted
+over the LAPB link. The first byte of the skbuff indicates the meaning of
+the rest of the skbuff, if any more information does exist.
+
+
+Packet Layer to Device Driver
+-----------------------------
+
+First Byte = 0x00 (X25_IFACE_DATA)
+
+This indicates that the rest of the skbuff contains data to be transmitted
+over the LAPB link. The LAPB link should already exist before any data is
+passed down.
+
+First Byte = 0x01 (X25_IFACE_CONNECT)
+
+Establish the LAPB link. If the link is already established then the connect
+confirmation message should be returned as soon as possible.
+
+First Byte = 0x02 (X25_IFACE_DISCONNECT)
+
+Terminate the LAPB link. If it is already disconnected then the disconnect
+confirmation message should be returned as soon as possible.
+
+First Byte = 0x03 (X25_IFACE_PARAMS)
+
+LAPB parameters. To be defined.
+
+
+Device Driver to Packet Layer
+-----------------------------
+
+First Byte = 0x00 (X25_IFACE_DATA)
+
+This indicates that the rest of the skbuff contains data that has been
+received over the LAPB link.
+
+First Byte = 0x01 (X25_IFACE_CONNECT)
+
+LAPB link has been established. The same message is used for both a LAPB
+link connect_confirmation and a connect_indication.
+
+First Byte = 0x02 (X25_IFACE_DISCONNECT)
+
+LAPB link has been terminated. This same message is used for both a LAPB
+link disconnect_confirmation and a disconnect_indication.
+
+First Byte = 0x03 (X25_IFACE_PARAMS)
+
+LAPB parameters. To be defined.
+
+
+
+Possible Problems
+=================
+
+(Henner Eisen, 2000-10-28)
+
+The X.25 packet layer protocol depends on a reliable datalink service.
+The LAPB protocol provides such reliable service. But this reliability
+is not preserved by the Linux network device driver interface:
+
+- With Linux 2.4.x (and above) SMP kernels, packet ordering is not
+ preserved. Even if a device driver calls netif_rx(skb1) and later
+ netif_rx(skb2), skb2 might be delivered to the network layer
+ earlier that skb1.
+- Data passed upstream by means of netif_rx() might be dropped by the
+ kernel if the backlog queue is congested.
+
+The X.25 packet layer protocol will detect this and reset the virtual
+call in question. But many upper layer protocols are not designed to
+handle such N-Reset events gracefully. And frequent N-Reset events
+will always degrade performance.
+
+Thus, driver authors should make netif_rx() as reliable as possible:
+
+SMP re-ordering will not occur if the driver's interrupt handler is
+always executed on the same CPU. Thus,
+
+- Driver authors should use irq affinity for the interrupt handler.
+
+The probability of packet loss due to backlog congestion can be
+reduced by the following measures or a combination thereof:
+
+(1) Drivers for kernel versions 2.4.x and above should always check the
+ return value of netif_rx(). If it returns NET_RX_DROP, the
+ driver's LAPB protocol must not confirm reception of the frame
+ to the peer.
+ This will reliably suppress packet loss. The LAPB protocol will
+ automatically cause the peer to re-transmit the dropped packet
+ later.
+ The lapb module interface was modified to support this. Its
+ data_indication() method should now transparently pass the
+ netif_rx() return value to the (lapb module) caller.
+(2) Drivers for kernel versions 2.2.x should always check the global
+ variable netdev_dropping when a new frame is received. The driver
+ should only call netif_rx() if netdev_dropping is zero. Otherwise
+ the driver should not confirm delivery of the frame and drop it.
+ Alternatively, the driver can queue the frame internally and call
+ netif_rx() later when netif_dropping is 0 again. In that case, delivery
+ confirmation should also be deferred such that the internal queue
+ cannot grow to much.
+ This will not reliably avoid packet loss, but the probability
+ of packet loss in netif_rx() path will be significantly reduced.
+(3) Additionally, driver authors might consider to support
+ CONFIG_NET_HW_FLOWCONTROL. This allows the driver to be woken up
+ when a previously congested backlog queue becomes empty again.
+ The driver could uses this for flow-controlling the peer by means
+ of the LAPB protocol's flow-control service.
diff --git a/Documentation/networking/x25.txt b/Documentation/networking/x25.txt
new file mode 100644
index 000000000..c91c6d715
--- /dev/null
+++ b/Documentation/networking/x25.txt
@@ -0,0 +1,44 @@
+Linux X.25 Project
+
+As my third year dissertation at University I have taken it upon myself to
+write an X.25 implementation for Linux. My aim is to provide a complete X.25
+Packet Layer and a LAPB module to allow for "normal" X.25 to be run using
+Linux. There are two sorts of X.25 cards available, intelligent ones that
+implement LAPB on the card itself, and unintelligent ones that simply do
+framing, bit-stuffing and checksumming. These both need to be handled by the
+system.
+
+I therefore decided to write the implementation such that as far as the
+Packet Layer is concerned, the link layer was being performed by a lower
+layer of the Linux kernel and therefore it did not concern itself with
+implementation of LAPB. Therefore the LAPB modules would be called by
+unintelligent X.25 card drivers and not by intelligent ones, this would
+provide a uniform device driver interface, and simplify configuration.
+
+To confuse matters a little, an 802.2 LLC implementation for Linux is being
+written which will allow X.25 to be run over an Ethernet (or Token Ring) and
+conform with the JNT "Pink Book", this will have a different interface to
+the Packet Layer but there will be no confusion since the class of device
+being served by the LLC will be completely separate from LAPB. The LLC
+implementation is being done as part of another protocol project (SNA) and
+by a different author.
+
+Just when you thought that it could not become more confusing, another
+option appeared, XOT. This allows X.25 Packet Layer frames to operate over
+the Internet using TCP/IP as a reliable link layer. RFC1613 specifies the
+format and behaviour of the protocol. If time permits this option will also
+be actively considered.
+
+A linux-x25 mailing list has been created at vger.kernel.org to support the
+development and use of Linux X.25. It is early days yet, but interested
+parties are welcome to subscribe to it. Just send a message to
+majordomo@vger.kernel.org with the following in the message body:
+
+subscribe linux-x25
+end
+
+The contents of the Subject line are ignored.
+
+Jonathan
+
+g4klx@g4klx.demon.co.uk
diff --git a/Documentation/networking/xfrm_proc.txt b/Documentation/networking/xfrm_proc.txt
new file mode 100644
index 000000000..d0d8bafa9
--- /dev/null
+++ b/Documentation/networking/xfrm_proc.txt
@@ -0,0 +1,74 @@
+XFRM proc - /proc/net/xfrm_* files
+==================================
+Masahide NAKAMURA <nakam@linux-ipv6.org>
+
+
+Transformation Statistics
+-------------------------
+xfrm_proc is a statistics shown factor dropped by transformation
+for developer.
+It is a counter designed from current transformation source code
+and defined like linux private MIB.
+
+Inbound statistics
+~~~~~~~~~~~~~~~~~~
+XfrmInError:
+ All errors which is not matched others
+XfrmInBufferError:
+ No buffer is left
+XfrmInHdrError:
+ Header error
+XfrmInNoStates:
+ No state is found
+ i.e. Either inbound SPI, address, or IPsec protocol at SA is wrong
+XfrmInStateProtoError:
+ Transformation protocol specific error
+ e.g. SA key is wrong
+XfrmInStateModeError:
+ Transformation mode specific error
+XfrmInStateSeqError:
+ Sequence error
+ i.e. Sequence number is out of window
+XfrmInStateExpired:
+ State is expired
+XfrmInStateMismatch:
+ State has mismatch option
+ e.g. UDP encapsulation type is mismatch
+XfrmInStateInvalid:
+ State is invalid
+XfrmInTmplMismatch:
+ No matching template for states
+ e.g. Inbound SAs are correct but SP rule is wrong
+XfrmInNoPols:
+ No policy is found for states
+ e.g. Inbound SAs are correct but no SP is found
+XfrmInPolBlock:
+ Policy discards
+XfrmInPolError:
+ Policy error
+
+Outbound errors
+~~~~~~~~~~~~~~~
+XfrmOutError:
+ All errors which is not matched others
+XfrmOutBundleGenError:
+ Bundle generation error
+XfrmOutBundleCheckError:
+ Bundle check error
+XfrmOutNoStates:
+ No state is found
+XfrmOutStateProtoError:
+ Transformation protocol specific error
+XfrmOutStateModeError:
+ Transformation mode specific error
+XfrmOutStateSeqError:
+ Sequence error
+ i.e. Sequence number overflow
+XfrmOutStateExpired:
+ State is expired
+XfrmOutPolBlock:
+ Policy discards
+XfrmOutPolDead:
+ Policy is dead
+XfrmOutPolError:
+ Policy error
diff --git a/Documentation/networking/xfrm_sync.txt b/Documentation/networking/xfrm_sync.txt
new file mode 100644
index 000000000..d7aac9ded
--- /dev/null
+++ b/Documentation/networking/xfrm_sync.txt
@@ -0,0 +1,169 @@
+
+The sync patches work is based on initial patches from
+Krisztian <hidden@balabit.hu> and others and additional patches
+from Jamal <hadi@cyberus.ca>.
+
+The end goal for syncing is to be able to insert attributes + generate
+events so that the an SA can be safely moved from one machine to another
+for HA purposes.
+The idea is to synchronize the SA so that the takeover machine can do
+the processing of the SA as accurate as possible if it has access to it.
+
+We already have the ability to generate SA add/del/upd events.
+These patches add ability to sync and have accurate lifetime byte (to
+ensure proper decay of SAs) and replay counters to avoid replay attacks
+with as minimal loss at failover time.
+This way a backup stays as closely uptodate as an active member.
+
+Because the above items change for every packet the SA receives,
+it is possible for a lot of the events to be generated.
+For this reason, we also add a nagle-like algorithm to restrict
+the events. i.e we are going to set thresholds to say "let me
+know if the replay sequence threshold is reached or 10 secs have passed"
+These thresholds are set system-wide via sysctls or can be updated
+per SA.
+
+The identified items that need to be synchronized are:
+- the lifetime byte counter
+note that: lifetime time limit is not important if you assume the failover
+machine is known ahead of time since the decay of the time countdown
+is not driven by packet arrival.
+- the replay sequence for both inbound and outbound
+
+1) Message Structure
+----------------------
+
+nlmsghdr:aevent_id:optional-TLVs.
+
+The netlink message types are:
+
+XFRM_MSG_NEWAE and XFRM_MSG_GETAE.
+
+A XFRM_MSG_GETAE does not have TLVs.
+A XFRM_MSG_NEWAE will have at least two TLVs (as is
+discussed further below).
+
+aevent_id structure looks like:
+
+ struct xfrm_aevent_id {
+ struct xfrm_usersa_id sa_id;
+ xfrm_address_t saddr;
+ __u32 flags;
+ __u32 reqid;
+ };
+
+The unique SA is identified by the combination of xfrm_usersa_id,
+reqid and saddr.
+
+flags are used to indicate different things. The possible
+flags are:
+ XFRM_AE_RTHR=1, /* replay threshold*/
+ XFRM_AE_RVAL=2, /* replay value */
+ XFRM_AE_LVAL=4, /* lifetime value */
+ XFRM_AE_ETHR=8, /* expiry timer threshold */
+ XFRM_AE_CR=16, /* Event cause is replay update */
+ XFRM_AE_CE=32, /* Event cause is timer expiry */
+ XFRM_AE_CU=64, /* Event cause is policy update */
+
+How these flags are used is dependent on the direction of the
+message (kernel<->user) as well the cause (config, query or event).
+This is described below in the different messages.
+
+The pid will be set appropriately in netlink to recognize direction
+(0 to the kernel and pid = processid that created the event
+when going from kernel to user space)
+
+A program needs to subscribe to multicast group XFRMNLGRP_AEVENTS
+to get notified of these events.
+
+2) TLVS reflect the different parameters:
+-----------------------------------------
+
+a) byte value (XFRMA_LTIME_VAL)
+This TLV carries the running/current counter for byte lifetime since
+last event.
+
+b)replay value (XFRMA_REPLAY_VAL)
+This TLV carries the running/current counter for replay sequence since
+last event.
+
+c)replay threshold (XFRMA_REPLAY_THRESH)
+This TLV carries the threshold being used by the kernel to trigger events
+when the replay sequence is exceeded.
+
+d) expiry timer (XFRMA_ETIMER_THRESH)
+This is a timer value in milliseconds which is used as the nagle
+value to rate limit the events.
+
+3) Default configurations for the parameters:
+----------------------------------------------
+
+By default these events should be turned off unless there is
+at least one listener registered to listen to the multicast
+group XFRMNLGRP_AEVENTS.
+
+Programs installing SAs will need to specify the two thresholds, however,
+in order to not change existing applications such as racoon
+we also provide default threshold values for these different parameters
+in case they are not specified.
+
+the two sysctls/proc entries are:
+a) /proc/sys/net/core/sysctl_xfrm_aevent_etime
+used to provide default values for the XFRMA_ETIMER_THRESH in incremental
+units of time of 100ms. The default is 10 (1 second)
+
+b) /proc/sys/net/core/sysctl_xfrm_aevent_rseqth
+used to provide default values for XFRMA_REPLAY_THRESH parameter
+in incremental packet count. The default is two packets.
+
+4) Message types
+----------------
+
+a) XFRM_MSG_GETAE issued by user-->kernel.
+XFRM_MSG_GETAE does not carry any TLVs.
+The response is a XFRM_MSG_NEWAE which is formatted based on what
+XFRM_MSG_GETAE queried for.
+The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+*if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved
+*if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved
+
+b) XFRM_MSG_NEWAE is issued by either user space to configure
+or kernel to announce events or respond to a XFRM_MSG_GETAE.
+
+i) user --> kernel to configure a specific SA.
+any of the values or threshold parameters can be updated by passing the
+appropriate TLV.
+A response is issued back to the sender in user space to indicate success
+or failure.
+In the case of success, additionally an event with
+XFRM_MSG_NEWAE is also issued to any listeners as described in iii).
+
+ii) kernel->user direction as a response to XFRM_MSG_GETAE
+The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+The threshold TLVs will be included if explicitly requested in
+the XFRM_MSG_GETAE message.
+
+iii) kernel->user to report as event if someone sets any values or
+thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above).
+In such a case XFRM_AE_CU flag is set to inform the user that
+the change happened as a result of an update.
+The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+
+iv) kernel->user to report event when replay threshold or a timeout
+is exceeded.
+In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout
+happened) is set to inform the user what happened.
+Note the two flags are mutually exclusive.
+The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs.
+
+Exceptions to threshold settings
+--------------------------------
+
+If you have an SA that is getting hit by traffic in bursts such that
+there is a period where the timer threshold expires with no packets
+seen, then an odd behavior is seen as follows:
+The first packet arrival after a timer expiry will trigger a timeout
+aevent; i.e we dont wait for a timeout period or a packet threshold
+to be reached. This is done for simplicity and efficiency reasons.
+
+-JHS
diff --git a/Documentation/networking/xfrm_sysctl.txt b/Documentation/networking/xfrm_sysctl.txt
new file mode 100644
index 000000000..5bbd16792
--- /dev/null
+++ b/Documentation/networking/xfrm_sysctl.txt
@@ -0,0 +1,4 @@
+/proc/sys/net/core/xfrm_* Variables:
+
+xfrm_acq_expires - INTEGER
+ default 30 - hard timeout in seconds for acquire requests
diff --git a/Documentation/networking/z8530drv.txt b/Documentation/networking/z8530drv.txt
new file mode 100644
index 000000000..2206abbc3
--- /dev/null
+++ b/Documentation/networking/z8530drv.txt
@@ -0,0 +1,657 @@
+This is a subset of the documentation. To use this driver you MUST have the
+full package from:
+
+Internet:
+=========
+
+1. ftp://ftp.ccac.rwth-aachen.de/pub/jr/z8530drv-utils_3.0-3.tar.gz
+
+2. ftp://ftp.pspt.fi/pub/ham/linux/ax25/z8530drv-utils_3.0-3.tar.gz
+
+Please note that the information in this document may be hopelessly outdated.
+A new version of the documentation, along with links to other important
+Linux Kernel AX.25 documentation and programs, is available on
+http://yaina.de/jreuter
+
+-----------------------------------------------------------------------------
+
+
+ SCC.C - Linux driver for Z8530 based HDLC cards for AX.25
+
+ ********************************************************************
+
+ (c) 1993,2000 by Joerg Reuter DL1BKE <jreuter@yaina.de>
+
+ portions (c) 1993 Guido ten Dolle PE1NNZ
+
+ for the complete copyright notice see >> Copying.Z8530DRV <<
+
+ ********************************************************************
+
+
+1. Initialization of the driver
+===============================
+
+To use the driver, 3 steps must be performed:
+
+ 1. if compiled as module: loading the module
+ 2. Setup of hardware, MODEM and KISS parameters with sccinit
+ 3. Attach each channel to the Linux kernel AX.25 with "ifconfig"
+
+Unlike the versions below 2.4 this driver is a real network device
+driver. If you want to run xNOS instead of our fine kernel AX.25
+use a 2.x version (available from above sites) or read the
+AX.25-HOWTO on how to emulate a KISS TNC on network device drivers.
+
+
+1.1 Loading the module
+======================
+
+(If you're going to compile the driver as a part of the kernel image,
+ skip this chapter and continue with 1.2)
+
+Before you can use a module, you'll have to load it with
+
+ insmod scc.o
+
+please read 'man insmod' that comes with module-init-tools.
+
+You should include the insmod in one of the /etc/rc.d/rc.* files,
+and don't forget to insert a call of sccinit after that. It
+will read your /etc/z8530drv.conf.
+
+1.2. /etc/z8530drv.conf
+=======================
+
+To setup all parameters you must run /sbin/sccinit from one
+of your rc.*-files. This has to be done BEFORE you can
+"ifconfig" an interface. Sccinit reads the file /etc/z8530drv.conf
+and sets the hardware, MODEM and KISS parameters. A sample file is
+delivered with this package. Change it to your needs.
+
+The file itself consists of two main sections.
+
+1.2.1 configuration of hardware parameters
+==========================================
+
+The hardware setup section defines the following parameters for each
+Z8530:
+
+chip 1
+data_a 0x300 # data port A
+ctrl_a 0x304 # control port A
+data_b 0x301 # data port B
+ctrl_b 0x305 # control port B
+irq 5 # IRQ No. 5
+pclock 4915200 # clock
+board BAYCOM # hardware type
+escc no # enhanced SCC chip? (8580/85180/85280)
+vector 0 # latch for interrupt vector
+special no # address of special function register
+option 0 # option to set via sfr
+
+
+chip - this is just a delimiter to make sccinit a bit simpler to
+ program. A parameter has no effect.
+
+data_a - the address of the data port A of this Z8530 (needed)
+ctrl_a - the address of the control port A (needed)
+data_b - the address of the data port B (needed)
+ctrl_b - the address of the control port B (needed)
+
+irq - the used IRQ for this chip. Different chips can use different
+ IRQs or the same. If they share an interrupt, it needs to be
+ specified within one chip-definition only.
+
+pclock - the clock at the PCLK pin of the Z8530 (option, 4915200 is
+ default), measured in Hertz
+
+board - the "type" of the board:
+
+ SCC type value
+ ---------------------------------
+ PA0HZP SCC card PA0HZP
+ EAGLE card EAGLE
+ PC100 card PC100
+ PRIMUS-PC (DG9BL) card PRIMUS
+ BayCom (U)SCC card BAYCOM
+
+escc - if you want support for ESCC chips (8580, 85180, 85280), set
+ this to "yes" (option, defaults to "no")
+
+vector - address of the vector latch (aka "intack port") for PA0HZP
+ cards. There can be only one vector latch for all chips!
+ (option, defaults to 0)
+
+special - address of the special function register on several cards.
+ (option, defaults to 0)
+
+option - The value you write into that register (option, default is 0)
+
+You can specify up to four chips (8 channels). If this is not enough,
+just change
+
+ #define MAXSCC 4
+
+to a higher value.
+
+Example for the BAYCOM USCC:
+----------------------------
+
+chip 1
+data_a 0x300 # data port A
+ctrl_a 0x304 # control port A
+data_b 0x301 # data port B
+ctrl_b 0x305 # control port B
+irq 5 # IRQ No. 5 (#)
+board BAYCOM # hardware type (*)
+#
+# SCC chip 2
+#
+chip 2
+data_a 0x302
+ctrl_a 0x306
+data_b 0x303
+ctrl_b 0x307
+board BAYCOM
+
+An example for a PA0HZP card:
+-----------------------------
+
+chip 1
+data_a 0x153
+data_b 0x151
+ctrl_a 0x152
+ctrl_b 0x150
+irq 9
+pclock 4915200
+board PA0HZP
+vector 0x168
+escc no
+#
+#
+#
+chip 2
+data_a 0x157
+data_b 0x155
+ctrl_a 0x156
+ctrl_b 0x154
+irq 9
+pclock 4915200
+board PA0HZP
+vector 0x168
+escc no
+
+A DRSI would should probably work with this:
+--------------------------------------------
+(actually: two DRSI cards...)
+
+chip 1
+data_a 0x303
+data_b 0x301
+ctrl_a 0x302
+ctrl_b 0x300
+irq 7
+pclock 4915200
+board DRSI
+escc no
+#
+#
+#
+chip 2
+data_a 0x313
+data_b 0x311
+ctrl_a 0x312
+ctrl_b 0x310
+irq 7
+pclock 4915200
+board DRSI
+escc no
+
+Note that you cannot use the on-board baudrate generator off DRSI
+cards. Use "mode dpll" for clock source (see below).
+
+This is based on information provided by Mike Bilow (and verified
+by Paul Helay)
+
+The utility "gencfg"
+--------------------
+
+If you only know the parameters for the PE1CHL driver for DOS,
+run gencfg. It will generate the correct port addresses (I hope).
+Its parameters are exactly the same as the ones you use with
+the "attach scc" command in net, except that the string "init" must
+not appear. Example:
+
+gencfg 2 0x150 4 2 0 1 0x168 9 4915200
+
+will print a skeleton z8530drv.conf for the OptoSCC to stdout.
+
+gencfg 2 0x300 2 4 5 -4 0 7 4915200 0x10
+
+does the same for the BAYCOM USCC card. In my opinion it is much easier
+to edit scc_config.h...
+
+
+1.2.2 channel configuration
+===========================
+
+The channel definition is divided into three sub sections for each
+channel:
+
+An example for scc0:
+
+# DEVICE
+
+device scc0 # the device for the following params
+
+# MODEM / BUFFERS
+
+speed 1200 # the default baudrate
+clock dpll # clock source:
+ # dpll = normal half duplex operation
+ # external = MODEM provides own Rx/Tx clock
+ # divider = use full duplex divider if
+ # installed (1)
+mode nrzi # HDLC encoding mode
+ # nrzi = 1k2 MODEM, G3RUH 9k6 MODEM
+ # nrz = DF9IC 9k6 MODEM
+ #
+bufsize 384 # size of buffers. Note that this must include
+ # the AX.25 header, not only the data field!
+ # (optional, defaults to 384)
+
+# KISS (Layer 1)
+
+txdelay 36 # (see chapter 1.4)
+persist 64
+slot 8
+tail 8
+fulldup 0
+wait 12
+min 3
+maxkey 7
+idle 3
+maxdef 120
+group 0
+txoff off
+softdcd on
+slip off
+
+The order WITHIN these sections is unimportant. The order OF these
+sections IS important. The MODEM parameters are set with the first
+recognized KISS parameter...
+
+Please note that you can initialize the board only once after boot
+(or insmod). You can change all parameters but "mode" and "clock"
+later with the Sccparam program or through KISS. Just to avoid
+security holes...
+
+(1) this divider is usually mounted on the SCC-PBC (PA0HZP) or not
+ present at all (BayCom). It feeds back the output of the DPLL
+ (digital pll) as transmit clock. Using this mode without a divider
+ installed will normally result in keying the transceiver until
+ maxkey expires --- of course without sending anything (useful).
+
+2. Attachment of a channel by your AX.25 software
+=================================================
+
+2.1 Kernel AX.25
+================
+
+To set up an AX.25 device you can simply type:
+
+ ifconfig scc0 44.128.1.1 hw ax25 dl0tha-7
+
+This will create a network interface with the IP number 44.128.20.107
+and the callsign "dl0tha". If you do not have any IP number (yet) you
+can use any of the 44.128.0.0 network. Note that you do not need
+axattach. The purpose of axattach (like slattach) is to create a KISS
+network device linked to a TTY. Please read the documentation of the
+ax25-utils and the AX.25-HOWTO to learn how to set the parameters of
+the kernel AX.25.
+
+2.2 NOS, NET and TFKISS
+=======================
+
+Since the TTY driver (aka KISS TNC emulation) is gone you need
+to emulate the old behaviour. The cost of using these programs is
+that you probably need to compile the kernel AX.25, regardless of whether
+you actually use it or not. First setup your /etc/ax25/axports,
+for example:
+
+ 9k6 dl0tha-9 9600 255 4 9600 baud port (scc3)
+ axlink dl0tha-15 38400 255 4 Link to NOS
+
+Now "ifconfig" the scc device:
+
+ ifconfig scc3 44.128.1.1 hw ax25 dl0tha-9
+
+You can now axattach a pseudo-TTY:
+
+ axattach /dev/ptys0 axlink
+
+and start your NOS and attach /dev/ptys0 there. The problem is that
+NOS is reachable only via digipeating through the kernel AX.25
+(disastrous on a DAMA controlled channel). To solve this problem,
+configure "rxecho" to echo the incoming frames from "9k6" to "axlink"
+and outgoing frames from "axlink" to "9k6" and start:
+
+ rxecho
+
+Or simply use "kissbridge" coming with z8530drv-utils:
+
+ ifconfig scc3 hw ax25 dl0tha-9
+ kissbridge scc3 /dev/ptys0
+
+
+3. Adjustment and Display of parameters
+=======================================
+
+3.1 Displaying SCC Parameters:
+==============================
+
+Once a SCC channel has been attached, the parameter settings and
+some statistic information can be shown using the param program:
+
+dl1bke-u:~$ sccstat scc0
+
+Parameters:
+
+speed : 1200 baud
+txdelay : 36
+persist : 255
+slottime : 0
+txtail : 8
+fulldup : 1
+waittime : 12
+mintime : 3 sec
+maxkeyup : 7 sec
+idletime : 3 sec
+maxdefer : 120 sec
+group : 0x00
+txoff : off
+softdcd : on
+SLIP : off
+
+Status:
+
+HDLC Z8530 Interrupts Buffers
+-----------------------------------------------------------------------
+Sent : 273 RxOver : 0 RxInts : 125074 Size : 384
+Received : 1095 TxUnder: 0 TxInts : 4684 NoSpace : 0
+RxErrors : 1591 ExInts : 11776
+TxErrors : 0 SpInts : 1503
+Tx State : idle
+
+
+The status info shown is:
+
+Sent - number of frames transmitted
+Received - number of frames received
+RxErrors - number of receive errors (CRC, ABORT)
+TxErrors - number of discarded Tx frames (due to various reasons)
+Tx State - status of the Tx interrupt handler: idle/busy/active/tail (2)
+RxOver - number of receiver overruns
+TxUnder - number of transmitter underruns
+RxInts - number of receiver interrupts
+TxInts - number of transmitter interrupts
+EpInts - number of receiver special condition interrupts
+SpInts - number of external/status interrupts
+Size - maximum size of an AX.25 frame (*with* AX.25 headers!)
+NoSpace - number of times a buffer could not get allocated
+
+An overrun is abnormal. If lots of these occur, the product of
+baudrate and number of interfaces is too high for the processing
+power of your computer. NoSpace errors are unlikely to be caused by the
+driver or the kernel AX.25.
+
+
+3.2 Setting Parameters
+======================
+
+
+The setting of parameters of the emulated KISS TNC is done in the
+same way in the SCC driver. You can change parameters by using
+the kissparms program from the ax25-utils package or use the program
+"sccparam":
+
+ sccparam <device> <paramname> <decimal-|hexadecimal value>
+
+You can change the following parameters:
+
+param : value
+------------------------
+speed : 1200
+txdelay : 36
+persist : 255
+slottime : 0
+txtail : 8
+fulldup : 1
+waittime : 12
+mintime : 3
+maxkeyup : 7
+idletime : 3
+maxdefer : 120
+group : 0x00
+txoff : off
+softdcd : on
+SLIP : off
+
+
+The parameters have the following meaning:
+
+speed:
+ The baudrate on this channel in bits/sec
+
+ Example: sccparam /dev/scc3 speed 9600
+
+txdelay:
+ The delay (in units of 10 ms) after keying of the
+ transmitter, until the first byte is sent. This is usually
+ called "TXDELAY" in a TNC. When 0 is specified, the driver
+ will just wait until the CTS signal is asserted. This
+ assumes the presence of a timer or other circuitry in the
+ MODEM and/or transmitter, that asserts CTS when the
+ transmitter is ready for data.
+ A normal value of this parameter is 30-36.
+
+ Example: sccparam /dev/scc0 txd 20
+
+persist:
+ This is the probability that the transmitter will be keyed
+ when the channel is found to be free. It is a value from 0
+ to 255, and the probability is (value+1)/256. The value
+ should be somewhere near 50-60, and should be lowered when
+ the channel is used more heavily.
+
+ Example: sccparam /dev/scc2 persist 20
+
+slottime:
+ This is the time between samples of the channel. It is
+ expressed in units of 10 ms. About 200-300 ms (value 20-30)
+ seems to be a good value.
+
+ Example: sccparam /dev/scc0 slot 20
+
+tail:
+ The time the transmitter will remain keyed after the last
+ byte of a packet has been transferred to the SCC. This is
+ necessary because the CRC and a flag still have to leave the
+ SCC before the transmitter is keyed down. The value depends
+ on the baudrate selected. A few character times should be
+ sufficient, e.g. 40ms at 1200 baud. (value 4)
+ The value of this parameter is in 10 ms units.
+
+ Example: sccparam /dev/scc2 4
+
+full:
+ The full-duplex mode switch. This can be one of the following
+ values:
+
+ 0: The interface will operate in CSMA mode (the normal
+ half-duplex packet radio operation)
+ 1: Fullduplex mode, i.e. the transmitter will be keyed at
+ any time, without checking the received carrier. It
+ will be unkeyed when there are no packets to be sent.
+ 2: Like 1, but the transmitter will remain keyed, also
+ when there are no packets to be sent. Flags will be
+ sent in that case, until a timeout (parameter 10)
+ occurs.
+
+ Example: sccparam /dev/scc0 fulldup off
+
+wait:
+ The initial waittime before any transmit attempt, after the
+ frame has been queue for transmit. This is the length of
+ the first slot in CSMA mode. In full duplex modes it is
+ set to 0 for maximum performance.
+ The value of this parameter is in 10 ms units.
+
+ Example: sccparam /dev/scc1 wait 4
+
+maxkey:
+ The maximal time the transmitter will be keyed to send
+ packets, in seconds. This can be useful on busy CSMA
+ channels, to avoid "getting a bad reputation" when you are
+ generating a lot of traffic. After the specified time has
+ elapsed, no new frame will be started. Instead, the trans-
+ mitter will be switched off for a specified time (parameter
+ min), and then the selected algorithm for keyup will be
+ started again.
+ The value 0 as well as "off" will disable this feature,
+ and allow infinite transmission time.
+
+ Example: sccparam /dev/scc0 maxk 20
+
+min:
+ This is the time the transmitter will be switched off when
+ the maximum transmission time is exceeded.
+
+ Example: sccparam /dev/scc3 min 10
+
+idle
+ This parameter specifies the maximum idle time in full duplex
+ 2 mode, in seconds. When no frames have been sent for this
+ time, the transmitter will be keyed down. A value of 0 is
+ has same result as the fullduplex mode 1. This parameter
+ can be disabled.
+
+ Example: sccparam /dev/scc2 idle off # transmit forever
+
+maxdefer
+ This is the maximum time (in seconds) to wait for a free channel
+ to send. When this timer expires the transmitter will be keyed
+ IMMEDIATELY. If you love to get trouble with other users you
+ should set this to a very low value ;-)
+
+ Example: sccparam /dev/scc0 maxdefer 240 # 2 minutes
+
+
+txoff:
+ When this parameter has the value 0, the transmission of packets
+ is enable. Otherwise it is disabled.
+
+ Example: sccparam /dev/scc2 txoff on
+
+group:
+ It is possible to build special radio equipment to use more than
+ one frequency on the same band, e.g. using several receivers and
+ only one transmitter that can be switched between frequencies.
+ Also, you can connect several radios that are active on the same
+ band. In these cases, it is not possible, or not a good idea, to
+ transmit on more than one frequency. The SCC driver provides a
+ method to lock transmitters on different interfaces, using the
+ "param <interface> group <x>" command. This will only work when
+ you are using CSMA mode (parameter full = 0).
+ The number <x> must be 0 if you want no group restrictions, and
+ can be computed as follows to create restricted groups:
+ <x> is the sum of some OCTAL numbers:
+
+ 200 This transmitter will only be keyed when all other
+ transmitters in the group are off.
+ 100 This transmitter will only be keyed when the carrier
+ detect of all other interfaces in the group is off.
+ 0xx A byte that can be used to define different groups.
+ Interfaces are in the same group, when the logical AND
+ between their xx values is nonzero.
+
+ Examples:
+ When 2 interfaces use group 201, their transmitters will never be
+ keyed at the same time.
+ When 2 interfaces use group 101, the transmitters will only key
+ when both channels are clear at the same time. When group 301,
+ the transmitters will not be keyed at the same time.
+
+ Don't forget to convert the octal numbers into decimal before
+ you set the parameter.
+
+ Example: (to be written)
+
+softdcd:
+ use a software dcd instead of the real one... Useful for a very
+ slow squelch.
+
+ Example: sccparam /dev/scc0 soft on
+
+
+4. Problems
+===========
+
+If you have tx-problems with your BayCom USCC card please check
+the manufacturer of the 8530. SGS chips have a slightly
+different timing. Try Zilog... A solution is to write to register 8
+instead to the data port, but this won't work with the ESCC chips.
+*SIGH!*
+
+A very common problem is that the PTT locks until the maxkeyup timer
+expires, although interrupts and clock source are correct. In most
+cases compiling the driver with CONFIG_SCC_DELAY (set with
+make config) solves the problems. For more hints read the (pseudo) FAQ
+and the documentation coming with z8530drv-utils.
+
+I got reports that the driver has problems on some 386-based systems.
+(i.e. Amstrad) Those systems have a bogus AT bus timing which will
+lead to delayed answers on interrupts. You can recognize these
+problems by looking at the output of Sccstat for the suspected
+port. If it shows under- and overruns you own such a system.
+
+Delayed processing of received data: This depends on
+
+- the kernel version
+
+- kernel profiling compiled or not
+
+- a high interrupt load
+
+- a high load of the machine --- running X, Xmorph, XV and Povray,
+ while compiling the kernel... hmm ... even with 32 MB RAM ... ;-)
+ Or running a named for the whole .ampr.org domain on an 8 MB
+ box...
+
+- using information from rxecho or kissbridge.
+
+Kernel panics: please read /linux/README and find out if it
+really occurred within the scc driver.
+
+If you cannot solve a problem, send me
+
+- a description of the problem,
+- information on your hardware (computer system, scc board, modem)
+- your kernel version
+- the output of cat /proc/net/z8530
+
+4. Thor RLC100
+==============
+
+Mysteriously this board seems not to work with the driver. Anyone
+got it up-and-running?
+
+
+Many thanks to Linus Torvalds and Alan Cox for including the driver
+in the Linux standard distribution and their support.
+
+Joerg Reuter ampr-net: dl1bke@db0pra.ampr.org
+ AX-25 : DL1BKE @ DB0ABH.#BAY.DEU.EU
+ Internet: jreuter@yaina.de
+ WWW : http://yaina.de/jreuter