Initial import

author: André Fabian Silva Delgado <emulatorman@parabola.nu> 2015-08-05 17:04:01 -0300
committer: André Fabian Silva Delgado <emulatorman@parabola.nu> 2015-08-05 17:04:01 -0300
commit: 57f0f512b273f60d52568b8c6b77e17f5636edc0 (patch)
tree: 5e910f0e82173f4ef4f51111366a3f1299037a7b /Documentation/x86
21 files changed, 3365 insertions, 0 deletions
diff --git a/Documentation/x86/00-INDEX b/Documentation/x86/00-INDEX
new file mode 100644
index 000000000..692264456
--- /dev/null
+++ b/Documentation/x86/00-INDEX
@@ -0,0 +1,20 @@
+00-INDEX
+	- this file
+boot.txt
+	- List of boot protocol versions
+early-microcode.txt
+	- How to load microcode from an initrd-CPIO archive early to fix CPU issues.
+earlyprintk.txt
+	- Using earlyprintk with a USB2 debug port key.
+entry_64.txt
+	- Describe (some of the) kernel entry points for x86.
+exception-tables.txt
+	- why and how Linux kernel uses exception tables on x86
+mtrr.txt
+	- how to use x86 Memory Type Range Registers to increase performance
+pat.txt
+	- Page Attribute Table intro and API
+usb-legacy-support.txt
+	- how to fix/avoid quirks when using emulated PS/2 mouse/keyboard.
+zero-page.txt
+	- layout of the first page of memory.
diff --git a/Documentation/x86/boot.txt b/Documentation/x86/boot.txt
new file mode 100644
index 000000000..88b85899d
--- /dev/null
+++ b/Documentation/x86/boot.txt
@@ -0,0 +1,1131 @@
+		     THE LINUX/x86 BOOT PROTOCOL
+		     ---------------------------
+
+On the x86 platform, the Linux kernel uses a rather complicated boot
+convention.  This has evolved partially due to historical aspects, as
+well as the desire in the early days to have the kernel itself be a
+bootable image, the complicated PC memory model and due to changed
+expectations in the PC industry caused by the effective demise of
+real-mode DOS as a mainstream operating system.
+
+Currently, the following versions of the Linux/x86 boot protocol exist.
+
+Old kernels:	zImage/Image support only.  Some very early kernels
+		may not even support a command line.
+
+Protocol 2.00:	(Kernel 1.3.73) Added bzImage and initrd support, as
+		well as a formalized way to communicate between the
+		boot loader and the kernel.  setup.S made relocatable,
+		although the traditional setup area still assumed
+		writable.
+
+Protocol 2.01:	(Kernel 1.3.76) Added a heap overrun warning.
+
+Protocol 2.02:	(Kernel 2.4.0-test3-pre3) New command line protocol.
+		Lower the conventional memory ceiling.	No overwrite
+		of the traditional setup area, thus making booting
+		safe for systems which use the EBDA from SMM or 32-bit
+		BIOS entry points.  zImage deprecated but still
+		supported.
+
+Protocol 2.03:	(Kernel 2.4.18-pre1) Explicitly makes the highest possible
+		initrd address available to the bootloader.
+
+Protocol 2.04:	(Kernel 2.6.14) Extend the syssize field to four bytes.
+
+Protocol 2.05:	(Kernel 2.6.20) Make protected mode kernel relocatable.
+		Introduce relocatable_kernel and kernel_alignment fields.
+
+Protocol 2.06:	(Kernel 2.6.22) Added a field that contains the size of
+		the boot command line.
+
+Protocol 2.07:	(Kernel 2.6.24) Added paravirtualised boot protocol.
+		Introduced hardware_subarch and hardware_subarch_data
+		and KEEP_SEGMENTS flag in load_flags.
+
+Protocol 2.08:	(Kernel 2.6.26) Added crc32 checksum and ELF format
+		payload. Introduced payload_offset and payload_length
+		fields to aid in locating the payload.
+
+Protocol 2.09:	(Kernel 2.6.26) Added a field of 64-bit physical
+		pointer to single linked list of struct	setup_data.
+
+Protocol 2.10:	(Kernel 2.6.31) Added a protocol for relaxed alignment
+		beyond the kernel_alignment added, new init_size and
+		pref_address fields.  Added extended boot loader IDs.
+
+Protocol 2.11:	(Kernel 3.6) Added a field for offset of EFI handover
+		protocol entry point.
+
+Protocol 2.12:	(Kernel 3.8) Added the xloadflags field and extension fields
+	 	to struct boot_params for loading bzImage and ramdisk
+		above 4G in 64bit.
+
+**** MEMORY LAYOUT
+
+The traditional memory map for the kernel loader, used for Image or
+zImage kernels, typically looks like:
+
+	|			 |
+0A0000	+------------------------+
+	|  Reserved for BIOS	 |	Do not use.  Reserved for BIOS EBDA.
+09A000	+------------------------+
+	|  Command line		 |
+	|  Stack/heap		 |	For use by the kernel real-mode code.
+098000	+------------------------+	
+	|  Kernel setup		 |	The kernel real-mode code.
+090200	+------------------------+
+	|  Kernel boot sector	 |	The kernel legacy boot sector.
+090000	+------------------------+
+	|  Protected-mode kernel |	The bulk of the kernel image.
+010000	+------------------------+
+	|  Boot loader		 |	<- Boot sector entry point 0000:7C00
+001000	+------------------------+
+	|  Reserved for MBR/BIOS |
+000800	+------------------------+
+	|  Typically used by MBR |
+000600	+------------------------+ 
+	|  BIOS use only	 |
+000000	+------------------------+
+
+
+When using bzImage, the protected-mode kernel was relocated to
+0x100000 ("high memory"), and the kernel real-mode block (boot sector,
+setup, and stack/heap) was made relocatable to any address between
+0x10000 and end of low memory. Unfortunately, in protocols 2.00 and
+2.01 the 0x90000+ memory range is still used internally by the kernel;
+the 2.02 protocol resolves that problem.
+
+It is desirable to keep the "memory ceiling" -- the highest point in
+low memory touched by the boot loader -- as low as possible, since
+some newer BIOSes have begun to allocate some rather large amounts of
+memory, called the Extended BIOS Data Area, near the top of low
+memory.	 The boot loader should use the "INT 12h" BIOS call to verify
+how much low memory is available.
+
+Unfortunately, if INT 12h reports that the amount of memory is too
+low, there is usually nothing the boot loader can do but to report an
+error to the user.  The boot loader should therefore be designed to
+take up as little space in low memory as it reasonably can.  For
+zImage or old bzImage kernels, which need data written into the
+0x90000 segment, the boot loader should make sure not to use memory
+above the 0x9A000 point; too many BIOSes will break above that point.
+
+For a modern bzImage kernel with boot protocol version >= 2.02, a
+memory layout like the following is suggested:
+
+	~                        ~
+        |  Protected-mode kernel |
+100000  +------------------------+
+	|  I/O memory hole	 |
+0A0000	+------------------------+
+	|  Reserved for BIOS	 |	Leave as much as possible unused
+	~                        ~
+	|  Command line		 |	(Can also be below the X+10000 mark)
+X+10000	+------------------------+
+	|  Stack/heap		 |	For use by the kernel real-mode code.
+X+08000	+------------------------+	
+	|  Kernel setup		 |	The kernel real-mode code.
+	|  Kernel boot sector	 |	The kernel legacy boot sector.
+X       +------------------------+
+	|  Boot loader		 |	<- Boot sector entry point 0000:7C00
+001000	+------------------------+
+	|  Reserved for MBR/BIOS |
+000800	+------------------------+
+	|  Typically used by MBR |
+000600	+------------------------+ 
+	|  BIOS use only	 |
+000000	+------------------------+
+
+... where the address X is as low as the design of the boot loader
+permits.
+
+
+**** THE REAL-MODE KERNEL HEADER
+
+In the following text, and anywhere in the kernel boot sequence, "a
+sector" refers to 512 bytes.  It is independent of the actual sector
+size of the underlying medium.
+
+The first step in loading a Linux kernel should be to load the
+real-mode code (boot sector and setup code) and then examine the
+following header at offset 0x01f1.  The real-mode code can total up to
+32K, although the boot loader may choose to load only the first two
+sectors (1K) and then examine the bootup sector size.
+
+The header looks like:
+
+Offset	Proto	Name		Meaning
+/Size
+
+01F1/1	ALL(1	setup_sects	The size of the setup in sectors
+01F2/2	ALL	root_flags	If set, the root is mounted readonly
+01F4/4	2.04+(2	syssize		The size of the 32-bit code in 16-byte paras
+01F8/2	ALL	ram_size	DO NOT USE - for bootsect.S use only
+01FA/2	ALL	vid_mode	Video mode control
+01FC/2	ALL	root_dev	Default root device number
+01FE/2	ALL	boot_flag	0xAA55 magic number
+0200/2	2.00+	jump		Jump instruction
+0202/4	2.00+	header		Magic signature "HdrS"
+0206/2	2.00+	version		Boot protocol version supported
+0208/4	2.00+	realmode_swtch	Boot loader hook (see below)
+020C/2	2.00+	start_sys_seg	The load-low segment (0x1000) (obsolete)
+020E/2	2.00+	kernel_version	Pointer to kernel version string
+0210/1	2.00+	type_of_loader	Boot loader identifier
+0211/1	2.00+	loadflags	Boot protocol option flags
+0212/2	2.00+	setup_move_size	Move to high memory size (used with hooks)
+0214/4	2.00+	code32_start	Boot loader hook (see below)
+0218/4	2.00+	ramdisk_image	initrd load address (set by boot loader)
+021C/4	2.00+	ramdisk_size	initrd size (set by boot loader)
+0220/4	2.00+	bootsect_kludge	DO NOT USE - for bootsect.S use only
+0224/2	2.01+	heap_end_ptr	Free memory after setup end
+0226/1	2.02+(3 ext_loader_ver	Extended boot loader version
+0227/1	2.02+(3	ext_loader_type	Extended boot loader ID
+0228/4	2.02+	cmd_line_ptr	32-bit pointer to the kernel command line
+022C/4	2.03+	initrd_addr_max	Highest legal initrd address
+0230/4	2.05+	kernel_alignment Physical addr alignment required for kernel
+0234/1	2.05+	relocatable_kernel Whether kernel is relocatable or not
+0235/1	2.10+	min_alignment	Minimum alignment, as a power of two
+0236/2	2.12+	xloadflags	Boot protocol option flags
+0238/4	2.06+	cmdline_size	Maximum size of the kernel command line
+023C/4	2.07+	hardware_subarch Hardware subarchitecture
+0240/8	2.07+	hardware_subarch_data Subarchitecture-specific data
+0248/4	2.08+	payload_offset	Offset of kernel payload
+024C/4	2.08+	payload_length	Length of kernel payload
+0250/8	2.09+	setup_data	64-bit physical pointer to linked list
+				of struct setup_data
+0258/8	2.10+	pref_address	Preferred loading address
+0260/4	2.10+	init_size	Linear memory required during initialization
+0264/4	2.11+	handover_offset	Offset of handover entry point
+
+(1) For backwards compatibility, if the setup_sects field contains 0, the
+    real value is 4.
+
+(2) For boot protocol prior to 2.04, the upper two bytes of the syssize
+    field are unusable, which means the size of a bzImage kernel
+    cannot be determined.
+
+(3) Ignored, but safe to set, for boot protocols 2.02-2.09.
+
+If the "HdrS" (0x53726448) magic number is not found at offset 0x202,
+the boot protocol version is "old".  Loading an old kernel, the
+following parameters should be assumed:
+
+	Image type = zImage
+	initrd not supported
+	Real-mode kernel must be located at 0x90000.
+
+Otherwise, the "version" field contains the protocol version,
+e.g. protocol version 2.01 will contain 0x0201 in this field.  When
+setting fields in the header, you must make sure only to set fields
+supported by the protocol version in use.
+
+
+**** DETAILS OF HEADER FIELDS
+
+For each field, some are information from the kernel to the bootloader
+("read"), some are expected to be filled out by the bootloader
+("write"), and some are expected to be read and modified by the
+bootloader ("modify").
+
+All general purpose boot loaders should write the fields marked
+(obligatory).  Boot loaders who want to load the kernel at a
+nonstandard address should fill in the fields marked (reloc); other
+boot loaders can ignore those fields.
+
+The byte order of all fields is littleendian (this is x86, after all.)
+
+Field name:	setup_sects
+Type:		read
+Offset/size:	0x1f1/1
+Protocol:	ALL
+
+  The size of the setup code in 512-byte sectors.  If this field is
+  0, the real value is 4.  The real-mode code consists of the boot
+  sector (always one 512-byte sector) plus the setup code.
+
+Field name:	 root_flags
+Type:		 modify (optional)
+Offset/size:	 0x1f2/2
+Protocol:	 ALL
+
+  If this field is nonzero, the root defaults to readonly.  The use of
+  this field is deprecated; use the "ro" or "rw" options on the
+  command line instead.
+
+Field name:	syssize
+Type:		read
+Offset/size:	0x1f4/4 (protocol 2.04+) 0x1f4/2 (protocol ALL)
+Protocol:	2.04+
+
+  The size of the protected-mode code in units of 16-byte paragraphs.
+  For protocol versions older than 2.04 this field is only two bytes
+  wide, and therefore cannot be trusted for the size of a kernel if
+  the LOAD_HIGH flag is set.
+
+Field name:	ram_size
+Type:		kernel internal
+Offset/size:	0x1f8/2
+Protocol:	ALL
+
+  This field is obsolete.
+
+Field name:	vid_mode
+Type:		modify (obligatory)
+Offset/size:	0x1fa/2
+
+  Please see the section on SPECIAL COMMAND LINE OPTIONS.
+
+Field name:	root_dev
+Type:		modify (optional)
+Offset/size:	0x1fc/2
+Protocol:	ALL
+
+  The default root device device number.  The use of this field is
+  deprecated, use the "root=" option on the command line instead.
+
+Field name:	boot_flag
+Type:		read
+Offset/size:	0x1fe/2
+Protocol:	ALL
+
+  Contains 0xAA55.  This is the closest thing old Linux kernels have
+  to a magic number.
+
+Field name:	jump
+Type:		read
+Offset/size:	0x200/2
+Protocol:	2.00+
+
+  Contains an x86 jump instruction, 0xEB followed by a signed offset
+  relative to byte 0x202.  This can be used to determine the size of
+  the header.
+
+Field name:	header
+Type:		read
+Offset/size:	0x202/4
+Protocol:	2.00+
+
+  Contains the magic number "HdrS" (0x53726448).
+
+Field name:	version
+Type:		read
+Offset/size:	0x206/2
+Protocol:	2.00+
+
+  Contains the boot protocol version, in (major << 8)+minor format,
+  e.g. 0x0204 for version 2.04, and 0x0a11 for a hypothetical version
+  10.17.
+
+Field name:	realmode_swtch
+Type:		modify (optional)
+Offset/size:	0x208/4
+Protocol:	2.00+
+
+  Boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
+
+Field name:	start_sys_seg
+Type:		read
+Offset/size:	0x20c/2
+Protocol:	2.00+
+
+  The load low segment (0x1000).  Obsolete.
+
+Field name:	kernel_version
+Type:		read
+Offset/size:	0x20e/2
+Protocol:	2.00+
+
+  If set to a nonzero value, contains a pointer to a NUL-terminated
+  human-readable kernel version number string, less 0x200.  This can
+  be used to display the kernel version to the user.  This value
+  should be less than (0x200*setup_sects).
+
+  For example, if this value is set to 0x1c00, the kernel version
+  number string can be found at offset 0x1e00 in the kernel file.
+  This is a valid value if and only if the "setup_sects" field
+  contains the value 15 or higher, as:
+
+	0x1c00  < 15*0x200 (= 0x1e00) but
+	0x1c00 >= 14*0x200 (= 0x1c00)
+
+	0x1c00 >> 9 = 14, so the minimum value for setup_secs is 15.
+
+Field name:	type_of_loader
+Type:		write (obligatory)
+Offset/size:	0x210/1
+Protocol:	2.00+
+
+  If your boot loader has an assigned id (see table below), enter
+  0xTV here, where T is an identifier for the boot loader and V is
+  a version number.  Otherwise, enter 0xFF here.
+
+  For boot loader IDs above T = 0xD, write T = 0xE to this field and
+  write the extended ID minus 0x10 to the ext_loader_type field.
+  Similarly, the ext_loader_ver field can be used to provide more than
+  four bits for the bootloader version.
+
+  For example, for T = 0x15, V = 0x234, write:
+
+  type_of_loader  <- 0xE4
+  ext_loader_type <- 0x05
+  ext_loader_ver  <- 0x23
+
+  Assigned boot loader ids (hexadecimal):
+
+	0  LILO			(0x00 reserved for pre-2.00 bootloader)
+	1  Loadlin
+	2  bootsect-loader	(0x20, all other values reserved)
+	3  Syslinux
+	4  Etherboot/gPXE/iPXE
+	5  ELILO
+	7  GRUB
+	8  U-Boot
+	9  Xen
+	A  Gujin
+	B  Qemu
+	C  Arcturus Networks uCbootloader
+	D  kexec-tools
+	E  Extended		(see ext_loader_type)
+	F  Special		(0xFF = undefined)
+       10  Reserved
+       11  Minimal Linux Bootloader <http://sebastian-plotz.blogspot.de>
+       12  OVMF UEFI virtualization stack
+
+  Please contact <hpa@zytor.com> if you need a bootloader ID
+  value assigned.
+
+Field name:	loadflags
+Type:		modify (obligatory)
+Offset/size:	0x211/1
+Protocol:	2.00+
+
+  This field is a bitmask.
+
+  Bit 0 (read):	LOADED_HIGH
+	- If 0, the protected-mode code is loaded at 0x10000.
+	- If 1, the protected-mode code is loaded at 0x100000.
+
+  Bit 1 (kernel internal): ALSR_FLAG
+	- Used internally by the compressed kernel to communicate
+	  KASLR status to kernel proper.
+	  If 1, KASLR enabled.
+	  If 0, KASLR disabled.
+
+  Bit 5 (write): QUIET_FLAG
+	- If 0, print early messages.
+	- If 1, suppress early messages.
+		This requests to the kernel (decompressor and early
+		kernel) to not write early messages that require
+		accessing the display hardware directly.
+
+  Bit 6 (write): KEEP_SEGMENTS
+	Protocol: 2.07+
+	- If 0, reload the segment registers in the 32bit entry point.
+	- If 1, do not reload the segment registers in the 32bit entry point.
+		Assume that %cs %ds %ss %es are all set to flat segments with
+		a base of 0 (or the equivalent for their environment).
+
+  Bit 7 (write): CAN_USE_HEAP
+	Set this bit to 1 to indicate that the value entered in the
+	heap_end_ptr is valid.  If this field is clear, some setup code
+	functionality will be disabled.
+
+Field name:	setup_move_size
+Type:		modify (obligatory)
+Offset/size:	0x212/2
+Protocol:	2.00-2.01
+
+  When using protocol 2.00 or 2.01, if the real mode kernel is not
+  loaded at 0x90000, it gets moved there later in the loading
+  sequence.  Fill in this field if you want additional data (such as
+  the kernel command line) moved in addition to the real-mode kernel
+  itself.
+
+  The unit is bytes starting with the beginning of the boot sector.
+  
+  This field is can be ignored when the protocol is 2.02 or higher, or
+  if the real-mode code is loaded at 0x90000.
+
+Field name:	code32_start
+Type:		modify (optional, reloc)
+Offset/size:	0x214/4
+Protocol:	2.00+
+
+  The address to jump to in protected mode.  This defaults to the load
+  address of the kernel, and can be used by the boot loader to
+  determine the proper load address.
+
+  This field can be modified for two purposes:
+
+  1. as a boot loader hook (see ADVANCED BOOT LOADER HOOKS below.)
+
+  2. if a bootloader which does not install a hook loads a
+     relocatable kernel at a nonstandard address it will have to modify
+     this field to point to the load address.
+
+Field name:	ramdisk_image
+Type:		write (obligatory)
+Offset/size:	0x218/4
+Protocol:	2.00+
+
+  The 32-bit linear address of the initial ramdisk or ramfs.  Leave at
+  zero if there is no initial ramdisk/ramfs.
+
+Field name:	ramdisk_size
+Type:		write (obligatory)
+Offset/size:	0x21c/4
+Protocol:	2.00+
+
+  Size of the initial ramdisk or ramfs.  Leave at zero if there is no
+  initial ramdisk/ramfs.
+
+Field name:	bootsect_kludge
+Type:		kernel internal
+Offset/size:	0x220/4
+Protocol:	2.00+
+
+  This field is obsolete.
+
+Field name:	heap_end_ptr
+Type:		write (obligatory)
+Offset/size:	0x224/2
+Protocol:	2.01+
+
+  Set this field to the offset (from the beginning of the real-mode
+  code) of the end of the setup stack/heap, minus 0x0200.
+
+Field name:	ext_loader_ver
+Type:		write (optional)
+Offset/size:	0x226/1
+Protocol:	2.02+
+
+  This field is used as an extension of the version number in the
+  type_of_loader field.  The total version number is considered to be
+  (type_of_loader & 0x0f) + (ext_loader_ver << 4).
+
+  The use of this field is boot loader specific.  If not written, it
+  is zero.
+
+  Kernels prior to 2.6.31 did not recognize this field, but it is safe
+  to write for protocol version 2.02 or higher.
+
+Field name:	ext_loader_type
+Type:		write (obligatory if (type_of_loader & 0xf0) == 0xe0)
+Offset/size:	0x227/1
+Protocol:	2.02+
+
+  This field is used as an extension of the type number in
+  type_of_loader field.  If the type in type_of_loader is 0xE, then
+  the actual type is (ext_loader_type + 0x10).
+
+  This field is ignored if the type in type_of_loader is not 0xE.
+
+  Kernels prior to 2.6.31 did not recognize this field, but it is safe
+  to write for protocol version 2.02 or higher.
+
+Field name:	cmd_line_ptr
+Type:		write (obligatory)
+Offset/size:	0x228/4
+Protocol:	2.02+
+
+  Set this field to the linear address of the kernel command line.
+  The kernel command line can be located anywhere between the end of
+  the setup heap and 0xA0000; it does not have to be located in the
+  same 64K segment as the real-mode code itself.
+
+  Fill in this field even if your boot loader does not support a
+  command line, in which case you can point this to an empty string
+  (or better yet, to the string "auto".)  If this field is left at
+  zero, the kernel will assume that your boot loader does not support
+  the 2.02+ protocol.
+
+Field name:	initrd_addr_max
+Type:		read
+Offset/size:	0x22c/4
+Protocol:	2.03+
+
+  The maximum address that may be occupied by the initial
+  ramdisk/ramfs contents.  For boot protocols 2.02 or earlier, this
+  field is not present, and the maximum address is 0x37FFFFFF.  (This
+  address is defined as the address of the highest safe byte, so if
+  your ramdisk is exactly 131072 bytes long and this field is
+  0x37FFFFFF, you can start your ramdisk at 0x37FE0000.)
+
+Field name:	kernel_alignment
+Type:		read/modify (reloc)
+Offset/size:	0x230/4
+Protocol:	2.05+ (read), 2.10+ (modify)
+
+  Alignment unit required by the kernel (if relocatable_kernel is
+  true.)  A relocatable kernel that is loaded at an alignment
+  incompatible with the value in this field will be realigned during
+  kernel initialization.
+
+  Starting with protocol version 2.10, this reflects the kernel
+  alignment preferred for optimal performance; it is possible for the
+  loader to modify this field to permit a lesser alignment.  See the
+  min_alignment and pref_address field below.
+
+Field name:	relocatable_kernel
+Type:		read (reloc)
+Offset/size:	0x234/1
+Protocol:	2.05+
+
+  If this field is nonzero, the protected-mode part of the kernel can
+  be loaded at any address that satisfies the kernel_alignment field.
+  After loading, the boot loader must set the code32_start field to
+  point to the loaded code, or to a boot loader hook.
+
+Field name:	min_alignment
+Type:		read (reloc)
+Offset/size:	0x235/1
+Protocol:	2.10+
+
+  This field, if nonzero, indicates as a power of two the minimum
+  alignment required, as opposed to preferred, by the kernel to boot.
+  If a boot loader makes use of this field, it should update the
+  kernel_alignment field with the alignment unit desired; typically:
+
+	kernel_alignment = 1 << min_alignment
+
+  There may be a considerable performance cost with an excessively
+  misaligned kernel.  Therefore, a loader should typically try each
+  power-of-two alignment from kernel_alignment down to this alignment.
+
+Field name:     xloadflags
+Type:           read
+Offset/size:    0x236/2
+Protocol:       2.12+
+
+  This field is a bitmask.
+
+  Bit 0 (read):	XLF_KERNEL_64
+	- If 1, this kernel has the legacy 64-bit entry point at 0x200.
+
+  Bit 1 (read): XLF_CAN_BE_LOADED_ABOVE_4G
+        - If 1, kernel/boot_params/cmdline/ramdisk can be above 4G.
+
+  Bit 2 (read):	XLF_EFI_HANDOVER_32
+	- If 1, the kernel supports the 32-bit EFI handoff entry point
+          given at handover_offset.
+
+  Bit 3 (read): XLF_EFI_HANDOVER_64
+	- If 1, the kernel supports the 64-bit EFI handoff entry point
+          given at handover_offset + 0x200.
+
+  Bit 4 (read): XLF_EFI_KEXEC
+	- If 1, the kernel supports kexec EFI boot with EFI runtime support.
+
+Field name:	cmdline_size
+Type:		read
+Offset/size:	0x238/4
+Protocol:	2.06+
+
+  The maximum size of the command line without the terminating
+  zero. This means that the command line can contain at most
+  cmdline_size characters. With protocol version 2.05 and earlier, the
+  maximum size was 255.
+
+Field name:	hardware_subarch
+Type:		write (optional, defaults to x86/PC)
+Offset/size:	0x23c/4
+Protocol:	2.07+
+
+  In a paravirtualized environment the hardware low level architectural
+  pieces such as interrupt handling, page table handling, and
+  accessing process control registers needs to be done differently.
+
+  This field allows the bootloader to inform the kernel we are in one
+  one of those environments.
+
+  0x00000000	The default x86/PC environment
+  0x00000001	lguest
+  0x00000002	Xen
+  0x00000003	Moorestown MID
+  0x00000004	CE4100 TV Platform
+
+Field name:	hardware_subarch_data
+Type:		write (subarch-dependent)
+Offset/size:	0x240/8
+Protocol:	2.07+
+
+  A pointer to data that is specific to hardware subarch
+  This field is currently unused for the default x86/PC environment,
+  do not modify.
+
+Field name:	payload_offset
+Type:		read
+Offset/size:	0x248/4
+Protocol:	2.08+
+
+  If non-zero then this field contains the offset from the beginning
+  of the protected-mode code to the payload.
+
+  The payload may be compressed. The format of both the compressed and
+  uncompressed data should be determined using the standard magic
+  numbers.  The currently supported compression formats are gzip
+  (magic numbers 1F 8B or 1F 9E), bzip2 (magic number 42 5A), LZMA
+  (magic number 5D 00), XZ (magic number FD 37), and LZ4 (magic number
+  02 21).  The uncompressed payload is currently always ELF (magic
+  number 7F 45 4C 46).
+
+Field name:	payload_length
+Type:		read
+Offset/size:	0x24c/4
+Protocol:	2.08+
+
+  The length of the payload.
+
+Field name:	setup_data
+Type:		write (special)
+Offset/size:	0x250/8
+Protocol:	2.09+
+
+  The 64-bit physical pointer to NULL terminated single linked list of
+  struct setup_data. This is used to define a more extensible boot
+  parameters passing mechanism. The definition of struct setup_data is
+  as follow:
+
+  struct setup_data {
+	  u64 next;
+	  u32 type;
+	  u32 len;
+	  u8  data[0];
+  };
+
+  Where, the next is a 64-bit physical pointer to the next node of
+  linked list, the next field of the last node is 0; the type is used
+  to identify the contents of data; the len is the length of data
+  field; the data holds the real payload.
+
+  This list may be modified at a number of points during the bootup
+  process.  Therefore, when modifying this list one should always make
+  sure to consider the case where the linked list already contains
+  entries.
+
+Field name:	pref_address
+Type:		read (reloc)
+Offset/size:	0x258/8
+Protocol:	2.10+
+
+  This field, if nonzero, represents a preferred load address for the
+  kernel.  A relocating bootloader should attempt to load at this
+  address if possible.
+
+  A non-relocatable kernel will unconditionally move itself and to run
+  at this address.
+
+Field name:	init_size
+Type:		read
+Offset/size:	0x260/4
+
+  This field indicates the amount of linear contiguous memory starting
+  at the kernel runtime start address that the kernel needs before it
+  is capable of examining its memory map.  This is not the same thing
+  as the total amount of memory the kernel needs to boot, but it can
+  be used by a relocating boot loader to help select a safe load
+  address for the kernel.
+
+  The kernel runtime start address is determined by the following algorithm:
+
+  if (relocatable_kernel)
+	runtime_start = align_up(load_address, kernel_alignment)
+  else
+	runtime_start = pref_address
+
+Field name:	handover_offset
+Type:		read
+Offset/size:	0x264/4
+
+  This field is the offset from the beginning of the kernel image to
+  the EFI handover protocol entry point. Boot loaders using the EFI
+  handover protocol to boot the kernel should jump to this offset.
+
+  See EFI HANDOVER PROTOCOL below for more details.
+
+
+**** THE IMAGE CHECKSUM
+
+From boot protocol version 2.08 onwards the CRC-32 is calculated over
+the entire file using the characteristic polynomial 0x04C11DB7 and an
+initial remainder of 0xffffffff.  The checksum is appended to the
+file; therefore the CRC of the file up to the limit specified in the
+syssize field of the header is always 0.
+
+
+**** THE KERNEL COMMAND LINE
+
+The kernel command line has become an important way for the boot
+loader to communicate with the kernel.  Some of its options are also
+relevant to the boot loader itself, see "special command line options"
+below.
+
+The kernel command line is a null-terminated string. The maximum
+length can be retrieved from the field cmdline_size.  Before protocol
+version 2.06, the maximum was 255 characters.  A string that is too
+long will be automatically truncated by the kernel.
+
+If the boot protocol version is 2.02 or later, the address of the
+kernel command line is given by the header field cmd_line_ptr (see
+above.)  This address can be anywhere between the end of the setup
+heap and 0xA0000.
+
+If the protocol version is *not* 2.02 or higher, the kernel
+command line is entered using the following protocol:
+
+	At offset 0x0020 (word), "cmd_line_magic", enter the magic
+	number 0xA33F.
+
+	At offset 0x0022 (word), "cmd_line_offset", enter the offset
+	of the kernel command line (relative to the start of the
+	real-mode kernel).
+	
+	The kernel command line *must* be within the memory region
+	covered by setup_move_size, so you may need to adjust this
+	field.
+
+
+**** MEMORY LAYOUT OF THE REAL-MODE CODE
+
+The real-mode code requires a stack/heap to be set up, as well as
+memory allocated for the kernel command line.  This needs to be done
+in the real-mode accessible memory in bottom megabyte.
+
+It should be noted that modern machines often have a sizable Extended
+BIOS Data Area (EBDA).  As a result, it is advisable to use as little
+of the low megabyte as possible.
+
+Unfortunately, under the following circumstances the 0x90000 memory
+segment has to be used:
+
+	- When loading a zImage kernel ((loadflags & 0x01) == 0).
+	- When loading a 2.01 or earlier boot protocol kernel.
+
+	  -> For the 2.00 and 2.01 boot protocols, the real-mode code
+	     can be loaded at another address, but it is internally
+	     relocated to 0x90000.  For the "old" protocol, the
+	     real-mode code must be loaded at 0x90000.
+
+When loading at 0x90000, avoid using memory above 0x9a000.
+
+For boot protocol 2.02 or higher, the command line does not have to be
+located in the same 64K segment as the real-mode setup code; it is
+thus permitted to give the stack/heap the full 64K segment and locate
+the command line above it.
+
+The kernel command line should not be located below the real-mode
+code, nor should it be located in high memory.
+
+
+**** SAMPLE BOOT CONFIGURATION
+
+As a sample configuration, assume the following layout of the real
+mode segment:
+
+    When loading below 0x90000, use the entire segment:
+
+	0x0000-0x7fff	Real mode kernel
+	0x8000-0xdfff	Stack and heap
+	0xe000-0xffff	Kernel command line
+
+    When loading at 0x90000 OR the protocol version is 2.01 or earlier:
+
+	0x0000-0x7fff	Real mode kernel
+	0x8000-0x97ff	Stack and heap
+	0x9800-0x9fff	Kernel command line
+
+Such a boot loader should enter the following fields in the header:
+
+	unsigned long base_ptr;	/* base address for real-mode segment */
+
+	if ( setup_sects == 0 ) {
+		setup_sects = 4;
+	}
+
+	if ( protocol >= 0x0200 ) {
+		type_of_loader = <type code>;
+		if ( loading_initrd ) {
+			ramdisk_image = <initrd_address>;
+			ramdisk_size = <initrd_size>;
+		}
+
+		if ( protocol >= 0x0202 && loadflags & 0x01 )
+			heap_end = 0xe000;
+		else
+			heap_end = 0x9800;
+
+		if ( protocol >= 0x0201 ) {
+			heap_end_ptr = heap_end - 0x200;
+			loadflags |= 0x80; /* CAN_USE_HEAP */
+		}
+
+		if ( protocol >= 0x0202 ) {
+			cmd_line_ptr = base_ptr + heap_end;
+			strcpy(cmd_line_ptr, cmdline);
+		} else {
+			cmd_line_magic	= 0xA33F;
+			cmd_line_offset = heap_end;
+			setup_move_size = heap_end + strlen(cmdline)+1;
+			strcpy(base_ptr+cmd_line_offset, cmdline);
+		}
+	} else {
+		/* Very old kernel */
+
+		heap_end = 0x9800;
+
+		cmd_line_magic	= 0xA33F;
+		cmd_line_offset = heap_end;
+
+		/* A very old kernel MUST have its real-mode code
+		   loaded at 0x90000 */
+
+		if ( base_ptr != 0x90000 ) {
+			/* Copy the real-mode kernel */
+			memcpy(0x90000, base_ptr, (setup_sects+1)*512);
+			base_ptr = 0x90000;		 /* Relocated */
+		}
+
+		strcpy(0x90000+cmd_line_offset, cmdline);
+
+		/* It is recommended to clear memory up to the 32K mark */
+		memset(0x90000 + (setup_sects+1)*512, 0,
+		       (64-(setup_sects+1))*512);
+	}
+
+
+**** LOADING THE REST OF THE KERNEL
+
+The 32-bit (non-real-mode) kernel starts at offset (setup_sects+1)*512
+in the kernel file (again, if setup_sects == 0 the real value is 4.)
+It should be loaded at address 0x10000 for Image/zImage kernels and
+0x100000 for bzImage kernels.
+
+The kernel is a bzImage kernel if the protocol >= 2.00 and the 0x01
+bit (LOAD_HIGH) in the loadflags field is set:
+
+	is_bzImage = (protocol >= 0x0200) && (loadflags & 0x01);
+	load_address = is_bzImage ? 0x100000 : 0x10000;
+
+Note that Image/zImage kernels can be up to 512K in size, and thus use
+the entire 0x10000-0x90000 range of memory.  This means it is pretty
+much a requirement for these kernels to load the real-mode part at
+0x90000.  bzImage kernels allow much more flexibility.
+
+
+**** SPECIAL COMMAND LINE OPTIONS
+
+If the command line provided by the boot loader is entered by the
+user, the user may expect the following command line options to work.
+They should normally not be deleted from the kernel command line even
+though not all of them are actually meaningful to the kernel.  Boot
+loader authors who need additional command line options for the boot
+loader itself should get them registered in
+Documentation/kernel-parameters.txt to make sure they will not
+conflict with actual kernel options now or in the future.
+
+  vga=<mode>
+	<mode> here is either an integer (in C notation, either
+	decimal, octal, or hexadecimal) or one of the strings
+	"normal" (meaning 0xFFFF), "ext" (meaning 0xFFFE) or "ask"
+	(meaning 0xFFFD).  This value should be entered into the
+	vid_mode field, as it is used by the kernel before the command
+	line is parsed.
+
+  mem=<size>
+	<size> is an integer in C notation optionally followed by
+	(case insensitive) K, M, G, T, P or E (meaning << 10, << 20,
+	<< 30, << 40, << 50 or << 60).  This specifies the end of
+	memory to the kernel. This affects the possible placement of
+	an initrd, since an initrd should be placed near end of
+	memory.  Note that this is an option to *both* the kernel and
+	the bootloader!
+
+  initrd=<file>
+	An initrd should be loaded.  The meaning of <file> is
+	obviously bootloader-dependent, and some boot loaders
+	(e.g. LILO) do not have such a command.
+
+In addition, some boot loaders add the following options to the
+user-specified command line:
+
+  BOOT_IMAGE=<file>
+	The boot image which was loaded.  Again, the meaning of <file>
+	is obviously bootloader-dependent.
+
+  auto
+	The kernel was booted without explicit user intervention.
+
+If these options are added by the boot loader, it is highly
+recommended that they are located *first*, before the user-specified
+or configuration-specified command line.  Otherwise, "init=/bin/sh"
+gets confused by the "auto" option.
+
+
+**** RUNNING THE KERNEL
+
+The kernel is started by jumping to the kernel entry point, which is
+located at *segment* offset 0x20 from the start of the real mode
+kernel.  This means that if you loaded your real-mode kernel code at
+0x90000, the kernel entry point is 9020:0000.
+
+At entry, ds = es = ss should point to the start of the real-mode
+kernel code (0x9000 if the code is loaded at 0x90000), sp should be
+set up properly, normally pointing to the top of the heap, and
+interrupts should be disabled.  Furthermore, to guard against bugs in
+the kernel, it is recommended that the boot loader sets fs = gs = ds =
+es = ss.
+
+In our example from above, we would do:
+
+	/* Note: in the case of the "old" kernel protocol, base_ptr must
+	   be == 0x90000 at this point; see the previous sample code */
+
+	seg = base_ptr >> 4;
+
+	cli();	/* Enter with interrupts disabled! */
+
+	/* Set up the real-mode kernel stack */
+	_SS = seg;
+	_SP = heap_end;
+
+	_DS = _ES = _FS = _GS = seg;
+	jmp_far(seg+0x20, 0);	/* Run the kernel */
+
+If your boot sector accesses a floppy drive, it is recommended to
+switch off the floppy motor before running the kernel, since the
+kernel boot leaves interrupts off and thus the motor will not be
+switched off, especially if the loaded kernel has the floppy driver as
+a demand-loaded module!
+
+
+**** ADVANCED BOOT LOADER HOOKS
+
+If the boot loader runs in a particularly hostile environment (such as
+LOADLIN, which runs under DOS) it may be impossible to follow the
+standard memory location requirements.  Such a boot loader may use the
+following hooks that, if set, are invoked by the kernel at the
+appropriate time.  The use of these hooks should probably be
+considered an absolutely last resort!
+
+IMPORTANT: All the hooks are required to preserve %esp, %ebp, %esi and
+%edi across invocation.
+
+  realmode_swtch:
+	A 16-bit real mode far subroutine invoked immediately before
+	entering protected mode.  The default routine disables NMI, so
+	your routine should probably do so, too.
+
+  code32_start:
+	A 32-bit flat-mode routine *jumped* to immediately after the
+	transition to protected mode, but before the kernel is
+	uncompressed.  No segments, except CS, are guaranteed to be
+	set up (current kernels do, but older ones do not); you should
+	set them up to BOOT_DS (0x18) yourself.
+
+	After completing your hook, you should jump to the address
+	that was in this field before your boot loader overwrote it
+	(relocated, if appropriate.)
+
+
+**** 32-bit BOOT PROTOCOL
+
+For machine with some new BIOS other than legacy BIOS, such as EFI,
+LinuxBIOS, etc, and kexec, the 16-bit real mode setup code in kernel
+based on legacy BIOS can not be used, so a 32-bit boot protocol needs
+to be defined.
+
+In 32-bit boot protocol, the first step in loading a Linux kernel
+should be to setup the boot parameters (struct boot_params,
+traditionally known as "zero page"). The memory for struct boot_params
+should be allocated and initialized to all zero. Then the setup header
+from offset 0x01f1 of kernel image on should be loaded into struct
+boot_params and examined. The end of setup header can be calculated as
+follow:
+
+	0x0202 + byte value at offset 0x0201
+
+In addition to read/modify/write the setup header of the struct
+boot_params as that of 16-bit boot protocol, the boot loader should
+also fill the additional fields of the struct boot_params as that
+described in zero-page.txt.
+
+After setting up the struct boot_params, the boot loader can load the
+32/64-bit kernel in the same way as that of 16-bit boot protocol.
+
+In 32-bit boot protocol, the kernel is started by jumping to the
+32-bit kernel entry point, which is the start address of loaded
+32/64-bit kernel.
+
+At entry, the CPU must be in 32-bit protected mode with paging
+disabled; a GDT must be loaded with the descriptors for selectors
+__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat
+segment; __BOOT_CS must have execute/read permission, and __BOOT_DS
+must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
+must be __BOOT_DS; interrupt must be disabled; %esi must hold the base
+address of the struct boot_params; %ebp, %edi and %ebx must be zero.
+
+**** 64-bit BOOT PROTOCOL
+
+For machine with 64bit cpus and 64bit kernel, we could use 64bit bootloader
+and we need a 64-bit boot protocol.
+
+In 64-bit boot protocol, the first step in loading a Linux kernel
+should be to setup the boot parameters (struct boot_params,
+traditionally known as "zero page"). The memory for struct boot_params
+could be allocated anywhere (even above 4G) and initialized to all zero.
+Then, the setup header at offset 0x01f1 of kernel image on should be
+loaded into struct boot_params and examined. The end of setup header
+can be calculated as follows:
+
+	0x0202 + byte value at offset 0x0201
+
+In addition to read/modify/write the setup header of the struct
+boot_params as that of 16-bit boot protocol, the boot loader should
+also fill the additional fields of the struct boot_params as described
+in zero-page.txt.
+
+After setting up the struct boot_params, the boot loader can load
+64-bit kernel in the same way as that of 16-bit boot protocol, but
+kernel could be loaded above 4G.
+
+In 64-bit boot protocol, the kernel is started by jumping to the
+64-bit kernel entry point, which is the start address of loaded
+64-bit kernel plus 0x200.
+
+At entry, the CPU must be in 64-bit mode with paging enabled.
+The range with setup_header.init_size from start address of loaded
+kernel and zero page and command line buffer get ident mapping;
+a GDT must be loaded with the descriptors for selectors
+__BOOT_CS(0x10) and __BOOT_DS(0x18); both descriptors must be 4G flat
+segment; __BOOT_CS must have execute/read permission, and __BOOT_DS
+must have read/write permission; CS must be __BOOT_CS and DS, ES, SS
+must be __BOOT_DS; interrupt must be disabled; %rsi must hold the base
+address of the struct boot_params.
+
+**** EFI HANDOVER PROTOCOL
+
+This protocol allows boot loaders to defer initialisation to the EFI
+boot stub. The boot loader is required to load the kernel/initrd(s)
+from the boot media and jump to the EFI handover protocol entry point
+which is hdr->handover_offset bytes from the beginning of
+startup_{32,64}.
+
+The function prototype for the handover entry point looks like this,
+
+    efi_main(void *handle, efi_system_table_t *table, struct boot_params *bp)
+
+'handle' is the EFI image handle passed to the boot loader by the EFI
+firmware, 'table' is the EFI system table - these are the first two
+arguments of the "handoff state" as described in section 2.3 of the
+UEFI specification. 'bp' is the boot loader-allocated boot params.
+
+The boot loader *must* fill out the following fields in bp,
+
+    o hdr.code32_start
+    o hdr.cmd_line_ptr
+    o hdr.cmdline_size
+    o hdr.ramdisk_image (if applicable)
+    o hdr.ramdisk_size  (if applicable)
+
+All other fields should be zero.
diff --git a/Documentation/x86/early-microcode.txt b/Documentation/x86/early-microcode.txt
new file mode 100644
index 000000000..41465f971
--- /dev/null
+++ b/Documentation/x86/early-microcode.txt
@@ -0,0 +1,42 @@
+Early load microcode
+====================
+By Fenghua Yu <fenghua.yu@intel.com>
+
+Kernel can update microcode in early phase of boot time. Loading microcode early
+can fix CPU issues before they are observed during kernel boot time.
+
+Microcode is stored in an initrd file. The microcode is read from the initrd
+file and loaded to CPUs during boot time.
+
+The format of the combined initrd image is microcode in cpio format followed by
+the initrd image (maybe compressed). Kernel parses the combined initrd image
+during boot time. The microcode file in cpio name space is:
+on Intel: /*(DEBLOBBED)*/
+on AMD  : /*(DEBLOBBED)*/
+
+During BSP boot (before SMP starts), if the kernel finds the microcode file in
+the initrd file, it parses the microcode and saves matching microcode in memory.
+If matching microcode is found, it will be uploaded in BSP and later on in all
+APs.
+
+The cached microcode patch is applied when CPUs resume from a sleep state.
+
+There are two legacy user space interfaces to load microcode, either through
+/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file
+in sysfs.
+
+In addition to these two legacy methods, the early loading method described
+here is the third method with which microcode can be uploaded to a system's
+CPUs.
+
+The following example script shows how to generate a new combined initrd file in
+/boot/initrd-3.5.0.ucode.img with original microcode microcode.bin and
+original initrd image /boot/initrd-3.5.0.img.
+
+mkdir initrd
+cd initrd
+mkdir -p kernel/x86/microcode
+cp ../microcode.bin /*(DEBLOBBED)*/ (or /*(DEBLOBBED)*/)
+find . | cpio -o -H newc >../ucode.cpio
+cd ..
+cat ucode.cpio /boot/initrd-3.5.0.img >/boot/initrd-3.5.0.ucode.img
diff --git a/Documentation/x86/earlyprintk.txt b/Documentation/x86/earlyprintk.txt
new file mode 100644
index 000000000..688e3eeed
--- /dev/null
+++ b/Documentation/x86/earlyprintk.txt
@@ -0,0 +1,136 @@
+
+Mini-HOWTO for using the earlyprintk=dbgp boot option with a
+USB2 Debug port key and a debug cable, on x86 systems.
+
+You need two computers, the 'USB debug key' special gadget and
+and two USB cables, connected like this:
+
+  [host/target] <-------> [USB debug key] <-------> [client/console]
+
+1. There are a number of specific hardware requirements:
+
+ a.) Host/target system needs to have USB debug port capability.
+
+ You can check this capability by looking at a 'Debug port' bit in
+ the lspci -vvv output:
+
+ # lspci -vvv
+ ...
+ 00:1d.7 USB Controller: Intel Corporation 82801H (ICH8 Family) USB2 EHCI Controller #1 (rev 03) (prog-if 20 [EHCI])
+         Subsystem: Lenovo ThinkPad T61
+         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
+         Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
+         Latency: 0
+         Interrupt: pin D routed to IRQ 19
+         Region 0: Memory at fe227000 (32-bit, non-prefetchable) [size=1K]
+         Capabilities: [50] Power Management version 2
+                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
+                 Status: D0 PME-Enable- DSel=0 DScale=0 PME+
+         Capabilities: [58] Debug port: BAR=1 offset=00a0
+                            ^^^^^^^^^^^ <==================== [ HERE ]
+	 Kernel driver in use: ehci_hcd
+         Kernel modules: ehci-hcd
+ ...
+
+( If your system does not list a debug port capability then you probably
+  won't be able to use the USB debug key. )
+
+ b.) You also need a Netchip USB debug cable/key:
+
+        http://www.plxtech.com/products/NET2000/NET20DC/default.asp
+
+     This is a small blue plastic connector with two USB connections,
+     it draws power from its USB connections.
+
+ c.) You need a second client/console system with a high speed USB 2.0
+     port.
+
+ d.) The Netchip device must be plugged directly into the physical
+     debug port on the "host/target" system.  You cannot use a USB hub in
+     between the physical debug port and the "host/target" system.
+
+     The EHCI debug controller is bound to a specific physical USB
+     port and the Netchip device will only work as an early printk
+     device in this port.  The EHCI host controllers are electrically
+     wired such that the EHCI debug controller is hooked up to the
+     first physical and there is no way to change this via software.
+     You can find the physical port through experimentation by trying
+     each physical port on the system and rebooting.  Or you can try
+     and use lsusb or look at the kernel info messages emitted by the
+     usb stack when you plug a usb device into various ports on the
+     "host/target" system.
+
+     Some hardware vendors do not expose the usb debug port with a
+     physical connector and if you find such a device send a complaint
+     to the hardware vendor, because there is no reason not to wire
+     this port into one of the physically accessible ports.
+
+ e.) It is also important to note, that many versions of the Netchip
+     device require the "client/console" system to be plugged into the
+     right and side of the device (with the product logo facing up and
+     readable left to right).  The reason being is that the 5 volt
+     power supply is taken from only one side of the device and it
+     must be the side that does not get rebooted.
+
+2. Software requirements:
+
+ a.) On the host/target system:
+
+    You need to enable the following kernel config option:
+
+      CONFIG_EARLY_PRINTK_DBGP=y
+
+    And you need to add the boot command line: "earlyprintk=dbgp".
+    (If you are using Grub, append it to the 'kernel' line in
+     /etc/grub.conf)
+
+    On systems with more than one EHCI debug controller you must
+    specify the correct EHCI debug controller number.  The ordering
+    comes from the PCI bus enumeration of the EHCI controllers.  The
+    default with no number argument is "0" the first EHCI debug
+    controller.  To use the second EHCI debug controller, you would
+    use the command line: "earlyprintk=dbgp1"
+
+    NOTE: normally earlyprintk console gets turned off once the
+    regular console is alive - use "earlyprintk=dbgp,keep" to keep
+    this channel open beyond early bootup. This can be useful for
+    debugging crashes under Xorg, etc.
+
+ b.) On the client/console system:
+
+    You should enable the following kernel config option:
+
+      CONFIG_USB_SERIAL_DEBUG=y
+
+    On the next bootup with the modified kernel you should
+    get a /dev/ttyUSBx device(s).
+
+    Now this channel of kernel messages is ready to be used: start
+    your favorite terminal emulator (minicom, etc.) and set
+    it up to use /dev/ttyUSB0 - or use a raw 'cat /dev/ttyUSBx' to
+    see the raw output.
+
+ c.) On Nvidia Southbridge based systems: the kernel will try to probe
+     and find out which port has debug device connected.
+
+3. Testing that it works fine:
+
+   You can test the output by using earlyprintk=dbgp,keep and provoking
+   kernel messages on the host/target system. You can provoke a harmless
+   kernel message by for example doing:
+
+     echo h > /proc/sysrq-trigger
+
+   On the host/target system you should see this help line in "dmesg" output:
+
+     SysRq : HELP : loglevel(0-9) reBoot Crashdump terminate-all-tasks(E) memory-full-oom-kill(F) kill-all-tasks(I) saK show-backtrace-all-active-cpus(L) show-memory-usage(M) nice-all-RT-tasks(N) powerOff show-registers(P) show-all-timers(Q) unRaw Sync show-task-states(T) Unmount show-blocked-tasks(W) dump-ftrace-buffer(Z)
+
+   On the client/console system do:
+
+       cat /dev/ttyUSB0
+
+   And you should see the help line above displayed shortly after you've
+   provoked it on the host system.
+
+If it does not work then please ask about it on the linux-kernel@vger.kernel.org
+mailing list or contact the x86 maintainers.
diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt
new file mode 100644
index 000000000..9132b8617
--- /dev/null
+++ b/Documentation/x86/entry_64.txt
@@ -0,0 +1,104 @@
+This file documents some of the kernel entries in
+arch/x86/kernel/entry_64.S.  A lot of this explanation is adapted from
+an email from Ingo Molnar:
+
+http://lkml.kernel.org/r/<20110529191055.GC9835%40elte.hu>
+
+The x86 architecture has quite a few different ways to jump into
+kernel code.  Most of these entry points are registered in
+arch/x86/kernel/traps.c and implemented in arch/x86/kernel/entry_64.S
+for 64-bit, arch/x86/kernel/entry_32.S for 32-bit and finally
+arch/x86/ia32/ia32entry.S which implements the 32-bit compatibility
+syscall entry points and thus provides for 32-bit processes the
+ability to execute syscalls when running on 64-bit kernels.
+
+The IDT vector assignments are listed in arch/x86/include/asm/irq_vectors.h.
+
+Some of these entries are:
+
+ - system_call: syscall instruction from 64-bit code.
+
+ - ia32_syscall: int 0x80 from 32-bit or 64-bit code; compat syscall
+   either way.
+
+ - ia32_syscall, ia32_sysenter: syscall and sysenter from 32-bit
+   code
+
+ - interrupt: An array of entries.  Every IDT vector that doesn't
+   explicitly point somewhere else gets set to the corresponding
+   value in interrupts.  These point to a whole array of
+   magically-generated functions that make their way to do_IRQ with
+   the interrupt number as a parameter.
+
+ - APIC interrupts: Various special-purpose interrupts for things
+   like TLB shootdown.
+
+ - Architecturally-defined exceptions like divide_error.
+
+There are a few complexities here.  The different x86-64 entries
+have different calling conventions.  The syscall and sysenter
+instructions have their own peculiar calling conventions.  Some of
+the IDT entries push an error code onto the stack; others don't.
+IDT entries using the IST alternative stack mechanism need their own
+magic to get the stack frames right.  (You can find some
+documentation in the AMD APM, Volume 2, Chapter 8 and the Intel SDM,
+Volume 3, Chapter 6.)
+
+Dealing with the swapgs instruction is especially tricky.  Swapgs
+toggles whether gs is the kernel gs or the user gs.  The swapgs
+instruction is rather fragile: it must nest perfectly and only in
+single depth, it should only be used if entering from user mode to
+kernel mode and then when returning to user-space, and precisely
+so. If we mess that up even slightly, we crash.
+
+So when we have a secondary entry, already in kernel mode, we *must
+not* use SWAPGS blindly - nor must we forget doing a SWAPGS when it's
+not switched/swapped yet.
+
+Now, there's a secondary complication: there's a cheap way to test
+which mode the CPU is in and an expensive way.
+
+The cheap way is to pick this info off the entry frame on the kernel
+stack, from the CS of the ptregs area of the kernel stack:
+
+	xorl %ebx,%ebx
+	testl $3,CS+8(%rsp)
+	je error_kernelspace
+	SWAPGS
+
+The expensive (paranoid) way is to read back the MSR_GS_BASE value
+(which is what SWAPGS modifies):
+
+	movl $1,%ebx
+	movl $MSR_GS_BASE,%ecx
+	rdmsr
+	testl %edx,%edx
+	js 1f   /* negative -> in kernel */
+	SWAPGS
+	xorl %ebx,%ebx
+1:	ret
+
+If we are at an interrupt or user-trap/gate-alike boundary then we can
+use the faster check: the stack will be a reliable indicator of
+whether SWAPGS was already done: if we see that we are a secondary
+entry interrupting kernel mode execution, then we know that the GS
+base has already been switched. If it says that we interrupted
+user-space execution then we must do the SWAPGS.
+
+But if we are in an NMI/MCE/DEBUG/whatever super-atomic entry context,
+which might have triggered right after a normal entry wrote CS to the
+stack but before we executed SWAPGS, then the only safe way to check
+for GS is the slower method: the RDMSR.
+
+Therefore, super-atomic entries (except NMI, which is handled separately)
+must use idtentry with paranoid=1 to handle gsbase correctly.  This
+triggers three main behavior changes:
+
+ - Interrupt entry will use the slower gsbase check.
+ - Interrupt entry from user mode will switch off the IST stack.
+ - Interrupt exit to kernel mode will not attempt to reschedule.
+
+We try to only use IST entries and the paranoid entry code for vectors
+that absolutely need the more expensive check for the GS base - and we
+generate all 'normal' entry points with the regular (faster) paranoid=0
+variant.
diff --git a/Documentation/x86/exception-tables.txt b/Documentation/x86/exception-tables.txt
new file mode 100644
index 000000000..32901aa36
--- /dev/null
+++ b/Documentation/x86/exception-tables.txt
@@ -0,0 +1,292 @@
+     Kernel level exception handling in Linux
+  Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
+
+When a process runs in kernel mode, it often has to access user
+mode memory whose address has been passed by an untrusted program.
+To protect itself the kernel has to verify this address.
+
+In older versions of Linux this was done with the
+int verify_area(int type, const void * addr, unsigned long size)
+function (which has since been replaced by access_ok()).
+
+This function verified that the memory area starting at address
+'addr' and of size 'size' was accessible for the operation specified
+in type (read or write). To do this, verify_read had to look up the
+virtual memory area (vma) that contained the address addr. In the
+normal case (correctly working program), this test was successful.
+It only failed for a few buggy programs. In some kernel profiling
+tests, this normally unneeded verification used up a considerable
+amount of time.
+
+To overcome this situation, Linus decided to let the virtual memory
+hardware present in every Linux-capable CPU handle this test.
+
+How does this work?
+
+Whenever the kernel tries to access an address that is currently not
+accessible, the CPU generates a page fault exception and calls the
+page fault handler
+
+void do_page_fault(struct pt_regs *regs, unsigned long error_code)
+
+in arch/x86/mm/fault.c. The parameters on the stack are set up by
+the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter
+regs is a pointer to the saved registers on the stack, error_code
+contains a reason code for the exception.
+
+do_page_fault first obtains the unaccessible address from the CPU
+control register CR2. If the address is within the virtual address
+space of the process, the fault probably occurred, because the page
+was not swapped in, write protected or something similar. However,
+we are interested in the other case: the address is not valid, there
+is no vma that contains this address. In this case, the kernel jumps
+to the bad_area label.
+
+There it uses the address of the instruction that caused the exception
+(i.e. regs->eip) to find an address where the execution can continue
+(fixup). If this search is successful, the fault handler modifies the
+return address (again regs->eip) and returns. The execution will
+continue at the address in fixup.
+
+Where does fixup point to?
+
+Since we jump to the contents of fixup, fixup obviously points
+to executable code. This code is hidden inside the user access macros.
+I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h
+as an example. The definition is somewhat hard to follow, so let's peek at
+the code generated by the preprocessor and the compiler. I selected
+the get_user call in drivers/char/sysrq.c for a detailed examination.
+
+The original code in sysrq.c line 587:
+        get_user(c, buf);
+
+The preprocessor output (edited to become somewhat readable):
+
+(
+  {
+    long __gu_err = - 14 , __gu_val = 0;
+    const __typeof__(*( (  buf ) )) *__gu_addr = ((buf));
+    if (((((0 + current_set[0])->tss.segment) == 0x18 )  ||
+       (((sizeof(*(buf))) <= 0xC0000000UL) &&
+       ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))
+      do {
+        __gu_err  = 0;
+        switch ((sizeof(*(buf)))) {
+          case 1:
+            __asm__ __volatile__(
+              "1:      mov" "b" " %2,%" "b" "1\n"
+              "2:\n"
+              ".section .fixup,\"ax\"\n"
+              "3:      movl %3,%0\n"
+              "        xor" "b" " %" "b" "1,%" "b" "1\n"
+              "        jmp 2b\n"
+              ".section __ex_table,\"a\"\n"
+              "        .align 4\n"
+              "        .long 1b,3b\n"
+              ".text"        : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)
+                            (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  )) ;
+              break;
+          case 2:
+            __asm__ __volatile__(
+              "1:      mov" "w" " %2,%" "w" "1\n"
+              "2:\n"
+              ".section .fixup,\"ax\"\n"
+              "3:      movl %3,%0\n"
+              "        xor" "w" " %" "w" "1,%" "w" "1\n"
+              "        jmp 2b\n"
+              ".section __ex_table,\"a\"\n"
+              "        .align 4\n"
+              "        .long 1b,3b\n"
+              ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
+                            (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  ));
+              break;
+          case 4:
+            __asm__ __volatile__(
+              "1:      mov" "l" " %2,%" "" "1\n"
+              "2:\n"
+              ".section .fixup,\"ax\"\n"
+              "3:      movl %3,%0\n"
+              "        xor" "l" " %" "" "1,%" "" "1\n"
+              "        jmp 2b\n"
+              ".section __ex_table,\"a\"\n"
+              "        .align 4\n"        "        .long 1b,3b\n"
+              ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)
+                            (   __gu_addr   )) ), "i"(- 14 ), "0"(__gu_err));
+              break;
+          default:
+            (__gu_val) = __get_user_bad();
+        }
+      } while (0) ;
+    ((c)) = (__typeof__(*((buf))))__gu_val;
+    __gu_err;
+  }
+);
+
+WOW! Black GCC/assembly magic. This is impossible to follow, so let's
+see what code gcc generates:
+
+ >         xorl %edx,%edx
+ >         movl current_set,%eax
+ >         cmpl $24,788(%eax)
+ >         je .L1424
+ >         cmpl $-1073741825,64(%esp)
+ >         ja .L1423
+ > .L1424:
+ >         movl %edx,%eax
+ >         movl 64(%esp),%ebx
+ > #APP
+ > 1:      movb (%ebx),%dl                /* this is the actual user access */
+ > 2:
+ > .section .fixup,"ax"
+ > 3:      movl $-14,%eax
+ >         xorb %dl,%dl
+ >         jmp 2b
+ > .section __ex_table,"a"
+ >         .align 4
+ >         .long 1b,3b
+ > .text
+ > #NO_APP
+ > .L1423:
+ >         movzbl %dl,%esi
+
+The optimizer does a good job and gives us something we can actually
+understand. Can we? The actual user access is quite obvious. Thanks
+to the unified address space we can just access the address in user
+memory. But what does the .section stuff do?????
+
+To understand this we have to look at the final kernel:
+
+ > objdump --section-headers vmlinux
+ >
+ > vmlinux:     file format elf32-i386
+ >
+ > Sections:
+ > Idx Name          Size      VMA       LMA       File off  Algn
+ >   0 .text         00098f40  c0100000  c0100000  00001000  2**4
+ >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
+ >   1 .fixup        000016bc  c0198f40  c0198f40  00099f40  2**0
+ >                   CONTENTS, ALLOC, LOAD, READONLY, CODE
+ >   2 .rodata       0000f127  c019a5fc  c019a5fc  0009b5fc  2**2
+ >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
+ >   3 __ex_table    000015c0  c01a9724  c01a9724  000aa724  2**2
+ >                   CONTENTS, ALLOC, LOAD, READONLY, DATA
+ >   4 .data         0000ea58  c01abcf0  c01abcf0  000abcf0  2**4
+ >                   CONTENTS, ALLOC, LOAD, DATA
+ >   5 .bss          00018e21  c01ba748  c01ba748  000ba748  2**2
+ >                   ALLOC
+ >   6 .comment      00000ec4  00000000  00000000  000ba748  2**0
+ >                   CONTENTS, READONLY
+ >   7 .note         00001068  00000ec4  00000ec4  000bb60c  2**0
+ >                   CONTENTS, READONLY
+
+There are obviously 2 non standard ELF sections in the generated object
+file. But first we want to find out what happened to our code in the
+final kernel executable:
+
+ > objdump --disassemble --section=.text vmlinux
+ >
+ > c017e785 <do_con_write+c1> xorl   %edx,%edx
+ > c017e787 <do_con_write+c3> movl   0xc01c7bec,%eax
+ > c017e78c <do_con_write+c8> cmpl   $0x18,0x314(%eax)
+ > c017e793 <do_con_write+cf> je     c017e79f <do_con_write+db>
+ > c017e795 <do_con_write+d1> cmpl   $0xbfffffff,0x40(%esp,1)
+ > c017e79d <do_con_write+d9> ja     c017e7a7 <do_con_write+e3>
+ > c017e79f <do_con_write+db> movl   %edx,%eax
+ > c017e7a1 <do_con_write+dd> movl   0x40(%esp,1),%ebx
+ > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
+ > c017e7a7 <do_con_write+e3> movzbl %dl,%esi
+
+The whole user memory access is reduced to 10 x86 machine instructions.
+The instructions bracketed in the .section directives are no longer
+in the normal execution path. They are located in a different section
+of the executable file:
+
+ > objdump --disassemble --section=.fixup vmlinux
+ >
+ > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
+ > c0199ffa <.fixup+10ba> xorb   %dl,%dl
+ > c0199ffc <.fixup+10bc> jmp    c017e7a7 <do_con_write+e3>
+
+And finally:
+ > objdump --full-contents --section=__ex_table vmlinux
+ >
+ >  c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0  ................
+ >  c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0  ................
+ >  c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0  ................
+
+or in human readable byte order:
+
+ >  c01aa7c4 c017c093 c0199fe0 c017c097 c017c099  ................
+ >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
+                               ^^^^^^^^^^^^^^^^^
+                               this is the interesting part!
+ >  c01aa7e4 c0180a08 c019a001 c0180a0a c019a004  ................
+
+What happened? The assembly directives
+
+.section .fixup,"ax"
+.section __ex_table,"a"
+
+told the assembler to move the following code to the specified
+sections in the ELF object file. So the instructions
+3:      movl $-14,%eax
+        xorb %dl,%dl
+        jmp 2b
+ended up in the .fixup section of the object file and the addresses
+        .long 1b,3b
+ended up in the __ex_table section of the object file. 1b and 3b
+are local labels. The local label 1b (1b stands for next label 1
+backward) is the address of the instruction that might fault, i.e.
+in our case the address of the label 1 is c017e7a5:
+the original assembly code: > 1:      movb (%ebx),%dl
+and linked in vmlinux     : > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
+
+The local label 3 (backwards again) is the address of the code to handle
+the fault, in our case the actual value is c0199ff5:
+the original assembly code: > 3:      movl $-14,%eax
+and linked in vmlinux     : > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax
+
+The assembly code
+ > .section __ex_table,"a"
+ >         .align 4
+ >         .long 1b,3b
+
+becomes the value pair
+ >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
+                               ^this is ^this is
+                               1b       3b
+c017e7a5,c0199ff5 in the exception table of the kernel.
+
+So, what actually happens if a fault from kernel mode with no suitable
+vma occurs?
+
+1.) access to invalid address:
+ > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl
+2.) MMU generates exception
+3.) CPU calls do_page_fault
+4.) do page fault calls search_exception_table (regs->eip == c017e7a5);
+5.) search_exception_table looks up the address c017e7a5 in the
+    exception table (i.e. the contents of the ELF section __ex_table)
+    and returns the address of the associated fault handle code c0199ff5.
+6.) do_page_fault modifies its own return address to point to the fault
+    handle code and returns.
+7.) execution continues in the fault handling code.
+8.) 8a) EAX becomes -EFAULT (== -14)
+    8b) DL  becomes zero (the value we "read" from user space)
+    8c) execution continues at local label 2 (address of the
+        instruction immediately after the faulting user access).
+
+The steps 8a to 8c in a certain way emulate the faulting instruction.
+
+That's it, mostly. If you look at our example, you might ask why
+we set EAX to -EFAULT in the exception handler code. Well, the
+get_user macro actually returns a value: 0, if the user access was
+successful, -EFAULT on failure. Our original code did not test this
+return value, however the inline assembly code in get_user tries to
+return -EFAULT. GCC selected EAX to return this value.
+
+NOTE:
+Due to the way that the exception table is built and needs to be ordered,
+only use exceptions for code in the .text section.  Any other section
+will cause the exception table to not be sorted correctly, and the
+exceptions will fail.
diff --git a/Documentation/x86/i386/IO-APIC.txt b/Documentation/x86/i386/IO-APIC.txt
new file mode 100644
index 000000000..15f5baf7e
--- /dev/null
+++ b/Documentation/x86/i386/IO-APIC.txt
@@ -0,0 +1,119 @@
+Most (all) Intel-MP compliant SMP boards have the so-called 'IO-APIC',
+which is an enhanced interrupt controller. It enables us to route
+hardware interrupts to multiple CPUs, or to CPU groups. Without an
+IO-APIC, interrupts from hardware will be delivered only to the
+CPU which boots the operating system (usually CPU#0).
+
+Linux supports all variants of compliant SMP boards, including ones with
+multiple IO-APICs. Multiple IO-APICs are used in high-end servers to
+distribute IRQ load further.
+
+There are (a few) known breakages in certain older boards, such bugs are
+usually worked around by the kernel. If your MP-compliant SMP board does
+not boot Linux, then consult the linux-smp mailing list archives first.
+
+If your box boots fine with enabled IO-APIC IRQs, then your
+/proc/interrupts will look like this one:
+
+   ---------------------------->
+  hell:~> cat /proc/interrupts
+             CPU0
+    0:    1360293    IO-APIC-edge  timer
+    1:          4    IO-APIC-edge  keyboard
+    2:          0          XT-PIC  cascade
+   13:          1          XT-PIC  fpu
+   14:       1448    IO-APIC-edge  ide0
+   16:      28232   IO-APIC-level  Intel EtherExpress Pro 10/100 Ethernet
+   17:      51304   IO-APIC-level  eth0
+  NMI:          0
+  ERR:          0
+  hell:~>
+  <----------------------------
+
+Some interrupts are still listed as 'XT PIC', but this is not a problem;
+none of those IRQ sources is performance-critical.
+
+
+In the unlikely case that your board does not create a working mp-table,
+you can use the pirq= boot parameter to 'hand-construct' IRQ entries. This
+is non-trivial though and cannot be automated. One sample /etc/lilo.conf
+entry:
+
+	append="pirq=15,11,10"
+
+The actual numbers depend on your system, on your PCI cards and on their
+PCI slot position. Usually PCI slots are 'daisy chained' before they are
+connected to the PCI chipset IRQ routing facility (the incoming PIRQ1-4
+lines):
+
+               ,-.        ,-.        ,-.        ,-.        ,-.
+     PIRQ4 ----| |-.    ,-| |-.    ,-| |-.    ,-| |--------| |
+               |S|  \  /  |S|  \  /  |S|  \  /  |S|        |S|
+     PIRQ3 ----|l|-. `/---|l|-. `/---|l|-. `/---|l|--------|l|
+               |o|  \/    |o|  \/    |o|  \/    |o|        |o|
+     PIRQ2 ----|t|-./`----|t|-./`----|t|-./`----|t|--------|t|
+               |1| /\     |2| /\     |3| /\     |4|        |5|
+     PIRQ1 ----| |-  `----| |-  `----| |-  `----| |--------| |
+               `-'        `-'        `-'        `-'        `-'
+
+Every PCI card emits a PCI IRQ, which can be INTA, INTB, INTC or INTD:
+
+                               ,-.
+                         INTD--| |
+                               |S|
+                         INTC--|l|
+                               |o|
+                         INTB--|t|
+                               |x|
+                         INTA--| |
+                               `-'
+
+These INTA-D PCI IRQs are always 'local to the card', their real meaning
+depends on which slot they are in. If you look at the daisy chaining diagram,
+a card in slot4, issuing INTA IRQ, it will end up as a signal on PIRQ4 of
+the PCI chipset. Most cards issue INTA, this creates optimal distribution
+between the PIRQ lines. (distributing IRQ sources properly is not a
+necessity, PCI IRQs can be shared at will, but it's a good for performance
+to have non shared interrupts). Slot5 should be used for videocards, they
+do not use interrupts normally, thus they are not daisy chained either.
+
+so if you have your SCSI card (IRQ11) in Slot1, Tulip card (IRQ9) in
+Slot2, then you'll have to specify this pirq= line:
+
+	append="pirq=11,9"
+
+the following script tries to figure out such a default pirq= line from
+your PCI configuration:
+
+	echo -n pirq=; echo `scanpci | grep T_L | cut -c56-` | sed 's/ /,/g'
+
+note that this script won't work if you have skipped a few slots or if your
+board does not do default daisy-chaining. (or the IO-APIC has the PIRQ pins
+connected in some strange way). E.g. if in the above case you have your SCSI
+card (IRQ11) in Slot3, and have Slot1 empty:
+
+	append="pirq=0,9,11"
+
+[value '0' is a generic 'placeholder', reserved for empty (or non-IRQ emitting)
+slots.]
+
+Generally, it's always possible to find out the correct pirq= settings, just
+permute all IRQ numbers properly ... it will take some time though. An
+'incorrect' pirq line will cause the booting process to hang, or a device
+won't function properly (e.g. if it's inserted as a module).
+
+If you have 2 PCI buses, then you can use up to 8 pirq values, although such
+boards tend to have a good configuration.
+
+Be prepared that it might happen that you need some strange pirq line:
+
+	append="pirq=0,0,0,0,0,0,9,11"
+
+Use smart trial-and-error techniques to find out the correct pirq line ...
+
+Good luck and mail to linux-smp@vger.kernel.org or
+linux-kernel@vger.kernel.org if you have any problems that are not covered
+by this document.
+
+-- mingo
+
diff --git a/Documentation/x86/intel_mpx.txt b/Documentation/x86/intel_mpx.txt
new file mode 100644
index 000000000..818518a3f
--- /dev/null
+++ b/Documentation/x86/intel_mpx.txt
@@ -0,0 +1,244 @@
+1. Intel(R) MPX Overview
+========================
+
+Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability
+introduced into Intel Architecture. Intel MPX provides hardware features
+that can be used in conjunction with compiler changes to check memory
+references, for those references whose compile-time normal intentions are
+usurped at runtime due to buffer overflow or underflow.
+
+You can tell if your CPU supports MPX by looking in /proc/cpuinfo:
+
+	cat /proc/cpuinfo  | grep ' mpx '
+
+For more information, please refer to Intel(R) Architecture Instruction
+Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection
+Extensions.
+
+Note: As of December 2014, no hardware with MPX is available but it is
+possible to use SDE (Intel(R) Software Development Emulator) instead, which
+can be downloaded from
+http://software.intel.com/en-us/articles/intel-software-development-emulator
+
+
+2. How to get the advantage of MPX
+==================================
+
+For MPX to work, changes are required in the kernel, binutils and compiler.
+No source changes are required for applications, just a recompile.
+
+There are a lot of moving parts of this to all work right. The following
+is how we expect the compiler, application and kernel to work together.
+
+1) Application developer compiles with -fmpx. The compiler will add the
+   instrumentation as well as some setup code called early after the app
+   starts. New instruction prefixes are noops for old CPUs.
+2) That setup code allocates (virtual) space for the "bounds directory",
+   points the "bndcfgu" register to the directory (must also set the valid
+   bit) and notifies the kernel (via the new prctl(PR_MPX_ENABLE_MANAGEMENT))
+   that the app will be using MPX.  The app must be careful not to access
+   the bounds tables between the time when it populates "bndcfgu" and
+   when it calls the prctl().  This might be hard to guarantee if the app
+   is compiled with MPX.  You can add "__attribute__((bnd_legacy))" to
+   the function to disable MPX instrumentation to help guarantee this.
+   Also be careful not to call out to any other code which might be
+   MPX-instrumented.
+3) The kernel detects that the CPU has MPX, allows the new prctl() to
+   succeed, and notes the location of the bounds directory. Userspace is
+   expected to keep the bounds directory at that locationWe note it
+   instead of reading it each time because the 'xsave' operation needed
+   to access the bounds directory register is an expensive operation.
+4) If the application needs to spill bounds out of the 4 registers, it
+   issues a bndstx instruction. Since the bounds directory is empty at
+   this point, a bounds fault (#BR) is raised, the kernel allocates a
+   bounds table (in the user address space) and makes the relevant entry
+   in the bounds directory point to the new table.
+5) If the application violates the bounds specified in the bounds registers,
+   a separate kind of #BR is raised which will deliver a signal with
+   information about the violation in the 'struct siginfo'.
+6) Whenever memory is freed, we know that it can no longer contain valid
+   pointers, and we attempt to free the associated space in the bounds
+   tables. If an entire table becomes unused, we will attempt to free
+   the table and remove the entry in the directory.
+
+To summarize, there are essentially three things interacting here:
+
+GCC with -fmpx:
+ * enables annotation of code with MPX instructions and prefixes
+ * inserts code early in the application to call in to the "gcc runtime"
+GCC MPX Runtime:
+ * Checks for hardware MPX support in cpuid leaf
+ * allocates virtual space for the bounds directory (malloc() essentially)
+ * points the hardware BNDCFGU register at the directory
+ * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to
+   start managing the bounds directories
+Kernel MPX Code:
+ * Checks for hardware MPX support in cpuid leaf
+ * Handles #BR exceptions and sends SIGSEGV to the app when it violates
+   bounds, like during a buffer overflow.
+ * When bounds are spilled in to an unallocated bounds table, the kernel
+   notices in the #BR exception, allocates the virtual space, then
+   updates the bounds directory to point to the new table. It keeps
+   special track of the memory with a VM_MPX flag.
+ * Frees unused bounds tables at the time that the memory they described
+   is unmapped.
+
+
+3. How does MPX kernel code work
+================================
+
+Handling #BR faults caused by MPX
+---------------------------------
+
+When MPX is enabled, there are 2 new situations that can generate
+#BR faults.
+  * new bounds tables (BT) need to be allocated to save bounds.
+  * bounds violation caused by MPX instructions.
+
+We hook #BR handler to handle these two new situations.
+
+On-demand kernel allocation of bounds tables
+--------------------------------------------
+
+MPX only has 4 hardware registers for storing bounds information. If
+MPX-enabled code needs more than these 4 registers, it needs to spill
+them somewhere. It has two special instructions for this which allow
+the bounds to be moved between the bounds registers and some new "bounds
+tables".
+
+#BR exceptions are a new class of exceptions just for MPX. They are
+similar conceptually to a page fault and will be raised by the MPX
+hardware during both bounds violations or when the tables are not
+present. The kernel handles those #BR exceptions for not-present tables
+by carving the space out of the normal processes address space and then
+pointing the bounds-directory over to it.
+
+The tables need to be accessed and controlled by userspace because
+the instructions for moving bounds in and out of them are extremely
+frequent. They potentially happen every time a register points to
+memory. Any direct kernel involvement (like a syscall) to access the
+tables would obviously destroy performance.
+
+Why not do this in userspace? MPX does not strictly require anything in
+the kernel. It can theoretically be done completely from userspace. Here
+are a few ways this could be done. We don't think any of them are practical
+in the real-world, but here they are.
+
+Q: Can virtual space simply be reserved for the bounds tables so that we
+   never have to allocate them?
+A: MPX-enabled application will possibly create a lot of bounds tables in
+   process address space to save bounds information. These tables can take
+   up huge swaths of memory (as much as 80% of the memory on the system)
+   even if we clean them up aggressively. In the worst-case scenario, the
+   tables can be 4x the size of the data structure being tracked. IOW, a
+   1-page structure can require 4 bounds-table pages. An X-GB virtual
+   area needs 4*X GB of virtual space, plus 2GB for the bounds directory.
+   If we were to preallocate them for the 128TB of user virtual address
+   space, we would need to reserve 512TB+2GB, which is larger than the
+   entire virtual address space today. This means they can not be reserved
+   ahead of time. Also, a single process's pre-popualated bounds directory
+   consumes 2GB of virtual *AND* physical memory. IOW, it's completely
+   infeasible to prepopulate bounds directories.
+
+Q: Can we preallocate bounds table space at the same time memory is
+   allocated which might contain pointers that might eventually need
+   bounds tables?
+A: This would work if we could hook the site of each and every memory
+   allocation syscall. This can be done for small, constrained applications.
+   But, it isn't practical at a larger scale since a given app has no
+   way of controlling how all the parts of the app might allocate memory
+   (think libraries). The kernel is really the only place to intercept
+   these calls.
+
+Q: Could a bounds fault be handed to userspace and the tables allocated
+   there in a signal handler intead of in the kernel?
+A: mmap() is not on the list of safe async handler functions and even
+   if mmap() would work it still requires locking or nasty tricks to
+   keep track of the allocation state there.
+
+Having ruled out all of the userspace-only approaches for managing
+bounds tables that we could think of, we create them on demand in
+the kernel.
+
+Decoding MPX instructions
+-------------------------
+
+If a #BR is generated due to a bounds violation caused by MPX.
+We need to decode MPX instructions to get violation address and
+set this address into extended struct siginfo.
+
+The _sigfault feild of struct siginfo is extended as follow:
+
+87		/* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
+88		struct {
+89			void __user *_addr; /* faulting insn/memory ref. */
+90 #ifdef __ARCH_SI_TRAPNO
+91			int _trapno;	/* TRAP # which caused the signal */
+92 #endif
+93			short _addr_lsb; /* LSB of the reported address */
+94			struct {
+95				void __user *_lower;
+96				void __user *_upper;
+97			} _addr_bnd;
+98		} _sigfault;
+
+The '_addr' field refers to violation address, and new '_addr_and'
+field refers to the upper/lower bounds when a #BR is caused.
+
+Glibc will be also updated to support this new siginfo. So user
+can get violation address and bounds when bounds violations occur.
+
+Cleanup unused bounds tables
+----------------------------
+
+When a BNDSTX instruction attempts to save bounds to a bounds directory
+entry marked as invalid, a #BR is generated. This is an indication that
+no bounds table exists for this entry. In this case the fault handler
+will allocate a new bounds table on demand.
+
+Since the kernel allocated those tables on-demand without userspace
+knowledge, it is also responsible for freeing them when the associated
+mappings go away.
+
+Here, the solution for this issue is to hook do_munmap() to check
+whether one process is MPX enabled. If yes, those bounds tables covered
+in the virtual address region which is being unmapped will be freed also.
+
+Adding new prctl commands
+-------------------------
+
+Two new prctl commands are added to enable and disable MPX bounds tables
+management in kernel.
+
+155	#define PR_MPX_ENABLE_MANAGEMENT	43
+156	#define PR_MPX_DISABLE_MANAGEMENT	44
+
+Runtime library in userspace is responsible for allocation of bounds
+directory. So kernel have to use XSAVE instruction to get the base
+of bounds directory from BNDCFG register.
+
+But XSAVE is expected to be very expensive. In order to do performance
+optimization, we have to get the base of bounds directory and save it
+into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT
+command execution.
+
+
+4. Special rules
+================
+
+1) If userspace is requesting help from the kernel to do the management
+of bounds tables, it may not create or modify entries in the bounds directory.
+
+Certainly users can allocate bounds tables and forcibly point the bounds
+directory at them through XSAVE instruction, and then set valid bit
+of bounds entry to have this entry valid.  But, the kernel will decline
+to assist in managing these tables.
+
+2) Userspace may not take multiple bounds directory entries and point
+them at the same bounds table.
+
+This is allowed architecturally.  See more information "Intel(R) Architecture
+Instruction Set Extensions Programming Reference" (9.3.4).
+
+However, if users did this, the kernel might be fooled in to unmaping an
+in-use bounds table since it does not recognize sharing.
diff --git a/Documentation/x86/mtrr.txt b/Documentation/x86/mtrr.txt
new file mode 100644
index 000000000..cc071dc33
--- /dev/null
+++ b/Documentation/x86/mtrr.txt
@@ -0,0 +1,305 @@
+MTRR (Memory Type Range Register) control
+3 Jun 1999
+Richard Gooch
+<rgooch@atnf.csiro.au>
+
+  On Intel P6 family processors (Pentium Pro, Pentium II and later)
+  the Memory Type Range Registers (MTRRs) may be used to control
+  processor access to memory ranges. This is most useful when you have
+  a video (VGA) card on a PCI or AGP bus. Enabling write-combining
+  allows bus write transfers to be combined into a larger transfer
+  before bursting over the PCI/AGP bus. This can increase performance
+  of image write operations 2.5 times or more.
+
+  The Cyrix 6x86, 6x86MX and M II processors have Address Range
+  Registers (ARRs) which provide a similar functionality to MTRRs. For
+  these, the ARRs are used to emulate the MTRRs.
+
+  The AMD K6-2 (stepping 8 and above) and K6-3 processors have two
+  MTRRs. These are supported.  The AMD Athlon family provide 8 Intel
+  style MTRRs.
+
+  The Centaur C6 (WinChip) has 8 MCRs, allowing write-combining. These
+  are supported.
+
+  The VIA Cyrix III and VIA C3 CPUs offer 8 Intel style MTRRs.
+
+  The CONFIG_MTRR option creates a /proc/mtrr file which may be used
+  to manipulate your MTRRs. Typically the X server should use
+  this. This should have a reasonably generic interface so that
+  similar control registers on other processors can be easily
+  supported.
+
+
+There are two interfaces to /proc/mtrr: one is an ASCII interface
+which allows you to read and write. The other is an ioctl()
+interface. The ASCII interface is meant for administration. The
+ioctl() interface is meant for C programs (i.e. the X server). The
+interfaces are described below, with sample commands and C code.
+
+===============================================================================
+Reading MTRRs from the shell:
+
+% cat /proc/mtrr
+reg00: base=0x00000000 (   0MB), size= 128MB: write-back, count=1
+reg01: base=0x08000000 ( 128MB), size=  64MB: write-back, count=1
+===============================================================================
+Creating MTRRs from the C-shell:
+# echo "base=0xf8000000 size=0x400000 type=write-combining" >! /proc/mtrr
+or if you use bash:
+# echo "base=0xf8000000 size=0x400000 type=write-combining" >| /proc/mtrr
+
+And the result thereof:
+% cat /proc/mtrr
+reg00: base=0x00000000 (   0MB), size= 128MB: write-back, count=1
+reg01: base=0x08000000 ( 128MB), size=  64MB: write-back, count=1
+reg02: base=0xf8000000 (3968MB), size=   4MB: write-combining, count=1
+
+This is for video RAM at base address 0xf8000000 and size 4 megabytes. To
+find out your base address, you need to look at the output of your X
+server, which tells you where the linear framebuffer address is. A
+typical line that you may get is:
+
+(--) S3: PCI: 968 rev 0, Linear FB @ 0xf8000000
+
+Note that you should only use the value from the X server, as it may
+move the framebuffer base address, so the only value you can trust is
+that reported by the X server.
+
+To find out the size of your framebuffer (what, you don't actually
+know?), the following line will tell you:
+
+(--) S3: videoram:  4096k
+
+That's 4 megabytes, which is 0x400000 bytes (in hexadecimal).
+A patch is being written for XFree86 which will make this automatic:
+in other words the X server will manipulate /proc/mtrr using the
+ioctl() interface, so users won't have to do anything. If you use a
+commercial X server, lobby your vendor to add support for MTRRs.
+===============================================================================
+Creating overlapping MTRRs:
+
+%echo "base=0xfb000000 size=0x1000000 type=write-combining" >/proc/mtrr
+%echo "base=0xfb000000 size=0x1000 type=uncachable" >/proc/mtrr
+
+And the results: cat /proc/mtrr
+reg00: base=0x00000000 (   0MB), size=  64MB: write-back, count=1
+reg01: base=0xfb000000 (4016MB), size=  16MB: write-combining, count=1
+reg02: base=0xfb000000 (4016MB), size=   4kB: uncachable, count=1
+
+Some cards (especially Voodoo Graphics boards) need this 4 kB area
+excluded from the beginning of the region because it is used for
+registers.
+
+NOTE: You can only create type=uncachable region, if the first
+region that you created is type=write-combining.
+===============================================================================
+Removing MTRRs from the C-shell:
+% echo "disable=2" >! /proc/mtrr
+or using bash:
+% echo "disable=2" >| /proc/mtrr
+===============================================================================
+Reading MTRRs from a C program using ioctl()'s:
+
+/*  mtrr-show.c
+
+    Source file for mtrr-show (example program to show MTRRs using ioctl()'s)
+
+    Copyright (C) 1997-1998  Richard Gooch
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+    Richard Gooch may be reached by email at  rgooch@atnf.csiro.au
+    The postal address is:
+      Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
+*/
+
+/*
+    This program will use an ioctl() on /proc/mtrr to show the current MTRR
+    settings. This is an alternative to reading /proc/mtrr.
+
+
+    Written by      Richard Gooch   17-DEC-1997
+
+    Last updated by Richard Gooch   2-MAY-1998
+
+
+*/
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <errno.h>
+#include <asm/mtrr.h>
+
+#define TRUE 1
+#define FALSE 0
+#define ERRSTRING strerror (errno)
+
+static char *mtrr_strings[MTRR_NUM_TYPES] =
+{
+    "uncachable",               /* 0 */
+    "write-combining",          /* 1 */
+    "?",                        /* 2 */
+    "?",                        /* 3 */
+    "write-through",            /* 4 */
+    "write-protect",            /* 5 */
+    "write-back",               /* 6 */
+};
+
+int main ()
+{
+    int fd;
+    struct mtrr_gentry gentry;
+
+    if ( ( fd = open ("/proc/mtrr", O_RDONLY, 0) ) == -1 )
+    {
+	if (errno == ENOENT)
+	{
+	    fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
+		   stderr);
+	    exit (1);
+	}
+	fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
+	exit (2);
+    }
+    for (gentry.regnum = 0; ioctl (fd, MTRRIOC_GET_ENTRY, &gentry) == 0;
+	 ++gentry.regnum)
+    {
+	if (gentry.size < 1)
+	{
+	    fprintf (stderr, "Register: %u disabled\n", gentry.regnum);
+	    continue;
+	}
+	fprintf (stderr, "Register: %u base: 0x%lx size: 0x%lx type: %s\n",
+		 gentry.regnum, gentry.base, gentry.size,
+		 mtrr_strings[gentry.type]);
+    }
+    if (errno == EINVAL) exit (0);
+    fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
+    exit (3);
+}   /*  End Function main  */
+===============================================================================
+Creating MTRRs from a C programme using ioctl()'s:
+
+/*  mtrr-add.c
+
+    Source file for mtrr-add (example programme to add an MTRRs using ioctl())
+
+    Copyright (C) 1997-1998  Richard Gooch
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
+
+    Richard Gooch may be reached by email at  rgooch@atnf.csiro.au
+    The postal address is:
+      Richard Gooch, c/o ATNF, P. O. Box 76, Epping, N.S.W., 2121, Australia.
+*/
+
+/*
+    This programme will use an ioctl() on /proc/mtrr to add an entry. The first
+    available mtrr is used. This is an alternative to writing /proc/mtrr.
+
+
+    Written by      Richard Gooch   17-DEC-1997
+
+    Last updated by Richard Gooch   2-MAY-1998
+
+
+*/
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <sys/ioctl.h>
+#include <errno.h>
+#include <asm/mtrr.h>
+
+#define TRUE 1
+#define FALSE 0
+#define ERRSTRING strerror (errno)
+
+static char *mtrr_strings[MTRR_NUM_TYPES] =
+{
+    "uncachable",               /* 0 */
+    "write-combining",          /* 1 */
+    "?",                        /* 2 */
+    "?",                        /* 3 */
+    "write-through",            /* 4 */
+    "write-protect",            /* 5 */
+    "write-back",               /* 6 */
+};
+
+int main (int argc, char **argv)
+{
+    int fd;
+    struct mtrr_sentry sentry;
+
+    if (argc != 4)
+    {
+	fprintf (stderr, "Usage:\tmtrr-add base size type\n");
+	exit (1);
+    }
+    sentry.base = strtoul (argv[1], NULL, 0);
+    sentry.size = strtoul (argv[2], NULL, 0);
+    for (sentry.type = 0; sentry.type < MTRR_NUM_TYPES; ++sentry.type)
+    {
+	if (strcmp (argv[3], mtrr_strings[sentry.type]) == 0) break;
+    }
+    if (sentry.type >= MTRR_NUM_TYPES)
+    {
+	fprintf (stderr, "Illegal type: \"%s\"\n", argv[3]);
+	exit (2);
+    }
+    if ( ( fd = open ("/proc/mtrr", O_WRONLY, 0) ) == -1 )
+    {
+	if (errno == ENOENT)
+	{
+	    fputs ("/proc/mtrr not found: not supported or you don't have a PPro?\n",
+		   stderr);
+	    exit (3);
+	}
+	fprintf (stderr, "Error opening /proc/mtrr\t%s\n", ERRSTRING);
+	exit (4);
+    }
+    if (ioctl (fd, MTRRIOC_ADD_ENTRY, &sentry) == -1)
+    {
+	fprintf (stderr, "Error doing ioctl(2) on /dev/mtrr\t%s\n", ERRSTRING);
+	exit (5);
+    }
+    fprintf (stderr, "Sleeping for 5 seconds so you can see the new entry\n");
+    sleep (5);
+    close (fd);
+    fputs ("I've just closed /proc/mtrr so now the new entry should be gone\n",
+	   stderr);
+}   /*  End Function main  */
+===============================================================================
diff --git a/Documentation/x86/pat.txt b/Documentation/x86/pat.txt
new file mode 100644
index 000000000..cf08c9fff
--- /dev/null
+++ b/Documentation/x86/pat.txt
@@ -0,0 +1,160 @@
+
+PAT (Page Attribute Table)
+
+x86 Page Attribute Table (PAT) allows for setting the memory attribute at the
+page level granularity. PAT is complementary to the MTRR settings which allows
+for setting of memory types over physical address ranges. However, PAT is
+more flexible than MTRR due to its capability to set attributes at page level
+and also due to the fact that there are no hardware limitations on number of
+such attribute settings allowed. Added flexibility comes with guidelines for
+not having memory type aliasing for the same physical memory with multiple
+virtual addresses.
+
+PAT allows for different types of memory attributes. The most commonly used
+ones that will be supported at this time are Write-back, Uncached,
+Write-combined and Uncached Minus.
+
+
+PAT APIs
+--------
+
+There are many different APIs in the kernel that allows setting of memory
+attributes at the page level. In order to avoid aliasing, these interfaces
+should be used thoughtfully. Below is a table of interfaces available,
+their intended usage and their memory attribute relationships. Internally,
+these APIs use a reserve_memtype()/free_memtype() interface on the physical
+address range to avoid any aliasing.
+
+
+-------------------------------------------------------------------
+API                    |    RAM   |  ACPI,...  |  Reserved/Holes  |
+-----------------------|----------|------------|------------------|
+                       |          |            |                  |
+ioremap                |    --    |    UC-     |       UC-        |
+                       |          |            |                  |
+ioremap_cache          |    --    |    WB      |       WB         |
+                       |          |            |                  |
+ioremap_nocache        |    --    |    UC-     |       UC-        |
+                       |          |            |                  |
+ioremap_wc             |    --    |    --      |       WC         |
+                       |          |            |                  |
+set_memory_uc          |    UC-   |    --      |       --         |
+ set_memory_wb         |          |            |                  |
+                       |          |            |                  |
+set_memory_wc          |    WC    |    --      |       --         |
+ set_memory_wb         |          |            |                  |
+                       |          |            |                  |
+pci sysfs resource     |    --    |    --      |       UC-        |
+                       |          |            |                  |
+pci sysfs resource_wc  |    --    |    --      |       WC         |
+ is IORESOURCE_PREFETCH|          |            |                  |
+                       |          |            |                  |
+pci proc               |    --    |    --      |       UC-        |
+ !PCIIOC_WRITE_COMBINE |          |            |                  |
+                       |          |            |                  |
+pci proc               |    --    |    --      |       WC         |
+ PCIIOC_WRITE_COMBINE  |          |            |                  |
+                       |          |            |                  |
+/dev/mem               |    --    |  WB/WC/UC- |    WB/WC/UC-     |
+ read-write            |          |            |                  |
+                       |          |            |                  |
+/dev/mem               |    --    |    UC-     |       UC-        |
+ mmap SYNC flag        |          |            |                  |
+                       |          |            |                  |
+/dev/mem               |    --    |  WB/WC/UC- |    WB/WC/UC-     |
+ mmap !SYNC flag       |          |(from exist-|  (from exist-    |
+ and                   |          |  ing alias)|    ing alias)    |
+ any alias to this area|          |            |                  |
+                       |          |            |                  |
+/dev/mem               |    --    |    WB      |       WB         |
+ mmap !SYNC flag       |          |            |                  |
+ no alias to this area |          |            |                  |
+ and                   |          |            |                  |
+ MTRR says WB          |          |            |                  |
+                       |          |            |                  |
+/dev/mem               |    --    |    --      |       UC-        |
+ mmap !SYNC flag       |          |            |                  |
+ no alias to this area |          |            |                  |
+ and                   |          |            |                  |
+ MTRR says !WB         |          |            |                  |
+                       |          |            |                  |
+-------------------------------------------------------------------
+
+Advanced APIs for drivers
+-------------------------
+A. Exporting pages to users with remap_pfn_range, io_remap_pfn_range,
+vm_insert_pfn
+
+Drivers wanting to export some pages to userspace do it by using mmap
+interface and a combination of
+1) pgprot_noncached()
+2) io_remap_pfn_range() or remap_pfn_range() or vm_insert_pfn()
+
+With PAT support, a new API pgprot_writecombine is being added. So, drivers can
+continue to use the above sequence, with either pgprot_noncached() or
+pgprot_writecombine() in step 1, followed by step 2.
+
+In addition, step 2 internally tracks the region as UC or WC in memtype
+list in order to ensure no conflicting mapping.
+
+Note that this set of APIs only works with IO (non RAM) regions. If driver
+wants to export a RAM region, it has to do set_memory_uc() or set_memory_wc()
+as step 0 above and also track the usage of those pages and use set_memory_wb()
+before the page is freed to free pool.
+
+
+
+Notes:
+
+-- in the above table mean "Not suggested usage for the API". Some of the --'s
+are strictly enforced by the kernel. Some others are not really enforced
+today, but may be enforced in future.
+
+For ioremap and pci access through /sys or /proc - The actual type returned
+can be more restrictive, in case of any existing aliasing for that address.
+For example: If there is an existing uncached mapping, a new ioremap_wc can
+return uncached mapping in place of write-combine requested.
+
+set_memory_[uc|wc] and set_memory_wb should be used in pairs, where driver will
+first make a region uc or wc and switch it back to wb after use.
+
+Over time writes to /proc/mtrr will be deprecated in favor of using PAT based
+interfaces. Users writing to /proc/mtrr are suggested to use above interfaces.
+
+Drivers should use ioremap_[uc|wc] to access PCI BARs with [uc|wc] access
+types.
+
+Drivers should use set_memory_[uc|wc] to set access type for RAM ranges.
+
+
+PAT debugging
+-------------
+
+With CONFIG_DEBUG_FS enabled, PAT memtype list can be examined by
+
+# mount -t debugfs debugfs /sys/kernel/debug
+# cat /sys/kernel/debug/x86/pat_memtype_list
+PAT memtype list:
+uncached-minus @ 0x7fadf000-0x7fae0000
+uncached-minus @ 0x7fb19000-0x7fb1a000
+uncached-minus @ 0x7fb1a000-0x7fb1b000
+uncached-minus @ 0x7fb1b000-0x7fb1c000
+uncached-minus @ 0x7fb1c000-0x7fb1d000
+uncached-minus @ 0x7fb1d000-0x7fb1e000
+uncached-minus @ 0x7fb1e000-0x7fb25000
+uncached-minus @ 0x7fb25000-0x7fb26000
+uncached-minus @ 0x7fb26000-0x7fb27000
+uncached-minus @ 0x7fb27000-0x7fb28000
+uncached-minus @ 0x7fb28000-0x7fb2e000
+uncached-minus @ 0x7fb2e000-0x7fb2f000
+uncached-minus @ 0x7fb2f000-0x7fb30000
+uncached-minus @ 0x7fb31000-0x7fb32000
+uncached-minus @ 0x80000000-0x90000000
+
+This list shows physical address ranges and various PAT settings used to
+access those physical address ranges.
+
+Another, more verbose way of getting PAT related debug messages is with
+"debugpat" boot parameter. With this parameter, various debug messages are
+printed to dmesg log.
+
diff --git a/Documentation/x86/tlb.txt b/Documentation/x86/tlb.txt
new file mode 100644
index 000000000..39d172326
--- /dev/null
+++ b/Documentation/x86/tlb.txt
@@ -0,0 +1,75 @@
+When the kernel unmaps or modified the attributes of a range of
+memory, it has two choices:
+ 1. Flush the entire TLB with a two-instruction sequence.  This is
+    a quick operation, but it causes collateral damage: TLB entries
+    from areas other than the one we are trying to flush will be
+    destroyed and must be refilled later, at some cost.
+ 2. Use the invlpg instruction to invalidate a single page at a
+    time.  This could potentialy cost many more instructions, but
+    it is a much more precise operation, causing no collateral
+    damage to other TLB entries.
+
+Which method to do depends on a few things:
+ 1. The size of the flush being performed.  A flush of the entire
+    address space is obviously better performed by flushing the
+    entire TLB than doing 2^48/PAGE_SIZE individual flushes.
+ 2. The contents of the TLB.  If the TLB is empty, then there will
+    be no collateral damage caused by doing the global flush, and
+    all of the individual flush will have ended up being wasted
+    work.
+ 3. The size of the TLB.  The larger the TLB, the more collateral
+    damage we do with a full flush.  So, the larger the TLB, the
+    more attrative an individual flush looks.  Data and
+    instructions have separate TLBs, as do different page sizes.
+ 4. The microarchitecture.  The TLB has become a multi-level
+    cache on modern CPUs, and the global flushes have become more
+    expensive relative to single-page flushes.
+
+There is obviously no way the kernel can know all these things,
+especially the contents of the TLB during a given flush.  The
+sizes of the flush will vary greatly depending on the workload as
+well.  There is essentially no "right" point to choose.
+
+You may be doing too many individual invalidations if you see the
+invlpg instruction (or instructions _near_ it) show up high in
+profiles.  If you believe that individual invalidations being
+called too often, you can lower the tunable:
+
+	/sys/kernel/debug/x86/tlb_single_page_flush_ceiling
+
+This will cause us to do the global flush for more cases.
+Lowering it to 0 will disable the use of the individual flushes.
+Setting it to 1 is a very conservative setting and it should
+never need to be 0 under normal circumstances.
+
+Despite the fact that a single individual flush on x86 is
+guaranteed to flush a full 2MB [1], hugetlbfs always uses the full
+flushes.  THP is treated exactly the same as normal memory.
+
+You might see invlpg inside of flush_tlb_mm_range() show up in
+profiles, or you can use the trace_tlb_flush() tracepoints. to
+determine how long the flush operations are taking.
+
+Essentially, you are balancing the cycles you spend doing invlpg
+with the cycles that you spend refilling the TLB later.
+
+You can measure how expensive TLB refills are by using
+performance counters and 'perf stat', like this:
+
+perf stat -e
+	cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
+	cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
+	cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
+	cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
+	cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
+	cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
+
+That works on an IvyBridge-era CPU (i5-3320M).  Different CPUs
+may have differently-named counters, but they should at least
+be there in some form.  You can use pmu-tools 'ocperf list'
+(https://github.com/andikleen/pmu-tools) to find the right
+counters for a given CPU.
+
+1. A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
+   says: "One execution of INVLPG is sufficient even for a page
+   with size greater than 4 KBytes."
diff --git a/Documentation/x86/usb-legacy-support.txt b/Documentation/x86/usb-legacy-support.txt
new file mode 100644
index 000000000..1894cdfc6
--- /dev/null
+++ b/Documentation/x86/usb-legacy-support.txt
@@ -0,0 +1,44 @@
+USB Legacy support
+~~~~~~~~~~~~~~~~~~
+
+Vojtech Pavlik <vojtech@suse.cz>, January 2004
+
+
+Also known as "USB Keyboard" or "USB Mouse support" in the BIOS Setup is a
+feature that allows one to use the USB mouse and keyboard as if they were
+their classic PS/2 counterparts.  This means one can use an USB keyboard to
+type in LILO for example.
+
+It has several drawbacks, though:
+
+1) On some machines, the emulated PS/2 mouse takes over even when no USB
+   mouse is present and a real PS/2 mouse is present.  In that case the extra
+   features (wheel, extra buttons, touchpad mode) of the real PS/2 mouse may
+   not be available.
+
+2) If CONFIG_HIGHMEM64G is enabled, the PS/2 mouse emulation can cause
+   system crashes, because the SMM BIOS is not expecting to be in PAE mode.
+   The Intel E7505 is a typical machine where this happens.
+
+3) If AMD64 64-bit mode is enabled, again system crashes often happen,
+   because the SMM BIOS isn't expecting the CPU to be in 64-bit mode.  The
+   BIOS manufacturers only test with Windows, and Windows doesn't do 64-bit
+   yet.
+
+Solutions:
+
+Problem 1) can be solved by loading the USB drivers prior to loading the
+PS/2 mouse driver. Since the PS/2 mouse driver is in 2.6 compiled into
+the kernel unconditionally, this means the USB drivers need to be
+compiled-in, too.
+
+Problem 2) can currently only be solved by either disabling HIGHMEM64G
+in the kernel config or USB Legacy support in the BIOS. A BIOS update
+could help, but so far no such update exists.
+
+Problem 3) is usually fixed by a BIOS update. Check the board
+manufacturers web site. If an update is not available, disable USB
+Legacy support in the BIOS. If this alone doesn't help, try also adding
+idle=poll on the kernel command line. The BIOS may be entering the SMM
+on the HLT instruction as well.
+
diff --git a/Documentation/x86/x86_64/00-INDEX b/Documentation/x86/x86_64/00-INDEX
new file mode 100644
index 000000000..92fc20ab5
--- /dev/null
+++ b/Documentation/x86/x86_64/00-INDEX
@@ -0,0 +1,16 @@
+00-INDEX
+	- This file
+boot-options.txt
+	- AMD64-specific boot options.
+cpu-hotplug-spec
+	- Firmware support for CPU hotplug under Linux/x86-64
+fake-numa-for-cpusets
+	- Using numa=fake and CPUSets for Resource Management
+kernel-stacks
+	- Context-specific per-processor interrupt stacks.
+machinecheck
+	- Configurable sysfs parameters for the x86-64 machine check code.
+mm.txt
+	- Memory layout of x86-64 (4 level page tables, 46 bits physical).
+uefi.txt
+	- Booting Linux via Unified Extensible Firmware Interface.
diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt
new file mode 100644
index 000000000..522347929
--- /dev/null
+++ b/Documentation/x86/x86_64/boot-options.txt
@@ -0,0 +1,284 @@
+AMD64 specific boot options
+
+There are many others (usually documented in driver documentation), but
+only the AMD64 specific ones are listed here.
+
+Machine check
+
+   Please see Documentation/x86/x86_64/machinecheck for sysfs runtime tunables.
+
+   mce=off
+		Disable machine check
+   mce=no_cmci
+		Disable CMCI(Corrected Machine Check Interrupt) that
+		Intel processor supports.  Usually this disablement is
+		not recommended, but it might be handy if your hardware
+		is misbehaving.
+		Note that you'll get more problems without CMCI than with
+		due to the shared banks, i.e. you might get duplicated
+		error logs.
+   mce=dont_log_ce
+		Don't make logs for corrected errors.  All events reported
+		as corrected are silently cleared by OS.
+		This option will be useful if you have no interest in any
+		of corrected errors.
+   mce=ignore_ce
+		Disable features for corrected errors, e.g. polling timer
+		and CMCI.  All events reported as corrected are not cleared
+		by OS and remained in its error banks.
+		Usually this disablement is not recommended, however if
+		there is an agent checking/clearing corrected errors
+		(e.g. BIOS or hardware monitoring applications), conflicting
+		with OS's error handling, and you cannot deactivate the agent,
+		then this option will be a help.
+   mce=bootlog
+		Enable logging of machine checks left over from booting.
+		Disabled by default on AMD because some BIOS leave bogus ones.
+		If your BIOS doesn't do that it's a good idea to enable though
+		to make sure you log even machine check events that result
+		in a reboot. On Intel systems it is enabled by default.
+   mce=nobootlog
+		Disable boot machine check logging.
+   mce=tolerancelevel[,monarchtimeout] (number,number)
+		tolerance levels:
+		0: always panic on uncorrected errors, log corrected errors
+		1: panic or SIGBUS on uncorrected errors, log corrected errors
+		2: SIGBUS or log uncorrected errors, log corrected errors
+		3: never panic or SIGBUS, log all errors (for testing only)
+		Default is 1
+		Can be also set using sysfs which is preferable.
+		monarchtimeout:
+		Sets the time in us to wait for other CPUs on machine checks. 0
+		to disable.
+   mce=bios_cmci_threshold
+		Don't overwrite the bios-set CMCI threshold. This boot option
+		prevents Linux from overwriting the CMCI threshold set by the
+		bios. Without this option, Linux always sets the CMCI
+		threshold to 1. Enabling this may make memory predictive failure
+		analysis less effective if the bios sets thresholds for memory
+		errors since we will not see details for all errors.
+
+   nomce (for compatibility with i386): same as mce=off
+
+   Everything else is in sysfs now.
+
+APICs
+
+   apic		 Use IO-APIC. Default
+
+   noapic	 Don't use the IO-APIC.
+
+   disableapic	 Don't use the local APIC
+
+   nolapic	 Don't use the local APIC (alias for i386 compatibility)
+
+   pirq=...	 See Documentation/x86/i386/IO-APIC.txt
+
+   noapictimer	 Don't set up the APIC timer
+
+   no_timer_check Don't check the IO-APIC timer. This can work around
+		 problems with incorrect timer initialization on some boards.
+   apicpmtimer
+		 Do APIC timer calibration using the pmtimer. Implies
+		 apicmaintimer. Useful when your PIT timer is totally
+		 broken.
+
+Timing
+
+  notsc
+  Don't use the CPU time stamp counter to read the wall time.
+  This can be used to work around timing problems on multiprocessor systems
+  with not properly synchronized CPUs.
+
+  nohpet
+  Don't use the HPET timer.
+
+Idle loop
+
+  idle=poll
+  Don't do power saving in the idle loop using HLT, but poll for rescheduling
+  event. This will make the CPUs eat a lot more power, but may be useful
+  to get slightly better performance in multiprocessor benchmarks. It also
+  makes some profiling using performance counters more accurate.
+  Please note that on systems with MONITOR/MWAIT support (like Intel EM64T
+  CPUs) this option has no performance advantage over the normal idle loop.
+  It may also interact badly with hyperthreading.
+
+Rebooting
+
+   reboot=b[ios] | t[riple] | k[bd] | a[cpi] | e[fi] [, [w]arm | [c]old]
+   bios	  Use the CPU reboot vector for warm reset
+   warm   Don't set the cold reboot flag
+   cold   Set the cold reboot flag
+   triple Force a triple fault (init)
+   kbd    Use the keyboard controller. cold reset (default)
+   acpi   Use the ACPI RESET_REG in the FADT. If ACPI is not configured or the
+          ACPI reset does not work, the reboot path attempts the reset using
+          the keyboard controller.
+   efi    Use efi reset_system runtime service. If EFI is not configured or the
+          EFI reset does not work, the reboot path attempts the reset using
+          the keyboard controller.
+
+   Using warm reset will be much faster especially on big memory
+   systems because the BIOS will not go through the memory check.
+   Disadvantage is that not all hardware will be completely reinitialized
+   on reboot so there may be boot problems on some systems.
+
+   reboot=force
+
+   Don't stop other CPUs on reboot. This can make reboot more reliable
+   in some cases.
+
+Non Executable Mappings
+
+  noexec=on|off
+
+  on      Enable(default)
+  off     Disable
+
+NUMA
+
+  numa=off	Only set up a single NUMA node spanning all memory.
+
+  numa=noacpi   Don't parse the SRAT table for NUMA setup
+
+  numa=fake=<size>[MG]
+		If given as a memory unit, fills all system RAM with nodes of
+		size interleaved over physical nodes.
+
+  numa=fake=<N>
+		If given as an integer, fills all system RAM with N fake nodes
+		interleaved over physical nodes.
+
+ACPI
+
+  acpi=off	Don't enable ACPI
+  acpi=ht	Use ACPI boot table parsing, but don't enable ACPI
+		interpreter
+  acpi=force	Force ACPI on (currently not needed)
+
+  acpi=strict   Disable out of spec ACPI workarounds.
+
+  acpi_sci={edge,level,high,low}  Set up ACPI SCI interrupt.
+
+  acpi=noirq	Don't route interrupts
+
+  acpi=nocmcff	Disable firmware first mode for corrected errors. This
+		disables parsing the HEST CMC error source to check if
+		firmware has set the FF flag. This may result in
+		duplicate corrected error reports.
+
+PCI
+
+  pci=off		Don't use PCI
+  pci=conf1		Use conf1 access.
+  pci=conf2		Use conf2 access.
+  pci=rom		Assign ROMs.
+  pci=assign-busses	Assign busses
+  pci=irqmask=MASK	Set PCI interrupt mask to MASK
+  pci=lastbus=NUMBER	Scan up to NUMBER busses, no matter what the mptable says.
+  pci=noacpi		Don't use ACPI to set up PCI interrupt routing.
+
+IOMMU (input/output memory management unit)
+
+ Currently four x86-64 PCI-DMA mapping implementations exist:
+
+   1. <arch/x86_64/kernel/pci-nommu.c>: use no hardware/software IOMMU at all
+      (e.g. because you have < 3 GB memory).
+      Kernel boot message: "PCI-DMA: Disabling IOMMU"
+
+   2. <arch/x86/kernel/amd_gart_64.c>: AMD GART based hardware IOMMU.
+      Kernel boot message: "PCI-DMA: using GART IOMMU"
+
+   3. <arch/x86_64/kernel/pci-swiotlb.c> : Software IOMMU implementation. Used
+      e.g. if there is no hardware IOMMU in the system and it is need because
+      you have >3GB memory or told the kernel to us it (iommu=soft))
+      Kernel boot message: "PCI-DMA: Using software bounce buffering
+      for IO (SWIOTLB)"
+
+   4. <arch/x86_64/pci-calgary.c> : IBM Calgary hardware IOMMU. Used in IBM
+      pSeries and xSeries servers. This hardware IOMMU supports DMA address
+      mapping with memory protection, etc.
+      Kernel boot message: "PCI-DMA: Using Calgary IOMMU"
+
+ iommu=[<size>][,noagp][,off][,force][,noforce][,leak[=<nr_of_leak_pages>]
+	[,memaper[=<order>]][,merge][,forcesac][,fullflush][,nomerge]
+	[,noaperture][,calgary]
+
+  General iommu options:
+    off                Don't initialize and use any kind of IOMMU.
+    noforce            Don't force hardware IOMMU usage when it is not needed.
+                       (default).
+    force              Force the use of the hardware IOMMU even when it is
+                       not actually needed (e.g. because < 3 GB memory).
+    soft               Use software bounce buffering (SWIOTLB) (default for
+                       Intel machines). This can be used to prevent the usage
+                       of an available hardware IOMMU.
+
+  iommu options only relevant to the AMD GART hardware IOMMU:
+    <size>             Set the size of the remapping area in bytes.
+    allowed            Overwrite iommu off workarounds for specific chipsets.
+    fullflush          Flush IOMMU on each allocation (default).
+    nofullflush        Don't use IOMMU fullflush.
+    leak               Turn on simple iommu leak tracing (only when
+                       CONFIG_IOMMU_LEAK is on). Default number of leak pages
+                       is 20.
+    memaper[=<order>]  Allocate an own aperture over RAM with size 32MB<<order.
+                       (default: order=1, i.e. 64MB)
+    merge              Do scatter-gather (SG) merging. Implies "force"
+                       (experimental).
+    nomerge            Don't do scatter-gather (SG) merging.
+    noaperture         Ask the IOMMU not to touch the aperture for AGP.
+    forcesac           Force single-address cycle (SAC) mode for masks <40bits
+                       (experimental).
+    noagp              Don't initialize the AGP driver and use full aperture.
+    allowdac           Allow double-address cycle (DAC) mode, i.e. DMA >4GB.
+                       DAC is used with 32-bit PCI to push a 64-bit address in
+                       two cycles. When off all DMA over >4GB is forced through
+                       an IOMMU or software bounce buffering.
+    nodac              Forbid DAC mode, i.e. DMA >4GB.
+    panic              Always panic when IOMMU overflows.
+    calgary            Use the Calgary IOMMU if it is available
+
+  iommu options only relevant to the software bounce buffering (SWIOTLB) IOMMU
+  implementation:
+    swiotlb=<pages>[,force]
+    <pages>            Prereserve that many 128K pages for the software IO
+                       bounce buffering.
+    force              Force all IO through the software TLB.
+
+  Settings for the IBM Calgary hardware IOMMU currently found in IBM
+  pSeries and xSeries machines:
+
+    calgary=[64k,128k,256k,512k,1M,2M,4M,8M]
+    calgary=[translate_empty_slots]
+    calgary=[disable=<PCI bus number>]
+    panic              Always panic when IOMMU overflows
+
+    64k,...,8M - Set the size of each PCI slot's translation table
+    when using the Calgary IOMMU. This is the size of the translation
+    table itself in main memory. The smallest table, 64k, covers an IO
+    space of 32MB; the largest, 8MB table, can cover an IO space of
+    4GB. Normally the kernel will make the right choice by itself.
+
+    translate_empty_slots - Enable translation even on slots that have
+    no devices attached to them, in case a device will be hotplugged
+    in the future.
+
+    disable=<PCI bus number> - Disable translation on a given PHB. For
+    example, the built-in graphics adapter resides on the first bridge
+    (PCI bus number 0); if translation (isolation) is enabled on this
+    bridge, X servers that access the hardware directly from user
+    space might stop working. Use this option if you have devices that
+    are accessed from userspace directly on some PCI host bridge.
+
+Debugging
+
+  kstack=N	Print N words from the kernel stack in oops dumps.
+
+Miscellaneous
+
+	nogbpages
+		Do not use GB pages for kernel direct mappings.
+	gbpages
+		Use GB pages for kernel direct mappings.
diff --git a/Documentation/x86/x86_64/cpu-hotplug-spec b/Documentation/x86/x86_64/cpu-hotplug-spec
new file mode 100644
index 000000000..3c23e0587
--- /dev/null
+++ b/Documentation/x86/x86_64/cpu-hotplug-spec
@@ -0,0 +1,21 @@
+Firmware support for CPU hotplug under Linux/x86-64
+---------------------------------------------------
+
+Linux/x86-64 supports CPU hotplug now. For various reasons Linux wants to
+know in advance of boot time the maximum number of CPUs that could be plugged
+into the system. ACPI 3.0 currently has no official way to supply
+this information from the firmware to the operating system.
+
+In ACPI each CPU needs an LAPIC object in the MADT table (5.2.11.5 in the
+ACPI 3.0 specification).  ACPI already has the concept of disabled LAPIC
+objects by setting the Enabled bit in the LAPIC object to zero.
+
+For CPU hotplug Linux/x86-64 expects now that any possible future hotpluggable
+CPU is already available in the MADT. If the CPU is not available yet
+it should have its LAPIC Enabled bit set to 0. Linux will use the number
+of disabled LAPICs to compute the maximum number of future CPUs.
+
+In the worst case the user can overwrite this choice using a command line
+option (additional_cpus=...), but it is recommended to supply the correct
+number (or a reasonable approximation of it, with erring towards more not less)
+in the MADT to avoid manual configuration.
diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets b/Documentation/x86/x86_64/fake-numa-for-cpusets
new file mode 100644
index 000000000..0f11d9bec
--- /dev/null
+++ b/Documentation/x86/x86_64/fake-numa-for-cpusets
@@ -0,0 +1,67 @@
+Using numa=fake and CPUSets for Resource Management
+Written by David Rientjes <rientjes@cs.washington.edu>
+
+This document describes how the numa=fake x86_64 command-line option can be used
+in conjunction with cpusets for coarse memory management.  Using this feature,
+you can create fake NUMA nodes that represent contiguous chunks of memory and
+assign them to cpusets and their attached tasks.  This is a way of limiting the
+amount of system memory that are available to a certain class of tasks.
+
+For more information on the features of cpusets, see
+Documentation/cgroups/cpusets.txt.
+There are a number of different configurations you can use for your needs.  For
+more information on the numa=fake command line option and its various ways of
+configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt.
+
+For the purposes of this introduction, we'll assume a very primitive NUMA
+emulation setup of "numa=fake=4*512,".  This will split our system memory into
+four equal chunks of 512M each that we can now use to assign to cpusets.  As
+you become more familiar with using this combination for resource control,
+you'll determine a better setup to minimize the number of nodes you have to deal
+with.
+
+A machine may be split as follows with "numa=fake=4*512," as reported by dmesg:
+
+	Faking node 0 at 0000000000000000-0000000020000000 (512MB)
+	Faking node 1 at 0000000020000000-0000000040000000 (512MB)
+	Faking node 2 at 0000000040000000-0000000060000000 (512MB)
+	Faking node 3 at 0000000060000000-0000000080000000 (512MB)
+	...
+	On node 0 totalpages: 130975
+	On node 1 totalpages: 131072
+	On node 2 totalpages: 131072
+	On node 3 totalpages: 131072
+
+Now following the instructions for mounting the cpusets filesystem from
+Documentation/cgroups/cpusets.txt, you can assign fake nodes (i.e. contiguous memory
+address spaces) to individual cpusets:
+
+	[root@xroads /]# mkdir exampleset
+	[root@xroads /]# mount -t cpuset none exampleset
+	[root@xroads /]# mkdir exampleset/ddset
+	[root@xroads /]# cd exampleset/ddset
+	[root@xroads /exampleset/ddset]# echo 0-1 > cpus
+	[root@xroads /exampleset/ddset]# echo 0-1 > mems
+
+Now this cpuset, 'ddset', will only allowed access to fake nodes 0 and 1 for
+memory allocations (1G).
+
+You can now assign tasks to these cpusets to limit the memory resources
+available to them according to the fake nodes assigned as mems:
+
+	[root@xroads /exampleset/ddset]# echo $$ > tasks
+	[root@xroads /exampleset/ddset]# dd if=/dev/zero of=tmp bs=1024 count=1G
+	[1] 13425
+
+Notice the difference between the system memory usage as reported by
+/proc/meminfo between the restricted cpuset case above and the unrestricted
+case (i.e. running the same 'dd' command without assigning it to a fake NUMA
+cpuset):
+				Unrestricted	Restricted
+	MemTotal:		3091900 kB	3091900 kB
+	MemFree:		  42113 kB	1513236 kB
+
+This allows for coarse memory management for the tasks you assign to particular
+cpusets.  Since cpusets can form a hierarchy, you can create some pretty
+interesting combinations of use-cases for various classes of tasks for your
+memory management needs.
diff --git a/Documentation/x86/x86_64/kernel-stacks b/Documentation/x86/x86_64/kernel-stacks
new file mode 100644
index 000000000..e3c8a49d1
--- /dev/null
+++ b/Documentation/x86/x86_64/kernel-stacks
@@ -0,0 +1,101 @@
+Most of the text from Keith Owens, hacked by AK
+
+x86_64 page size (PAGE_SIZE) is 4K.
+
+Like all other architectures, x86_64 has a kernel stack for every
+active thread.  These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big.
+These stacks contain useful data as long as a thread is alive or a
+zombie. While the thread is in user space the kernel stack is empty
+except for the thread_info structure at the bottom.
+
+In addition to the per thread stacks, there are specialized stacks
+associated with each CPU.  These stacks are only used while the kernel
+is in control on that CPU; when a CPU returns to user space the
+specialized stacks contain no useful data.  The main CPU stacks are:
+
+* Interrupt stack.  IRQSTACKSIZE
+
+  Used for external hardware interrupts.  If this is the first external
+  hardware interrupt (i.e. not a nested hardware interrupt) then the
+  kernel switches from the current task to the interrupt stack.  Like
+  the split thread and interrupt stacks on i386, this gives more room
+  for kernel interrupt processing without having to increase the size
+  of every per thread stack.
+
+  The interrupt stack is also used when processing a softirq.
+
+Switching to the kernel interrupt stack is done by software based on a
+per CPU interrupt nest counter. This is needed because x86-64 "IST"
+hardware stacks cannot nest without races.
+
+x86_64 also has a feature which is not available on i386, the ability
+to automatically switch to a new stack for designated events such as
+double fault or NMI, which makes it easier to handle these unusual
+events on x86_64.  This feature is called the Interrupt Stack Table
+(IST).  There can be up to 7 IST entries per CPU. The IST code is an
+index into the Task State Segment (TSS). The IST entries in the TSS
+point to dedicated stacks; each stack can be a different size.
+
+An IST is selected by a non-zero value in the IST field of an
+interrupt-gate descriptor.  When an interrupt occurs and the hardware
+loads such a descriptor, the hardware automatically sets the new stack
+pointer based on the IST value, then invokes the interrupt handler.  If
+the interrupt came from user mode, then the interrupt handler prologue
+will switch back to the per-thread stack.  If software wants to allow
+nested IST interrupts then the handler must adjust the IST values on
+entry to and exit from the interrupt handler.  (This is occasionally
+done, e.g. for debug exceptions.)
+
+Events with different IST codes (i.e. with different stacks) can be
+nested.  For example, a debug interrupt can safely be interrupted by an
+NMI.  arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack
+pointers on entry to and exit from all IST events, in theory allowing
+IST events with the same code to be nested.  However in most cases, the
+stack size allocated to an IST assumes no nesting for the same code.
+If that assumption is ever broken then the stacks will become corrupt.
+
+The currently assigned IST stacks are :-
+
+* STACKFAULT_STACK.  EXCEPTION_STKSZ (PAGE_SIZE).
+
+  Used for interrupt 12 - Stack Fault Exception (#SS).
+
+  This allows the CPU to recover from invalid stack segments. Rarely
+  happens.
+
+* DOUBLEFAULT_STACK.  EXCEPTION_STKSZ (PAGE_SIZE).
+
+  Used for interrupt 8 - Double Fault Exception (#DF).
+
+  Invoked when handling one exception causes another exception. Happens
+  when the kernel is very confused (e.g. kernel stack pointer corrupt).
+  Using a separate stack allows the kernel to recover from it well enough
+  in many cases to still output an oops.
+
+* NMI_STACK.  EXCEPTION_STKSZ (PAGE_SIZE).
+
+  Used for non-maskable interrupts (NMI).
+
+  NMI can be delivered at any time, including when the kernel is in the
+  middle of switching stacks.  Using IST for NMI events avoids making
+  assumptions about the previous state of the kernel stack.
+
+* DEBUG_STACK.  DEBUG_STKSZ
+
+  Used for hardware debug interrupts (interrupt 1) and for software
+  debug interrupts (INT3).
+
+  When debugging a kernel, debug interrupts (both hardware and
+  software) can occur at any time.  Using IST for these interrupts
+  avoids making assumptions about the previous state of the kernel
+  stack.
+
+* MCE_STACK.  EXCEPTION_STKSZ (PAGE_SIZE).
+
+  Used for interrupt 18 - Machine Check Exception (#MC).
+
+  MCE can be delivered at any time, including when the kernel is in the
+  middle of switching stacks.  Using IST for MCE events avoids making
+  assumptions about the previous state of the kernel stack.
+
+For more details see the Intel IA32 or AMD AMD64 architecture manuals.
diff --git a/Documentation/x86/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck
new file mode 100644
index 000000000..b1fb30273
--- /dev/null
+++ b/Documentation/x86/x86_64/machinecheck
@@ -0,0 +1,83 @@
+
+Configurable sysfs parameters for the x86-64 machine check code.
+
+Machine checks report internal hardware error conditions detected
+by the CPU. Uncorrected errors typically cause a machine check
+(often with panic), corrected ones cause a machine check log entry.
+
+Machine checks are organized in banks (normally associated with
+a hardware subsystem) and subevents in a bank. The exact meaning
+of the banks and subevent is CPU specific.
+
+mcelog knows how to decode them.
+
+When you see the "Machine check errors logged" message in the system
+log then mcelog should run to collect and decode machine check entries
+from /dev/mcelog. Normally mcelog should be run regularly from a cronjob.
+
+Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN
+(N = CPU number)
+
+The directory contains some configurable entries:
+
+Entries:
+
+bankNctl
+(N bank number)
+	64bit Hex bitmask enabling/disabling specific subevents for bank N
+	When a bit in the bitmask is zero then the respective
+	subevent will not be reported.
+	By default all events are enabled.
+	Note that BIOS maintain another mask to disable specific events
+	per bank.  This is not visible here
+
+The following entries appear for each CPU, but they are truly shared
+between all CPUs.
+
+check_interval
+	How often to poll for corrected machine check errors, in seconds
+	(Note output is hexademical). Default 5 minutes.  When the poller
+	finds MCEs it triggers an exponential speedup (poll more often) on
+	the polling interval.  When the poller stops finding MCEs, it
+	triggers an exponential backoff (poll less often) on the polling
+	interval. The check_interval variable is both the initial and
+	maximum polling interval. 0 means no polling for corrected machine
+	check errors (but some corrected errors might be still reported
+	in other ways)
+
+tolerant
+	Tolerance level. When a machine check exception occurs for a non
+	corrected machine check the kernel can take different actions.
+	Since machine check exceptions can happen any time it is sometimes
+	risky for the kernel to kill a process because it defies
+	normal kernel locking rules. The tolerance level configures
+	how hard the kernel tries to recover even at some risk of
+	deadlock.  Higher tolerant values trade potentially better uptime
+	with the risk of a crash or even corruption (for tolerant >= 3).
+
+	0: always panic on uncorrected errors, log corrected errors
+	1: panic or SIGBUS on uncorrected errors, log corrected errors
+	2: SIGBUS or log uncorrected errors, log corrected errors
+	3: never panic or SIGBUS, log all errors (for testing only)
+
+	Default: 1
+
+	Note this only makes a difference if the CPU allows recovery
+	from a machine check exception. Current x86 CPUs generally do not.
+
+trigger
+	Program to run when a machine check event is detected.
+	This is an alternative to running mcelog regularly from cron
+	and allows to detect events faster.
+monarch_timeout
+	How long to wait for the other CPUs to machine check too on a
+	exception. 0 to disable waiting for other CPUs.
+	Unit: us
+
+TBD document entries for AMD threshold interrupt configuration
+
+For more details about the x86 machine check architecture
+see the Intel and AMD architecture manuals from their developer websites.
+
+For more details about the architecture see
+see http://one.firstfloor.org/~andi/mce.pdf
diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt
new file mode 100644
index 000000000..05712ac83
--- /dev/null
+++ b/Documentation/x86/x86_64/mm.txt
@@ -0,0 +1,42 @@
+
+<previous description obsolete, deleted>
+
+Virtual memory map with 4 level page tables:
+
+0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
+hole caused by [48:63] sign extension
+ffff800000000000 - ffff87ffffffffff (=43 bits) guard hole, reserved for hypervisor
+ffff880000000000 - ffffc7ffffffffff (=64 TB) direct mapping of all phys. memory
+ffffc80000000000 - ffffc8ffffffffff (=40 bits) hole
+ffffc90000000000 - ffffe8ffffffffff (=45 bits) vmalloc/ioremap space
+ffffe90000000000 - ffffe9ffffffffff (=40 bits) hole
+ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
+... unused hole ...
+ffffec0000000000 - fffffc0000000000 (=44 bits) kasan shadow memory (16TB)
+... unused hole ...
+ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
+... unused hole ...
+ffffffff80000000 - ffffffffa0000000 (=512 MB)  kernel text mapping, from phys 0
+ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space
+ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
+ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
+
+The direct mapping covers all memory in the system up to the highest
+memory address (this means in some cases it can also include PCI memory
+holes).
+
+vmalloc space is lazily synchronized into the different PML4 pages of
+the processes using the page fault handler, with init_level4_pgt as
+reference.
+
+Current X86-64 implementations only support 40 bits of address space,
+but we support up to 46 bits. This expands into MBZ space in the page tables.
+
+->trampoline_pgd:
+
+We map EFI runtime services in the aforementioned PGD in the virtual
+range of 64Gb (arbitrarily set, can be raised if needed)
+
+0xffffffef00000000 - 0xffffffff00000000
+
+-Andi Kleen, Jul 2004
diff --git a/Documentation/x86/x86_64/uefi.txt b/Documentation/x86/x86_64/uefi.txt
new file mode 100644
index 000000000..a5e2b4fdb
--- /dev/null
+++ b/Documentation/x86/x86_64/uefi.txt
@@ -0,0 +1,42 @@
+General note on [U]EFI x86_64 support
+-------------------------------------
+
+The nomenclature EFI and UEFI are used interchangeably in this document.
+
+Although the tools below are _not_ needed for building the kernel,
+the needed bootloader support and associated tools for x86_64 platforms
+with EFI firmware and specifications are listed below.
+
+1. UEFI specification:  http://www.uefi.org
+
+2. Booting Linux kernel on UEFI x86_64 platform requires bootloader
+   support. Elilo with x86_64 support can be used.
+
+3. x86_64 platform with EFI/UEFI firmware.
+
+Mechanics:
+---------
+- Build the kernel with the following configuration.
+	CONFIG_FB_EFI=y
+	CONFIG_FRAMEBUFFER_CONSOLE=y
+  If EFI runtime services are expected, the following configuration should
+  be selected.
+	CONFIG_EFI=y
+	CONFIG_EFI_VARS=y or m		# optional
+- Create a VFAT partition on the disk
+- Copy the following to the VFAT partition:
+	elilo bootloader with x86_64 support, elilo configuration file,
+	kernel image built in first step and corresponding
+	initrd. Instructions on building elilo	and its dependencies
+	can be found in the elilo sourceforge project.
+- Boot to EFI shell and invoke elilo choosing the kernel image built
+  in first step.
+- If some or all EFI runtime services don't work, you can try following
+  kernel command line parameters to turn off some or all EFI runtime
+  services.
+	noefi		turn off all EFI runtime services
+	reboot_type=k	turn off EFI reboot runtime service
+- If the EFI memory map has additional entries not in the E820 map,
+  you can include those entries in the kernels memory map of available
+  physical RAM by using the following kernel command line parameter.
+	add_efi_memmap	include EFI memory map of available physical RAM
diff --git a/Documentation/x86/zero-page.txt b/Documentation/x86/zero-page.txt
new file mode 100644
index 000000000..82fbdbc1e
--- /dev/null
+++ b/Documentation/x86/zero-page.txt
@@ -0,0 +1,37 @@
+The additional fields in struct boot_params as a part of 32-bit boot
+protocol of kernel. These should be filled by bootloader or 16-bit
+real-mode setup code of the kernel. References/settings to it mainly
+are in:
+
+  arch/x86/include/uapi/asm/bootparam.h
+
+
+Offset	Proto	Name		Meaning
+/Size
+
+000/040	ALL	screen_info	Text mode or frame buffer information
+				(struct screen_info)
+040/014	ALL	apm_bios_info	APM BIOS information (struct apm_bios_info)
+058/008	ALL	tboot_addr      Physical address of tboot shared page
+060/010	ALL	ist_info	Intel SpeedStep (IST) BIOS support information
+				(struct ist_info)
+080/010	ALL	hd0_info	hd0 disk parameter, OBSOLETE!!
+090/010	ALL	hd1_info	hd1 disk parameter, OBSOLETE!!
+0A0/010	ALL	sys_desc_table	System description table (struct sys_desc_table)
+0B0/010	ALL	olpc_ofw_header	OLPC's OpenFirmware CIF and friends
+0C0/004	ALL	ext_ramdisk_image ramdisk_image high 32bits
+0C4/004	ALL	ext_ramdisk_size  ramdisk_size high 32bits
+0C8/004	ALL	ext_cmd_line_ptr  cmd_line_ptr high 32bits
+140/080	ALL	edid_info	Video mode setup (struct edid_info)
+1C0/020	ALL	efi_info	EFI 32 information (struct efi_info)
+1E0/004	ALL	alk_mem_k	Alternative mem check, in KB
+1E4/004	ALL	scratch		Scratch field for the kernel setup code
+1E8/001	ALL	e820_entries	Number of entries in e820_map (below)
+1E9/001	ALL	eddbuf_entries	Number of entries in eddbuf (below)
+1EA/001	ALL	edd_mbr_sig_buf_entries	Number of entries in edd_mbr_sig_buffer
+				(below)
+1EF/001	ALL	sentinel	Used to detect broken bootloaders
+290/040	ALL	edd_mbr_sig_buffer EDD MBR signatures
+2D0/A00	ALL	e820_map	E820 memory map table
+				(array of struct e820entry)
+D00/1EC	ALL	eddbuf		EDD data (array of struct edd_info)
author	André Fabian Silva Delgado <emulatorman@parabola.nu>	2015-08-05 17:04:01 -0300
committer	André Fabian Silva Delgado <emulatorman@parabola.nu>	2015-08-05 17:04:01 -0300
commit	57f0f512b273f60d52568b8c6b77e17f5636edc0 (patch)
tree	5e910f0e82173f4ef4f51111366a3f1299037a7b /Documentation/x86