diff options
Diffstat (limited to 'Documentation/x86')
-rw-r--r-- | Documentation/x86/early-microcode.txt | 28 | ||||
-rw-r--r-- | Documentation/x86/exception-tables.txt | 35 | ||||
-rw-r--r-- | Documentation/x86/intel_mpx.txt | 10 | ||||
-rw-r--r-- | Documentation/x86/pat.txt | 32 | ||||
-rw-r--r-- | Documentation/x86/protection-keys.txt | 27 | ||||
-rw-r--r-- | Documentation/x86/tlb.txt | 4 | ||||
-rw-r--r-- | Documentation/x86/topology.txt | 208 | ||||
-rw-r--r-- | Documentation/x86/x86_64/boot-options.txt | 2 | ||||
-rw-r--r-- | Documentation/x86/x86_64/fake-numa-for-cpusets | 4 | ||||
-rw-r--r-- | Documentation/x86/x86_64/machinecheck | 2 | ||||
-rw-r--r-- | Documentation/x86/x86_64/mm.txt | 20 |
11 files changed, 354 insertions, 18 deletions
diff --git a/Documentation/x86/early-microcode.txt b/Documentation/x86/early-microcode.txt index 41465f971..c2a71f527 100644 --- a/Documentation/x86/early-microcode.txt +++ b/Documentation/x86/early-microcode.txt @@ -40,3 +40,31 @@ cp ../microcode.bin /*(DEBLOBBED)*/ (or /*(DEBLOBBED)*/) find . | cpio -o -H newc >../ucode.cpio cd .. cat ucode.cpio /boot/initrd-3.5.0.img >/boot/initrd-3.5.0.ucode.img + +Builtin microcode +================= + +We can also load builtin microcode supplied through the regular firmware +builtin method CONFIG_FIRMWARE_IN_KERNEL. Only 64-bit is currently +supported. + +Here's an example: + +CONFIG_FIRMWARE_IN_KERNEL=y +CONFIG_EXTRA_FIRMWARE="/*(DEBLOBBED)*/ /*(DEBLOBBED)*/" +CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware" + +This basically means, you have the following tree structure locally: + +/lib/firmware/ +|-- amd-ucode +... +| |-- microcode_amd_fam15h.bin +... +|-- intel-ucode +... +| |-- 06-3a-09 +... + +so that the build system can find those files and integrate them into +the final kernel image. The early loader finds them and applies them. diff --git a/Documentation/x86/exception-tables.txt b/Documentation/x86/exception-tables.txt index 32901aa36..e396bcd8d 100644 --- a/Documentation/x86/exception-tables.txt +++ b/Documentation/x86/exception-tables.txt @@ -290,3 +290,38 @@ Due to the way that the exception table is built and needs to be ordered, only use exceptions for code in the .text section. Any other section will cause the exception table to not be sorted correctly, and the exceptions will fail. + +Things changed when 64-bit support was added to x86 Linux. Rather than +double the size of the exception table by expanding the two entries +from 32-bits to 64 bits, a clever trick was used to store addresses +as relative offsets from the table itself. The assembly code changed +from: + .long 1b,3b +to: + .long (from) - . + .long (to) - . + +and the C-code that uses these values converts back to absolute addresses +like this: + + ex_insn_addr(const struct exception_table_entry *x) + { + return (unsigned long)&x->insn + x->insn; + } + +In v4.6 the exception table entry was expanded with a new field "handler". +This is also 32-bits wide and contains a third relative function +pointer which points to one of: + +1) int ex_handler_default(const struct exception_table_entry *fixup) + This is legacy case that just jumps to the fixup code +2) int ex_handler_fault(const struct exception_table_entry *fixup) + This case provides the fault number of the trap that occurred at + entry->insn. It is used to distinguish page faults from machine + check. +3) int ex_handler_ext(const struct exception_table_entry *fixup) + This case is used for uaccess_err ... we need to set a flag + in the task structure. Before the handler functions existed this + case was handled by adding a large offset to the fixup to tag + it as special. +More functions can easily be added. diff --git a/Documentation/x86/intel_mpx.txt b/Documentation/x86/intel_mpx.txt index 818518a3f..85d0549ad 100644 --- a/Documentation/x86/intel_mpx.txt +++ b/Documentation/x86/intel_mpx.txt @@ -45,7 +45,7 @@ is how we expect the compiler, application and kernel to work together. MPX-instrumented. 3) The kernel detects that the CPU has MPX, allows the new prctl() to succeed, and notes the location of the bounds directory. Userspace is - expected to keep the bounds directory at that locationWe note it + expected to keep the bounds directory at that location. We note it instead of reading it each time because the 'xsave' operation needed to access the bounds directory register is an expensive operation. 4) If the application needs to spill bounds out of the 4 registers, it @@ -136,7 +136,7 @@ A: MPX-enabled application will possibly create a lot of bounds tables in If we were to preallocate them for the 128TB of user virtual address space, we would need to reserve 512TB+2GB, which is larger than the entire virtual address space today. This means they can not be reserved - ahead of time. Also, a single process's pre-popualated bounds directory + ahead of time. Also, a single process's pre-populated bounds directory consumes 2GB of virtual *AND* physical memory. IOW, it's completely infeasible to prepopulate bounds directories. @@ -151,7 +151,7 @@ A: This would work if we could hook the site of each and every memory these calls. Q: Could a bounds fault be handed to userspace and the tables allocated - there in a signal handler intead of in the kernel? + there in a signal handler instead of in the kernel? A: mmap() is not on the list of safe async handler functions and even if mmap() would work it still requires locking or nasty tricks to keep track of the allocation state there. @@ -167,7 +167,7 @@ If a #BR is generated due to a bounds violation caused by MPX. We need to decode MPX instructions to get violation address and set this address into extended struct siginfo. -The _sigfault feild of struct siginfo is extended as follow: +The _sigfault field of struct siginfo is extended as follow: 87 /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */ 88 struct { @@ -240,5 +240,5 @@ them at the same bounds table. This is allowed architecturally. See more information "Intel(R) Architecture Instruction Set Extensions Programming Reference" (9.3.4). -However, if users did this, the kernel might be fooled in to unmaping an +However, if users did this, the kernel might be fooled in to unmapping an in-use bounds table since it does not recognize sharing. diff --git a/Documentation/x86/pat.txt b/Documentation/x86/pat.txt index 54944c71b..2a4ee6302 100644 --- a/Documentation/x86/pat.txt +++ b/Documentation/x86/pat.txt @@ -196,3 +196,35 @@ Another, more verbose way of getting PAT related debug messages is with "debugpat" boot parameter. With this parameter, various debug messages are printed to dmesg log. +PAT Initialization +------------------ + +The following table describes how PAT is initialized under various +configurations. The PAT MSR must be updated by Linux in order to support WC +and WT attributes. Otherwise, the PAT MSR has the value programmed in it +by the firmware. Note, Xen enables WC attribute in the PAT MSR for guests. + + MTRR PAT Call Sequence PAT State PAT MSR + ========================================================= + E E MTRR -> PAT init Enabled OS + E D MTRR -> PAT init Disabled - + D E MTRR -> PAT disable Disabled BIOS + D D MTRR -> PAT disable Disabled - + - np/E PAT -> PAT disable Disabled BIOS + - np/D PAT -> PAT disable Disabled - + E !P/E MTRR -> PAT init Disabled BIOS + D !P/E MTRR -> PAT disable Disabled BIOS + !M !P/E MTRR stub -> PAT disable Disabled BIOS + + Legend + ------------------------------------------------ + E Feature enabled in CPU + D Feature disabled/unsupported in CPU + np "nopat" boot option specified + !P CONFIG_X86_PAT option unset + !M CONFIG_MTRR option unset + Enabled PAT state set to enabled + Disabled PAT state set to disabled + OS PAT initializes PAT MSR with OS setting + BIOS PAT keeps PAT MSR with BIOS setting + diff --git a/Documentation/x86/protection-keys.txt b/Documentation/x86/protection-keys.txt new file mode 100644 index 000000000..c281ded1b --- /dev/null +++ b/Documentation/x86/protection-keys.txt @@ -0,0 +1,27 @@ +Memory Protection Keys for Userspace (PKU aka PKEYs) is a CPU feature +which will be found on future Intel CPUs. + +Memory Protection Keys provides a mechanism for enforcing page-based +protections, but without requiring modification of the page tables +when an application changes protection domains. It works by +dedicating 4 previously ignored bits in each page table entry to a +"protection key", giving 16 possible keys. + +There is also a new user-accessible register (PKRU) with two separate +bits (Access Disable and Write Disable) for each key. Being a CPU +register, PKRU is inherently thread-local, potentially giving each +thread a different set of protections from every other thread. + +There are two new instructions (RDPKRU/WRPKRU) for reading and writing +to the new register. The feature is only available in 64-bit mode, +even though there is theoretically space in the PAE PTEs. These +permissions are enforced on data access only and have no effect on +instruction fetches. + +=========================== Config Option =========================== + +This config option adds approximately 1.5kb of text. and 50 bytes of +data to the executable. A workload which does large O_DIRECT reads +of holes in XFS files was run to exercise get_user_pages_fast(). No +performance delta was observed with the config option +enabled or disabled. diff --git a/Documentation/x86/tlb.txt b/Documentation/x86/tlb.txt index 39d172326..6a0607b99 100644 --- a/Documentation/x86/tlb.txt +++ b/Documentation/x86/tlb.txt @@ -5,7 +5,7 @@ memory, it has two choices: from areas other than the one we are trying to flush will be destroyed and must be refilled later, at some cost. 2. Use the invlpg instruction to invalidate a single page at a - time. This could potentialy cost many more instructions, but + time. This could potentially cost many more instructions, but it is a much more precise operation, causing no collateral damage to other TLB entries. @@ -19,7 +19,7 @@ Which method to do depends on a few things: work. 3. The size of the TLB. The larger the TLB, the more collateral damage we do with a full flush. So, the larger the TLB, the - more attrative an individual flush looks. Data and + more attractive an individual flush looks. Data and instructions have separate TLBs, as do different page sizes. 4. The microarchitecture. The TLB has become a multi-level cache on modern CPUs, and the global flushes have become more diff --git a/Documentation/x86/topology.txt b/Documentation/x86/topology.txt new file mode 100644 index 000000000..06afac252 --- /dev/null +++ b/Documentation/x86/topology.txt @@ -0,0 +1,208 @@ +x86 Topology +============ + +This documents and clarifies the main aspects of x86 topology modelling and +representation in the kernel. Update/change when doing changes to the +respective code. + +The architecture-agnostic topology definitions are in +Documentation/cputopology.txt. This file holds x86-specific +differences/specialities which must not necessarily apply to the generic +definitions. Thus, the way to read up on Linux topology on x86 is to start +with the generic one and look at this one in parallel for the x86 specifics. + +Needless to say, code should use the generic functions - this file is *only* +here to *document* the inner workings of x86 topology. + +Started by Thomas Gleixner <tglx@linutronix.de> and Borislav Petkov <bp@alien8.de>. + +The main aim of the topology facilities is to present adequate interfaces to +code which needs to know/query/use the structure of the running system wrt +threads, cores, packages, etc. + +The kernel does not care about the concept of physical sockets because a +socket has no relevance to software. It's an electromechanical component. In +the past a socket always contained a single package (see below), but with the +advent of Multi Chip Modules (MCM) a socket can hold more than one package. So +there might be still references to sockets in the code, but they are of +historical nature and should be cleaned up. + +The topology of a system is described in the units of: + + - packages + - cores + - threads + +* Package: + + Packages contain a number of cores plus shared resources, e.g. DRAM + controller, shared caches etc. + + AMD nomenclature for package is 'Node'. + + Package-related topology information in the kernel: + + - cpuinfo_x86.x86_max_cores: + + The number of cores in a package. This information is retrieved via CPUID. + + - cpuinfo_x86.phys_proc_id: + + The physical ID of the package. This information is retrieved via CPUID + and deduced from the APIC IDs of the cores in the package. + + - cpuinfo_x86.logical_id: + + The logical ID of the package. As we do not trust BIOSes to enumerate the + packages in a consistent way, we introduced the concept of logical package + ID so we can sanely calculate the number of maximum possible packages in + the system and have the packages enumerated linearly. + + - topology_max_packages(): + + The maximum possible number of packages in the system. Helpful for per + package facilities to preallocate per package information. + + +* Cores: + + A core consists of 1 or more threads. It does not matter whether the threads + are SMT- or CMT-type threads. + + AMDs nomenclature for a CMT core is "Compute Unit". The kernel always uses + "core". + + Core-related topology information in the kernel: + + - smp_num_siblings: + + The number of threads in a core. The number of threads in a package can be + calculated by: + + threads_per_package = cpuinfo_x86.x86_max_cores * smp_num_siblings + + +* Threads: + + A thread is a single scheduling unit. It's the equivalent to a logical Linux + CPU. + + AMDs nomenclature for CMT threads is "Compute Unit Core". The kernel always + uses "thread". + + Thread-related topology information in the kernel: + + - topology_core_cpumask(): + + The cpumask contains all online threads in the package to which a thread + belongs. + + The number of online threads is also printed in /proc/cpuinfo "siblings." + + - topology_sibling_mask(): + + The cpumask contains all online threads in the core to which a thread + belongs. + + - topology_logical_package_id(): + + The logical package ID to which a thread belongs. + + - topology_physical_package_id(): + + The physical package ID to which a thread belongs. + + - topology_core_id(); + + The ID of the core to which a thread belongs. It is also printed in /proc/cpuinfo + "core_id." + + + +System topology examples + +Note: + +The alternative Linux CPU enumeration depends on how the BIOS enumerates the +threads. Many BIOSes enumerate all threads 0 first and then all threads 1. +That has the "advantage" that the logical Linux CPU numbers of threads 0 stay +the same whether threads are enabled or not. That's merely an implementation +detail and has no practical impact. + +1) Single Package, Single Core + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + +2) Single Package, Dual Core + + a) One thread per core + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [core 1] -> [thread 0] -> Linux CPU 1 + + b) Two threads per core + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 1 + -> [core 1] -> [thread 0] -> Linux CPU 2 + -> [thread 1] -> Linux CPU 3 + + Alternative enumeration: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 2 + -> [core 1] -> [thread 0] -> Linux CPU 1 + -> [thread 1] -> Linux CPU 3 + + AMD nomenclature for CMT systems: + + [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0 + -> [Compute Unit Core 1] -> Linux CPU 1 + -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2 + -> [Compute Unit Core 1] -> Linux CPU 3 + +4) Dual Package, Dual Core + + a) One thread per core + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [core 1] -> [thread 0] -> Linux CPU 1 + + [package 1] -> [core 0] -> [thread 0] -> Linux CPU 2 + -> [core 1] -> [thread 0] -> Linux CPU 3 + + b) Two threads per core + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 1 + -> [core 1] -> [thread 0] -> Linux CPU 2 + -> [thread 1] -> Linux CPU 3 + + [package 1] -> [core 0] -> [thread 0] -> Linux CPU 4 + -> [thread 1] -> Linux CPU 5 + -> [core 1] -> [thread 0] -> Linux CPU 6 + -> [thread 1] -> Linux CPU 7 + + Alternative enumeration: + + [package 0] -> [core 0] -> [thread 0] -> Linux CPU 0 + -> [thread 1] -> Linux CPU 4 + -> [core 1] -> [thread 0] -> Linux CPU 1 + -> [thread 1] -> Linux CPU 5 + + [package 1] -> [core 0] -> [thread 0] -> Linux CPU 2 + -> [thread 1] -> Linux CPU 6 + -> [core 1] -> [thread 0] -> Linux CPU 3 + -> [thread 1] -> Linux CPU 7 + + AMD nomenclature for CMT systems: + + [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0 + -> [Compute Unit Core 1] -> Linux CPU 1 + -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2 + -> [Compute Unit Core 1] -> Linux CPU 3 + + [node 1] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 4 + -> [Compute Unit Core 1] -> Linux CPU 5 + -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 6 + -> [Compute Unit Core 1] -> Linux CPU 7 diff --git a/Documentation/x86/x86_64/boot-options.txt b/Documentation/x86/x86_64/boot-options.txt index 68ed3114c..0965a71f9 100644 --- a/Documentation/x86/x86_64/boot-options.txt +++ b/Documentation/x86/x86_64/boot-options.txt @@ -60,6 +60,8 @@ Machine check threshold to 1. Enabling this may make memory predictive failure analysis less effective if the bios sets thresholds for memory errors since we will not see details for all errors. + mce=recovery + Force-enable recoverable machine check code paths nomce (for compatibility with i386): same as mce=off diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets b/Documentation/x86/x86_64/fake-numa-for-cpusets index 0f11d9bec..4b09f1883 100644 --- a/Documentation/x86/x86_64/fake-numa-for-cpusets +++ b/Documentation/x86/x86_64/fake-numa-for-cpusets @@ -8,7 +8,7 @@ assign them to cpusets and their attached tasks. This is a way of limiting the amount of system memory that are available to a certain class of tasks. For more information on the features of cpusets, see -Documentation/cgroups/cpusets.txt. +Documentation/cgroup-v1/cpusets.txt. There are a number of different configurations you can use for your needs. For more information on the numa=fake command line option and its various ways of configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt. @@ -33,7 +33,7 @@ A machine may be split as follows with "numa=fake=4*512," as reported by dmesg: On node 3 totalpages: 131072 Now following the instructions for mounting the cpusets filesystem from -Documentation/cgroups/cpusets.txt, you can assign fake nodes (i.e. contiguous memory +Documentation/cgroup-v1/cpusets.txt, you can assign fake nodes (i.e. contiguous memory address spaces) to individual cpusets: [root@xroads /]# mkdir exampleset diff --git a/Documentation/x86/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck index b1fb30273..d0648a74f 100644 --- a/Documentation/x86/x86_64/machinecheck +++ b/Documentation/x86/x86_64/machinecheck @@ -36,7 +36,7 @@ between all CPUs. check_interval How often to poll for corrected machine check errors, in seconds - (Note output is hexademical). Default 5 minutes. When the poller + (Note output is hexadecimal). Default 5 minutes. When the poller finds MCEs it triggers an exponential speedup (poll more often) on the polling interval. When the poller stops finding MCEs, it triggers an exponential backoff (poll less often) on the polling diff --git a/Documentation/x86/x86_64/mm.txt b/Documentation/x86/x86_64/mm.txt index 05712ac83..8c7dd5957 100644 --- a/Documentation/x86/x86_64/mm.txt +++ b/Documentation/x86/x86_64/mm.txt @@ -16,8 +16,10 @@ ffffec0000000000 - fffffc0000000000 (=44 bits) kasan shadow memory (16TB) ... unused hole ... ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks ... unused hole ... +ffffffef00000000 - ffffffff00000000 (=64 GB) EFI region mapping space +... unused hole ... ffffffff80000000 - ffffffffa0000000 (=512 MB) kernel text mapping, from phys 0 -ffffffffa0000000 - ffffffffff5fffff (=1525 MB) module mapping space +ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole @@ -29,14 +31,16 @@ vmalloc space is lazily synchronized into the different PML4 pages of the processes using the page fault handler, with init_level4_pgt as reference. -Current X86-64 implementations only support 40 bits of address space, -but we support up to 46 bits. This expands into MBZ space in the page tables. - -->trampoline_pgd: +Current X86-64 implementations support up to 46 bits of address space (64 TB), +which is our current limit. This expands into MBZ space in the page tables. -We map EFI runtime services in the aforementioned PGD in the virtual -range of 64Gb (arbitrarily set, can be raised if needed) +We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual +memory window (this size is arbitrary, it can be raised later if needed). +The mappings are not part of any other kernel PGD and are only available +during EFI runtime calls. -0xffffffef00000000 - 0xffffffff00000000 +Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all +physical memory, vmalloc/ioremap space and virtual memory map are randomized. +Their order is preserved but their base will be offset early at boot time. -Andi Kleen, Jul 2004 |