12 files changed, 148 insertions, 60 deletions
diff --git a/tools/perf/Documentation/intel-pt.txt b/tools/perf/Documentation/intel-pt.txt
index c94c9de31..be764f9ec 100644
--- a/tools/perf/Documentation/intel-pt.txt
+++ b/tools/perf/Documentation/intel-pt.txt
@@ -671,6 +671,7 @@ The letters are:
 	e	synthesize tracing error events
 	d	create a debug log
 	g	synthesize a call chain (use with i or x)
+	l	synthesize last branch entries (use with i or x)
 
 "Instructions" events look like they were recorded by "perf record -e
 instructions".
@@ -707,12 +708,26 @@ on the sample is *not* adjusted and reflects the last known value of TSC.
 
 For Intel PT, the default period is 100us.
 
+Setting it to a zero period means "as often as possible".
+
+In the case of Intel PT that is the same as a period of 1 and a unit of
+'instructions' (i.e. --itrace=i1i).
+
 Also the call chain size (default 16, max. 1024) for instructions or
 transactions events can be specified. e.g.
 
 	--itrace=ig32
 	--itrace=xg32
 
+Also the number of last branch entries (default 64, max. 1024) for instructions or
+transactions events can be specified. e.g.
+
+       --itrace=il10
+       --itrace=xl10
+
+Note that last branch entries are cleared for each sample, so there is no overlap
+from one sample to the next.
+
 To disable trace decoding entirely, use the option --no-itrace.
 
 
@@ -749,3 +764,32 @@ perf inject also accepts the --itrace option in which case tracing data is
 removed and replaced with the synthesized events. e.g.
 
 	perf inject --itrace -i perf.data -o perf.data.new
+
+Below is an example of using Intel PT with autofdo.  It requires autofdo
+(https://github.com/google/autofdo) and gcc version 5.  The bubble
+sort example is from the AutoFDO tutorial (https://gcc.gnu.org/wiki/AutoFDO/Tutorial)
+amended to take the number of elements as a parameter.
+
+	$ gcc-5 -O3 sort.c -o sort_optimized
+	$ ./sort_optimized 30000
+	Bubble sorting array of 30000 elements
+	2254 ms
+
+	$ cat ~/.perfconfig
+	[intel-pt]
+		mispred-all
+
+	$ perf record -e intel_pt//u ./sort 3000
+	Bubble sorting array of 3000 elements
+	58 ms
+	[ perf record: Woken up 2 times to write data ]
+	[ perf record: Captured and wrote 3.939 MB perf.data ]
+	$ perf inject -i perf.data -o inj --itrace=i100usle --strip
+	$ ./create_gcov --binary=./sort --profile=inj --gcov=sort.gcov -gcov_version=1
+	$ gcc-5 -O3 -fauto-profile=sort.gcov sort.c -o sort_autofdo
+	$ ./sort_autofdo 30000
+	Bubble sorting array of 30000 elements
+	2155 ms
+
+Note there is currently no advantage to using Intel PT instead of LBR, but
+that may change in the future if greater use is made of the data.
diff --git a/tools/perf/Documentation/itrace.txt b/tools/perf/Documentation/itrace.txt
index 2ff946677..65453f4c7 100644
--- a/tools/perf/Documentation/itrace.txt
+++ b/tools/perf/Documentation/itrace.txt
@@ -6,6 +6,7 @@
 		e	synthesize error events
 		d	create a debug log
 		g	synthesize a call chain (use with i or x)
+		l	synthesize last branch entries (use with i or x)
 
 	The default is all events i.e. the same as --itrace=ibxe
 
@@ -20,3 +21,6 @@
 
 	Also the call chain size (default 16, max. 1024) for instructions or
 	transactions events can be specified.
+
+	Also the number of last branch entries (default 64, max. 1024) for
+	instructions or transactions events can be specified.
diff --git a/tools/perf/Documentation/perf-bench.txt b/tools/perf/Documentation/perf-bench.txt
index ab632d9fb..34750fc32 100644
--- a/tools/perf/Documentation/perf-bench.txt
+++ b/tools/perf/Documentation/perf-bench.txt
@@ -82,7 +82,7 @@ Be multi thread instead of multi process
 Specify number of groups
 
 -l::
---loop=::
+--nr_loops=::
 Specify number of loops
 
 Example of *messaging*
@@ -139,64 +139,48 @@ Suite for evaluating performance of simple memory copy in various ways.
 Options of *memcpy*
 ^^^^^^^^^^^^^^^^^^^
 -l::
---length::
-Specify length of memory to copy (default: 1MB).
+--size::
+Specify size of memory to copy (default: 1MB).
 Available units are B, KB, MB, GB and TB (case insensitive).
 
--r::
---routine::
-Specify routine to copy (default: default).
-Available routines are depend on the architecture.
+-f::
+--function::
+Specify function to copy (default: default).
+Available functions are depend on the architecture.
 On x86-64, x86-64-unrolled, x86-64-movsq and x86-64-movsb are supported.
 
--i::
---iterations::
+-l::
+--nr_loops::
 Repeat memcpy invocation this number of times.
 
 -c::
---cycle::
+--cycles::
 Use perf's cpu-cycles event instead of gettimeofday syscall.
 
--o::
---only-prefault::
-Show only the result with page faults before memcpy.
-
--n::
---no-prefault::
-Show only the result without page faults before memcpy.
-
 *memset*::
 Suite for evaluating performance of simple memory set in various ways.
 
 Options of *memset*
 ^^^^^^^^^^^^^^^^^^^
 -l::
---length::
-Specify length of memory to set (default: 1MB).
+--size::
+Specify size of memory to set (default: 1MB).
 Available units are B, KB, MB, GB and TB (case insensitive).
 
--r::
---routine::
-Specify routine to set (default: default).
-Available routines are depend on the architecture.
+-f::
+--function::
+Specify function to set (default: default).
+Available functions are depend on the architecture.
 On x86-64, x86-64-unrolled, x86-64-stosq and x86-64-stosb are supported.
 
--i::
---iterations::
+-l::
+--nr_loops::
 Repeat memset invocation this number of times.
 
 -c::
---cycle::
+--cycles::
 Use perf's cpu-cycles event instead of gettimeofday syscall.
 
--o::
---only-prefault::
-Show only the result with page faults before memset.
-
--n::
---no-prefault::
-Show only the result without page faults before memset.
-
 SUITES FOR 'numa'
 ~~~~~~~~~~~~~~~~~
 *mem*::
diff --git a/tools/perf/Documentation/perf-inject.txt b/tools/perf/Documentation/perf-inject.txt
index 0c721c3e3..0b1cedeef 100644
--- a/tools/perf/Documentation/perf-inject.txt
+++ b/tools/perf/Documentation/perf-inject.txt
@@ -50,6 +50,9 @@ OPTIONS
 
 include::itrace.txt[]
 
+--strip::
+	Use with --itrace to strip out non-synthesized events.
+
 SEE ALSO
 --------
 linkperf:perf-record[1], linkperf:perf-report[1], linkperf:perf-archive[1]
diff --git a/tools/perf/Documentation/perf-list.txt b/tools/perf/Documentation/perf-list.txt
index bada8933f..79483f40e 100644
--- a/tools/perf/Documentation/perf-list.txt
+++ b/tools/perf/Documentation/perf-list.txt
@@ -30,6 +30,7 @@ counted. The following modifiers exist:
  G - guest counting (in KVM guests)
  H - host counting (not in KVM guests)
  p - precise level
+ P - use maximum detected precise level
  S - read sample value (PERF_SAMPLE_READ)
  D - pin the event to the PMU
 
@@ -125,6 +126,8 @@ To limit the list use:
 . If none of the above is matched, it will apply the supplied glob to all
   events, printing the ones that match.
 
+. As a last resort, it will do a substring search in all event names.
+
 One or more types can be used at the same time, listing the events for the
 types specified.
 
diff --git a/tools/perf/Documentation/perf-record.txt b/tools/perf/Documentation/perf-record.txt
index 2e9ce77b5..e630a7d2c 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -144,7 +144,7 @@ OPTIONS
 
 --call-graph::
 	Setup and enable call-graph (stack chain/backtrace) recording,
-	implies -g.
+	implies -g.  Default is "fp".
 
 	Allows specifying "fp" (frame pointer) or "dwarf"
 	(DWARF's CFI - Call Frame Information) or "lbr"
@@ -154,13 +154,18 @@ OPTIONS
 	In some systems, where binaries are build with gcc
 	--fomit-frame-pointer, using the "fp" method will produce bogus
 	call graphs, using "dwarf", if available (perf tools linked to
-	the libunwind library) should be used instead.
+	the libunwind or libdw library) should be used instead.
 	Using the "lbr" method doesn't require any compiler options. It
 	will produce call graphs from the hardware LBR registers. The
 	main limition is that it is only available on new Intel
 	platforms, such as Haswell. It can only get user call chain. It
 	doesn't work with branch stack sampling at the same time.
 
+	When "dwarf" recording is used, perf also records (user) stack dump
+	when sampled.  Default size of the stack dump is 8192 (bytes).
+	User can change the size by passing the size after comma like
+	"--call-graph dwarf,4096".
+
 -q::
 --quiet::
 	Don't print any message, useful for scripting.
@@ -236,6 +241,7 @@ following filters are defined:
         - any_call: any function call or system call
         - any_ret: any function return or system call return
         - ind_call: any indirect branch
+        - call: direct calls, including far (to/from kernel) calls
         - u:  only when the branch target is at the user level
         - k: only when the branch target is in the kernel
         - hv: only when the target is at the hypervisor level
@@ -308,6 +314,12 @@ This option sets the time out limit. The default value is 500 ms.
 Record context switch events i.e. events of type PERF_RECORD_SWITCH or
 PERF_RECORD_SWITCH_CPU_WIDE.
 
+--clang-path::
+Path to clang binary to use for compiling BPF scriptlets.
+
+--clang-opt::
+Options passed to clang when compiling BPF scriptlets.
+
 SEE ALSO
 --------
 linkperf:perf-stat[1], linkperf:perf-list[1]
diff --git a/tools/perf/Documentation/perf-report.txt b/tools/perf/Documentation/perf-report.txt
index 9c7981bfd..5ce8da1e1 100644
--- a/tools/perf/Documentation/perf-report.txt
+++ b/tools/perf/Documentation/perf-report.txt
@@ -29,7 +29,7 @@ OPTIONS
 --show-nr-samples::
 	Show the number of samples for each symbol
 
---showcpuutilization::
+--show-cpu-utilization::
         Show sample percentage for different cpu modes.
 
 -T::
@@ -68,7 +68,7 @@ OPTIONS
 --sort=::
 	Sort histogram entries by given key(s) - multiple keys can be specified
 	in CSV format.  Following sort keys are available:
-	pid, comm, dso, symbol, parent, cpu, srcline, weight, local_weight.
+	pid, comm, dso, symbol, parent, cpu, socket, srcline, weight, local_weight.
 
 	Each key has following meaning:
 
@@ -79,6 +79,7 @@ OPTIONS
 	- parent: name of function matched to the parent regex filter. Unmatched
 	entries are displayed as "[other]".
 	- cpu: cpu number the task ran at the time of sample
+	- socket: processor socket number the task ran at the time of sample
 	- srcline: filename and line number executed at the time of sample.  The
 	DWARF debugging info must be provided.
 	- srcfile: file name of the source file of the same. Requires dwarf
@@ -168,30 +169,40 @@ OPTIONS
 --dump-raw-trace::
         Dump raw trace in ASCII.
 
--g [type,min[,limit],order[,key][,branch]]::
---call-graph::
-        Display call chains using type, min percent threshold, optional print
-	limit and order.
-	type can be either:
+-g::
+--call-graph=<print_type,threshold[,print_limit],order,sort_key,branch>::
+        Display call chains using type, min percent threshold, print limit,
+	call order, sort key and branch.  Note that ordering of parameters is not
+	fixed so any parement can be given in an arbitraty order.  One exception
+	is the print_limit which should be preceded by threshold.
+
+	print_type can be either:
 	- flat: single column, linear exposure of call chains.
-	- graph: use a graph tree, displaying absolute overhead rates.
+	- graph: use a graph tree, displaying absolute overhead rates. (default)
 	- fractal: like graph, but displays relative rates. Each branch of
-		 the tree is considered as a new profiled object. +
+		 the tree is considered as a new profiled object.
+	- none: disable call chain display.
+
+	threshold is a percentage value which specifies a minimum percent to be
+	included in the output call graph.  Default is 0.5 (%).
+
+	print_limit is only applied when stdio interface is used.  It's to limit
+	number of call graph entries in a single hist entry.  Note that it needs
+	to be given after threshold (but not necessarily consecutive).
+	Default is 0 (unlimited).
 
 	order can be either:
 	- callee: callee based call graph.
 	- caller: inverted caller based call graph.
+	Default is 'caller' when --children is used, otherwise 'callee'.
 
-	key can be:
-	- function: compare on functions
+	sort_key can be:
+	- function: compare on functions (default)
 	- address: compare on individual code addresses
 
 	branch can be:
-	- branch: include last branch information in callgraph
-	when available. Usually more convenient to use --branch-history
-	for this.
-
-	Default: fractal,0.5,callee,function.
+	- branch: include last branch information in callgraph when available.
+	          Usually more convenient to use --branch-history for this.
 
 --children::
 	Accumulate callchain of children to parent entry so that then can
@@ -204,6 +215,8 @@ OPTIONS
 	beyond the specified depth will be ignored. This is a trade-off
 	between information loss and faster processing especially for
 	workloads that can have a very long callchain stack.
+	Note that when using the --itrace option the synthesized callchain size
+	will override this value if the synthesized callchain size is bigger.
 
 	Default: 127
 
@@ -349,6 +362,9 @@ include::itrace.txt[]
 	This option extends the perf report to show reference callgraphs,
 	which collected by reference event, in no callgraph event.
 
+--socket-filter::
+	Only report the samples on the processor socket that match with this filter
+
 include::callchain-overhead-calculation.txt[]
 
 SEE ALSO
diff --git a/tools/perf/Documentation/perf-script.txt b/tools/perf/Documentation/perf-script.txt
index dc3ec783b..382ddfb45 100644
--- a/tools/perf/Documentation/perf-script.txt
+++ b/tools/perf/Documentation/perf-script.txt
@@ -112,11 +112,11 @@ OPTIONS
 --debug-mode::
         Do various checks like samples ordering and lost events.
 
--f::
+-F::
 --fields::
         Comma separated list of fields to print. Options are:
         comm, tid, pid, time, cpu, event, trace, ip, sym, dso, addr, symoff,
-	srcline, period, iregs, flags.
+	srcline, period, iregs, brstack, brstacksym, flags.
         Field list can be prepended with the type, trace, sw or hw,
         to indicate to which event type the field list applies.
         e.g., -f sw:comm,tid,time,ip,sym  and -f trace:time,cpu,trace
@@ -175,6 +175,16 @@ OPTIONS
 	Finally, a user may not set fields to none for all event types.
 	i.e., -f "" is not allowed.
 
+	The brstack output includes branch related information with raw addresses using the
+	/v/v/v/v/ syntax in the following order:
+	FROM: branch source instruction
+	TO  : branch target instruction
+        M/P/-: M=branch target mispredicted or branch direction was mispredicted, P=target predicted or direction predicted, -=not supported
+	X/- : X=branch inside a transactional region, -=not in transaction region or not supported
+	A/- : A=TSX abort entry, -=not aborted region or not supported
+
+	The brstacksym is identical to brstack, except that the FROM and TO addresses are printed in a symbolic form if possible.
+
 -k::
 --vmlinux=<file>::
         vmlinux pathname
@@ -249,6 +259,9 @@ include::itrace.txt[]
 --full-source-path::
 	Show the full path for source files for srcline output.
 
+--ns::
+	Use 9 decimal places when displaying time (i.e. show the nanoseconds)
+
 SEE ALSO
 --------
 linkperf:perf-record[1], linkperf:perf-script-perl[1],
diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt
index 47469abdc..4e074a660 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -128,8 +128,9 @@ perf stat --repeat 10 --null --sync --pre 'make -s O=defconfig-build/clean' -- m
 
 -I msecs::
 --interval-print msecs::
-	Print count deltas every N milliseconds (minimum: 100ms)
-	example: perf stat -I 1000 -e cycles -a sleep 5
+Print count deltas every N milliseconds (minimum: 10ms)
+The overhead percentage could be high in some cases, for instance with small, sub 100ms intervals.  Use with caution.
+	example: 'perf stat -I 1000 -e cycles -a sleep 5'
 
 --per-socket::
 Aggregate counts per processor socket for system-wide mode measurements.  This
diff --git a/tools/perf/Documentation/perf-top.txt b/tools/perf/Documentation/perf-top.txt
index f6a23eb29..556cec09b 100644
--- a/tools/perf/Documentation/perf-top.txt
+++ b/tools/perf/Documentation/perf-top.txt
@@ -160,9 +160,10 @@ Default is to monitor all CPUS.
 -g::
 	Enables call-graph (stack chain/backtrace) recording.
 
---call-graph::
+--call-graph [mode,type,min[,limit],order[,key][,branch]]::
 	Setup and enable call-graph (stack chain/backtrace) recording,
-	implies -g.
+	implies -g.  See `--call-graph` section in perf-record and
+	perf-report man pages for details.
 
 --children::
 	Accumulate callchain of children to parent entry so that then can
diff --git a/tools/perf/Documentation/perf-trace.txt b/tools/perf/Documentation/perf-trace.txt
index 7ea078658..13293de88 100644
--- a/tools/perf/Documentation/perf-trace.txt
+++ b/tools/perf/Documentation/perf-trace.txt
@@ -62,7 +62,6 @@ OPTIONS
 --verbose=::
         Verbosity level.
 
--i::
 --no-inherit::
 	Child tasks do not inherit counters.
 
diff --git a/tools/perf/Documentation/perf.txt b/tools/perf/Documentation/perf.txt
index 2b1317763..864e37597 100644
--- a/tools/perf/Documentation/perf.txt
+++ b/tools/perf/Documentation/perf.txt
@@ -27,6 +27,14 @@ OPTIONS
 	Setup buildid cache directory. It has higher priority than
 	buildid.dir config file option.
 
+-v::
+--version::
+  Display perf version.
+
+-h::
+--help::
+  Run perf help command.
+
 DESCRIPTION
 -----------
 Performance counters for Linux are a new kernel-based subsystem