summaryrefslogtreecommitdiff
path: root/arch/x86/kernel/cpu
AgeCommit message (Collapse)Author
2022-09-23x86/resctrl: Add resctrl_rmid_realloc_limit to abstract x86's boot_cpu_dataJames Morse
resctrl_rmid_realloc_threshold can be set by user-space. The maximum value is specified by the architecture. Currently max_threshold_occ_write() reads the maximum value from boot_cpu_data.x86_cache_size, which is not portable to another architecture. Add resctrl_rmid_realloc_limit to describe the maximum size in bytes that user-space can set the threshold to. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-21-james.morse@arm.com
2022-09-23x86/resctrl: Rename and change the units of resctrl_cqm_thresholdJames Morse
resctrl_cqm_threshold is stored in a hardware specific chunk size, but exposed to user-space as bytes. This means the filesystem parts of resctrl need to know how the hardware counts, to convert the user provided byte value to chunks. The interface between the architecture's resctrl code and the filesystem ought to treat everything as bytes. Change the unit of resctrl_cqm_threshold to bytes. resctrl_arch_rmid_read() still returns its value in chunks, so this needs converting to bytes. As all the users have been touched, rename the variable to resctrl_rmid_realloc_threshold, which describes what the value is for. Neither r->num_rmid nor hw_res->mon_scale are guaranteed to be a power of 2, so the existing code introduces a rounding error from resctrl's theoretical fraction of the cache usage. This behaviour is kept as it ensures the user visible value matches the value read from hardware when the rmid will be reallocated. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-20-james.morse@arm.com
2022-09-23x86/resctrl: Move get_corrected_mbm_count() into resctrl_arch_rmid_read()James Morse
resctrl_arch_rmid_read() is intended as the function that an architecture agnostic resctrl filesystem driver can use to read a value in bytes from a counter. Currently the function returns the MBM values in chunks directly from hardware. When reading a bandwidth counter, get_corrected_mbm_count() must be used to correct the value read. get_corrected_mbm_count() is architecture specific, this work should be done in resctrl_arch_rmid_read(). Move the function calls. This allows the resctrl filesystems's chunks value to be removed in favour of the architecture private version. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-19-james.morse@arm.com
2022-09-23x86/resctrl: Move mbm_overflow_count() into resctrl_arch_rmid_read()James Morse
resctrl_arch_rmid_read() is intended as the function that an architecture agnostic resctrl filesystem driver can use to read a value in bytes from a counter. Currently the function returns the MBM values in chunks directly from hardware. When reading a bandwidth counter, mbm_overflow_count() must be used to correct for any possible overflow. mbm_overflow_count() is architecture specific, its behaviour should be part of resctrl_arch_rmid_read(). Move the mbm_overflow_count() calls into resctrl_arch_rmid_read(). This allows the resctrl filesystems's prev_msr to be removed in favour of the architecture private version. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-18-james.morse@arm.com
2022-09-23x86/resctrl: Pass the required parameters into resctrl_arch_rmid_read()James Morse
resctrl_arch_rmid_read() is intended as the function that an architecture agnostic resctrl filesystem driver can use to read a value in bytes from a hardware register. Currently the function returns the MBM values in chunks directly from hardware. To convert this to bytes, some correction and overflow calculations are needed. These depend on the resource and domain structures. Overflow detection requires the old chunks value. None of this is available to resctrl_arch_rmid_read(). MPAM requires the resource and domain structures to find the MMIO device that holds the registers. Pass the resource and domain to resctrl_arch_rmid_read(). This makes rmid_dirty() too big. Instead merge it with its only caller, and the name is kept as a local variable. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-17-james.morse@arm.com
2022-09-23x86/resctrl: Abstract __rmid_read()James Morse
__rmid_read() selects the specified eventid and returns the counter value from the MSR. The error handling is architecture specific, and handled by the callers, rdtgroup_mondata_show() and __mon_event_count(). Error handling should be handled by architecture specific code, as a different architecture may have different requirements. MPAM's counters can report that they are 'not ready', requiring a second read after a short delay. This should be hidden from resctrl. Make __rmid_read() the architecture specific function for reading a counter. Rename it resctrl_arch_rmid_read() and move the error handling into it. A read from a counter that hardware supports but resctrl does not now returns -EINVAL instead of -EIO from the default case in __mon_event_count(). It isn't possible for user-space to see this change as resctrl doesn't expose counters it doesn't support. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-16-james.morse@arm.com
2022-09-23x86/microcode/AMD: Track patch allocation size explicitlyKees Cook
In preparation for reducing the use of ksize(), record the actual allocation size for later memcpy(). This avoids copying extra (uninitialized!) bytes into the patch buffer when the requested allocation size isn't exactly the size of a kmalloc bucket. Additionally, fix potential future issues where runtime bounds checking will notice that the buffer was allocated to a smaller value than returned by ksize(). Fixes: 757885e94a22 ("x86, microcode, amd: Early microcode patch loading support for AMD") Suggested-by: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/lkml/CA+DvKQ+bp7Y7gmaVhacjv9uF6Ar-o4tet872h4Q8RPYPJjcJQA@mail.gmail.com/
2022-09-23x86/resctrl: Allow per-rmid arch private storage to be resetJames Morse
To abstract the rmid counters into a helper that returns the number of bytes counted, architecture specific per-rmid state is needed. It needs to be possible to reset this hidden state, as the values may outlive the life of an rmid, or the mount time of the filesystem. mon_event_read() is called with first = true when an rmid is first allocated in mkdir_mondata_subdir(). Add resctrl_arch_reset_rmid() and call it from __mon_event_count()'s rr->first check. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-15-james.morse@arm.com
2022-09-22x86/resctrl: Add per-rmid arch private storage for overflow and chunksJames Morse
A renamed __rmid_read() is intended as the function that an architecture agnostic resctrl filesystem driver can use to read a value in bytes from a counter. Currently the function returns the MBM values in chunks directly from hardware. For bandwidth counters the resctrl filesystem uses this to calculate the number of bytes ever seen. MPAM's scaling of counters can be changed at runtime, reducing the resolution but increasing the range. When this is changed the prev_msr values need to be converted by the architecture code. Add an array for per-rmid private storage. The prev_msr and chunks values will move here to allow resctrl_arch_rmid_read() to always return the number of bytes read by this counter without assistance from the filesystem. The values are moved in later patches when the overflow and correction calls are moved into __rmid_read(). Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-14-james.morse@arm.com
2022-09-22x86/resctrl: Calculate bandwidth from the previous __mon_event_count() chunksJames Morse
mbm_bw_count() is only called by the mbm_handle_overflow() worker once a second. It reads the hardware register, calculates the bandwidth and updates m->prev_bw_msr which is used to hold the previous hardware register value. Operating directly on hardware register values makes it difficult to make this code architecture independent, so that it can be moved to /fs/, making the mba_sc feature something resctrl supports with no additional support from the architecture. Prior to calling mbm_bw_count(), mbm_update() reads from the same hardware register using __mon_event_count(). Change mbm_bw_count() to use the current chunks value most recently saved by __mon_event_count(). This removes an extra call to __rmid_read(). Instead of using m->prev_msr to calculate the number of chunks seen, use the rr->val that was updated by __mon_event_count(). This removes an extra call to mbm_overflow_count() and get_corrected_mbm_count(). Calculating bandwidth like this means mbm_bw_count() no longer operates on hardware register values directly. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-13-james.morse@arm.com
2022-09-22x86/resctrl: Allow update_mba_bw() to update controls directlyJames Morse
update_mba_bw() calculates a new control value for the MBA resource based on the user provided mbps_val and the current measured bandwidth. Some control values need remapping by delay_bw_map(). It does this by calling wrmsrl() directly. This needs splitting up to be done by an architecture specific helper, so that the remainder can eventually be moved to /fs/. Add resctrl_arch_update_one() to apply one configuration value to the provided resource and domain. This avoids the staging and cross-calling that is only needed with changes made by user-space. delay_bw_map() moves to be part of the arch code, to maintain the 'percentage control' view of MBA resources in resctrl. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-12-james.morse@arm.com
2022-09-22x86/resctrl: Remove architecture copy of mbps_valJames Morse
The resctrl arch code provides a second configuration array mbps_val[] for the MBA software controller. Since resctrl switched over to allocating and freeing its own array when needed, nothing uses the arch code version. Remove it. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-11-james.morse@arm.com
2022-09-22x86/resctrl: Switch over to the resctrl mbps_val listJames Morse
Updates to resctrl's software controller follow the same path as other configuration updates, but they don't modify the hardware state. rdtgroup_schemata_write() uses parse_line() and the resource's parse_ctrlval() function to stage the configuration. resctrl_arch_update_domains() then updates the mbps_val[] array instead, and resctrl_arch_update_domains() skips the rdt_ctrl_update() call that would update hardware. This complicates the interface between resctrl's filesystem parts and architecture specific code. It should be possible for mba_sc to be completely implemented by the filesystem parts of resctrl. This would allow it to work on a second architecture with no additional code. resctrl_arch_update_domains() using the mbps_val[] array prevents this. Change parse_bw() to write the configuration value directly to the mbps_val[] array in the domain structure. Change rdtgroup_schemata_write() to skip the call to resctrl_arch_update_domains(), meaning all the mba_sc specific code in resctrl_arch_update_domains() can be removed. On the read-side, show_doms() and update_mba_bw() are changed to read the mbps_val[] array from the domain structure. With this, resctrl_arch_get_config() no longer needs to consider mba_sc resources. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-10-james.morse@arm.com
2022-09-22x86/resctrl: Create mba_sc configuration in the rdt_domainJames Morse
To support resctrl's MBA software controller, the architecture must provide a second configuration array to hold the mbps_val[] from user-space. This complicates the interface between the architecture specific code and the filesystem portions of resctrl that will move to /fs/, to allow multiple architectures to support resctrl. Make the filesystem parts of resctrl create an array for the mba_sc values. The software controller can be changed to use this, allowing the architecture code to only consider the values configured in hardware. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-9-james.morse@arm.com
2022-09-22x86/resctrl: Abstract and use supports_mba_mbps()James Morse
To determine whether the mba_MBps option to resctrl should be supported, resctrl tests the boot CPUs' x86_vendor. This isn't portable, and needs abstracting behind a helper so this check can be part of the filesystem code that moves to /fs/. Re-use the tests set_mba_sc() does to determine if the mba_sc is supported on this system. An 'alloc_capable' test is added so that support for the controls isn't implied by the 'delay_linear' property, which is always true for MPAM. Because mbm_update() only update mba_sc if the mbm_local counters are enabled, supports_mba_mbps() checks is_mbm_local_enabled(). (instead of using is_mbm_enabled(), which checks both). Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-8-james.morse@arm.com
2022-09-22x86/resctrl: Remove set_mba_sc()s control array re-initialisationJames Morse
set_mba_sc() enables the 'software controller' to regulate the bandwidth based on the byte counters. This can be managed entirely in the parts of resctrl that move to /fs/, without any extra support from the architecture specific code. set_mba_sc() is called by rdt_enable_ctx() during mount and unmount. It currently resets the arch code's ctrl_val[] and mbps_val[] arrays. The ctrl_val[] was already reset when the domain was created, and by reset_all_ctrls() when the filesystem was last unmounted. Doing the work in set_mba_sc() is not necessary as the values are already at their defaults due to the creation of the domain, or were previously reset during umount(), or are about to reset during umount(). Add a reset of the mbps_val[] in reset_all_ctrls(), allowing the code in set_mba_sc() that reaches in to the architecture specific structures to be removed. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-7-james.morse@arm.com
2022-09-22x86/resctrl: Add domain offline callback for resctrl workJames Morse
Because domains are exposed to user-space via resctrl, the filesystem must update its state when CPU hotplug callbacks are triggered. Some of this work is common to any architecture that would support resctrl, but the work is tied up with the architecture code to free the memory. Move the monitor subdir removal and the cancelling of the mbm/limbo works into a new resctrl_offline_domain() call. These bits are not specific to the architecture. Grouping them in one function allows that code to be moved to /fs/ and re-used by another architecture. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-6-james.morse@arm.com
2022-09-22x86/resctrl: Group struct rdt_hw_domain cleanupJames Morse
domain_add_cpu() and domain_remove_cpu() need to kfree() the child arrays that were allocated by domain_setup_ctrlval(). As this memory is moved around, and new arrays are created, adjusting the error handling cleanup code becomes noisier. To simplify this, move all the kfree() calls into a domain_free() helper. This depends on struct rdt_hw_domain being kzalloc()d, allowing it to unconditionally kfree() all the child arrays. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-5-james.morse@arm.com
2022-09-22x86/resctrl: Add domain online callback for resctrl workJames Morse
Because domains are exposed to user-space via resctrl, the filesystem must update its state when CPU hotplug callbacks are triggered. Some of this work is common to any architecture that would support resctrl, but the work is tied up with the architecture code to allocate the memory. Move domain_setup_mon_state(), the monitor subdir creation call and the mbm/limbo workers into a new resctrl_online_domain() call. These bits are not specific to the architecture. Grouping them in one function allows that code to be moved to /fs/ and re-used by another architecture. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-4-james.morse@arm.com
2022-09-22x86/resctrl: Merge mon_capable and mon_enabledJames Morse
mon_enabled and mon_capable are always set as a pair by rdt_get_mon_l3_config(). There is no point having two values. Merge them together. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-3-james.morse@arm.com
2022-09-22x86/resctrl: Kill off alloc_enabledJames Morse
rdt_resources_all[] used to have extra entries for L2CODE/L2DATA. These were hidden from resctrl by the alloc_enabled value. Now that the L2/L2CODE/L2DATA resources have been merged together, alloc_enabled doesn't mean anything, it always has the same value as alloc_capable which indicates allocation is supported by this resource. Remove alloc_enabled and its helpers. Signed-off-by: James Morse <james.morse@arm.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jamie Iles <quic_jiles@quicinc.com> Reviewed-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Xin Hao <xhao@linux.alibaba.com> Tested-by: Shaopeng Tan <tan.shaopeng@fujitsu.com> Tested-by: Cristian Marussi <cristian.marussi@arm.com> Link: https://lore.kernel.org/r/20220902154829.30399-2-james.morse@arm.com
2022-09-08x86/sgx: Handle VA page allocation failure for EAUG on PF.Haitao Huang
VM_FAULT_NOPAGE is expected behaviour for -EBUSY failure path, when augmenting a page, as this means that the reclaimer thread has been triggered, and the intention is just to round-trip in ring-3, and retry with a new page fault. Fixes: 5a90d2c3f5ef ("x86/sgx: Support adding of pages to an initialized enclave") Signed-off-by: Haitao Huang <haitao.huang@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Tested-by: Vijay Dhanraj <vijay.dhanraj@intel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20220906000221.34286-3-jarkko@kernel.org
2022-09-08x86/sgx: Do not fail on incomplete sanitization on premature stop of ksgxdJarkko Sakkinen
Unsanitized pages trigger WARN_ON() unconditionally, which can panic the whole computer, if /proc/sys/kernel/panic_on_warn is set. In sgx_init(), if misc_register() fails or misc_register() succeeds but neither sgx_drv_init() nor sgx_vepc_init() succeeds, then ksgxd will be prematurely stopped. This may leave unsanitized pages, which will result a false warning. Refine __sgx_sanitize_pages() to return: 1. Zero when the sanitization process is complete or ksgxd has been requested to stop. 2. The number of unsanitized pages otherwise. Fixes: 51ab30eb2ad4 ("x86/sgx: Replace section->init_laundry_list with sgx_dirty_page_list") Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Reinette Chatre <reinette.chatre@intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/linux-sgx/20220825051827.246698-1-jarkko@kernel.org/T/#u Link: https://lkml.kernel.org/r/20220906000221.34286-2-jarkko@kernel.org
2022-09-02x86/microcode: Print previous version of microcode after reloadAshok Raj
Print both old and new versions of microcode after a reload is complete because knowing the previous microcode version is sometimes important from a debugging perspective. [ bp: Massage commit message. ] Signed-off-by: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20220829181030.722891-1-ashok.raj@intel.com
2022-09-01sgx: use ->f_mapping...Al Viro
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-31x86/resctrl: Fix to restore to original value when re-enabling hardware ↵Kohei Tarumizu
prefetch register The current pseudo_lock.c code overwrites the value of the MSR_MISC_FEATURE_CONTROL to 0 even if the original value is not 0. Therefore, modify it to save and restore the original values. Fixes: 018961ae5579 ("x86/intel_rdt: Pseudo-lock region creation/removal core") Fixes: 443810fe6160 ("x86/intel_rdt: Create debugfs files for pseudo-locking testing") Fixes: 8a2fc0e1bc0c ("x86/intel_rdt: More precise L2 hit/miss measurements") Signed-off-by: Kohei Tarumizu <tarumizu.kohei@fujitsu.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Reinette Chatre <reinette.chatre@intel.com> Link: https://lkml.kernel.org/r/eb660f3c2010b79a792c573c02d01e8e841206ad.1661358182.git.reinette.chatre@intel.com
2022-08-29x86/mce: Retrieve poison range from hardwareJane Chu
When memory poison consumption machine checks fire, MCE notifier handlers like nfit_handle_mce() record the impacted physical address range which is reported by the hardware in the MCi_MISC MSR. The error information includes data about blast radius, i.e. how many cachelines did the hardware determine are impacted. A recent change 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity") updated nfit_handle_mce() to stop hard coding the blast radius value of 1 cacheline, and instead rely on the blast radius reported in 'struct mce' which can be up to 4K (64 cachelines). It turns out that apei_mce_report_mem_error() had a similar problem in that it hard coded a blast radius of 4K rather than reading the blast radius from the error information. Fix apei_mce_report_mem_error() to convey the proper poison granularity. Signed-off-by: Jane Chu <jane.chu@oracle.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/7ed50fd8-521e-cade-77b1-738b8bfb8502@oracle.com Link: https://lore.kernel.org/r/20220826233851.1319100-1-jane.chu@oracle.com
2022-08-27x86/cpufeatures: Add LbrExtV2 feature bitSandipan Das
CPUID leaf 0x80000022 i.e. ExtPerfMonAndDbg advertises some new performance monitoring features for AMD processors. Bit 1 of EAX indicates support for Last Branch Record Extension Version 2 (LbrExtV2) features. If found to be set during PMU initialization, the EBX bits of the same leaf can be used to determine the number of available LBR entries. For better utilization of feature words, LbrExtV2 is added as a scattered feature bit. [peterz: Rename to AMD_LBR_V2] Signed-off-by: Sandipan Das <sandipan.das@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/172d2b0df39306ed77221c45ee1aa62e8ae0548d.1660211399.git.sandipan.das@amd.com
2022-08-26x86/microcode: Remove ->request_microcode_user()Borislav Petkov
181b6f40e9ea ("x86/microcode: Rip out the OLD_INTERFACE") removed the old microcode loading interface but forgot to remove the related ->request_microcode_user() functionality which it uses. Rip it out now too. Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/r/20220825075445.28171-1-bp@alien8.de
2022-08-18x86/bugs: Add "unknown" reporting for MMIO Stale DataPawan Gupta
Older Intel CPUs that are not in the affected processor list for MMIO Stale Data vulnerabilities currently report "Not affected" in sysfs, which may not be correct. Vulnerability status for these older CPUs is unknown. Add known-not-affected CPUs to the whitelist. Report "unknown" mitigation status for CPUs that are not in blacklist, whitelist and also don't enumerate MSR ARCH_CAPABILITIES bits that reflect hardware immunity to MMIO Stale Data vulnerabilities. Mitigation is not deployed when the status is unknown. [ bp: Massage, fixup. ] Fixes: 8d50cdf8b834 ("x86/speculation/mmio: Add sysfs reporting for Processor MMIO Stale Data") Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com> Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/a932c154772f2121794a5f2eded1a11013114711.1657846269.git.pawan.kumar.gupta@linux.intel.com
2022-08-15x86/sgx: Improve comments for sgx_encl_lookup/alloc_backing()Kristen Carlson Accardi
Modify the comments for sgx_encl_lookup_backing() and for sgx_encl_alloc_backing() to indicate that they take a reference which must be dropped with a call to sgx_encl_put_backing(). Make sgx_encl_lookup_backing() static for now, and change the name of sgx_encl_get_backing() to __sgx_encl_get_backing() to make it more clear that sgx_encl_get_backing() is an internal function. Signed-off-by: Kristen Carlson Accardi <kristen.c.accardi@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lore.kernel.org/all/YtUs3MKLzFg+rqEV@zn.tnic/
2022-08-13Merge tag 'x86-urgent-2022-08-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Ingo Molnar: "Fix the 'IBPB mitigated RETBleed' mode of operation on AMD CPUs (not turned on by default), which also need STIBP enabled (if available) to be '100% safe' on even the shortest speculation windows" * tag 'x86-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/bugs: Enable STIBP for IBPB mitigated RETBleed
2022-08-09Merge tag 'x86_bugs_pbrsb' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 eIBRS fixes from Borislav Petkov: "More from the CPU vulnerability nightmares front: Intel eIBRS machines do not sufficiently mitigate against RET mispredictions when doing a VM Exit therefore an additional RSB, one-entry stuffing is needed" * tag 'x86_bugs_pbrsb' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/speculation: Add LFENCE to RSB fill sequence x86/speculation: Add RSB VM Exit protections
2022-08-08x86/bugs: Enable STIBP for IBPB mitigated RETBleedKim Phillips
AMD's "Technical Guidance for Mitigating Branch Type Confusion, Rev. 1.0 2022-07-12" whitepaper, under section 6.1.2 "IBPB On Privileged Mode Entry / SMT Safety" says: Similar to the Jmp2Ret mitigation, if the code on the sibling thread cannot be trusted, software should set STIBP to 1 or disable SMT to ensure SMT safety when using this mitigation. So, like already being done for retbleed=unret, and now also for retbleed=ibpb, force STIBP on machines that have it, and report its SMT vulnerability status accordingly. [ bp: Remove the "we" and remove "[AMD]" applicability parameter which doesn't work here. ] Fixes: 3ebc17006888 ("x86/bugs: Add retbleed=ibpb") Signed-off-by: Kim Phillips <kim.phillips@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org # 5.10, 5.15, 5.19 Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537 Link: https://lore.kernel.org/r/20220804192201.439596-1-kim.phillips@amd.com
2022-08-07Merge tag 'mm-nonmm-stable-2022-08-06-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc updates from Andrew Morton: "Updates to various subsystems which I help look after. lib, ocfs2, fatfs, autofs, squashfs, procfs, etc. A relatively small amount of material this time" * tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits) scripts/gdb: ensure the absolute path is generated on initial source MAINTAINERS: kunit: add David Gow as a maintainer of KUnit mailmap: add linux.dev alias for Brendan Higgins mailmap: update Kirill's email profile: setup_profiling_timer() is moslty not implemented ocfs2: fix a typo in a comment ocfs2: use the bitmap API to simplify code ocfs2: remove some useless functions lib/mpi: fix typo 'the the' in comment proc: add some (hopefully) insightful comments bdi: remove enum wb_congested_state kernel/hung_task: fix address space of proc_dohung_task_timeout_secs lib/lzo/lzo1x_compress.c: replace ternary operator with min() and min_t() squashfs: support reading fragments in readahead call squashfs: implement readahead squashfs: always build "file direct" version of page actor Revert "squashfs: provide backing_dev_info in order to disable read-ahead" fs/ocfs2: Fix spelling typo in comment ia64: old_rr4 added under CONFIG_HUGETLB_PAGE proc: fix test for "vsyscall=xonly" boot option ...
2022-08-06Merge tag 'x86-urgent-2022-08-06' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: - build fix for old(er) binutils - build fix for new GCC - kexec boot environment fix * tag 'x86-urgent-2022-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/entry: Build thunk_$(BITS) only if CONFIG_PREEMPTION=y x86/numa: Use cpumask_available instead of hardcoded NULL check x86/bus_lock: Don't assume the init value of DEBUGCTLMSR.BUS_LOCK_DETECT to be zero
2022-08-05Merge tag 'x86_sgx_for_v6.0-2022-08-03.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 SGX updates from Dave Hansen: "A set of x86/sgx changes focused on implementing the "SGX2" features, plus a minor cleanup: - SGX2 ISA support which makes enclave memory management much more dynamic. For instance, enclaves can now change enclave page permissions on the fly. - Removal of an unused structure member" * tag 'x86_sgx_for_v6.0-2022-08-03.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits) x86/sgx: Drop 'page_index' from sgx_backing selftests/sgx: Page removal stress test selftests/sgx: Test reclaiming of untouched page selftests/sgx: Test invalid access to removed enclave page selftests/sgx: Test faulty enclave behavior selftests/sgx: Test complete changing of page type flow selftests/sgx: Introduce TCS initialization enclave operation selftests/sgx: Introduce dynamic entry point selftests/sgx: Test two different SGX2 EAUG flows selftests/sgx: Add test for TCS page permission changes selftests/sgx: Add test for EPCM permission changes Documentation/x86: Introduce enclave runtime management section x86/sgx: Free up EPC pages directly to support large page ranges x86/sgx: Support complete page removal x86/sgx: Support modifying SGX page type x86/sgx: Tighten accessible memory range after enclave initialization x86/sgx: Support adding of pages to an initialized enclave x86/sgx: Support restricting of enclave page permissions x86/sgx: Support VA page allocation without reclaiming x86/sgx: Export sgx_encl_page_alloc() ...
2022-08-04Merge tag 'pci-v5.20-changes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull pci updates from Bjorn Helgaas: "Enumeration: - Consolidate duplicated 'next function' scanning and extend to allow 'isolated functions' on s390, similar to existing hypervisors (Niklas Schnelle) Resource management: - Implement pci_iobar_pfn() for sparc, which allows us to remove the sparc-specific pci_mmap_page_range() and pci_mmap_resource_range(). This removes the ability to map the entire PCI I/O space using /proc/bus/pci, but we believe that's already been broken since v2.6.28 (Arnd Bergmann) - Move common PCI definitions to asm-generic/pci.h and rework others to be be more specific and more encapsulated in arches that need them (Stafford Horne) Power management: - Convert drivers to new *_PM_OPS macros to avoid need for '#ifdef CONFIG_PM_SLEEP' or '__maybe_unused' (Bjorn Helgaas) Virtualization: - Add ACS quirk for Broadcom BCM5750x multifunction NICs that isolate the functions but don't advertise an ACS capability (Pavan Chebbi) Error handling: - Clear PCI Status register during enumeration in case firmware left errors logged (Kai-Heng Feng) - When we have native control of AER, enable error reporting for all devices that support AER. Previously only a few drivers enabled this (Stefan Roese) - Keep AER error reporting enabled for switches. Previously we enabled this during enumeration but immediately disabled it (Stefan Roese) - Iterate over error counters instead of error strings to avoid printing junk in AER sysfs counters (Mohamed Khalfella) ASPM: - Remove pcie_aspm_pm_state_change() so ASPM config changes, e.g., via sysfs, are not lost across power state changes (Kai-Heng Feng) Endpoint framework: - Don't stop an EPC when unbinding an EPF from it (Shunsuke Mie) Endpoint embedded DMA controller driver: - Simplify and clean up support for the DesignWare embedded DMA (eDMA) controller (Frank Li, Serge Semin) Broadcom STB PCIe controller driver: - Avoid config space accesses when link is down because we can't recover from the CPU aborts these cause (Jim Quinlan) - Look for power regulators described under Root Ports in DT and enable them before scanning the secondary bus (Jim Quinlan) - Disable/enable regulators in suspend/resume (Jim Quinlan) Freescale i.MX6 PCIe controller driver: - Simplify and clean up clock and PHY management (Richard Zhu) - Disable/enable regulators in suspend/resume (Richard Zhu) - Set PCIE_DBI_RO_WR_EN before writing DBI registers (Richard Zhu) - Allow speeds faster than Gen2 (Richard Zhu) - Make link being down a non-fatal error so controller probe doesn't fail if there are no Endpoints connected (Richard Zhu) Loongson PCIe controller driver: - Add ACPI and MCFG support for Loongson LS7A (Huacai Chen) - Avoid config reads to non-existent LS2K/LS7A devices because a hardware defect causes machine hangs (Huacai Chen) - Work around LS7A integrated devices that report incorrect Interrupt Pin values (Jianmin Lv) Marvell Aardvark PCIe controller driver: - Add support for AER and Slot capability on emulated bridge (Pali Rohár) MediaTek PCIe controller driver: - Add Airoha EN7532 to DT binding (John Crispin) - Allow building of driver for ARCH_AIROHA (Felix Fietkau) MediaTek PCIe Gen3 controller driver: - Print decoded LTSSM state when the link doesn't come up (Jianjun Wang) NVIDIA Tegra194 PCIe controller driver: - Convert DT binding to json-schema (Vidya Sagar) - Add DT bindings and driver support for Tegra234 Root Port and Endpoint mode (Vidya Sagar) - Fix some Root Port interrupt handling issues (Vidya Sagar) - Set default Max Payload Size to 256 bytes (Vidya Sagar) - Fix Data Link Feature capability programming (Vidya Sagar) - Extend Endpoint mode support to devices beyond Controller-5 (Vidya Sagar) Qualcomm PCIe controller driver: - Rework clock, reset, PHY power-on ordering to avoid hangs and improve consistency (Robert Marko, Christian Marangi) - Move pipe_clk handling to PHY drivers (Dmitry Baryshkov) - Add IPQ60xx support (Selvam Sathappan Periakaruppan) - Allow ASPM L1 and substates for 2.7.0 (Krishna chaitanya chundru) - Add support for more than 32 MSI interrupts (Dmitry Baryshkov) Renesas R-Car PCIe controller driver: - Convert DT binding to json-schema (Herve Codina) - Add Renesas RZ/N1D (R9A06G032) to rcar-gen2 DT binding and driver (Herve Codina) Samsung Exynos PCIe controller driver: - Fix phy-exynos-pcie driver so it follows the 'phy_init() before phy_power_on()' PHY programming model (Marek Szyprowski) Synopsys DesignWare PCIe controller driver: - Simplify and clean up the DWC core extensively (Serge Semin) - Fix an issue with programming the ATU for regions that cross a 4GB boundary (Serge Semin) - Enable the CDM check if 'snps,enable-cdm-check' exists; previously we skipped it if 'num-lanes' was absent (Serge Semin) - Allocate a 32-bit DMA-able page to be MSI target instead of using a driver data structure that may not be addressable with 32-bit address (Will McVicker) - Add DWC core support for more than 32 MSI interrupts (Dmitry Baryshkov) Xilinx Versal CPM PCIe controller driver: - Add DT binding and driver support for Versal CPM5 Gen5 Root Port (Bharat Kumar Gogada)" * tag 'pci-v5.20-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (150 commits) PCI: imx6: Support more than Gen2 speed link mode PCI: imx6: Set PCIE_DBI_RO_WR_EN before writing DBI registers PCI: imx6: Reformat suspend callback to keep symmetric with resume PCI: imx6: Move the imx6_pcie_ltssm_disable() earlier PCI: imx6: Disable clocks in reverse order of enable PCI: imx6: Do not hide PHY driver callbacks and refine the error handling PCI: imx6: Reduce resume time by only starting link if it was up before suspend PCI: imx6: Mark the link down as non-fatal error PCI: imx6: Move regulator enable out of imx6_pcie_deassert_core_reset() PCI: imx6: Turn off regulator when system is in suspend mode PCI: imx6: Call host init function directly in resume PCI: imx6: Disable i.MX6QDL clock when disabling ref clocks PCI: imx6: Propagate .host_init() errors to caller PCI: imx6: Collect clock enables in imx6_pcie_clk_enable() PCI: imx6: Factor out ref clock disable to match enable PCI: imx6: Move imx6_pcie_clk_disable() earlier PCI: imx6: Move imx6_pcie_enable_ref_clk() earlier PCI: imx6: Move PHY management functions together PCI: imx6: Move imx6_pcie_grp_offset(), imx6_pcie_configure_type() earlier PCI: imx6: Convert to NOIRQ_SYSTEM_SLEEP_PM_OPS() ...
2022-08-04Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull kvm updates from Paolo Bonzini: "Quite a large pull request due to a selftest API overhaul and some patches that had come in too late for 5.19. ARM: - Unwinder implementations for both nVHE modes (classic and protected), complete with an overflow stack - Rework of the sysreg access from userspace, with a complete rewrite of the vgic-v3 view to allign with the rest of the infrastructure - Disagregation of the vcpu flags in separate sets to better track their use model. - A fix for the GICv2-on-v3 selftest - A small set of cosmetic fixes RISC-V: - Track ISA extensions used by Guest using bitmap - Added system instruction emulation framework - Added CSR emulation framework - Added gfp_custom flag in struct kvm_mmu_memory_cache - Added G-stage ioremap() and iounmap() functions - Added support for Svpbmt inside Guest s390: - add an interface to provide a hypervisor dump for secure guests - improve selftests to use TAP interface - enable interpretive execution of zPCI instructions (for PCI passthrough) - First part of deferred teardown - CPU Topology - PV attestation - Minor fixes x86: - Permit guests to ignore single-bit ECC errors - Intel IPI virtualization - Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS - PEBS virtualization - Simplify PMU emulation by just using PERF_TYPE_RAW events - More accurate event reinjection on SVM (avoid retrying instructions) - Allow getting/setting the state of the speaker port data bit - Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent - "Notify" VM exit (detect microarchitectural hangs) for Intel - Use try_cmpxchg64 instead of cmpxchg64 - Ignore benign host accesses to PMU MSRs when PMU is disabled - Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior - Allow NX huge page mitigation to be disabled on a per-vm basis - Port eager page splitting to shadow MMU as well - Enable CMCI capability by default and handle injected UCNA errors - Expose pid of vcpu threads in debugfs - x2AVIC support for AMD - cleanup PIO emulation - Fixes for LLDT/LTR emulation - Don't require refcounted "struct page" to create huge SPTEs - Miscellaneous cleanups: - MCE MSR emulation - Use separate namespaces for guest PTEs and shadow PTEs bitmasks - PIO emulation - Reorganize rmap API, mostly around rmap destruction - Do not workaround very old KVM bugs for L0 that runs with nesting enabled - new selftests API for CPUID Generic: - Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache - new selftests API using struct kvm_vcpu instead of a (vm, id) tuple" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (606 commits) selftests: kvm: set rax before vmcall selftests: KVM: Add exponent check for boolean stats selftests: KVM: Provide descriptive assertions in kvm_binary_stats_test selftests: KVM: Check stat name before other fields KVM: x86/mmu: remove unused variable RISC-V: KVM: Add support for Svpbmt inside Guest/VM RISC-V: KVM: Use PAGE_KERNEL_IO in kvm_riscv_gstage_ioremap() RISC-V: KVM: Add G-stage ioremap() and iounmap() functions KVM: Add gfp_custom flag in struct kvm_mmu_memory_cache RISC-V: KVM: Add extensible CSR emulation framework RISC-V: KVM: Add extensible system instruction emulation framework RISC-V: KVM: Factor-out instruction emulation into separate sources RISC-V: KVM: move preempt_disable() call in kvm_arch_vcpu_ioctl_run RISC-V: KVM: Make kvm_riscv_guest_timer_init a void function RISC-V: KVM: Fix variable spelling mistake RISC-V: KVM: Improve ISA extension by using a bitmap KVM, x86/mmu: Fix the comment around kvm_tdp_mmu_zap_leafs() KVM: SVM: Dump Virtual Machine Save Area (VMSA) to klog KVM: x86/mmu: Treat NX as a valid SPTE bit for NPT KVM: x86: Do not block APIC write for non ICR registers ...
2022-08-04x86/acrn: Set up timekeepingFei Li
ACRN Hypervisor reports timing information via CPUID leaf 0x40000010. Get the TSC and CPU frequency via CPUID leaf 0x40000010 and set the kernel values accordingly. Signed-off-by: Fei Li <fei1.li@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Conghui <conghui.chen@intel.com> Link: https://lore.kernel.org/r/20220804055903.365211-1-fei1.li@intel.com
2022-08-03x86/speculation: Add RSB VM Exit protectionsDaniel Sneddon
tl;dr: The Enhanced IBRS mitigation for Spectre v2 does not work as documented for RET instructions after VM exits. Mitigate it with a new one-entry RSB stuffing mechanism and a new LFENCE. == Background == Indirect Branch Restricted Speculation (IBRS) was designed to help mitigate Branch Target Injection and Speculative Store Bypass, i.e. Spectre, attacks. IBRS prevents software run in less privileged modes from affecting branch prediction in more privileged modes. IBRS requires the MSR to be written on every privilege level change. To overcome some of the performance issues of IBRS, Enhanced IBRS was introduced. eIBRS is an "always on" IBRS, in other words, just turn it on once instead of writing the MSR on every privilege level change. When eIBRS is enabled, more privileged modes should be protected from less privileged modes, including protecting VMMs from guests. == Problem == Here's a simplification of how guests are run on Linux' KVM: void run_kvm_guest(void) { // Prepare to run guest VMRESUME(); // Clean up after guest runs } The execution flow for that would look something like this to the processor: 1. Host-side: call run_kvm_guest() 2. Host-side: VMRESUME 3. Guest runs, does "CALL guest_function" 4. VM exit, host runs again 5. Host might make some "cleanup" function calls 6. Host-side: RET from run_kvm_guest() Now, when back on the host, there are a couple of possible scenarios of post-guest activity the host needs to do before executing host code: * on pre-eIBRS hardware (legacy IBRS, or nothing at all), the RSB is not touched and Linux has to do a 32-entry stuffing. * on eIBRS hardware, VM exit with IBRS enabled, or restoring the host IBRS=1 shortly after VM exit, has a documented side effect of flushing the RSB except in this PBRSB situation where the software needs to stuff the last RSB entry "by hand". IOW, with eIBRS supported, host RET instructions should no longer be influenced by guest behavior after the host retires a single CALL instruction. However, if the RET instructions are "unbalanced" with CALLs after a VM exit as is the RET in #6, it might speculatively use the address for the instruction after the CALL in #3 as an RSB prediction. This is a problem since the (untrusted) guest controls this address. Balanced CALL/RET instruction pairs such as in step #5 are not affected. == Solution == The PBRSB issue affects a wide variety of Intel processors which support eIBRS. But not all of them need mitigation. Today, X86_FEATURE_RSB_VMEXIT triggers an RSB filling sequence that mitigates PBRSB. Systems setting RSB_VMEXIT need no further mitigation - i.e., eIBRS systems which enable legacy IBRS explicitly. However, such systems (X86_FEATURE_IBRS_ENHANCED) do not set RSB_VMEXIT and most of them need a new mitigation. Therefore, introduce a new feature flag X86_FEATURE_RSB_VMEXIT_LITE which triggers a lighter-weight PBRSB mitigation versus RSB_VMEXIT. The lighter-weight mitigation performs a CALL instruction which is immediately followed by a speculative execution barrier (INT3). This steers speculative execution to the barrier -- just like a retpoline -- which ensures that speculation can never reach an unbalanced RET. Then, ensure this CALL is retired before continuing execution with an LFENCE. In other words, the window of exposure is opened at VM exit where RET behavior is troublesome. While the window is open, force RSB predictions sampling for RET targets to a dead end at the INT3. Close the window with the LFENCE. There is a subset of eIBRS systems which are not vulnerable to PBRSB. Add these systems to the cpu_vuln_whitelist[] as NO_EIBRS_PBRSB. Future systems that aren't vulnerable will set ARCH_CAP_PBRSB_NO. [ bp: Massage, incorporate review comments from Andy Cooper. ] Signed-off-by: Daniel Sneddon <daniel.sneddon@linux.intel.com> Co-developed-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Borislav Petkov <bp@suse.de>
2022-08-02Merge tag 'random-6.0-rc1-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/crng/random Pull random number generator updates from Jason Donenfeld: "Though there's been a decent amount of RNG-related development during this last cycle, not all of it is coming through this tree, as this cycle saw a shift toward tackling early boot time seeding issues, which took place in other trees as well. Here's a summary of the various patches: - The CONFIG_ARCH_RANDOM .config option and the "nordrand" boot option have been removed, as they overlapped with the more widely supported and more sensible options, CONFIG_RANDOM_TRUST_CPU and "random.trust_cpu". This change allowed simplifying a bit of arch code. - x86's RDRAND boot time test has been made a bit more robust, with RDRAND disabled if it's clearly producing bogus results. This would be a tip.git commit, technically, but I took it through random.git to avoid a large merge conflict. - The RNG has long since mixed in a timestamp very early in boot, on the premise that a computer that does the same things, but does so starting at different points in wall time, could be made to still produce a different RNG state. Unfortunately, the clock isn't set early in boot on all systems, so now we mix in that timestamp when the time is actually set. - User Mode Linux now uses the host OS's getrandom() syscall to generate a bootloader RNG seed and later on treats getrandom() as the platform's RDRAND-like faculty. - The arch_get_random_{seed_,}_long() family of functions is now arch_get_random_{seed_,}_longs(), which enables certain platforms, such as s390, to exploit considerable performance advantages from requesting multiple CPU random numbers at once, while at the same time compiling down to the same code as before on platforms like x86. - A small cleanup changing a cmpxchg() into a try_cmpxchg(), from Uros. - A comment spelling fix" More info about other random number changes that come in through various architecture trees in the full commentary in the pull request: https://lore.kernel.org/all/20220731232428.2219258-1-Jason@zx2c4.com/ * tag 'random-6.0-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: random: correct spelling of "overwrites" random: handle archrandom with multiple longs um: seed rng using host OS rng random: use try_cmpxchg in _credit_init_bits timekeeping: contribute wall clock to rng on time change x86/rdrand: Remove "nordrand" flag in favor of "random.trust_cpu" random: remove CONFIG_ARCH_RANDOM
2022-08-02x86/bus_lock: Don't assume the init value of DEBUGCTLMSR.BUS_LOCK_DETECT to ↵Chenyi Qiang
be zero It's possible that this kernel has been kexec'd from a kernel that enabled bus lock detection, or (hypothetically) BIOS/firmware has set DEBUGCTLMSR_BUS_LOCK_DETECT. Disable bus lock detection explicitly if not wanted. Fixes: ebb1064e7c2e ("x86/traps: Handle #DB for bus lock") Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20220802033206.21333-1-chenyi.qiang@intel.com
2022-08-01Merge tag 'x86_cpu_for_v6.0_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Borislav Petkov: - Remove the vendor check when selecting MWAIT as the default idle state - Respect idle=nomwait when supplied on the kernel cmdline - Two small cleanups * tag 'x86_cpu_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Use MSR_IA32_MISC_ENABLE constants x86: Fix comment for X86_FEATURE_ZEN x86: Remove vendor checks from prefer_mwait_c1_over_halt x86: Handle idle=nomwait cmdline properly for x86_idle
2022-08-01Merge tag 'x86_vmware_for_v6.0_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 vmware cleanup from Borislav Petkov: - A single statement simplification by using the BIT() macro * tag 'x86_vmware_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/vmware: Use BIT() macro for shifting
2022-08-01Merge tag 'ras_core_for_v6.0_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RAS update from Borislav Petkov: "A single RAS change: - Probe whether hardware error injection (direct MSR writes) is possible when injecting errors on AMD platforms. In some cases, the platform could prohibit those" * tag 'ras_core_for_v6.0_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mce: Check whether writes to MCA_STATUS are getting ignored
2022-08-01Merge remote-tracking branch 'kvm/next' into kvm-next-5.20Paolo Bonzini
KVM/s390, KVM/x86 and common infrastructure changes for 5.20 x86: * Permit guests to ignore single-bit ECC errors * Fix races in gfn->pfn cache refresh; do not pin pages tracked by the cache * Intel IPI virtualization * Allow getting/setting pending triple fault with KVM_GET/SET_VCPU_EVENTS * PEBS virtualization * Simplify PMU emulation by just using PERF_TYPE_RAW events * More accurate event reinjection on SVM (avoid retrying instructions) * Allow getting/setting the state of the speaker port data bit * Refuse starting the kvm-intel module if VM-Entry/VM-Exit controls are inconsistent * "Notify" VM exit (detect microarchitectural hangs) for Intel * Cleanups for MCE MSR emulation s390: * add an interface to provide a hypervisor dump for secure guests * improve selftests to use TAP interface * enable interpretive execution of zPCI instructions (for PCI passthrough) * First part of deferred teardown * CPU Topology * PV attestation * Minor fixes Generic: * new selftests API using struct kvm_vcpu instead of a (vm, id) tuple x86: * Use try_cmpxchg64 instead of cmpxchg64 * Bugfixes * Ignore benign host accesses to PMU MSRs when PMU is disabled * Allow disabling KVM's "MONITOR/MWAIT are NOPs!" behavior * x86/MMU: Allow NX huge pages to be disabled on a per-vm basis * Port eager page splitting to shadow MMU as well * Enable CMCI capability by default and handle injected UCNA errors * Expose pid of vcpu threads in debugfs * x2AVIC support for AMD * cleanup PIO emulation * Fixes for LLDT/LTR emulation * Don't require refcounted "struct page" to create huge SPTEs x86 cleanups: * Use separate namespaces for guest PTEs and shadow PTEs bitmasks * PIO emulation * Reorganize rmap API, mostly around rmap destruction * Do not workaround very old KVM bugs for L0 that runs with nesting enabled * new selftests API for CPUID
2022-07-29x86/bugs: Do not enable IBPB at firmware entry when IBPB is not availableThadeu Lima de Souza Cascardo
Some cloud hypervisors do not provide IBPB on very recent CPU processors, including AMD processors affected by Retbleed. Using IBPB before firmware calls on such systems would cause a GPF at boot like the one below. Do not enable such calls when IBPB support is not present. EFI Variables Facility v0.08 2004-May-17 general protection fault, maybe for address 0x1: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 24 Comm: kworker/u2:1 Not tainted 5.19.0-rc8+ #7 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Workqueue: efi_rts_wq efi_call_rts RIP: 0010:efi_call_rts Code: e8 37 33 58 ff 41 bf 48 00 00 00 49 89 c0 44 89 f9 48 83 c8 01 4c 89 c2 48 c1 ea 20 66 90 b9 49 00 00 00 b8 01 00 00 00 31 d2 <0f> 30 e8 7b 9f 5d ff e8 f6 f8 ff ff 4c 89 f1 4c 89 ea 4c 89 e6 48 RSP: 0018:ffffb373800d7e38 EFLAGS: 00010246 RAX: 0000000000000001 RBX: 0000000000000006 RCX: 0000000000000049 RDX: 0000000000000000 RSI: ffff94fbc19d8fe0 RDI: ffff94fbc1b2b300 RBP: ffffb373800d7e70 R08: 0000000000000000 R09: 0000000000000000 R10: 000000000000000b R11: 000000000000000b R12: ffffb3738001fd78 R13: ffff94fbc2fcfc00 R14: ffffb3738001fd80 R15: 0000000000000048 FS: 0000000000000000(0000) GS:ffff94fc3da00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff94fc30201000 CR3: 000000006f610000 CR4: 00000000000406f0 Call Trace: <TASK> ? __wake_up process_one_work worker_thread ? rescuer_thread kthread ? kthread_complete_and_exit ret_from_fork </TASK> Modules linked in: Fixes: 28a99e95f55c ("x86/amd: Use IBPB for firmware calls") Reported-by: Dimitri John Ledkov <dimitri.ledkov@canonical.com> Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220728122602.2500509-1-cascardo@canonical.com
2022-07-26x86/cyrix: include header linux/isa-dma.hRandy Dunlap
x86/kernel/cpu/cyrix.c now needs to include <linux/isa-dma.h> since the 'isa_dma_bridge_buggy' variable was moved to it. Fixes this build error: ../arch/x86/kernel/cpu/cyrix.c: In function ‘init_cyrix’: ../arch/x86/kernel/cpu/cyrix.c:277:17: error: ‘isa_dma_bridge_buggy’ undeclared (first use in this function) 277 | isa_dma_bridge_buggy = 2; Fixes: abb4970ac335 ("PCI: Move isa_dma_bridge_buggy out of asm/dma.h") Link: https://lore.kernel.org/r/20220725202224.29269-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Acked-by: Stafford Horne <shorne@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: x86@kernel.org
2022-07-20x86/bugs: Warn when "ibrs" mitigation is selected on Enhanced IBRS partsPawan Gupta
IBRS mitigation for spectre_v2 forces write to MSR_IA32_SPEC_CTRL at every kernel entry/exit. On Enhanced IBRS parts setting MSR_IA32_SPEC_CTRL[IBRS] only once at boot is sufficient. MSR writes at every kernel entry/exit incur unnecessary performance loss. When Enhanced IBRS feature is present, print a warning about this unnecessary performance loss. Signed-off-by: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/2a5eaf54583c2bfe0edc4fea64006656256cca17.1657814857.git.pawan.kumar.gupta@linux.intel.com