-
Notifications
You must be signed in to change notification settings - Fork 30
Enable PMP for memory isolation #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d2552a5 to
319ba96
Compare
jserv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use unified "flexpage" notation.
e264a35 to
4a62d5b
Compare
Got it! Thanks for the correction and the L4 X.2 reference. |
109259d to
f6c3912
Compare
2644558 to
1bb5fcf
Compare
904e972 to
ed800fc
Compare
This comment was marked as outdated.
This comment was marked as outdated.
jserv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase the latest 'main' branch to resolve rtsched issues.
0d55f21 to
865a5d6
Compare
Finished. And I removed the M-mode fault-handling commits, as they are not aligned with the upcoming work. |
865a5d6 to
7e3992e
Compare
998ce20 to
e720ec9
Compare
User mode tasks require kernel stack isolation to prevent malicious or corrupted user stack pointers from compromising kernel memory during interrupt handling. Without this protection, a user task could set its stack pointer to an invalid or controlled address, causing the ISR to write trap frames to arbitrary memory locations. This commit implements stack isolation using the mscratch register as a discriminator between machine mode and user mode execution contexts. The ISR entry performs a blind swap with mscratch: for machine mode tasks (mscratch=0), the swap is immediately undone to restore the kernel stack pointer. For user mode tasks (mscratch=kernel_stack), the swap provides the kernel stack while preserving the user stack pointer in mscratch. Each user mode task is allocated a dedicated 512-byte kernel stack to ensure complete isolation between tasks and prevent stack overflow attacks. The task control block is extended to track per-task kernel stack allocations. A global pointer references the current task's kernel stack and is updated during each context switch. The ISR loads this pointer to access the appropriate per-task kernel stack through mscratch, replacing the previous approach of using a single global kernel stack shared by all user mode tasks. The interrupt frame structure is extended to include dedicated storage for the stack pointer. Task initialization zeroes the entire frame and correctly sets the initial stack pointer to support the new restoration path. For user mode tasks, the initial ISR frame is constructed on the kernel stack rather than the user stack, ensuring the frame is protected from user manipulation. Enumeration constants replace magic number usage for improved code clarity and consistency. The ISR implementation now includes separate entry and restoration paths for each privilege mode. The M-mode path maintains mscratch=0 throughout execution. The U-mode path saves the user stack pointer from mscratch immediately after frame allocation and restores mscratch to the current task's kernel stack address before returning to user mode, enabling the next trap to use the correct per-task kernel stack. Task initialization was updated to configure mscratch appropriately during the first dispatch. The dispatcher checks the current privilege level and sets mscratch to zero for machine mode tasks or to the per-task kernel stack base for user mode tasks. The main scheduler initialization ensures the first task's kernel stack pointer is set before entering the scheduling loop. The user mode output system call was modified to bypass the asynchronous logger queue and implement task-level synchronization. Direct output ensures strict FIFO ordering for test output clarity, while preventing task preemption during character transmission avoids interleaving when multiple user tasks print concurrently. This ensures each string is output atomically with respect to other tasks. A test helper function was added to support stack pointer manipulation during validation. Following the Linux kernel's context switching pattern, this provides precise control over stack operations without compiler interference. The validation harness uses this to verify syscall stability under corrupted stack pointer conditions. Documentation updates include the calling convention guide's stack layout section, which now distinguishes between machine mode and user mode task stack organization with detailed diagrams of the dual-stack design. The context switching guide's task initialization section reflects the updated function signature for building initial interrupt frames with per-task kernel stack parameters. Testing validates that system calls succeed even when invoked with a malicious stack pointer (0xDEADBEEF), confirming the ISR correctly uses the per-task kernel stack from mscratch rather than the user-controlled stack pointer.
Introduces RISC-V Physical Memory Protection (PMP) support for hardware-enforced memory isolation. TOR mode is adopted as the addressing scheme for its flexibility in supporting arbitrary address ranges without alignment requirements, simplifying region management for task stacks of varying sizes. Adds CSR definitions for PMP registers, permission encodings, and hardware constants. Provides structures for region configuration and state tracking, with priority-based management to handle the 16-region hardware limit. Includes error codes and functions for region configuration and access verification.
Introduces three abstractions that build upon the PMP infrastructure for managing memory protection at different granularities. Flexpages represent contiguous physical memory regions with protection attributes, providing arbitrary base addresses and sizes without alignment constraints. Memory spaces implement the address space concept but use distinct terminology to avoid confusion with virtual address spaces, as this structure represents a task's memory protection domain in a physical-address-only system. They organize flexpages into task memory views and support sharing across multiple tasks without requiring an MMU. Memory pools define static regions for boot-time initialization of kernel memory protection. Field naming retains 'as_' prefix (e.g., as_id, as_next) to reflect the underlying address space concept, while documentation uses "memory space" terminology for clarity in physical-memory-only contexts. Structures are used to enable runtime iteration, simplify debugging, and maintain consistency with other subsystems. Macro helpers reduce initialization boilerplate while maintaining type safety. Memory protection APIs are exposed to test programs for validation. This follows the established pattern where kernel subsystem interfaces are made available for testing purposes.
Defines static memory pools for boot-time PMP initialization using linker symbols to identify kernel memory regions. Linker symbol declarations are updated to include text segment boundaries and match actual linker script definitions for stack regions. Five kernel memory pools protect text as read-execute, data and bss as read-write, heap and stack as read-write without execute to prevent code injection. Macro helpers reduce initialization boilerplate while maintaining debuggability through struct arrays. Priority-based management handles the 16-region hardware constraint.
Extends TCB with a memory space pointer to enable per-task memory isolation. Each task can now reference its own memory protection domain through the flexpage mechanism.
Adds creation and destruction functions for flexpages, which are software abstractions representing contiguous physical memory regions with hardware-enforced protection attributes. These primitives will be used by higher-level memory space management to construct per-task memory views for PMP-based isolation. Function naming follows kernel conventions to reflect that these operations manage abstract memory protection objects rather than just memory allocation.
Add functions to create and destroy memory spaces, which serve as containers for flexpages. A memory space can be dedicated to a single task or shared across multiple tasks, supporting both isolated and shared memory models.
Provide helper functions for runtime-indexed access to PMP control and status registers alongside existing compile-time CSR macros. RISC-V CSR instructions encode register addresses as immediate values in the instruction itself, making dynamic selection impossible through simple arithmetic. These helpers use switch-case dispatch to map runtime indices to specific CSR instructions while preserving type safety. This enables PMP register management code to iterate over regions without knowing exact register numbers at compile-time. These helpers are designed for use by subsequent region management operations and are marked unused to allow incremental development without compiler warnings. PMP implementation is now included in the build system to make these helpers and future PMP functionality available at link time.
Establishes a centralized PMP configuration state that maintains a shadow copy of hardware register state in memory. This design allows the kernel to track and coordinate PMP region usage without repeatedly reading from hardware CSRs. The global configuration serves as the single source of truth for all PMP management operations throughout the kernel. A public accessor function provides controlled access to this shared state.
Provides a complete set of functions for managing Physical Memory Protection regions in TOR mode, maintaining shadow configuration state synchronized with hardware CSRs. Hardware initialization clears all PMP regions by zeroing address and configuration registers, then initializes shadow state with default values for each region slot. This establishes clean hardware and software state for subsequent region configuration. Region configuration validates that the address range is valid and the region is not locked, then constructs configuration bytes with TOR addressing mode and permission bits. Both hardware CSRs and shadow state are updated atomically, with optional locking to prevent further modification. A helper function computes configuration register index and bit offset from region index, eliminating code duplication across multiple operations. Region disabling clears the configuration byte to remove protection while preserving other regions in the same configuration register. Region locking sets the lock bit to prevent modification until hardware reset. Region retrieval reads address range, permissions, priority, and lock status from shadow configuration. Access verification checks whether a memory operation falls within configured region boundaries by comparing address and size, then validates that region permissions match the requested operation type. Address register read helpers are marked unused as the shadow state design eliminates the need to read hardware registers during normal operation. They remain available for potential future use cases requiring hardware state verification.
Implements hardware driver functions that bridge software flexpages with PMP hardware regions, enabling dynamic loading and eviction of memory protection mappings. This establishes the foundation for on-demand memory protection where tasks can access more memory than the 16 hardware PMP entries allow. The driver provides three core operations. Loading translates flexpage attributes to PMP configuration and programs the hardware region. Eviction disables a hardware region and clears the mapping. Victim selection examines all loaded flexpages and identifies the one with highest priority value for eviction. Kernel regions with priority 0 are never selected, ensuring system stability during context switches. To maintain architectural independence, the architecture layer implements the hardware-specific operations while the kernel layer provides architecture-agnostic wrappers. This layering allows kernel code to remain portable while leveraging hardware-specific features. The wrapper pattern enables future support for other memory protection units without modifying higher-level kernel logic.
When a task accesses memory not currently loaded in a hardware region, the system raises an access fault. Rather than panicking, the fault handler attempts recovery by dynamically loading the required region, enabling tasks to access more memory than can fit simultaneously in the available hardware regions. The fault handler examines the faulting address from mtval CSR to locate the corresponding flexpage in the task's memory space. If all hardware regions are occupied, a victim selection algorithm identifies the flexpage with highest priority value for eviction, then reuses its hardware slot for the newly required flexpage. This establishes demand-paging semantics for memory protection where region mappings are loaded on first access. The fault recovery mechanism ensures tasks can utilize their full memory space regardless of hardware region constraints, with kernel regions protected from eviction to maintain system stability.
32f3fae to
82f18b3
Compare
| /* PMP fault handled successfully, return current frame */ | ||
| ret_sp = isr_sp; | ||
| goto trap_exit; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user task triggers a fault that pmp_handle_access_fault cannot resolve, the kernel panics.
Considering the goal of this series is to provide isolation between user tasks and the kernel, is panicing the entire system the desired behavior here? It seems better to kill only the faulting task and let the rest of the system continue running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My current implementation uses panic as a deliberate "fail-fast" strategy during development - it helps catch bugs immediately rather than silently killing tasks and potentially masking issues.
However, I agree this is not the correct final behavior. I will first complete validation that PMP correctly isolates the kernel from user tasks, then circle back to implement proper fault handling - terminating the faulting task instead of causing a kernel panic that crashes the system.
b168f50 to
b1b3199
Compare
Hardware-enforced memory protection requires dynamic region management to support per-task isolation. Kernel regions must remain locked and accessible across all tasks, while user task regions need to be dynamically loaded and evicted during context switches. This implementation separates hardware lock control from region configuration. Region setup no longer automatically sets hardware lock bits based on shadow configuration flags. Instead, lock control is delegated to an explicit locking operation. This design allows kernel regions to be marked as protected in software state while remaining modifiable through the API, enabling dynamic region management without compromising protection semantics. During context switches, the previous task's dynamic regions are evicted from hardware slots, and the incoming task's regions are loaded into available slots. Kernel regions remain locked and preserved across all transitions, ensuring system stability and isolation boundaries.
Each task now receives a dedicated memory space during creation, with its stack registered as a flexpage. This establishes the protection metadata required for hardware enforcement. During context switches, the scheduler triggers memory protection reconfiguration for both preemptive and cooperative scheduling modes. The outgoing task's memory space is captured before scheduler state updates, ensuring both old and new memory spaces are available for the protection switching logic.
Configure memory protection for kernel text, data, BSS, heap, and stack regions during hardware initialization. Halt on setup failure. Also remove the temporary PMP validation hack that granted U-mode full access. With memory protection integrated into task management in previous development, this bypass is no longer necessary and must be removed to enforce actual isolation.
User mode system calls can trigger nested trap scenarios where an outer U-mode ecall handler invokes the syscall dispatcher, which attempts to yield control, triggering an inner M-mode ecall. The inner trap can overwrite state that the outer trap requires for context switching, corrupting task state during restoration. This commit introduces trap nesting depth tracking to maintain proper state isolation between trap levels. Only the outermost trap is responsible for establishing the ISR frame pointer that context switching uses. Inner traps preserve their own state and return control without affecting outer trap context. The trap handler ensures the saved program counter is correct before any operation that might trigger nested traps. This guarantees that context switches save valid return addresses regardless of nesting depth, preventing returns to trap-triggering instructions. Nested traps restore only their own execution context, while the outermost trap retains sole responsibility for context switch restoration. This ownership policy prevents inner traps from incorrectly restoring task contexts. When yielding from within trap context, the implementation invokes the scheduler directly rather than triggering additional trap nesting. A test application validates the fix by spawning both machine mode and user mode tasks that continuously yield and delay. This creates the mixed privilege mode scenario where nested traps occur.
Validate PMP-based memory isolation during task context switches. U-mode tasks verify stack integrity across multiple switches using magic values, ensuring PMP correctly protects each task's memory space. A secondary kernel protection test validates that U-mode cannot write to kernel memory, triggering PMP access faults as expected. The test is excluded from crash-detection CI but included in functional tests with expected exception criteria to verify both isolation correctness and fault handling.
b1b3199 to
ab59cdd
Compare
This PR implements PMP (Physical Memory Protection) support for RISC-V to enable hardware-enforced memory isolation in Linmo, addressing #30.
Currently Phase 1 (infrastructure) is complete. This branch will continue development through the remaining phases. Phase 1 adds the foundational structures and declarations: PMP hardware layer in arch/riscv with CSR definitions and region management structures, architecture-independent memory abstractions (flex pages, address spaces, memory pools), kernel memory pool declarations from linker symbols, and TCB extension for address space linkage.
The actual PMP operations including region configuration, CSR manipulation, and context switching integration are not yet implemented.
TOR mode is used for its flexibility with arbitrary address ranges without alignment constraints, simplifying region management for task stacks of varying sizes. Priority-based eviction allows the system to manage competing demands when the 16 hardware regions are exhausted, ensuring critical kernel and stack regions remain protected while allowing temporary mappings to be reclaimed as needed.
Summary by cubic
Enables RISC-V PMP for hardware memory isolation (#30). Uses TOR mode with boot-time kernel protection, trap-time flexpage loading, per-task context switching, and U-mode kernel stack isolation via mscratch.
Written for commit ab59cdd. Summary will update on new commits.