Skip to content

Conversation

@HeatCrab
Copy link
Collaborator

@HeatCrab HeatCrab commented Oct 29, 2025

This PR implements PMP (Physical Memory Protection) support for RISC-V to enable hardware-enforced memory isolation in Linmo, addressing #30.

Currently Phase 1 (infrastructure) is complete. This branch will continue development through the remaining phases. Phase 1 adds the foundational structures and declarations: PMP hardware layer in arch/riscv with CSR definitions and region management structures, architecture-independent memory abstractions (flex pages, address spaces, memory pools), kernel memory pool declarations from linker symbols, and TCB extension for address space linkage.

The actual PMP operations including region configuration, CSR manipulation, and context switching integration are not yet implemented.

TOR mode is used for its flexibility with arbitrary address ranges without alignment constraints, simplifying region management for task stacks of varying sizes. Priority-based eviction allows the system to manage competing demands when the 16 hardware regions are exhausted, ensuring critical kernel and stack regions remain protected while allowing temporary mappings to be reclaimed as needed.


Summary by cubic

Enables RISC-V PMP for hardware memory isolation (#30). Uses TOR mode with boot-time kernel protection, trap-time flexpage loading, per-task context switching, and U-mode kernel stack isolation via mscratch.

  • New Features
    • PMP CSR and numeric accessors; TOR-mode region set/disable/lock/read and access checks with shadow state.
    • Kernel memory pools from linker symbols: text RX; data/bss RW (no execute); heap/stack RW (no execute).
    • Flexpages and memory spaces with on-demand load/evict and victim selection; TCB linked to a memory space.
    • U-mode kernel stack isolation via mscratch (ISR frame includes SP); trap handler is nested-trap safe and performs load/evict on access faults; context switch swaps task regions.

Written for commit ab59cdd. Summary will update on new commits.

Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use unified "flexpage" notation.

@HeatCrab HeatCrab force-pushed the pmp/memory-isolation branch from e264a35 to 4a62d5b Compare October 31, 2025 13:25
@HeatCrab
Copy link
Collaborator Author

Use unified "flexpage" notation.

Got it! Thanks for the correction and the L4 X.2 reference.
I've fixed all occurrences to use "flexpage" notation.

@HeatCrab HeatCrab force-pushed the pmp/memory-isolation branch 5 times, most recently from 109259d to f6c3912 Compare November 6, 2025 09:16
@HeatCrab HeatCrab force-pushed the pmp/memory-isolation branch 2 times, most recently from 2644558 to 1bb5fcf Compare November 16, 2025 13:18
jserv

This comment was marked as outdated.

@HeatCrab HeatCrab force-pushed the pmp/memory-isolation branch 6 times, most recently from 904e972 to ed800fc Compare November 21, 2025 12:38
@HeatCrab

This comment was marked as outdated.

Copy link
Contributor

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase the latest 'main' branch to resolve rtsched issues.

@HeatCrab HeatCrab force-pushed the pmp/memory-isolation branch 6 times, most recently from 0d55f21 to 865a5d6 Compare November 22, 2025 08:36
@HeatCrab
Copy link
Collaborator Author

Rebase the latest 'main' branch to resolve rtsched issues.

Finished. And I removed the M-mode fault-handling commits, as they are not aligned with the upcoming work.
Next, I plan to start U-mode support (#19) on a new branch, and then circle back to complete the PMP development and apply any adjustments that may be needed after the U-mode integration.

@HeatCrab HeatCrab force-pushed the pmp/memory-isolation branch from 865a5d6 to 7e3992e Compare December 11, 2025 08:51
@HeatCrab HeatCrab force-pushed the pmp/memory-isolation branch 5 times, most recently from 998ce20 to e720ec9 Compare January 3, 2026 08:39
User mode tasks require kernel stack isolation to prevent malicious or
corrupted user stack pointers from compromising kernel memory during
interrupt handling. Without this protection, a user task could set its
stack pointer to an invalid or controlled address, causing the ISR to
write trap frames to arbitrary memory locations.

This commit implements stack isolation using the mscratch register as a
discriminator between machine mode and user mode execution contexts. The
ISR entry performs a blind swap with mscratch: for machine mode tasks
(mscratch=0), the swap is immediately undone to restore the kernel stack
pointer. For user mode tasks (mscratch=kernel_stack), the swap provides
the kernel stack while preserving the user stack pointer in mscratch.

Each user mode task is allocated a dedicated 512-byte kernel stack to
ensure complete isolation between tasks and prevent stack overflow
attacks. The task control block is extended to track per-task kernel
stack allocations. A global pointer references the current task's kernel
stack and is updated during each context switch. The ISR loads this
pointer to access the appropriate per-task kernel stack through
mscratch, replacing the previous approach of using a single global
kernel stack shared by all user mode tasks.

The interrupt frame structure is extended to include dedicated storage
for the stack pointer. Task initialization zeroes the entire frame and
correctly sets the initial stack pointer to support the new restoration
path. For user mode tasks, the initial ISR frame is constructed on the
kernel stack rather than the user stack, ensuring the frame is protected
from user manipulation. Enumeration constants replace magic number usage
for improved code clarity and consistency.

The ISR implementation now includes separate entry and restoration paths
for each privilege mode. The M-mode path maintains mscratch=0 throughout
execution. The U-mode path saves the user stack pointer from mscratch
immediately after frame allocation and restores mscratch to the current
task's kernel stack address before returning to user mode, enabling the
next trap to use the correct per-task kernel stack.

Task initialization was updated to configure mscratch appropriately
during the first dispatch. The dispatcher checks the current privilege
level and sets mscratch to zero for machine mode tasks or to the
per-task kernel stack base for user mode tasks. The main scheduler
initialization ensures the first task's kernel stack pointer is set
before entering the scheduling loop.

The user mode output system call was modified to bypass the asynchronous
logger queue and implement task-level synchronization. Direct output
ensures strict FIFO ordering for test output clarity, while preventing
task preemption during character transmission avoids interleaving when
multiple user tasks print concurrently. This ensures each string is
output atomically with respect to other tasks.

A test helper function was added to support stack pointer manipulation
during validation. Following the Linux kernel's context switching
pattern, this provides precise control over stack operations without
compiler interference. The validation harness uses this to verify
syscall stability under corrupted stack pointer conditions.

Documentation updates include the calling convention guide's stack layout
section, which now distinguishes between machine mode and user mode task
stack organization with detailed diagrams of the dual-stack design. The
context switching guide's task initialization section reflects the
updated function signature for building initial interrupt frames with
per-task kernel stack parameters.

Testing validates that system calls succeed even when invoked with a
malicious stack pointer (0xDEADBEEF), confirming the ISR correctly uses
the per-task kernel stack from mscratch rather than the user-controlled
stack pointer.
Introduces RISC-V Physical Memory Protection (PMP) support for
hardware-enforced memory isolation.

TOR mode is adopted as the addressing scheme for its flexibility in
supporting arbitrary address ranges without alignment requirements,
simplifying region management for task stacks of varying sizes.

Adds CSR definitions for PMP registers, permission encodings, and
hardware constants. Provides structures for region configuration and
state tracking, with priority-based management to handle the 16-region
hardware limit. Includes error codes and functions for region
configuration and access verification.
Introduces three abstractions that build upon the PMP infrastructure
for managing memory protection at different granularities.

Flexpages represent contiguous physical memory regions with
protection attributes, providing arbitrary base addresses and sizes
without alignment constraints. Memory spaces implement the address
space concept but use distinct terminology to avoid confusion with
virtual address spaces, as this structure represents a task's memory
protection domain in a physical-address-only system. They organize
flexpages into task memory views and support sharing across multiple
tasks without requiring an MMU. Memory pools define static regions for
boot-time initialization of kernel memory protection.

Field naming retains 'as_' prefix (e.g., as_id, as_next) to reflect
the underlying address space concept, while documentation uses "memory
space" terminology for clarity in physical-memory-only contexts.

Structures are used to enable runtime iteration, simplify debugging,
and maintain consistency with other subsystems. Macro helpers reduce
initialization boilerplate while maintaining type safety.

Memory protection APIs are exposed to test programs for validation.
This follows the established pattern where kernel subsystem interfaces
are made available for testing purposes.
Defines static memory pools for boot-time PMP initialization using
linker symbols to identify kernel memory regions.

Linker symbol declarations are updated to include text segment
boundaries and match actual linker script definitions for stack
regions. Five kernel memory pools protect text as read-execute, data
and bss as read-write, heap and stack as read-write without execute
to prevent code injection.

Macro helpers reduce initialization boilerplate while maintaining
debuggability through struct arrays. Priority-based management handles
the 16-region hardware constraint.
Extends TCB with a memory space pointer to enable per-task memory
isolation. Each task can now reference its own memory protection
domain through the flexpage mechanism.
Adds creation and destruction functions for flexpages, which are
software abstractions representing contiguous physical memory regions
with hardware-enforced protection attributes. These primitives will be
used by higher-level memory space management to construct per-task
memory views for PMP-based isolation.

Function naming follows kernel conventions to reflect that these
operations manage abstract memory protection objects rather than
just memory allocation.
Add functions to create and destroy memory spaces, which serve as
containers for flexpages. A memory space can be dedicated to a single
task or shared across multiple tasks, supporting both isolated and
shared memory models.
Provide helper functions for runtime-indexed access to PMP control
and status registers alongside existing compile-time CSR macros.
RISC-V CSR instructions encode register addresses as immediate
values in the instruction itself, making dynamic selection
impossible through simple arithmetic. These helpers use
switch-case dispatch to map runtime indices to specific CSR
instructions while preserving type safety.

This enables PMP register management code to iterate over regions
without knowing exact register numbers at compile-time. These
helpers are designed for use by subsequent region management
operations and are marked unused to allow incremental development
without compiler warnings.

PMP implementation is now included in the build system to make
these helpers and future PMP functionality available at link time.
Establishes a centralized PMP configuration state that maintains
a shadow copy of hardware register state in memory. This design
allows the kernel to track and coordinate PMP region usage
without repeatedly reading from hardware CSRs.

The global configuration serves as the single source of truth
for all PMP management operations throughout the kernel. A
public accessor function provides controlled access to this
shared state.
Provides a complete set of functions for managing Physical Memory
Protection regions in TOR mode, maintaining shadow configuration state
synchronized with hardware CSRs.

Hardware initialization clears all PMP regions by zeroing address and
configuration registers, then initializes shadow state with default
values for each region slot. This establishes clean hardware and
software state for subsequent region configuration.

Region configuration validates that the address range is valid and the
region is not locked, then constructs configuration bytes with TOR
addressing mode and permission bits. Both hardware CSRs and shadow
state are updated atomically, with optional locking to prevent further
modification. A helper function computes configuration register index
and bit offset from region index, eliminating code duplication across
multiple operations.

Region disabling clears the configuration byte to remove protection
while preserving other regions in the same configuration register.
Region locking sets the lock bit to prevent modification until
hardware reset. Region retrieval reads address range, permissions,
priority, and lock status from shadow configuration.

Access verification checks whether a memory operation falls within
configured region boundaries by comparing address and size, then
validates that region permissions match the requested operation type.

Address register read helpers are marked unused as the shadow state
design eliminates the need to read hardware registers during normal
operation. They remain available for potential future use cases
requiring hardware state verification.
Implements hardware driver functions that bridge software flexpages
with PMP hardware regions, enabling dynamic loading and eviction of
memory protection mappings. This establishes the foundation for
on-demand memory protection where tasks can access more memory than
the 16 hardware PMP entries allow.

The driver provides three core operations. Loading translates flexpage
attributes to PMP configuration and programs the hardware region.
Eviction disables a hardware region and clears the mapping. Victim
selection examines all loaded flexpages and identifies the one with
highest priority value for eviction. Kernel regions with priority 0
are never selected, ensuring system stability during context switches.

To maintain architectural independence, the architecture layer
implements the hardware-specific operations while the kernel layer
provides architecture-agnostic wrappers. This layering allows kernel
code to remain portable while leveraging hardware-specific features.
The wrapper pattern enables future support for other memory protection
units without modifying higher-level kernel logic.
When a task accesses memory not currently loaded in a hardware region,
the system raises an access fault. Rather than panicking, the fault
handler attempts recovery by dynamically loading the required region,
enabling tasks to access more memory than can fit simultaneously in the
available hardware regions.

The fault handler examines the faulting address from mtval CSR to locate
the corresponding flexpage in the task's memory space. If all hardware
regions are occupied, a victim selection algorithm identifies the
flexpage with highest priority value for eviction, then reuses its
hardware slot for the newly required flexpage.

This establishes demand-paging semantics for memory protection where
region mappings are loaded on first access. The fault recovery mechanism
ensures tasks can utilize their full memory space regardless of hardware
region constraints, with kernel regions protected from eviction to
maintain system stability.
@HeatCrab HeatCrab force-pushed the pmp/memory-isolation branch 3 times, most recently from 32f3fae to 82f18b3 Compare January 6, 2026 08:05
/* PMP fault handled successfully, return current frame */
ret_sp = isr_sp;
goto trap_exit;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user task triggers a fault that pmp_handle_access_fault cannot resolve, the kernel panics.

Considering the goal of this series is to provide isolation between user tasks and the kernel, is panicing the entire system the desired behavior here? It seems better to kill only the faulting task and let the rest of the system continue running.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My current implementation uses panic as a deliberate "fail-fast" strategy during development - it helps catch bugs immediately rather than silently killing tasks and potentially masking issues.

However, I agree this is not the correct final behavior. I will first complete validation that PMP correctly isolates the kernel from user tasks, then circle back to implement proper fault handling - terminating the faulting task instead of causing a kernel panic that crashes the system.

@HeatCrab HeatCrab force-pushed the pmp/memory-isolation branch 2 times, most recently from b168f50 to b1b3199 Compare January 6, 2026 08:49
Hardware-enforced memory protection requires dynamic region management
to support per-task isolation. Kernel regions must remain locked and
accessible across all tasks, while user task regions need to be
dynamically loaded and evicted during context switches.

This implementation separates hardware lock control from region
configuration. Region setup no longer automatically sets hardware lock
bits based on shadow configuration flags. Instead, lock control is
delegated to an explicit locking operation. This design allows kernel
regions to be marked as protected in software state while remaining
modifiable through the API, enabling dynamic region management without
compromising protection semantics.

During context switches, the previous task's dynamic regions are
evicted from hardware slots, and the incoming task's regions are loaded
into available slots. Kernel regions remain locked and preserved across
all transitions, ensuring system stability and isolation boundaries.
Each task now receives a dedicated memory space during creation, with
its stack registered as a flexpage. This establishes the protection
metadata required for hardware enforcement.

During context switches, the scheduler triggers memory protection
reconfiguration for both preemptive and cooperative scheduling modes.
The outgoing task's memory space is captured before scheduler state
updates, ensuring both old and new memory spaces are available for the
protection switching logic.
Configure memory protection for kernel text, data, BSS, heap, and stack
regions during hardware initialization. Halt on setup failure.

Also remove the temporary PMP validation hack that granted U-mode full
access. With memory protection integrated into task management in
previous development, this bypass is no longer necessary and must be
removed to enforce actual isolation.
User mode system calls can trigger nested trap scenarios where an outer
U-mode ecall handler invokes the syscall dispatcher, which attempts to
yield control, triggering an inner M-mode ecall. The inner trap can
overwrite state that the outer trap requires for context switching,
corrupting task state during restoration.

This commit introduces trap nesting depth tracking to maintain proper
state isolation between trap levels. Only the outermost trap is
responsible for establishing the ISR frame pointer that context
switching uses. Inner traps preserve their own state and return control
without affecting outer trap context.

The trap handler ensures the saved program counter is correct before
any operation that might trigger nested traps. This guarantees that
context switches save valid return addresses regardless of nesting
depth, preventing returns to trap-triggering instructions.

Nested traps restore only their own execution context, while the
outermost trap retains sole responsibility for context switch
restoration. This ownership policy prevents inner traps from
incorrectly restoring task contexts.

When yielding from within trap context, the implementation invokes the
scheduler directly rather than triggering additional trap nesting.

A test application validates the fix by spawning both machine mode and
user mode tasks that continuously yield and delay. This creates the
mixed privilege mode scenario where nested traps occur.
Validate PMP-based memory isolation during task context switches.
U-mode tasks verify stack integrity across multiple switches using
magic values, ensuring PMP correctly protects each task's memory
space.

A secondary kernel protection test validates that U-mode cannot
write to kernel memory, triggering PMP access faults as expected.

The test is excluded from crash-detection CI but included in
functional tests with expected exception criteria to verify both
isolation correctness and fault handling.
@HeatCrab HeatCrab force-pushed the pmp/memory-isolation branch from b1b3199 to ab59cdd Compare January 6, 2026 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants