Enhance task scheduler

## Executive Summary
In conjunction with pull request #6, this proposal outlines a comprehensive enhancement of Linmo's task scheduler to address current scalability limitations and prepare for multi-hart RISC-V systems. The enhancement transforms the current O(n) linear search scheduler into an O(1) bitmap-based system while preserving Linmo's weighted round-robin fairness model and adding comprehensive fairness guarantees suitable for SMP operation.

Key Objectives:
- Performance: O(n) → O(1) scheduling complexity
- Fairness: Multi-dimensional fairness across priority, time, and load
- Scalability: Single-hart → Multi-hart ready architecture
- Compatibility: 100% API preservation with existing applications

<img width="613" alt="Image" src="https://github.com/user-attachments/assets/4c56a7e8-7c5f-45c1-b4b3-64a79b0ea8a3" />

## Current State Analysis

### Existing Scheduler Limitations

#### 1. Algorithmic Complexity Issues

```c
/* Current O(n) scheduler search - scales poorly */
static list_node_t *find_next_ready_task(void)
{
    list_node_t *node = kcb->task_current;
    int itcnt = 0;
    
    while (itcnt++ < SCHED_IMAX) {  // Up to 500 iterations!
        node = list_cnext(kcb->tasks, node);
        if (task->state == TASK_READY && counter == 0) {
            return node;  // Found after O(n) search
        }
    }
}
```

Problems:
- O(n) Complexity: Search time grows linearly with task count
- Worst-Case Bound: `SCHED_IMAX = 500` iterations maximum
- Cache Misses: Scattered task nodes hurt memory performance
- Unpredictable Latency: Scheduling time varies with system load

#### 2. Memory Layout Inefficiencies

```c
/* Current single mixed list - poor cache locality */
typedef struct {
    list_t *tasks;              /* All tasks mixed regardless of priority */
    list_node_t *task_current;  /* Linear search starting point */
    list_node_t *last_ready_hint; /* Band-aid for performance */
} kcb_t;
```

Issues:
- Poor Locality: Tasks of different priorities mixed in single list
- Cache Thrashing: Frequent cache misses during priority searches
- No Priority Ordering: High-priority tasks buried among low-priority ones

#### 3. Fairness Limitations

```c
/* Current fairness is rudimentary */
typedef struct {
    uint16_t prio;    /* Only basic priority counter */
    uint16_t delay;   /* Simple delay mechanism */
    /* No starvation prevention */
    /* No fairness tracking */
    /* No load balancing support */
} tcb_t;
```

Deficiencies:
- No Starvation Prevention: Low-priority tasks can be starved indefinitely
- Limited Fairness Metrics: No tracking of actual vs. expected CPU time
- No Load Awareness: No consideration of system-wide load distribution

#### 4. Single-Hart Architecture Constraints
- No SMP Support: Cannot utilize multiple RISC-V harts
- No Load Balancing: No mechanism for distributing work across harts
- Global Bottlenecks: Single scheduler becomes bottleneck

## Enhancement Objectives & Requirements

### Primary Objectives

1. O(1) Scheduling Complexity
   - Constant-time task selection regardless of system size
   - Bounded worst-case scheduling latency < 10 μs
   - Eliminate `SCHED_IMAX` iteration limits

2. Comprehensive Fairness
   - Priority Fairness: Strict priority ordering preservation
   - Temporal Fairness: Proportional CPU time allocation
   - Load Fairness: Even distribution across harts
   - Starvation Prevention: Guaranteed progress for all tasks

3. Multi-Hart Scalability
   - Support for 2-8 RISC-V harts initially
   - Per-hart local scheduling with global coordination
   - Load balancing and work migration capabilities
   - NUMA-awareness preparation

4. Backward Compatibility
   - 100% API preservation for existing applications
   - Configurable enhancement levels
   - Graceful fallback mechanisms

## Technical Design Architecture

### 1. O(1) Bitmap-Based Priority Selection

#### Enhanced Priority Queue Design

```c
#define PRIO_LEVELS 8           /* Optimized for RISC-V cache lines */
#define MAX_HARTS 8             /* Support up to 8 RISC-V harts */

/* Per-hart scheduler state for cache locality */
typedef struct {
    /* O(1) Priority Selection Infrastructure */
    volatile uint32_t ready_bitmap;         /* 8-bit priority bitmap */
    list_t *ready_queues[PRIO_LEVELS];      /* Separate queue per priority */
    uint16_t queue_counts[PRIO_LEVELS];     /* O(1) size tracking */
    
    /* Weighted Round-Robin State per Priority Level */
    list_node_t *rr_cursors[PRIO_LEVELS];   /* Round-robin position */
    uint32_t quantum_cycles[PRIO_LEVELS];   /* Scheduling cycles per level */
    
    /* Performance Optimization */
    uint8_t last_selected_prio;             /* Cache hint for next selection */
    uint32_t local_switches;                /* Context switch count */
    
    /* Hart-Specific Data */
    tcb_t *current_task;                    /* Currently running task */
    uint8_t hart_id;                        /* RISC-V hart identifier */
    
} hart_sched_t;

/* Global multi-hart coordination */
typedef struct {
    hart_sched_t per_hart[MAX_HARTS];       /* Per-hart local state */
    
    /* Global Priority Coordination */
    volatile uint32_t global_ready_bitmap;   /* All harts combined */
    volatile uint32_t hart_active_mask;      /* Which harts are online */
    
    /* Fairness & Load Balancing */
    fairness_engine_t fairness_engine;      /* Global fairness coordinator */
    load_balancer_t load_balancer;          /* Multi-hart load balancer */
    
    /* Performance Metrics */
    uint64_t total_context_switches;        /* System-wide switches */
    uint64_t total_migrations;              /* Inter-hart migrations */
    
} multi_hart_sched_t;
```

#### RISC-V Optimized Bit Operations

```c
/* RISC-V optimized priority finding using De Bruijn sequence */
static const uint8_t debruijn_lut[32] = {
    0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
    31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
};

/* O(1) priority selection optimized for RISC-V */
static inline uint8_t find_highest_ready_priority_rv(uint32_t bitmap)
{
    if (unlikely(bitmap == 0))
        return IDLE_PRIORITY_LEVEL;
    
    /* Isolate rightmost set bit (highest priority) */
    uint32_t isolated = bitmap & (-bitmap);
    
    /* De Bruijn multiplication for O(1) bit position finding */
    uint32_t hash = (isolated * 0x077CB531U) >> 27;
    
    return debruijn_lut[hash & 0x1F];
}

/* Atomic bitmap operations for multi-hart safety */
static inline void mark_priority_ready_atomic(hart_sched_t *hart, uint8_t prio)
{
    /* Hart-local update */
    hart->ready_bitmap |= (1U << prio);
    
    /* Global coordination */
    __atomic_or_fetch(&multi_sched.global_ready_bitmap, 
                      (1U << prio), __ATOMIC_RELAXED);
}

static inline void mark_priority_empty_atomic(hart_sched_t *hart, uint8_t prio)
{
    /* Hart-local update */
    hart->ready_bitmap &= ~(1U << prio);
    
    /* Check if priority is empty across all harts */
    uint32_t global_prio_mask = 0;
    for (int h = 0; h < MAX_HARTS; h++) {
        if (multi_sched.per_hart[h].ready_bitmap & (1U << prio)) {
            global_prio_mask |= (1U << prio);
            break;
        }
    }
    
    if (!global_prio_mask) {
        __atomic_and_fetch(&multi_sched.global_ready_bitmap,
                          ~(1U << prio), __ATOMIC_RELAXED);
    }
}
```

### 2. Comprehensive Fairness Framework

#### Multi-Dimensional Fairness Tracking

```c
/* Comprehensive fairness metrics per task */
typedef struct {
    /* Temporal Fairness */
    uint64_t total_runtime_us;          /* Actual CPU time consumed */
    uint64_t expected_runtime_us;       /* Fair share based on priority */
    int32_t fairness_deficit_us;        /* Expected - Actual (signed) */
    uint64_t last_start_time;           /* When current execution started */
    
    /* Starvation Prevention */
    uint32_t wait_cycles;               /* Consecutive scheduling cycles waiting */
    uint8_t starvation_level;           /* Escalating boost level (0-7) */
    uint32_t boost_deadline;            /* When to remove temporary boost */
    
    /* Multi-Hart Fairness */
    uint16_t hart_affinity_mask;        /* Preferred hart assignment */
    uint8_t last_hart;                  /* Last hart where task executed */
    uint16_t migration_count;           /* Number of inter-hart moves */
    uint32_t cache_warmth_score;        /* Cache locality metric */
    
    /* Load Distribution */
    uint32_t load_weight;               /* Task's contribution to hart load */
    uint64_t total_wait_time_us;        /* Total time spent waiting */
    
} task_fairness_t;

/* Hart load metrics for balancing decisions */
typedef struct {
    uint32_t total_tasks;               /* Tasks assigned to this hart */
    uint32_t ready_tasks;               /* Currently ready tasks */
    uint64_t weighted_load;             /* Priority-weighted load value */
    uint32_t utilization_percent;       /* Recent CPU utilization */
    
    /* Migration Tracking */
    uint32_t migrations_in;             /* Tasks migrated to this hart */
    uint32_t migrations_out;            /* Tasks migrated from this hart */
    uint32_t migration_cost_total;      /* Cumulative migration overhead */
    
    /* Performance Metrics */
    uint64_t idle_time_us;              /* Time spent idle */
    uint32_t context_switches;          /* Local context switches */
    
} hart_load_t;
```

#### Proportional Fair Scheduling Algorithm

```c
/* Calculate task's fair share of CPU time */
static uint64_t calculate_fair_share_us(tcb_t *task, uint64_t period_us)
{
    /* Convert priority to weight using exponential mapping */
    uint8_t prio_level = priority_to_queue_index(task->prio);
    uint32_t task_weight = 1U << (PRIO_LEVELS - 1 - prio_level);
    
    /* Calculate system-wide total weight */
    uint32_t total_weight = 0;
    for (int h = 0; h < MAX_HARTS; h++) {
        hart_sched_t *hart = &multi_sched.per_hart[h];
        for (int p = 0; p < PRIO_LEVELS - 1; p++) { /* Exclude idle level */
            uint32_t tasks_at_level = hart->queue_counts[p];
            uint32_t level_weight = 1U << (PRIO_LEVELS - 1 - p);
            total_weight += tasks_at_level * level_weight;
        }
    }
    
    if (total_weight == 0)
        return 0;
    
    /* Proportional fair share calculation */
    return (period_us * task_weight) / total_weight;
}

/* Update fairness metrics after task execution */
static void update_fairness_metrics(tcb_t *task, uint64_t runtime_us)
{
    task_fairness_t *fairness = &task->fairness;
    
    /* Update actual runtime */
    fairness->total_runtime_us += runtime_us;
    
    /* Calculate expected runtime based on fair share */
    static uint64_t last_fairness_update = 0;
    uint64_t current_time = _read_us();
    
    if (last_fairness_update > 0) {
        uint64_t period = current_time - last_fairness_update;
        uint64_t fair_share = calculate_fair_share_us(task, period);
        
        fairness->expected_runtime_us += fair_share;
        
        /* Update fairness deficit (negative = behind, positive = ahead) */
        fairness->fairness_deficit_us = 
            (int32_t)(fairness->expected_runtime_us - fairness->total_runtime_us);
    }
    
    /* Reset wait cycles since task just ran */
    fairness->wait_cycles = 0;
    
    /* Global fairness update every 100ms */
    if (current_time - last_fairness_update > 100000) {
        last_fairness_update = current_time;
        global_fairness_reconciliation();
    }
}
```

#### Anti-Starvation Mechanisms

```c
/* Starvation detection and prevention */
#define STARVATION_THRESHOLD_CYCLES 50    /* Cycles before boost */
#define MAX_STARVATION_LEVEL 7           /* Maximum boost level */

static void check_and_prevent_starvation(hart_sched_t *hart)
{
    for (int prio = PRIO_LEVELS - 2; prio >= 0; prio--) { /* Skip idle */
        list_t *queue = hart->ready_queues[prio];
        if (!queue) continue;
        
        list_node_t *node = queue->head->next;
        while (node != queue->tail) {
            tcb_t *task = (tcb_t *)node->data;
            task_fairness_t *fairness = &task->fairness;
            node = node->next;
            
            if (!task || task->state != TASK_READY)
                continue;
            
            /* Increment wait cycles */
            fairness->wait_cycles++;
            
            /* Check for starvation condition */
            if (fairness->wait_cycles > STARVATION_THRESHOLD_CYCLES) {
                apply_starvation_boost(task, hart, prio);
            }
            /* Check for severe fairness deficit */
            else if (fairness->fairness_deficit_us < -10000) { /* 10ms behind */
                fairness->starvation_level = 
                    MIN(fairness->starvation_level + 1, MAX_STARVATION_LEVEL);
            }
            /* Reset boost if fairness restored */
            else if (fairness->fairness_deficit_us > -1000) { /* Within 1ms */
                fairness->starvation_level = 0;
            }
        }
    }
}

/* Apply temporary priority boost to starved task */
static void apply_starvation_boost(tcb_t *task, hart_sched_t *hart, uint8_t current_prio)
{
    if (current_prio == 0) /* Already highest priority */
        return;
    
    task_fairness_t *fairness = &task->fairness;
    uint8_t boost_prio = MAX(0, current_prio - 2); /* Boost by 2 levels */
    
    printf("FAIRNESS: Boosting starved task %u from prio %u to %u (deficit: %d μs)\n",
           task->id, current_prio, boost_prio, fairness->fairness_deficit_us);
    
    /* Move task to higher priority queue */
    migrate_task_to_priority_level(task, hart, current_prio, boost_prio);
    
    /* Set boost removal deadline */
    fairness->boost_deadline = multi_sched.global_ticks + 
                              (F_TIMER * fairness->starvation_level); /* More time for higher starvation */
    fairness->wait_cycles = 0;
    fairness->starvation_level++;
    
    /* Mark task for boost removal after execution */
    task->flags |= TASK_FLAG_PRIORITY_BOOSTED;
}
```

### 3. Multi-Hart Architecture Design

#### Hart-Local Scheduling with Global Coordination

```c
/* Per-hart O(1) scheduler */
static tcb_t *select_next_task_local(uint8_t hart_id)
{
    hart_sched_t *hart = &multi_sched.per_hart[hart_id];
    
    /* O(1) priority selection using local bitmap */
    uint8_t highest_prio = find_highest_ready_priority_rv(hart->ready_bitmap);
    
    if (highest_prio == IDLE_PRIORITY_LEVEL)
        return get_idle_task(hart_id);
    
    /* Select from highest priority queue using weighted round-robin */
    tcb_t *selected = select_from_priority_queue_wrr(hart, highest_prio);
    
    if (!selected) {
        /* Queue became empty, update bitmaps */
        mark_priority_empty_atomic(hart, highest_prio);
        return select_next_task_local(hart_id); /* Retry */
    }
    
    return selected;
}

/* Weighted round-robin within priority queue */
static tcb_t *select_from_priority_queue_wrr(hart_sched_t *hart, uint8_t prio)
{
    list_t *queue = hart->ready_queues[prio];
    if (!queue || queue->size == 0)
        return NULL;
    
    list_node_t *cursor = hart->rr_cursors[prio];
    if (!cursor || cursor == queue->tail)
        cursor = queue->head->next; /* Start from beginning */
    
    /* Linmo's weighted round-robin: find task with counter = 0 */
    list_node_t *start = cursor;
    do {
        if (cursor == queue->tail) {
            cursor = queue->head->next;
            continue;
        }
        
        tcb_t *task = (tcb_t *)cursor->data;
        if (!task || task->state != TASK_READY) {
            cursor = cursor->next;
            continue;
        }
        
        /* Apply starvation boost if needed */
        uint8_t effective_counter = (task->prio & 0xFF) >> task->fairness.starvation_level;
        
        if (effective_counter == 0) {
            /* Task ready to run - reload counter and select */
            uint8_t base_prio = (task->prio >> 8) & 0xFF;
            task->prio = (task->prio & 0xFF00) | base_prio;
            
            /* Remove from queue and update cursor */
            hart->rr_cursors[prio] = cursor->next;
            list_remove(queue, cursor);
            hart->queue_counts[prio]--;
            
            /* Update bitmap if queue became empty */
            if (hart->queue_counts[prio] == 0)
                mark_priority_empty_atomic(hart, prio);
            
            return task;
        }
        
        /* Decrement counter */
        task->prio = (task->prio & 0xFF00) | (effective_counter - 1);
        cursor = cursor->next;
        
    } while (cursor != start);
    
    return NULL; /* No task ready in this priority level */
}
```

#### Global Load Balancing Engine

```c
/* Global load balancing coordinator */
typedef struct {
    volatile uint32_t balance_generation;    /* Prevent race conditions */
    uint64_t last_balance_time;             /* When balancing last ran */
    uint32_t balance_interval_us;           /* How often to balance */
    
    /* Load metrics */
    hart_load_t hart_loads[MAX_HARTS];      /* Per-hart load tracking */
    uint32_t system_wide_load;              /* Total system load */
    
    /* Migration policies */
    uint32_t migration_cost_threshold;      /* Minimum benefit for migration */
    uint32_t max_migrations_per_balance;    /* Rate limiting */
    
} load_balancer_t;

/* Calculate comprehensive hart load metric */
static uint32_t calculate_hart_load(uint8_t hart_id)
{
    hart_sched_t *hart = &multi_sched.per_hart[hart_id];
    hart_load_t *load = &multi_sched.load_balancer.hart_loads[hart_id];
    
    uint32_t priority_weighted_load = 0;
    uint32_t task_count_load = 0;
    
    /* Weight by priority levels - higher priority = exponentially more load */
    for (int prio = 0; prio < PRIO_LEVELS - 1; prio++) { /* Exclude idle */
        uint32_t tasks = hart->queue_counts[prio];
        uint32_t weight = 1U << (PRIO_LEVELS - 1 - prio); /* Exponential weight */
        
        priority_weighted_load += tasks * weight * weight; /* Square for emphasis */
        task_count_load += tasks;
    }
    
    /* Combine multiple load factors */
    uint32_t utilization_load = (100 - load->utilization_percent) * 10;
    uint32_t total_load = priority_weighted_load * 100 + 
                         task_count_load * 50 + 
                         utilization_load;
    
    return total_load;
}

/* Smart load balancing with migration cost analysis */
static void global_load_balance(void)
{
    load_balancer_t *balancer = &multi_sched.load_balancer;
    uint64_t current_time = _read_us();
    
    /* Rate limiting */
    if (current_time - balancer->last_balance_time < balancer->balance_interval_us)
        return;
    
    balancer->last_balance_time = current_time;
    balancer->balance_generation++;
    
    /* Calculate current load distribution */
    uint32_t hart_loads[MAX_HARTS];
    uint32_t total_load = 0;
    uint32_t min_load = UINT32_MAX, max_load = 0;
    uint8_t min_hart = 0, max_hart = 0;
    
    for (uint8_t h = 0; h < MAX_HARTS; h++) {
        hart_loads[h] = calculate_hart_load(h);
        total_load += hart_loads[h];
        
        if (hart_loads[h] < min_load) {
            min_load = hart_loads[h];
            min_hart = h;
        }
        if (hart_loads[h] > max_load) {
            max_load = hart_loads[h];
            max_hart = h;
        }
    }
    
    uint32_t avg_load = total_load / MAX_HARTS;
    
    /* Check if rebalancing needed (>30% imbalance) */
    if (max_load > avg_load + (avg_load * 30 / 100)) {
        uint32_t migrations = 0;
        
        /* Migrate tasks from overloaded to underloaded harts */
        while (migrations < balancer->max_migrations_per_balance &&
               calculate_hart_load(max_hart) > avg_load + (avg_load / 10)) {
            
            if (migrate_task_for_balance(max_hart, min_hart)) {
                migrations++;
                /* Recalculate min_hart after migration */
                min_hart = find_least_loaded_hart();
            } else {
                break; /* No suitable tasks to migrate */
            }
        }
        
        printf("BALANCE: Migrated %u tasks from hart %u to hart %u\n",
               migrations, max_hart, min_hart);
    }
}
```

#### Work-Stealing Protocol

```c
/* Work-stealing for idle harts */
static tcb_t *attempt_work_stealing(uint8_t idle_hart)
{
    /* Try to steal from the most loaded hart */
    uint8_t victim_hart = find_most_loaded_hart();
    
    if (victim_hart == idle_hart)
        return NULL; /* No other harts available */
    
    /* Only steal if victim has significantly more work */
    uint32_t victim_load = calculate_hart_load(victim_hart);
    uint32_t idle_load = calculate_hart_load(idle_hart);
    
    if (victim_load <= idle_load * 2) /* Less than 2x load difference */
        return NULL;
    
    /* Prefer to steal lower priority tasks (less disruptive) */
    for (int prio = PRIO_LEVELS - 2; prio >= 4; prio--) { /* Skip idle and high prio */
        hart_sched_t *victim = &multi_sched.per_hart[victim_hart];
        list_t *queue = victim->ready_queues[prio];
        
        if (!queue || queue->size <= 1) /* Leave at least one task */
            continue;
        
        /* Steal from tail (oldest waiting task) */
        list_node_t *node = list_popback(queue);
        if (!node)
            continue;
        
        tcb_t *stolen_task = (tcb_t *)node->data;
        if (!stolen_task)
            continue;
        
        /* Update victim hart state */
        victim->queue_counts[prio]--;
        if (victim->queue_counts[prio] == 0) {
            mark_priority_empty_atomic(victim, prio);
        }
        
        /* Update task migration tracking */
        stolen_task->fairness.migration_count++;
        stolen_task->fairness.last_hart = idle_hart;
        stolen_task->fairness.cache_warmth_score = 0; /* Cache is cold */
        
        printf("STEAL: Task %u from hart %u to hart %u (prio %u)\n",
               stolen_task->id, victim_hart, idle_hart, prio);
        
        return stolen_task;
    }
    
    return NULL; /* No suitable work to steal */
}
```

### 4. Enhanced Main Scheduler Integration

```c
/* Main O(1) multi-hart scheduler */
static uint16_t schedule_next_task_o1_multihart(uint8_t hart_id)
{
    tcb_t *next_task = NULL;
    hart_sched_t *hart = &multi_sched.per_hart[hart_id];
    
    /* Handle current task state transition */
    handle_current_task_transition_multihart(hart_id);
    
    /* 1. Try local hart scheduling first (cache locality) */
    next_task = select_next_task_local(hart_id);
    
    /* 2. Check for global priority violations */
    if (!next_task || should_check_global_priority()) {
        tcb_t *global_candidate = check_global_priority_enforcement(hart_id);
        if (global_candidate && 
            (!next_task || task_has_higher_priority(global_candidate, next_task))) {
            next_task = global_candidate;
        }
    }
    
    /* 3. Try work-stealing if still no local work */
    if (!next_task) {
        next_task = attempt_work_stealing(hart_id);
    }
    
    /* 4. Check starvation and apply boosts */
    check_and_prevent_starvation(hart);
    
    /* 5. Fallback to idle task */
    if (!next_task) {
        next_task = get_idle_task(hart_id);
    }
    
    /* Update scheduler state and metrics */
    if (next_task) {
        next_task->state = TASK_RUNNING;
        next_task->fairness.last_start_time = _read_us();
        next_task->fairness.wait_cycles = 0;
        hart->current_task = next_task;
        hart->last_selected_prio = priority_to_queue_index(next_task->prio);
        
        /* Update performance counters */
        hart->local_switches++;
        multi_sched.total_context_switches++;
    }
    
    /* Periodic global maintenance */
    static uint32_t balance_counter = 0;
    if ((++balance_counter % 100) == 0) { /* Every 100 schedules */
        global_load_balance();
    }
    
    return next_task ? next_task->id : 0;
}

/* Global priority enforcement across harts */
static tcb_t *check_global_priority_enforcement(uint8_t current_hart)
{
    /* Find highest global priority across all harts */
    uint8_t global_highest = find_highest_ready_priority_rv(multi_sched.global_ready_bitmap);
    
    /* Check if current hart's highest priority matches global */
    hart_sched_t *hart = &multi_sched.per_hart[current_hart];
    uint8_t local_highest = find_highest_ready_priority_rv(hart->ready_bitmap);
    
    if (local_highest <= global_highest)
        return NULL; /* Local hart has appropriate priority work */
    
    /* Look for higher priority work on other harts */
    for (uint8_t h = 0; h < MAX_HARTS; h++) {
        if (h == current_hart)
            continue;
        
        hart_sched_t *other_hart = &multi_sched.per_hart[h];
        uint8_t other_highest = find_highest_ready_priority_rv(other_hart->ready_bitmap);
        
        if (other_highest < global_highest) {
            /* Found higher priority work - attempt to steal it */
            tcb_t *high_prio_task = steal_high_priority_task(h, current_hart, other_highest);
            if (high_prio_task) {
                printf("GLOBAL: Migrated high-priority task %u from hart %u to hart %u\n",
                       high_prio_task->id, h, current_hart);
                return high_prio_task;
            }
        }
    }
    
    return NULL;
}
```

## Implementation Roadmap

### Phase 1: Single-Hart O(1) Enhancement 
Tasks:
- [ ] Implement priority queue data structures
- [ ] Add bitmap manipulation functions with RISC-V optimization
- [ ] Create priority mapping from 16-bit Linmo priorities to 8 levels
- [ ] Add performance measurement infrastructure
- [ ] Unit tests for bitmap operations
- [ ] Implement O(1) priority selection algorithm
- [ ] Add weighted round-robin within priority levels
- [ ] Update task state transition logic
- [ ] Integrate with existing HAL context switching
- [ ] Performance benchmarking vs. current scheduler

Deliverables: Single-hart O(1) scheduler.

### Phase 2: Fairness Framework
Tasks:
- [ ] Add fairness tracking fields to TCB
- [ ] Implement proportional fair scheduling
- [ ] Add basic starvation prevention
- [ ] Create fairness measurement framework
- [ ] Fairness validation tests
- [ ] Implement deficit-based fairness tracking
- [ ] Add dynamic priority boosting for starved tasks
- [ ] Create fairness reconciliation algorithms
- [ ] Add comprehensive fairness metrics
- [ ] Long-running fairness stress tests

Deliverables: Comprehensive fairness framework with <5% deviation from ideal

### Phase 3: Multi-Hart Architecture
Tasks:
- [ ] Design per-hart scheduler state structures
- [ ] Implement hart-local ready queues
- [ ] Add global coordination mechanisms
- [ ] Basic multi-hart task spawning
- [ ] Hart identification and management
- [ ] Implement load calculation algorithms
- [ ] Add task migration mechanisms
- [ ] Create work-stealing protocol
- [ ] Add migration cost analysis
- [ ] Load balancing effectiveness testing
- [ ] Global priority enforcement
- [ ] Cross-hart fairness coordination
- [ ] Advanced migration policies
- [ ] Multi-hart stress testing

Deliverables: Full multi-hart scheduler with >85% SMP efficiency

## Performance Evaluation Framework

### Comprehensive Benchmark Suite

#### 1. Scheduling Performance Benchmarks

```c
/* Scheduler performance test suite */
typedef struct {
    uint32_t task_count;              /* Number of tasks in test */
    uint64_t avg_schedule_time_us;    /* Average scheduling overhead */
    uint64_t max_schedule_time_us;    /* Worst-case scheduling time */
    uint64_t total_context_switches;  /* Total switches during test */
    uint32_t cache_misses;            /* Cache miss count */
} sched_perf_result_t;

/* Performance test scenarios */
static const struct {
    const char *name;
    uint32_t task_count;
    uint32_t priority_mix[8];  /* Tasks per priority level */
    uint32_t duration_seconds;
} perf_tests[] = {
    {"Minimal Load",     10,  {1,1,2,2,2,1,1,0}, 60},
    {"Moderate Load",    50,  {2,3,8,10,15,8,4,0}, 120},
    {"Heavy Load",      100,  {5,5,15,20,25,15,10,5}, 180},
    {"Stress Test",     200,  {10,10,30,40,50,30,20,10}, 300},
    {"RT Mixed",         30,  {5,10,8,4,2,1,0,0}, 90},
};

/* Run comprehensive performance evaluation */
void run_scheduler_performance_evaluation(void)
{
    printf("=== Linmo Enhanced Scheduler Performance Evaluation ===\n");
    
    for (size_t i = 0; i < ARRAY_SIZE(perf_tests); i++) {
        printf("\nRunning test: %s\n", perf_tests[i].name);
        
        sched_perf_result_t result = run_perf_test(&perf_tests[i]);
        
        printf("Results:\n");
        printf("  Avg scheduling time: %llu μs\n", result.avg_schedule_time_us);
        printf("  Max scheduling time: %llu μs\n", result.max_schedule_time_us);
        printf("  Total context switches: %llu\n", result.total_context_switches);
        printf("  Scheduling efficiency: %.2f switches/ms\n", 
               (float)result.total_context_switches / (perf_tests[i].duration_seconds * 1000));
        
        /* Compare with O(n) baseline */
        uint64_t baseline_time = estimate_linear_scheduler_time(perf_tests[i].task_count);
        float improvement = (float)baseline_time / result.avg_schedule_time_us;
        printf("  Performance improvement: %.1fx faster than O(n)\n", improvement);
    }
}
```

#### 2. Fairness Validation Tests

```c
/* Fairness measurement framework */
typedef struct {
    uint32_t task_id;
    uint64_t actual_runtime_us;
    uint64_t expected_runtime_us;
    float fairness_ratio;           /* actual/expected */
    uint32_t starvation_events;
    uint32_t priority_boosts;
} task_fairness_result_t;

/* System-wide fairness metrics */
typedef struct {
    float overall_fairness_score;      /* 0.0-1.0, 1.0 = perfect fairness */
    uint32_t total_starvation_events;  /* Count of starvation incidents */
    float max_unfairness_ratio;        /* Worst fairness deviation */
    uint64_t total_test_duration_us;   /* Test duration */
    uint32_t priority_inversions;      /* Priority ordering violations */
} system_fairness_result_t;

/* Fairness test scenarios */
void run_fairness_validation_tests(void)
{
    printf("=== Fairness Validation Test Suite ===\n");
    
    /* Test 1: Equal priority fairness */
    system_fairness_result_t equal_prio = test_equal_priority_fairness(20, 300); /* 20 tasks, 5 minutes */
    printf("Equal Priority Test: Fairness Score = %.3f\n", equal_prio.overall_fairness_score);
    
    /* Test 2: Mixed priority fairness */
    system_fairness_result_t mixed_prio = test_mixed_priority_fairness(50, 600); /* 50 tasks, 10 minutes */
    printf("Mixed Priority Test: Fairness Score = %.3f\n", mixed_prio.overall_fairness_score);
    
    /* Test 3: Starvation prevention */
    system_fairness_result_t starvation = test_starvation_prevention(30, 180); /* 30 tasks, 3 minutes */
    printf("Starvation Prevention: Events = %u (should be 0)\n", starvation.total_starvation_events);
    
    /* Test 4: Load burst handling */
    system_fairness_result_t burst = test_load_burst_fairness(100, 120); /* 100 tasks, 2 minutes */
    printf("Load Burst Test: Max Unfairness = %.2f%% (should be <5%%)\n", 
           (burst.max_unfairness_ratio - 1.0) * 100);
}
```

#### 3. Multi-Hart Scalability Tests

```c
/* Multi-hart performance and fairness evaluation */
typedef struct {
    uint8_t hart_count;
    float smp_efficiency;              /* 0.0-1.0, 1.0 = perfect scaling */
    uint32_t total_migrations;         /* Task migrations between harts */
    float migration_efficiency;        /* Successful/beneficial migrations */
    uint32_t load_imbalance_events;    /* Times load became imbalanced */
    float avg_load_distribution;       /* How evenly work was distributed */
} multihart_result_t;

/* Multi-hart test scenarios */
void run_multihart_scalability_tests(void)
{
    printf("=== Multi-Hart Scalability Tests ===\n");
    
    for (uint8_t harts = 2; harts <= MAX_HARTS; harts++) {
        printf("\nTesting with %u harts:\n", harts);
        
        multihart_result_t result = run_multihart_test(harts, 100, 300); /* 100 tasks, 5 min */
        
        printf("  SMP Efficiency: %.1f%%\n", result.smp_efficiency * 100);
        printf("  Task Migrations: %u\n", result.total_migrations);
        printf("  Migration Efficiency: %.1f%%\n", result.migration_efficiency * 100);
        printf("  Load Distribution: %.1f%% even\n", result.avg_load_distribution * 100);
        
        /* Theoretical vs actual performance */
        float theoretical_speedup = (float)harts;
        float actual_speedup = result.smp_efficiency * harts;
        printf("  Speedup: %.2fx (theoretical: %.2fx)\n", actual_speedup, theoretical_speedup);
    }
}
```

## Implementation Guidelines and Best Practices

### Code Quality Standards

#### 1. Performance-Critical Code

```c
/* All scheduler hot paths must meet these requirements */
- Maximum 50 instructions for O(1) operations
- No dynamic memory allocation in scheduler core
- Atomic operations must be lock-free where possible
- Cache-line alignment for frequently accessed structures

/* Example: Performance-critical task selection */
static inline tcb_t *fast_task_select(hart_sched_t *hart)
{
    /* This function must complete in <10 μs */
    uint8_t prio = find_highest_ready_priority_rv(hart->ready_bitmap);
    return (prio < PRIO_LEVELS) ? dequeue_ready_task(hart, prio) : NULL;
}
```

#### 2. Multi-Hart Safety

```c
/* All shared data must be properly synchronized */
#define HART_LOCAL    /* Data accessed only by one hart */
#define HART_SHARED   /* Data shared between harts - needs atomics */
#define HART_GLOBAL   /* Data accessed by all harts - needs coordination */

/* Example: Safe bitmap updates */
static inline void safe_bitmap_update(volatile uint32_t *bitmap, uint8_t bit, bool set)
{
    if (set) {
        __atomic_or_fetch(bitmap, (1U << bit), __ATOMIC_RELAXED);
    } else {
        __atomic_and_fetch(bitmap, ~(1U << bit), __ATOMIC_RELAXED);
    }
}
```

#### 3. Fairness Invariants

```c
/* Fairness constraints that must always hold */
assert(task->fairness.total_runtime_us <= task->fairness.expected_runtime_us + FAIRNESS_TOLERANCE);
assert(task->fairness.wait_cycles < MAX_WAIT_CYCLES);
assert(task->fairness.starvation_level <= MAX_STARVATION_LEVEL);

/* Fairness verification in debug builds */
#ifdef CONFIG_FAIRNESS_DEBUG
#define VERIFY_FAIRNESS(task) verify_task_fairness_invariants(task)
#else
#define VERIFY_FAIRNESS(task) do {} while(0)
#endif
```

## Configuration Options

### Build-Time Configuration

```c
/* Enhanced scheduler feature levels */
#define CONFIG_ENHANCED_SCHEDULER_LEVEL_BASIC    1  /* O(1) only */
#define CONFIG_ENHANCED_SCHEDULER_LEVEL_FAIR     2  /* + fairness */
#define CONFIG_ENHANCED_SCHEDULER_LEVEL_SMP      3  /* + multi-hart */

/* Memory usage control */
#define CONFIG_MAX_HARTS                         4  /* Maximum supported harts */
#define CONFIG_PRIORITY_LEVELS                   8  /* Number of priority levels */
#define CONFIG_FAIRNESS_TRACKING                 1  /* Enable fairness metrics */
#define CONFIG_MIGRATION_SUPPORT                 1  /* Enable task migration */

/* Performance tuning */
#define CONFIG_LOAD_BALANCE_INTERVAL_MS        100  /* Load balancing frequency */
#define CONFIG_FAIRNESS_UPDATE_INTERVAL_MS     100  /* Fairness reconciliation */
#define CONFIG_STARVATION_THRESHOLD_CYCLES      50  /* Starvation detection */
```

### Runtime Configuration

```c
/* Runtime tunable parameters */
typedef struct {
    bool enhanced_scheduler_enabled;        /* Enable enhanced scheduler */
    uint8_t active_hart_count;             /* Number of harts to use */
    uint32_t migration_cost_threshold;     /* Minimum benefit for migration */
    uint32_t fairness_tolerance_us;        /* Acceptable fairness deviation */
    bool strict_priority_enforcement;      /* Global priority consistency */
    bool load_balancing_enabled;           /* Enable automatic load balancing */
} scheduler_config_t;

/* Configuration API */
int32_t mo_sched_configure(const scheduler_config_t *config);
int32_t mo_sched_get_stats(scheduler_stats_t *stats);
int32_t mo_sched_reset_stats(void);
```

Enhance task scheduler #13

Description

Executive Summary

Current State Analysis

Existing Scheduler Limitations

1. Algorithmic Complexity Issues

2. Memory Layout Inefficiencies

3. Fairness Limitations

4. Single-Hart Architecture Constraints

Enhancement Objectives & Requirements

Primary Objectives

Technical Design Architecture

1. O(1) Bitmap-Based Priority Selection

Enhanced Priority Queue Design

RISC-V Optimized Bit Operations

2. Comprehensive Fairness Framework

Multi-Dimensional Fairness Tracking

Proportional Fair Scheduling Algorithm

Anti-Starvation Mechanisms

3. Multi-Hart Architecture Design

Hart-Local Scheduling with Global Coordination

Global Load Balancing Engine

Work-Stealing Protocol

4. Enhanced Main Scheduler Integration

Implementation Roadmap

Phase 1: Single-Hart O(1) Enhancement

Phase 2: Fairness Framework

Phase 3: Multi-Hart Architecture

Performance Evaluation Framework

Comprehensive Benchmark Suite

1. Scheduling Performance Benchmarks

2. Fairness Validation Tests

3. Multi-Hart Scalability Tests

Implementation Guidelines and Best Practices

Code Quality Standards

1. Performance-Critical Code

2. Multi-Hart Safety

3. Fairness Invariants

Configuration Options

Build-Time Configuration

Runtime Configuration

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions