Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sival] rv_core_ibex_nmi_irq_test_silicon_owner_sival_rom_ext failure #22954

Closed
mundaym opened this issue May 3, 2024 · 16 comments
Closed

[sival] rv_core_ibex_nmi_irq_test_silicon_owner_sival_rom_ext failure #22954

mundaym opened this issue May 3, 2024 · 16 comments
Assignees

Comments

@mundaym
Copy link
Contributor

mundaym commented May 3, 2024

Description

Enabling alert_handler ping mechanism results in alert_handler NMI without any reported local of regular alerts. The only way to recover from this is by resetting the device.

@mundaym mundaym added this to the Earlgrey-PROD.M4 milestone May 3, 2024
@pamaury
Copy link
Contributor

pamaury commented May 3, 2024

I think @nbdd0121 has investigated the test in more depth and might comment here?

@andreaskurth
Copy link
Contributor

@moidx is currently investigating this. If there are any comments from others please provide them

@moidx
Copy link
Contributor

moidx commented May 3, 2024

Configuring the alert_handler triggers nmi interrupt without any alert class association.

CHECK_STATUS_OK(
alert_handler_testutils_configure_all(&alert_handler, config,
/*lock=*/kDifToggleDisabled));

The alert NMI is triggered even when there are no pending alerts. Verified this by removing the force alert.

// Trigger the alert handler to escalate.
CHECK_DIF_OK(dif_pwrmgr_alert_force(&pwrmgr, kDifPwrmgrAlertFatalFault));

Disabling the ping timer removes the spurious NMI issue:

// Configure the ping timer.
TRY(dif_alert_handler_configure_ping_timer(alert_handler, config.ping_timeout,
kDifToggleEnabled, locked));

The following write in dif_alert_handler.c is what triggers the NMI from alert_handler:

  if (enabled == kDifToggleEnabled) {
    mmio_region_write32_shadowed(
        alert_handler->base_addr,
        ALERT_HANDLER_PING_TIMER_EN_SHADOWED_REG_OFFSET, 1);
  }

@moidx
Copy link
Contributor

moidx commented May 3, 2024

I tried switching the test to use a recoverable alert, different escalation sequences and a valid ping timeout timer configuration, but none of these changes helped to move the alert handler NMI.

I will submit these test updates separately, but I think at this point I am going to flag this for CDC analysis.

CC: @a-will @matutem @nbdd0121 who also took a look at this issue.

@moidx
Copy link
Contributor

moidx commented May 3, 2024

I am going to measure the delay between enabling the ping mechanism and the NMI to try to determine if this is an issue with the reverse ping mechanism supported by the receivers.

This was recommended by @msfschaffner.

@moidx
Copy link
Contributor

moidx commented May 3, 2024

Measured alert NMI trigger time

Running the test with the alerts and ping configured, and without triggering the alert results in the first alert NMI triggering between 500-4000 microseconds.

This as measured with rv_timer using an equivalent 1us tick.

alert_hander NMI trigger time when alert ping is disabled

The interrupt fires consistently within 8-9 us. This seems to indicate that the issue is due to a ping timeout.

Reverse ping timeout calculation

The reverse ping timeout calculation is done using the following formula available in
prim_esc_receiver:

4  * N_ESC_SEV * (2 * 2 * 2^PING_CNT_DW)

pwrmgr is the only block consuming the N_ESC_SEV and PING_CNT_DW compile time
parameters:

alert_handler_reg_pkg::N_ESC_SEV = 4
alert_handler_reg_pkg::PING_CNT_DW = 16

The alert escalation responder inside pwrmgr is connected to the io_div4 clock,
yielding a target 24MHz frequency. The result expected timeout based on the above
parameters is thus:

reverse_ping_timeout = 0.175s = (4 * 4 ( 2 * 2 * 2^16)) / 24e6

The interrupt trigger measurement does not seem to rule out any potential issues with the reverse ping mechanism.

alert_handler configuration

  uint32_t cycles[3] = {0};
  CHECK_STATUS_OK(alert_handler_testutils_get_cycles_from_us(
      kEscalationPhase0Micros, &cycles[0]));
  CHECK_STATUS_OK(alert_handler_testutils_get_cycles_from_us(
      kEscalationPhase2Micros, &cycles[1]));
  CHECK_STATUS_OK(alert_handler_testutils_get_cycles_from_us(kIrqDeadlineMicros,
                                                             &cycles[2]));
  dif_alert_handler_escalation_phase_t esc_phases[] = {
      {.phase = kDifAlertHandlerClassStatePhase0,
       .signal = 0,
       .duration_cycles = cycles[0]},
      {.phase = kDifAlertHandlerClassStatePhase1,
       .signal = 3,
       .duration_cycles = cycles[1]}};
  dif_alert_handler_class_config_t class_config[] = {{
      .auto_lock_accumulation_counter = kDifToggleDisabled,
      .accumulator_threshold = 0,
      .irq_deadline_cycles = cycles[2],
      .escalation_phases = esc_phases,
      .escalation_phases_len = ARRAYSIZE(esc_phases),
      .crashdump_escalation_phase = kDifAlertHandlerClassStatePhase2,
  }};

  dif_alert_handler_alert_t alerts[] = {kTopEarlgreyAlertIdAesRecovCtrlUpdateErr};
  dif_alert_handler_class_t alert_classes[] = {kDifAlertHandlerClassA};
  dif_alert_handler_class_t classes[] = {kDifAlertHandlerClassA};
  dif_alert_handler_config_t config = {
      .alerts = alerts,
      .alert_classes = alert_classes,
      .alerts_len = ARRAYSIZE(alerts),
      .classes = classes,
      .class_configs = class_config,
      .classes_len = ARRAYSIZE(class_config),
      .ping_timeout = 0x100,
  };

@moidx
Copy link
Contributor

moidx commented May 4, 2024

@andreaskurth, this test is reproducible without a ROM_EXT running. @a-will suggested we can try to get a DV test configuration ready to run in GLS in case this is something we want to try.

CC: @sha-ron @OTshimeon

@andreaskurth
Copy link
Contributor

By @moidx: We can run chip_sw_rv_core_ibex_nmi_irq with test ROM on the netlist. We need to update the testcase to not trigger a fake alert, just wait. Then we shouldn't see any NMIs and can wait for the timeout. We should run this GLS over the next weekend. @moidx will create a test case.

@moidx
Copy link
Contributor

moidx commented Jun 2, 2024

Created the test case in #23441 and added test point to the GLS test plan. If unable to debug further on Z1, I propose we close this issue and consider removing dropping the ping mechanism from A1 if we continue to run into problems during bring-up. We can make this decision as part of M5 triage.

@johannheyszl FYI, since we'll have to test alert handler behavior with pinging mechanism disabled for Z1.

@vogelpi
Copy link
Contributor

vogelpi commented Jun 4, 2024

Discussed during triage meeting. Okay to move this to M5.

@johannheyszl
Copy link
Contributor

@moidx thanks for the heads up. IMHO this is OK for the testing we currently do, i.e. not fully invasive and cutting wires.

@andreaskurth andreaskurth added the Triage Priority Issue to be discussed with priority in the next triage meeting label Jul 4, 2024
@andreaskurth
Copy link
Contributor

andreaskurth commented Jul 4, 2024

Moving this to M6 as P1. It should be tested early on the final netlist. CC @sha-ron, we'll discuss this in our next meeting

@andreaskurth andreaskurth removed the Triage Priority Issue to be discussed with priority in the next triage meeting label Jul 4, 2024
@moidx
Copy link
Contributor

moidx commented Jul 24, 2024

#24119 tracks the findings after running the sw_alert_handler_ping_ok test post synthesis.

@vogelpi
Copy link
Contributor

vogelpi commented Jul 26, 2024

@moidx We could add this test to the GLS test plan as a P2. It should now be passing now with the latest ECO fixes in.

@andreaskurth
Copy link
Contributor

Confirmed that chip_sw_rv_core_ibex_nmi_irq is on the GLS testplan. Suggest moving this to M7 (still as P1).

@moidx
Copy link
Contributor

moidx commented Aug 18, 2024

Closing this issue as the sw_alert_handler_ping_ok is now passing in GLS. We can create a new issue if rv_core_ibex_nmi_req_irq_test we are able to run the test in GLS and results in failure.

@moidx moidx closed this as completed Aug 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants