Bug #4428
closedstrange reboot loop
100%
Description
I'm not able to trigger this reliably, but I've occasionally seen the SAM3S (ST12) on QMOD perform a constant reboot loop like this:
============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba� ============================================================================= SIMtrace2 firmware 0.7.0.49-ba�
The actual firmware installed is:
============================================================================= SIMtrace2 firmware 0.7.0.49-baa6, BOARD=qmod, APP=cardem (C) 2010-2019 by Harald Welte, 2018-2019 by Kevin Redon =============================================================================
Attaching with OpenOCD and even issuing a reset there didn't help. Only after
dumping the flash contents (attached), and a restart of OpenOCD + "reset" command
the firmware would boot again normally.
Files
Related issues
Updated by laforge about 4 years ago
this seems to be reproducable by manually resetting the board with 'R' on the serial console.
When breaking in openocd, I get
target halted due to debug-request, current mode: Handler HardFault xPSR: 0x00000003 pc: 0x35323030 msp: 0x51203100 > reg ===== arm v7m registers (0) r0 (/32): 0x2000721C (1) r1 (/32): 0x5A00000F (2) r2 (/32): 0x35323030 (3) r3 (/32): 0x400E0A00 (4) r4 (/32): 0x20000190 (5) r5 (/32): 0x00000200 (6) r6 (/32): 0x24004D0C (7) r7 (/32): 0x02000000 (8) r8 (/32): 0x00200002 (9) r9 (/32): 0x00000000 (10) r10 (/32): 0x20200015 (11) r11 (/32): 0x9A130220 (12) r12 (/32): 0x00000000 (13) sp (/32): 0x51203100 (14) lr (/32): 0xFFFFFFF9 (15) pc (/32): 0x35323030 (16) xPSR (/32): 0x00000003 (17) msp (/32): 0x51203100 (18) psp (/32): 0x50481000 (19) primask (/1): 0x00 (20) basepri (/8): 0x00 (21) faultmask (/1): 0x00 (22) control (/2): 0x00
Updated by laforge about 4 years ago
this can also be triggered on ST12 if an overly-long OUT message is sent. See also #4429.
Updated by laforge about 4 years ago
- Related to Bug #4429: cardem firmware crashes on too long OUT EP message added
Updated by laforge about 4 years ago
Even having the board fully disconnected all night long (no power, no UART, no JTAG) doesn't resolve the problem upon power-up. Either the flash content is somehow damaged, or the board (QMOD Serial Nr. 230) has some hardware issue. I'll continue with a different board.
Updated by laforge about 4 years ago
- Status changed from New to In Progress
- % Done changed from 0 to 10
I upgraded another board to most recent DFU loader and cardem firmware (0.7.0.50), and I can reproduce the problem: Pushing the "R" button on the serial console of ST12 triggers the problem:
Before pressing 'R':
-I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0 -I- 1: WWAN LED 1 -I- 1: WWAN LED 0
After pressing the 'R' key on the serial console:
RAsking NVIC to reset us � ============================================================================= SIMtrace2 firmware 0.7.0.50-02� ============================================================================= SIMtrace2 firmware 0.7.0.50-02� ============================================================================= SIMtrace2 firmware 0.7.0.50-02� ============================================================================= SIMtrace2 firmware 0.7.0.50-02
Updated by laforge about 4 years ago
- File 20200301-qmod-resetloop.dd 20200301-qmod-resetloop.dd added
- File 20200301-qmod-resetloop.dfu 20200301-qmod-resetloop.dfu added
- File 20200301-qmod-resetloop.cardem 20200301-qmod-resetloop.cardem added
- File qmod-dfu-flash-0.7.0.50-0271.bin qmod-dfu-flash-0.7.0.50-0271.bin added
- File qmod-cardem-dfu-0.7.0.50-0271.bin qmod-cardem-dfu-0.7.0.50-0271.bin added
- Assignee set to laforge
- Priority changed from Low to Normal
I've used "flash read_bank 0 /tmp/20200301-qmod-resetloop.dd 0 512000" to dump the flash and compared its contents with the binary files I had just flashed beforehand. They are a 1:1 match according to vbindiff. So no flash corruption.
Updated by laforge about 4 years ago
I'm now also seeing this occasionally on ST34:
-I- 1: WWAN LED 0 � ============================================================================= SIMtrace2 firmware 0.7.0.50-02� ============================================================================= SIMtrace2 firmware 0.7.0.50-02� ============================================================================= SIMtrace2 firmware 0.7.0.50-02�
Resetting the ST34 from ST12 also doesn't cause any change, it straight away goes into the same loop.
Updated by laforge about 4 years ago
openocd results from after the loop starting:
,
> halt target halted due to debug-request, current mode: Handler HardFault xPSR: 0x40000003 pc: 0x35323034 msp: 0x200071d0 > step target halted due to single-step, current mode: Handler HardFault xPSR: 0x40000003 pc: 0x35323034 msp: 0x200071d0 halted: PC: 0x35323034 > reg ===== arm v7m registers (0) r0 (/32): 0x2000721C (1) r1 (/32): 0x5A00000F (2) r2 (/32): 0x35323034 (3) r3 (/32): 0x400E0A00 (4) r4 (/32): 0x20000190 (5) r5 (/32): 0x00000200 (6) r6 (/32): 0x0040A988 (7) r7 (/32): 0x20000190 (8) r8 (/32): 0x00000000 (9) r9 (/32): 0x00001000 (10) r10 (/32): 0x20002D60 (11) r11 (/32): 0x20000198 (12) r12 (/32): 0x00000000 (13) sp (/32): 0x200071D0 (14) lr (/32): 0xFFFFFFF1 (15) pc (/32): 0x35323034 (16) xPSR (/32): 0x40000003 (17) msp (/32): 0x200071D0 (18) psp (/32): 0x80000000 (19) primask (/1): 0x00 (20) basepri (/8): 0x00 (21) faultmask (/1): 0x00 (22) control (/2): 0x00
Updated by laforge about 4 years ago
Ok, it is - like suspected - a watchdog reset that causes the looping. Now the question is, of course, what gets stuck to make the watchdog trigger:
Look at RSTC_SR:
> mdw 0x400E1404 0x400e1404: 00010201
where the '2' indicates watchdog reset.
Watchdog config looks like this:
> mdw 0x400e1450 3 0x400e1450: 00000000 32002200 00000000
Updated by laforge about 4 years ago
Note: WDRPROC=0, which means the watchdog only performs a processor reset, but not a reset of all peripherals. So if some peripheral is in a weird state, it will continue to be so after watchdog reset.
Updated by laforge about 4 years ago
Interestingly, the PID4 (WDT) peripheral is not even clocked:
> mdw 0x400E0418 0x400e0418: 00001900
to make things more odd, it seems impossible to set the lower 8 bits at allvia PCER0; it seems they're permanently clocked without showing up in the PCSR0.
looking at address 0, we see
> mdw 0 16 0x00000000: 51203120 39513450 32303031 35323034 0ffc0000 00010040 0ff90000 00010120 0x00000020: 0ff80000 000100c0 0ff60000 00010240 0ff80000 00010180 00070000 0000ff00
At CPU reset time the VTOR is at 0x00000000, so the stack would be initialized to 0x51203120, the reset code vector t0 0x39513450 and hardfault to 0x35323034 - which is exactly where our PC currently is. Those are of course all invalid addresses.
Interstingly, looking at VTOR, it actually doesn't point to 0x00000000 anymore, but to the start of the 'application' partition:
> mdw 0xE000Ed08 0xe000ed08: 00404000
but there we see exactly the same crap:
> mdw 0x00404000 16 0x00404000: 51203120 39513450 32303031 35323034 0ffc0000 00010040 0ff90000 00010120 0x00404020: 0ff80000 000100c0 0ff60000 00010240 0ff80000 00010180 00070000 0000ff00
and even at the start of flash (DFU loader):
> mdw 0x00404000 16 0x00404000: 51203120 39513450 32303031 35323034 0ffc0000 00010040 0ff90000 00010120 0x00404020: 0ff80000 000100c0 0ff60000 00010240 0ff80000 00010180 00070000 0000ff00
But why am I getting completely different results when dumping the flash via JTAG/openocd? The data almost looks like ascii with ascii-encoded digits (0x30+x).
Updated by laforge about 4 years ago
From SAM3S Data sheet (Section 19.3.3.8 Unique Identifier):
- Send the Start Read unique Identifier command (STUI) by writing the Flash Command Register with the STUI command.
- When the Unique Identifier is ready to be read, the FRDY bit in the Flash Programming Status Register (EEFC_FSR) falls.
- The Unique Identifier is located in the first 128 bits of the Flash memory mapping. So, at the address 0x400000-0x40008F.
- To stop the Unique Identifier mode, the user needs to send the Stop Read unique Identifier command (SPUI) by writing the Flash Command Register with the SPUI command.
- When the Stop read Unique Identifier command (SPUI) has been performed, the FRDY bit in the Flash Programming Status Register (EEFC_FSR) rises. If an interrupt was enabled by setting the FRDY bit in EEFC_FMR, the interrupt line of the NVIC is activated.
Note that during the sequence, the software can not run out of Flash (or the second plane in case of dual plane).
Updated by laforge about 4 years ago
Ok, that did the trick to recover:
We can see that EEFC_FSR (third dword below) is all-zero, i.e. FRDY=0 indicating "The Enhanced Embedded Flash Controller (EEFC) is busy".
> mdw 0x400E0A00 16 0x400e0a00: 00010300 00000000 00000000 00000000 00000000 00000412 00000000 00000000 0x400e0a20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Now we write the "Stop Read Unique Identifier" (SPUI) command 0x0F to the EEFC_FCR:
> mww 0x400e0a04 0x5a00000f
And afterwards, we can see FRDY is 1 (ready) again:
> mdw 0x400E0A00 16 0x400e0a00: 00010300 00000000 00000001 00000000 00000000 00000412 00000000 00000000 0x400e0a20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
If we now issue a JTAG or othe reset, the firmware recovers.
So what is happening is that somehow a reset is happening after we issued the Start Read Unique Identifier, but before the code has been able to issue the Stop Read Unique Identifier. And in their infinite wisdom, Atmel is putting the Unique Identity exactly where the exception vector table is read on CPU reset -> boom.
Now the good question is: How to read the Unique ID in a way that a reset cannot happen in between?
Updated by laforge about 4 years ago
- % Done changed from 40 to 80
- our firmware starts up, enables interrupts
- we then call (once from the USB code, once from main) EEFC_ReadUniqueID()
- That function is correctly relocated to memory and not executing from flash
- however, interrupt handlers might get called meanwhile, and interrupt code is in flash
- reading flash is not possible between STUI and SPUI
- the interrupt handler deadlocks (or reads wrong vector from the vector table and runs into hardfault)
- the watchdog resets the processor
Some initial testing with __disable_irq() / __enable_irq() in EEFC_ReadUniqueID() have been rather positive.
Updated by laforge about 4 years ago
Updated by laforge about 4 years ago
- Status changed from In Progress to Resolved
- % Done changed from 80 to 100
patch merged