Project

General

Profile

Actions

Bug #4428

closed

strange reboot loop

Added by laforge about 4 years ago. Updated about 4 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Category:
-
Target version:
-
Start date:
02/29/2020
Due date:
% Done:

100%

Spec Reference:

Description

I'm not able to trigger this reliably, but I've occasionally seen the SAM3S (ST12) on QMOD perform a constant reboot loop like this:

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

=============================================================================
SIMtrace2 firmware 0.7.0.49-ba�

The actual firmware installed is:

=============================================================================
SIMtrace2 firmware 0.7.0.49-baa6, BOARD=qmod, APP=cardem
(C) 2010-2019 by Harald Welte, 2018-2019 by Kevin Redon
=============================================================================

Attaching with OpenOCD and even issuing a reset there didn't help. Only after
dumping the flash contents (attached), and a restart of OpenOCD + "reset" command
the firmware would boot again normally.


Files

sam3s-flash_bank.dd sam3s-flash_bank.dd 500 KB laforge, 02/29/2020 02:14 PM
20200301-qmod-resetloop.dd 20200301-qmod-resetloop.dd 500 KB full flash image extracted via JTAG/openocd laforge, 03/01/2020 10:34 AM
20200301-qmod-resetloop.dfu 20200301-qmod-resetloop.dfu 16 KB first 16kb of image containing DFU laforge, 03/01/2020 10:34 AM
20200301-qmod-resetloop.cardem 20200301-qmod-resetloop.cardem 484 KB 16kb and upwards immage containing cardem laforge, 03/01/2020 10:34 AM
qmod-dfu-flash-0.7.0.50-0271.bin qmod-dfu-flash-0.7.0.50-0271.bin 15.9 KB DFU image that was flashed before laforge, 03/01/2020 10:34 AM
qmod-cardem-dfu-0.7.0.50-0271.bin qmod-cardem-dfu-0.7.0.50-0271.bin 32 KB cardem image that was flashed before laforge, 03/01/2020 10:34 AM

Related issues

Related to SIMtrace 2 - Bug #4429: cardem firmware crashes on too long OUT EP message Resolvedlaforge02/29/2020

Actions
Actions #1

Updated by laforge about 4 years ago

this seems to be reproducable by manually resetting the board with 'R' on the serial console.

When breaking in openocd, I get

target halted due to debug-request, current mode: Handler HardFault
xPSR: 0x00000003 pc: 0x35323030 msp: 0x51203100
> reg
===== arm v7m registers
(0) r0 (/32): 0x2000721C
(1) r1 (/32): 0x5A00000F
(2) r2 (/32): 0x35323030
(3) r3 (/32): 0x400E0A00
(4) r4 (/32): 0x20000190
(5) r5 (/32): 0x00000200
(6) r6 (/32): 0x24004D0C
(7) r7 (/32): 0x02000000
(8) r8 (/32): 0x00200002
(9) r9 (/32): 0x00000000
(10) r10 (/32): 0x20200015
(11) r11 (/32): 0x9A130220
(12) r12 (/32): 0x00000000
(13) sp (/32): 0x51203100
(14) lr (/32): 0xFFFFFFF9
(15) pc (/32): 0x35323030
(16) xPSR (/32): 0x00000003
(17) msp (/32): 0x51203100
(18) psp (/32): 0x50481000
(19) primask (/1): 0x00
(20) basepri (/8): 0x00
(21) faultmask (/1): 0x00
(22) control (/2): 0x00

Actions #2

Updated by laforge about 4 years ago

this can also be triggered on ST12 if an overly-long OUT message is sent. See also #4429.

Actions #3

Updated by laforge about 4 years ago

  • Related to Bug #4429: cardem firmware crashes on too long OUT EP message added
Actions #4

Updated by laforge about 4 years ago

Even having the board fully disconnected all night long (no power, no UART, no JTAG) doesn't resolve the problem upon power-up. Either the flash content is somehow damaged, or the board (QMOD Serial Nr. 230) has some hardware issue. I'll continue with a different board.

Actions #5

Updated by laforge about 4 years ago

  • Status changed from New to In Progress
  • % Done changed from 0 to 10

I upgraded another board to most recent DFU loader and cardem firmware (0.7.0.50), and I can reproduce the problem: Pushing the "R" button on the serial console of ST12 triggers the problem:

Before pressing 'R':

-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0
-I- 1: WWAN LED 1
-I- 1: WWAN LED 0

After pressing the 'R' key on the serial console:

RAsking NVIC to reset us
�

=============================================================================
SIMtrace2 firmware 0.7.0.50-02�

=============================================================================
SIMtrace2 firmware 0.7.0.50-02�

=============================================================================
SIMtrace2 firmware 0.7.0.50-02�

=============================================================================
SIMtrace2 firmware 0.7.0.50-02

Actions #6

Updated by laforge about 4 years ago

I've used "flash read_bank 0 /tmp/20200301-qmod-resetloop.dd 0 512000" to dump the flash and compared its contents with the binary files I had just flashed beforehand. They are a 1:1 match according to vbindiff. So no flash corruption.

Actions #7

Updated by laforge about 4 years ago

I'm now also seeing this occasionally on ST34:

-I- 1: WWAN LED 0
�

=============================================================================
SIMtrace2 firmware 0.7.0.50-02�

=============================================================================
SIMtrace2 firmware 0.7.0.50-02�

=============================================================================
SIMtrace2 firmware 0.7.0.50-02�

Resetting the ST34 from ST12 also doesn't cause any change, it straight away goes into the same loop.

Actions #8

Updated by laforge about 4 years ago

openocd results from after the loop starting:
,

> halt
target halted due to debug-request, current mode: Handler HardFault
xPSR: 0x40000003 pc: 0x35323034 msp: 0x200071d0
> step
target halted due to single-step, current mode: Handler HardFault
xPSR: 0x40000003 pc: 0x35323034 msp: 0x200071d0
halted: PC: 0x35323034
> reg
===== arm v7m registers
(0) r0 (/32): 0x2000721C
(1) r1 (/32): 0x5A00000F
(2) r2 (/32): 0x35323034
(3) r3 (/32): 0x400E0A00
(4) r4 (/32): 0x20000190
(5) r5 (/32): 0x00000200
(6) r6 (/32): 0x0040A988
(7) r7 (/32): 0x20000190
(8) r8 (/32): 0x00000000
(9) r9 (/32): 0x00001000
(10) r10 (/32): 0x20002D60
(11) r11 (/32): 0x20000198
(12) r12 (/32): 0x00000000
(13) sp (/32): 0x200071D0
(14) lr (/32): 0xFFFFFFF1
(15) pc (/32): 0x35323034
(16) xPSR (/32): 0x40000003
(17) msp (/32): 0x200071D0
(18) psp (/32): 0x80000000
(19) primask (/1): 0x00
(20) basepri (/8): 0x00
(21) faultmask (/1): 0x00
(22) control (/2): 0x00

Actions #9

Updated by laforge about 4 years ago

Ok, it is - like suspected - a watchdog reset that causes the looping. Now the question is, of course, what gets stuck to make the watchdog trigger:

Look at RSTC_SR:

> mdw 0x400E1404
0x400e1404: 00010201 

where the '2' indicates watchdog reset.

Watchdog config looks like this:

> mdw 0x400e1450 3 
0x400e1450: 00000000 32002200 00000000 

Actions #10

Updated by laforge about 4 years ago

Note: WDRPROC=0, which means the watchdog only performs a processor reset, but not a reset of all peripherals. So if some peripheral is in a weird state, it will continue to be so after watchdog reset.

Actions #11

Updated by laforge about 4 years ago

Interestingly, the PID4 (WDT) peripheral is not even clocked:

> mdw 0x400E0418
0x400e0418: 00001900 

to make things more odd, it seems impossible to set the lower 8 bits at allvia PCER0; it seems they're permanently clocked without showing up in the PCSR0.

looking at address 0, we see

> mdw 0 16
0x00000000: 51203120 39513450 32303031 35323034 0ffc0000 00010040 0ff90000 00010120 
0x00000020: 0ff80000 000100c0 0ff60000 00010240 0ff80000 00010180 00070000 0000ff00 

At CPU reset time the VTOR is at 0x00000000, so the stack would be initialized to 0x51203120, the reset code vector t0 0x39513450 and hardfault to 0x35323034 - which is exactly where our PC currently is. Those are of course all invalid addresses.

Interstingly, looking at VTOR, it actually doesn't point to 0x00000000 anymore, but to the start of the 'application' partition:

> mdw 0xE000Ed08   
0xe000ed08: 00404000 

but there we see exactly the same crap:

> mdw 0x00404000 16
0x00404000: 51203120 39513450 32303031 35323034 0ffc0000 00010040 0ff90000 00010120 
0x00404020: 0ff80000 000100c0 0ff60000 00010240 0ff80000 00010180 00070000 0000ff00 

and even at the start of flash (DFU loader):

> mdw 0x00404000 16
0x00404000: 51203120 39513450 32303031 35323034 0ffc0000 00010040 0ff90000 00010120 
0x00404020: 0ff80000 000100c0 0ff60000 00010240 0ff80000 00010180 00070000 0000ff00 

But why am I getting completely different results when dumping the flash via JTAG/openocd? The data almost looks like ascii with ascii-encoded digits (0x30+x).

Actions #12

Updated by laforge about 4 years ago

From SAM3S Data sheet (Section 19.3.3.8 Unique Identifier):

  • Send the Start Read unique Identifier command (STUI) by writing the Flash Command Register with the STUI command.
  • When the Unique Identifier is ready to be read, the FRDY bit in the Flash Programming Status Register (EEFC_FSR) falls.
  • The Unique Identifier is located in the first 128 bits of the Flash memory mapping. So, at the address 0x400000-0x40008F.
  • To stop the Unique Identifier mode, the user needs to send the Stop Read unique Identifier command (SPUI) by writing the Flash Command Register with the SPUI command.
  • When the Stop read Unique Identifier command (SPUI) has been performed, the FRDY bit in the Flash Programming Status Register (EEFC_FSR) rises. If an interrupt was enabled by setting the FRDY bit in EEFC_FMR, the interrupt line of the NVIC is activated.
    Note that during the sequence, the software can not run out of Flash (or the second plane in case of dual plane).
Actions #13

Updated by laforge about 4 years ago

Ok, that did the trick to recover:

We can see that EEFC_FSR (third dword below) is all-zero, i.e. FRDY=0 indicating "The Enhanced Embedded Flash Controller (EEFC) is busy".

> mdw 0x400E0A00 16
0x400e0a00: 00010300 00000000 00000000 00000000 00000000 00000412 00000000 00000000 
0x400e0a20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000

Now we write the "Stop Read Unique Identifier" (SPUI) command 0x0F to the EEFC_FCR:

> mww 0x400e0a04 0x5a00000f

And afterwards, we can see FRDY is 1 (ready) again:

> mdw 0x400E0A00 16        
0x400e0a00: 00010300 00000000 00000001 00000000 00000000 00000412 00000000 00000000 
0x400e0a20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 

If we now issue a JTAG or othe reset, the firmware recovers.

So what is happening is that somehow a reset is happening after we issued the Start Read Unique Identifier, but before the code has been able to issue the Stop Read Unique Identifier. And in their infinite wisdom, Atmel is putting the Unique Identity exactly where the exception vector table is read on CPU reset -> boom.

Now the good question is: How to read the Unique ID in a way that a reset cannot happen in between?

Actions #14

Updated by laforge about 4 years ago

  • % Done changed from 10 to 40
Actions #15

Updated by laforge about 4 years ago

  • % Done changed from 40 to 80
Ok, the likely sequence of events has been established as:
  • our firmware starts up, enables interrupts
  • we then call (once from the USB code, once from main) EEFC_ReadUniqueID()
  • That function is correctly relocated to memory and not executing from flash
  • however, interrupt handlers might get called meanwhile, and interrupt code is in flash
  • reading flash is not possible between STUI and SPUI
  • the interrupt handler deadlocks (or reads wrong vector from the vector table and runs into hardfault)
  • the watchdog resets the processor

Some initial testing with __disable_irq() / __enable_irq() in EEFC_ReadUniqueID() have been rather positive.

Actions #17

Updated by laforge about 4 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 80 to 100

patch merged

Actions

Also available in: Atom PDF

Add picture from clipboard (Maximum size: 48.8 MB)