Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNSS Logger Logging Failure #137

Closed
derekpickell opened this issue Apr 14, 2022 · 13 comments
Closed

GNSS Logger Logging Failure #137

derekpickell opened this issue Apr 14, 2022 · 13 comments

Comments

@derekpickell
Copy link
Contributor

Hi all. I've encountered an issue where my Artemis - ZED-F9P system will unexpectedly and permanently freeze/fail while logging RAWX/SFRBX messages. It has been incredibly hard to reproduce this failure (occurred 16 hours into a logging session, and after 12 hours a second time), and I'm having a hard time narrowing down where the root cause may be coming from, so I figured I'd ask some of the experts here.

  • Software versions:
    Recently upgraded to GNSS Library v2.2.8 from v2.0.15. Apollo/Artemis Core v1.2.3. I'm using the RAWX Logging without Callbacks example modified with SdFat and a WDT.

  • How is the breakout board wired to your microcontroller?
    I2C @ 400kHz with all pull-up jumpers disabled.

  • How is everything being powered?
    Issue encountered while powered with battery. Currently seeing if I can replicate the issue over USB with serial debug messages.

Steps to reproduce

Unfortunately the first time this issue appeared I was not outputting any debug to serial. However, the behavior is qualitatively similar to this issue https://github.com/sparkfun/OpenLog_Artemis_GNSS_Logger/issues/8 in that the STAT Led was no longer flashing but the power LED was on. I also had all constellations enabled and NavigationFrequency = 1, whereas in the past I only ever used GPS L1/L2. The second time around I output "important GNSS messages", which read "checkUbloxI2C: I2C error: endTransmission returned 4" ad infinitum.

Like I mentioned, it's been hard to debug because of the difficultly in reproducing the issue. There are quite a few variables I'd like to test (number of constellations, I2C clock speed, revert to GNSS library version 2.0.15, eliminating load on Artemis by disabling SPI/Serial streams)...

Thanks!

@PaulZC
Copy link
Collaborator

PaulZC commented Apr 14, 2022

Hi Derek (@derekpickell ),

Thanks for reporting. I will try to replicate...

Which physical board are you using? The MicroMod Data Logging Carrier Board with Artemis Processor Board?

Which version of SdFat are you using? How is your SD card formatted (FAT32 or exFAT)?

Do you need to use 400kHz I2C? Could you make do with 100kHz?

The other thing to throw into the mix is the Apollo3 core version. I will give v2.2.1 a try first.

With Apollo3 v1.2.1, error 4 is "other error" (returned from am_hal_iom_blocking_transfer):

	case AM_HAL_STATUS_INVALID_OPERATION:
	case AM_HAL_STATUS_INVALID_ARG:
	case AM_HAL_STATUS_INVALID_HANDLE:
	default:
		return 4;

That could be tricky to diagnose, but I'll give it a go...

Best wishes,
Paul

@derekpickell
Copy link
Contributor Author

derekpickell commented Apr 14, 2022

Hi @PaulZC,

Thanks for the fast reply. To answer your questions:

  1. I am using the Artemis Module with a ZED-F9 on a custom PCB, but similar architecture to the MMDLCB + Artemis combo.
  2. SdFat v2.1.0 with 32GB Sandisk extreme formatted to FAT32
  3. 100kHz should theoretically be OK given my 1Hz nav rate. It's on my list of "variables" to try to isolate and tweak to see if there is any improvement.

@PaulZC
Copy link
Collaborator

PaulZC commented Apr 14, 2022

Hi Derek,

Thanks for this. I will try and get as close as I can with the hardware I have available.

I hate to say it, but this could be down to a hardware glitch…

I’ll let you know what I find.

All the best,
Paul

@PaulZC
Copy link
Collaborator

PaulZC commented Apr 14, 2022

Hi Derek,

I'm running a test using the attached code (see zip file below).

I'm using: v2.2.9 of this library; Apollo v2.2.1 (latest); SdFat 2.1.2 (latest); 16GB SanDisk "EDGE" card - FAT32 - freshly formatted using the SD Association formatter; MicroMod Data Logging Carrier Board; MicroMod Artemis Processor Board; ZED-F9P GPA RTK2 connected via Qwiic (100kHz, no pull-ups); a dual-band antenna with a good but not perfect view of the sky. The ZED is running F9 HPG 1.30 (latest) - protocol version 27.30.

I'm logging around 2-3KBytes/sec:

image

I'll let you know how it goes. I'll leave it for ~36 hours, unless I see it crash before then.

All the best,
Paul

DataLoggingExample4_RXM_without_Callbacks_SdFat.zip

@PaulZC
Copy link
Collaborator

PaulZC commented Apr 15, 2022

14 hours in and it is still chugging along nicely...

Note to self:

I'm not quite using a vanilla copy of Apollo3 v2.2.1. My copy includes paulvha's SPI end fix:

In libraries/SPI/src/SPI.cpp change end() to:

void arduino::MbedSPI::end() {
  if (dev) {
    delete dev;
    dev = NULL;
  }
}      

(The dev = NULL is important.)

@PaulZC
Copy link
Collaborator

PaulZC commented Apr 16, 2022

Hi Derek (@derekpickell ),

No signs of badness with this test...

image

image

I re-started the test once, after ~100kBytes, which is why the "bytes written" doesn't quite match. The logged data is completely clean.

I'll give 400kHz a try. My money's on that being the cause...

Best wishes,
Paul

@PaulZC
Copy link
Collaborator

PaulZC commented Apr 18, 2022

Hi Derek (@derekpickell ),

Sorry. No clues here. I left the 400kHz test running for almost 48 hours and the data is completely clean:

image

image

Just to summarize, I was using:

  • Arduino 1.8.19
  • v2.2.9 of this library
  • Apollo v2.2.1 (latest) - modified with the SPI.end fix described above
  • SdFat 2.1.2 (latest)
  • 16GB SanDisk "EDGE" card - FAT32 - freshly formatted using the SD Association formatter
  • MicroMod Data Logging Carrier Board
  • MicroMod Artemis Processor Board
  • ZED-F9P GPS RTK2 connected via Qwiic
    • 400kHz
    • Pull-up resistors disabled on the ZED board
    • Pull-up resistors disabled on the MMDL Carrier Board
    • Artemis internal pull-ups disabled as per the zip file attached above
  • A dual-band antenna with a good but not perfect view of the sky
  • The ZED has been updated to F9 HPG 1.30 (latest) - protocol version 27.30
    • The release notes for this firmware do mention "Improved I2C interface robustness" - please do upgrade if required

I have of course seen I2C badness in the past, especially on Artemis, especially at 400kHz, especially with pull-ups enabled. I'll try a quick test with the pull-ups enabled just to see if I can replicate your issue.

It seems more likely that your issue is caused by the Apollo3 core, not this library. But I'm happy to try to help you debug the issue - if I can.

All the best,
Paul

@PaulZC
Copy link
Collaborator

PaulZC commented Apr 18, 2022

OK. With the Artemis pull-ups enabled, the ZED board pull-ups disabled, I see bus errors at 400kHz:

image

At 100kHz, the checksum errors are less, but still present:

image

@PaulZC
Copy link
Collaborator

PaulZC commented Apr 18, 2022

I've downgraded the ZED to HPG 1.13. I'll do a test using that, just to see if I can replicate your I2C endTransmission error.

@PaulZC
Copy link
Collaborator

PaulZC commented Apr 18, 2022

Ah ha! We might have a winner!!

image

image

For this test, I was using:

  • 400kHz I2C
  • Logging RAWX only (no SFRBX) at 2Hz, instead of the normal 1Hz, from all constellations
  • No pull-ups (Artemis internal pull-ups disabled)
  • ZED-F9P HPG 1.13 (not 1.30)

The endTransmission errors appeared only an hour or so into the test. I didn't see exactly when it happened.

I don't know if this is the smoking gun we're looking for, but it is certainly mighty suspicious! Log SFRBX and RAWX at 1Hz with HPG 1.30 with 400kHz I2C for almost 48 hours. No errors. Switch to logging RAWX only at 2Hz with HPG 1.13 with 400kHz and I got a failure within approx. an hour... The HPG version appears critical.

Can you please confirm which version of HPG your ZED is running? This tutorial may help.

Best wishes,
Paul

@derekpickell
Copy link
Contributor Author

Wow great find!! That certainly seems quite suspicious. Looking at my Sparkfun ZED (SMA) boards, I see HPG 1.12 while my custom modules have HPG 1.13. I'll update all to HPG 1.30 and run a trial as soon as wrap up my current test—currently running at 100kHz 24+ hrs now without issues. So something seems up with the 400kHz + HPG 1.13 combo...
-Derek

@derekpickell
Copy link
Contributor Author

My combo of 100kHz I2C and HPG 1.30 has been running on 2 boards for ~48 hours now and both happily blinking away. I feel comfortable saying that the firmware update fixed the issue (the Ublox release notes are very vague about what I2C improvements were made under the hood and it is strange that HPG 1.12 doesn't present any problems either). Thanks for the help!

@PaulZC
Copy link
Collaborator

PaulZC commented Apr 21, 2022

No problem Derek - glad that's working for you!

Please close this issue once you're happy.

Very best wishes,
Paul

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants