openwrt/target/linux
Leon M. Busch-George 98d325aaf8 ipq40xx: wpj428: panic on squashfs error to work around boot limbo
Apparently, a few ipq40xx devices have sporadic problems when reading the
flash over SPI. When that happens, the result of the faulty SPI read is
cached and it isn't re-attempted. Depending on when it happens, the router
either panics and reboots or is left in a partially broken state (an
application wont start).
The data on the flash is alright.

This wasn't the case with Openwrt with Linux < 5.x but I wasn't able to
work out which software change was responsible.

Github user karlpip created a patch for testing that disabled the cache
entirely and added logs. Typically, only one or two SPI operations fail at
a time:

  [689200.631152] spi-nor spi0.0: SPI transfer failed: -110
  [689200.631280] spi_master spi0: failed to transfer one message from queue
  [689200.635369] jffs2: Write of 68 bytes at 0x00ffccf4 failed. returned -110, retlen 0
  [689200.642014] jffs2: Not marking the space at 0x00ffccf4 as dirty because the flash driver returned retlen zero

Because reads aren't re-attempted, squashfs can't recover:

  [3171844.279235] SQUASHFS error: Failed to read block 0x2bb912: -5
  [3171844.279284] SQUASHFS error: Unable to read fragment cache entry [2bb912]
  [3171844.283980] SQUASHFS error: Unable to read page, block 2bb912, size 14e6c
  [3171844.291650] SQUASHFS error: Unable to read fragment cache entry [2bb912]
  [3171844.297831] SQUASHFS error: Unable to read page, block 2bb912, size 14e6c

I assume there to be some kind of underlying electrical problem because,
in my experience, this happens a lot more when PoE is used.

NoTengoBattery has made an in-depth investigation:
https://forum.openwrt.org/t/patch-squashfs-data-probably-corrupt/70480

.. and created a patch that evicts the page cache and retries reading:
https://github.com/NoTengoBattery/openwrt/blob/linksys-ea6350v3-mastertrack/target/linux/ipq40xx/patches-5.4/9996-fs_squashfs_improve_squashfs_error_resistance.patch

The patch also works well with the WPJ428 but NoTengoBattery didn't try to
upstream it ("This is not the solution that should be used").

In 2020, I tried and failed to create a working patch that prevents faulty pages to
be cached in the first place. Because I needed a solution, I backported
  "squashfs: add option to panic on errors " (10dde05b89980ef)
which has since become available in Openwrt.

The 'error=panic' option has been tested on a fleet of multiple hundred
WPJ428s over multiple years. Without this patch, devices regularly went
into 'limbo' on reboot or update and required a manual reboot.
Devices with this patch don't. I was initially concerned that the kernel
panic would leave devices with a real corrupted data but I haven't seen a
case of actual corruption since (outside of people turning off the power
during upgrades).

The WPJ428 is the only device I tested this patch on - others might also
benefit.

Reviewed-by: Robert Marko <robimarko@gmail.com>
Signed-off-by: Leon M. Busch-George <leon@georgemail.eu>
2023-09-24 18:55:35 +02:00
..
airoha kernel: bump 5.15 to 5.15.125 2023-08-09 22:06:24 +02:00
apm821xx kernel: bump 6.1 to 6.1.53 2023-09-23 13:10:28 +02:00
archs38
armsr armsr: ensure kmod-fs-vfat is selected for mounting ESP 2023-09-24 12:51:14 +02:00
at91
ath25 kernel: bump 5.15 to 5.15.125 2023-08-09 22:06:24 +02:00
ath79 kernel: bump 6.1 to 6.1.54 2023-09-23 13:10:28 +02:00
bcm27xx kernel: bump 6.1 to 6.1.53 2023-09-23 13:10:28 +02:00
bcm47xx kernel: bump 5.15 to 5.15.132 2023-09-20 14:13:00 +02:00
bcm53xx kernel: bump 6.1 to 6.1.54 2023-09-23 13:10:28 +02:00
bcm63xx kernel: bump 5.15 to 5.15.125 2023-08-09 22:06:24 +02:00
bcm4908 kernel: bump 5.15 to 5.15.126 2023-08-13 13:03:43 +02:00
bmips kernel: bump 6.1 to 6.1.53 2023-09-23 13:10:28 +02:00
gemini gemini: Fix up kernel v6.1 config 2023-08-10 19:31:37 +02:00
generic kernel: bump 6.1 to 6.1.55 2023-09-24 12:45:34 +02:00
imx kernel: backport NVMEM patches queued for the v6.5 2023-06-16 09:45:38 +02:00
ipq40xx ipq40xx: wpj428: panic on squashfs error to work around boot limbo 2023-09-24 18:55:35 +02:00
ipq806x ipq806x: sync config-6.1 with latest kernel 2023-09-24 18:12:30 +02:00
kirkwood
lantiq kernel: bump 5.15 to 5.15.125 2023-08-09 22:06:24 +02:00
layerscape generic: sync MediaTek Ethernet driver with upstream 2023-08-28 16:35:22 +01:00
malta kernel: remove CRYPTO_BLAKE2S from all >=5.15 2023-07-08 16:54:01 +02:00
mediatek mediatek: add support for Buffalo WSR-3200AX4S 2023-09-24 18:42:12 +02:00
mpc85xx mpc85xx: correct WS-AP3715i eth LED assignment 2023-09-21 01:10:40 +02:00
mvebu mvebu: eDPU: add support for version with external switch 2023-09-19 12:12:17 +02:00
mxs mxs: add testing kernel 6.1 2023-07-01 12:54:30 +02:00
octeon kernel: bump 5.15 to 5.15.125 2023-08-09 22:06:24 +02:00
octeontx kernel: bump 5.15 to 5.15.123 2023-07-30 18:02:47 +02:00
omap
oxnas kernel: bump 5.15 to 5.15.125 2023-08-09 22:06:24 +02:00
pistachio kernel: bump 5.15 to 5.15.132 2023-09-20 14:13:00 +02:00
qoriq
qualcommax ipq807x: add support for Netgear WAX620 2023-09-24 13:09:16 +02:00
ramips ramips: fix Mercusys MR70X LAN port assignments 2023-09-24 17:09:26 +02:00
realtek kernel: bump 5.15 to 5.15.132 2023-09-20 14:13:00 +02:00
rockchip rockchip: add support for Radxa ROCK Pi E 2023-09-05 00:20:51 +05:30
sifiveu kernel: bump 5.15 to 5.15.117 2023-06-16 19:44:28 +02:00
sunxi sunxi: generalize top-level BOARDNAME and update suported SoCs 2023-09-24 18:16:40 +02:00
tegra
uml kernel: bump 6.1 to 6.1.53 2023-09-23 13:10:28 +02:00
x86 x86: add 6.1 testing kernel 2023-09-19 11:38:38 +02:00
zynq
Makefile