Home > Software engineering >  PCI driver failed: Detected PCI bus error on device
PCI driver failed: Detected PCI bus error on device

Time:01-18

I'm trying to do reset on a specific pci device using my own customize driver on ppc64(power pc) machine.

This driver works on another ppc64 machine.

This is the function that responsible to to do this action. I removed several code lines to emphasize the important flow.

int reset_device(void)
{
    pdev =  g_reset_info.devs[ix];        
    err = pci_enable_device(pdev);

    if (err) {
        return err;
    }
    pci_set_master(pdev);
    err = pci_save_state(pdev);
    if (err) {
            return err;
    }

    pdev =  g_reset_info.devs[ix];

    err = pci_set_pcie_reset_state(pdev, pcie_hot_reset);
    if (err) {
        return err;
    }

    msleep(jiffies_to_msecs(HZ/2));
    msleep(jiffies_to_msecs(HZ/2));

    pdev =  g_reset_info.devs[ix];

    err = pci_set_pcie_reset_state(pdev, pcie_deassert_reset);
        if (err) {
            return err;
        }

    pdev =  g_reset_info.devs[ix];
    pci_restore_state(pdev);

    msleep(jiffies_to_msecs(HZ/2));
    msleep(jiffies_to_msecs(HZ/2));

    return 0;
}

This is the output which came from the dmesg:

mst_ppc_pci_reset_driver reset_device 63 Send hot reset to device: 0000:50:00.0 
mst_ppc_pci_reset_driver reset_device 81 Deassert device: 0000:50:00.0 
Call Trace: 
[c000000186f92fe0] [c0000000000155ac] .show_stack 0x6c/0x198 (unreliable) 
[c000000186f93090] [c000000000076a8c] .eeh_dn_check_failure 0x354/0x3f0 
[c000000186f93150] [c000000000029b7c] .rtas_read_config 0x13c/0x198 
[c000000186f931f0] [c00000000039c8d0] .pci_bus_read_config_word 0xa0/0xf8 
[c000000186f932b0] [c0000000003a2730] .pci_find_capability 0x40/0xd0 
[c000000186f93360] [c0000000003a2b6c] .pci_restore_pcie_state 0x54/0x2e8 
[c000000186f93410] [c0000000003a501c] .pci_restore_state 0x84/0x1b8 
[c000000186f934d0] [d000000003810384] .reset_device 0x184/0x430 [mst_ppc_pci_reset] 
[c000000186f93590] [c0000000003a6254] .local_pci_probe 0x7c/0xf8 
[c000000186f93620] [c0000000003a63a8] .__pci_device_probe 0xd8/0x128 
[c000000186f936d0] [c0000000003a72a8] .pci_device_probe 0x38/0x68 
[c000000186f93760] [c0000000004d0bd8] .really_probe 0xb0/0x288 
[c000000186f93810] [c0000000004d0e4c] .driver_probe_device 0x9c/0x110 
[c000000186f938a0] [c0000000004d0fbc] .__driver_attach 0xfc/0x100 
[c000000186f93930] [c0000000004cfee4] .bus_for_each_dev 0xc4/0x118 
[c000000186f939e0] [c0000000004d08a8] .driver_attach 0x28/0x40 
[c000000186f93a60] [c0000000004cf3b0] .bus_add_driver 0x190/0x340 
[c000000186f93b10] [c0000000004d1950] .driver_register 0x98/0x1b8 
[c000000186f93bb0] [c0000000003a760c] .__pci_register_driver 0x64/0x140 
[c000000186f93c50] [d0000000038107c0] .init 0x28/0x400 [mst_ppc_pci_reset] 
[c000000186f93cd0] [c00000000000ab68] .do_one_initcall 0x68/0x1e0 
[c000000186f93d90] [c00000000010893c] .SyS_init_module 0xcc/0x218 
[c000000186f93e30] [c0000000000098ec] syscall_exit 0x0/0x40 
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xf (was 0xffffffff, writing 0xff) 
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xe (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xd (was 0xffffffff, writing 0x40)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xc (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xb (was 0xffffffff, writing 0x6115b3)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0xa (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x9 (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x8 (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x7 (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x6 (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x5 (was 0xffffffff, writing 0x0)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x4 (was 0xffffffff, writing 0x9e00000c)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x3 (was 0xffffffff, writing 0x20)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x2 (was 0xffffffff, writing 0x2070000)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x1 (was 0xffffffff, writing 0x100146)
mst_ppc_pci_reset_driver 0000:50:00.0: restoring config space at offset 0x0 (was 0xffffffff, writing 0x101115b3)
EEH: Detected PCI bus error on device <null>
EEH: This PCI device has failed 1 times in the last hour:
EEH: Bus location=U78CB.001.WZS02VY-P1-C11-T1 driver=mst_ppc_pci_reset_driver pci addr=0000:50:00.0
EEH: Device location=U78CB.001.WZS02VY-P1-C11-T1 driver= pci addr=<null>
EEH: of node=/pci@800000020000013/pci15b3,61@0
EEH: PCI device/vendor: 101115b3
EEH: PCI cmd/status register: 00100140
EEH: PCI-E capabilities and status follow:
EEH: PCI-E 00: 0002c010
EEH: PCI-E 01: 19008fe2
EEH: PCI-E 02: 0000595e
EEH: PCI-E 03: 0043f103
EEH: PCI-E 04: 10830000
EEH: PCI-E 05: 00000000
EEH: PCI-E 06: 00000000
EEH: PCI-E 07: 00000000
EEH: PCI-E 08: 00000000
EEH: PCI-E AER capability register set follows:
EEH: PCI-E AER 00: 00010001
EEH: PCI-E AER 01: 00000000
EEH: PCI-E AER 02: 00000000
EEH: PCI-E AER 03: 00062010
EEH: PCI-E AER 04: 00000000
EEH: PCI-E AER 05: 00002000
EEH: PCI-E AER 06: 000001e4
EEH: PCI-E AER 07: 00000000
EEH: PCI-E AER 08: 00000000
EEH: PCI-E AER 09: 00000000
EEH: PCI-E AER 0a: 00000000
EEH: PCI-E AER 0b: 00000000
EEH: PCI-E AER 0c: 00000000
EEH: PCI-E AER 0d: 00000000
RTAS: event: 2736, Type: Platform Error, Severity: 2
mst_ppc_pci_reset_driver 0000:50:00.0: PME# disabled

CodePudding user response:

When debugging this sort of issue it's a very good idea to track what kernel versions you are using and to provide specific details about the HW you are testing with. From the fact your kernel has eeh_dn_check_failure() rather than eeh_check_dev_failure() I can gather this is a very old kernel. Does the other system you tested with have the same kernel? Same firmware? All this is relevant to your problem.

Anyway, I'd say you need a one second wait between the de-asserting the reset and restoring config space. The PCI spec requires that system software give the device one second to initialise after a reset before attempting IO, config cycles included. In 2015 commit 26833a5029b7 ("powerpc/eeh: Make the delay for PE reset unified") added a delay after de-assert (on powerpc at least) so that would be handled for you. Considering your kernel is old enough to still have eeh_dn_check_failure() (renamed in 2012, see f8f7d63fd96e) you probably don't have that patch and need to do the wait yourself.

What's probably happening is that the device isn't ready to respond to config accesses and drops them. The hypervisor will detect a timeout and assumes the device is malfunctioning so it isolates the device (freezes it) using the EEH mechanism that IBM's Power hardware has. Normally the OS will try to un-freeze and reset the device after that happens, but that can fail for a lot of reasons, especially on older kernels.

  • Related