龙空技术网

内存故障EDAC

有灵魂的程序猿 61

前言:

现在同学们对“c语言错误代码e0020”大体比较着重,朋友们都需要学习一些“c语言错误代码e0020”的相关知识。那么小编在网摘上汇集了一些有关“c语言错误代码e0020””的相关知识,希望咱们能喜欢,你们快快来学习一下吧!

什么是EDAC

EDAC是Error Detection and Correction(错误检测和校正),在计算机系统中,EDAC是一种用于检测和纠正内存中的硬件错误的技术。内存中的硬件错误可能会导致数据损坏或系统崩溃,可以通过实现EDAC在系统可以在发生错误时及时检测到并进行纠正,以确保数据的完整性和系统的稳定性。

EDAC技术通常涉及硬件和软件的协同工作。硬件部分包括内存控制器和相关的电路,用于检测内存中的错误并在可能的情况下进行自动纠正。软件部分则包括驱动程序和工具,用于监控硬件错误、记录错误信息以及通知系统管理员。

EDAC原理

EDAC的原理基本上是通过添加冗余信息来检测和纠正内存中的硬件错误1. 错误检测(Error Detection): 在存储数据时,系统会额外存储一些冗余信息,例如校验和、奇偶校验位或纠错码 当数据被读取时,系统会重新计算这些冗余信息,并与存储的值进行比较 如果检测到数据与冗余信息不匹配,系统就会意识到发生了错误2. 错误纠正(Error Correction): 对于能够纠正错误的EDAC系统,系统可以根据冗余信息中的特定算法自动纠正错误3. 故障定位和通知: 当发生错误时,EDAC系统会记录错误信息,包括错误类型、发生位置等

EDAC驱动

Linux 内核里会包含多款edac的驱动,如果在低版本的内核找不到对应的edac驱动,那就需要去高版本内核中查找对应的驱动,可以选择移植到低版本内核中或者升级系统内核版本;高低版本都找不到的edac驱动只能联系厂商,例如intel、amd

其中我用到的就是MC Driver for Intel 10nm server processors,但是4.19内核当中没有找到

i10nm_edac驱动,所以就从内核5.10中移植过来(花了好几天调通)。

驱动加载 i10nm_edac

[ 3.945980] EDAC MC0: Giving out device to module i10nm_edac controller Intel_10nm Socket#0 IMC#0: DEV 0000:fe:0c.0 (INTERRUPT)[ 3.945985] EDAC i10nm: v0.0.3

EINJ注错

EINJ(Error Injection)是一种用于内存错误注入的机制,通常用于测试和验证系统的错误处理能力。通过EINJ,用户可以模拟内存中的硬件错误,以测试系统的错误检测、纠正和处理机制是否有效。

EINJ通常包含以下几个关键元素:

错误类型(Error Type):EINJ定义了不同类型的错误,例如单位错误、双位错误、奇偶校验错误等。

错误位置(Error Location):EINJ指定了要注入错误的位置,通常是内存地址或内存模块。

注入参数(Injection Parameters):EINJ还包含了一些注入参数,例如错误严重性级别、注入次数、注入速率等。这些参数可以帮助用户控制错误注入的方式和规模。

注错驱动

root@sonic:/home/admin# insmod einj.koroot@sonic:/home/admin# dmesg[ 66.882973] EINJ: Error INJection is initialized.

注错工具

记忆中是intel提供的

root@sonic:/home/admin/test_tools# chmod +x inject_inject_ce  inject_ue  

EINJ表

查看EINJ表

1.cat /sys/firmware/acpi/tables/EINJ > EINJ.bin

2.iasl -d EINJ.bin

3.cat EINJ.dsl

root@bsp:/home/bsp-server# cat /sys/firmware/acpi/tables/EINJ > EINJ.binroot@bsp:/home/bsp-server# iasl -d EINJ.bin Command 'iasl' not found, but can be installed with:apt install acpica-toolsroot@bsp:/home/bsp-server# apt install acpica-toolsReading package lists... DoneBuilding dependency tree... DoneReading state information... DoneThe following NEW packages will be installed:  acpica-tools0 upgraded, 1 newly installed, 0 to remove and 91 not upgraded.Need to get 900 kB of archives.After this operation, 2,570 kB of additional disk space will be used.Get:1  jammy/universe amd64 acpica-tools amd64 20200925-6 [900 kB]Fetched 900 kB in 4s (223 kB/s)                                     Selecting previously unselected package acpica-tools.(Reading database ... 316633 files and directories currently installed.)Preparing to unpack .../acpica-tools_20200925-6_amd64.deb ...Unpacking acpica-tools (20200925-6) ...Setting up acpica-tools (20200925-6) ...Processing triggers for man-db (2.10.2-1) ...root@bsp:/home/bsp-server# iasl -d EINJ.bin Intel ACPI Component ArchitectureASL+ Optimizing Compiler/Disassembler version 20200925Copyright (c) 2000 - 2020 Intel CorporationFile appears to be binary: found 324 non-ASCII characters, disassemblingBinary file appears to be a valid ACPI table, disassemblingInput file EINJ.bin, Length 0x170 (368) bytesACPI: EINJ 0x0000000000000000 000170 (v01 HPE    Server   00000001 INTL 00000001)Acpi Data Table [EINJ] decodedFormatted output:  EINJ.dsl - 10805 bytesroot@bsp:/home/bsp-server# cat EINJ.dsl /* * Intel ACPI Component Architecture * AML/ASL+ Disassembler version 20200925 (64-bit version) * Copyright (c) 2000 - 2020 Intel Corporation *  * Disassembly of EINJ.bin, Thu Feb 29 11:19:54 2024 * * ACPI Data Table [EINJ] * * Format: [HexOffset DecimalOffset ByteLength]  FieldName : FieldValue */[000h 0000   4]                    Signature : "EINJ"    [Error Injection table][004h 0004   4]                 Table Length : 00000170[008h 0008   1]                     Revision : 01[009h 0009   1]                     Checksum : DC[00Ah 0010   6]                       Oem ID : "HPE   "[010h 0016   8]                 Oem Table ID : "Server  "[018h 0024   4]                 Oem Revision : 00000001[01Ch 0028   4]              Asl Compiler ID : "INTL"[020h 0032   4]        Asl Compiler Revision : 00000001[024h 0036   4]      Injection Header Length : 0000000C[028h 0040   1]                        Flags : 00[029h 0041   3]                     Reserved : 000000[02Ch 0044   4]        Injection Entry Count : 0000000A[030h 0048   1]                       Action : 00 [Begin Operation][031h 0049   1]                  Instruction : 03 [Write Register Value][032h 0050   1]        Flags (decoded below) : 01                      Preserve Register Bits : 1[033h 0051   1]                     Reserved : 00[034h 0052  12]              Register Region : [Generic Address Structure][034h 0052   1]                     Space ID : 00 [SystemMemory][035h 0053   1]                    Bit Width : 40[036h 0054   1]                   Bit Offset : 00[037h 0055   1]         Encoded Access Width : 04 [QWord Access:64][038h 0056   8]                      Address : 00000000A0DCA018[040h 0064   8]                        Value : 0000000055AA55AA[048h 0072   8]                         Mask : 00000000FFFFFFFF[050h 0080   1]                       Action : 01 [Get Trigger Table][051h 0081   1]                  Instruction : 00 [Read Register][052h 0082   1]        Flags (decoded below) : 00                      Preserve Register Bits : 0[053h 0083   1]                     Reserved : 00[054h 0084  12]              Register Region : [Generic Address Structure][054h 0084   1]                     Space ID : 00 [SystemMemory][055h 0085   1]                    Bit Width : 40[056h 0086   1]                   Bit Offset : 00[057h 0087   1]         Encoded Access Width : 04 [QWord Access:64][058h 0088   8]                      Address : 00000000A0DCA048[060h 0096   8]                        Value : 0000000000000000[068h 0104   8]                         Mask : FFFFFFFFFFFFFFFF[070h 0112   1]                       Action : 02 [Set Error Type][071h 0113   1]                  Instruction : 02 [Write Register][072h 0114   1]        Flags (decoded below) : 01                      Preserve Register Bits : 1[073h 0115   1]                     Reserved : 00[074h 0116  12]              Register Region : [Generic Address Structure][074h 0116   1]                     Space ID : 00 [SystemMemory][075h 0117   1]                    Bit Width : 40[076h 0118   1]                   Bit Offset : 00[077h 0119   1]         Encoded Access Width : 04 [QWord Access:64][078h 0120   8]                      Address : 00000000A0DCA020[080h 0128   8]                        Value : 0000000000000000[088h 0136   8]                         Mask : 00000000FFFFFFFF[090h 0144   1]                       Action : 03 [Get Error Type][091h 0145   1]                  Instruction : 00 [Read Register][092h 0146   1]        Flags (decoded below) : 00                      Preserve Register Bits : 0[093h 0147   1]                     Reserved : 00[094h 0148  12]              Register Region : [Generic Address Structure][094h 0148   1]                     Space ID : 00 [SystemMemory][095h 0149   1]                    Bit Width : 40[096h 0150   1]                   Bit Offset : 00[097h 0151   1]         Encoded Access Width : 04 [QWord Access:64][098h 0152   8]                      Address : 00000000A0DCA050[0A0h 0160   8]                        Value : 0000000000000000[0A8h 0168   8]                         Mask : 00000000FFFFFFFF[0B0h 0176   1]                       Action : 04 [End Operation][0B1h 0177   1]                  Instruction : 03 [Write Register Value][0B2h 0178   1]        Flags (decoded below) : 01                      Preserve Register Bits : 1[0B3h 0179   1]                     Reserved : 00[0B4h 0180  12]              Register Region : [Generic Address Structure][0B4h 0180   1]                     Space ID : 00 [SystemMemory][0B5h 0181   1]                    Bit Width : 40[0B6h 0182   1]                   Bit Offset : 00[0B7h 0183   1]         Encoded Access Width : 04 [QWord Access:64][0B8h 0184   8]                      Address : 00000000A0DCA018[0C0h 0192   8]                        Value : 0000000000000000[0C8h 0200   8]                         Mask : 00000000FFFFFFFF[0D0h 0208   1]                       Action : 05 [Execute Operation][0D1h 0209   1]                  Instruction : 03 [Write Register Value][0D2h 0210   1]        Flags (decoded below) : 01                      Preserve Register Bits : 1[0D3h 0211   1]                     Reserved : 00[0D4h 0212  12]              Register Region : [Generic Address Structure][0D4h 0212   1]                     Space ID : 01 [SystemIO][0D5h 0213   1]                    Bit Width : 10[0D6h 0214   1]                   Bit Offset : 00[0D7h 0215   1]         Encoded Access Width : 02 [Word Access:16][0D8h 0216   8]                      Address : 00000000000000B2[0E0h 0224   8]                        Value : 000000000000009A[0E8h 0232   8]                         Mask : 000000000000FFFF[0F0h 0240   1]                       Action : 06 [Check Busy Status][0F1h 0241   1]                  Instruction : 01 [Read Register Value][0F2h 0242   1]        Flags (decoded below) : 00                      Preserve Register Bits : 0[0F3h 0243   1]                     Reserved : 00[0F4h 0244  12]              Register Region : [Generic Address Structure][0F4h 0244   1]                     Space ID : 00 [SystemMemory][0F5h 0245   1]                    Bit Width : 40[0F6h 0246   1]                   Bit Offset : 00[0F7h 0247   1]         Encoded Access Width : 04 [QWord Access:64][0F8h 0248   8]                      Address : 00000000A0DCA058[100h 0256   8]                        Value : 0000000000000001[108h 0264   8]                         Mask : 0000000000000001[110h 0272   1]                       Action : 07 [Get Command Status][111h 0273   1]                  Instruction : 00 [Read Register][112h 0274   1]        Flags (decoded below) : 01                      Preserve Register Bits : 1[113h 0275   1]                     Reserved : 00[114h 0276  12]              Register Region : [Generic Address Structure][114h 0276   1]                     Space ID : 00 [SystemMemory][115h 0277   1]                    Bit Width : 40[116h 0278   1]                   Bit Offset : 00[117h 0279   1]         Encoded Access Width : 04 [QWord Access:64][118h 0280   8]                      Address : 00000000A0DCA060[120h 0288   8]                        Value : 0000000000000000[128h 0296   8]                         Mask : 00000000000001FE[130h 0304   1]                       Action : 08 [Set Error Type With Address][131h 0305   1]                  Instruction : 02 [Write Register][132h 0306   1]        Flags (decoded below) : 01                      Preserve Register Bits : 1[133h 0307   1]                     Reserved : 00[134h 0308  12]              Register Region : [Generic Address Structure][134h 0308   1]                     Space ID : 00 [SystemMemory][135h 0309   1]                    Bit Width : 40[136h 0310   1]                   Bit Offset : 00[137h 0311   1]         Encoded Access Width : 04 [QWord Access:64][138h 0312   8]                      Address : 00000000A0DCA078[140h 0320   8]                        Value : 0000000000000000[148h 0328   8]                         Mask : 00000000FFFFFFFF[150h 0336   1]                       Action : 09 [Get Execute Timings][151h 0337   1]                  Instruction : 00 [Read Register][152h 0338   1]        Flags (decoded below) : 00                      Preserve Register Bits : 0[153h 0339   1]                     Reserved : 00[154h 0340  12]              Register Region : [Generic Address Structure][154h 0340   1]                     Space ID : 00 [SystemMemory][155h 0341   1]                    Bit Width : 40[156h 0342   1]                   Bit Offset : 00[157h 0343   1]         Encoded Access Width : 04 [QWord Access:64][158h 0344   8]                      Address : 00000000A0DCA09C[160h 0352   8]                        Value : 00007FFF00003FFF[168h 0360   8]                         Mask : FFFFFFFFFFFFFFFFRaw Table Data: Length 368 (0x170)    0000: 45 49 4E 4A 70 01 00 00 01 DC 48 50 45 20 20 20  // EINJp.....HPE       0010: 53 65 72 76 65 72 20 20 01 00 00 00 49 4E 54 4C  // Server  ....INTL    0020: 01 00 00 00 0C 00 00 00 00 00 00 00 0A 00 00 00  // ................    0030: 00 03 01 00 00 40 00 04 18 A0 DC A0 00 00 00 00  // .....@..........    0040: AA 55 AA 55 00 00 00 00 FF FF FF FF 00 00 00 00  // .U.U............    0050: 01 00 00 00 00 40 00 04 48 A0 DC A0 00 00 00 00  // .....@..H.......    0060: 00 00 00 00 00 00 00 00 FF FF FF FF FF FF FF FF  // ................    0070: 02 02 01 00 00 40 00 04 20 A0 DC A0 00 00 00 00  // .....@.. .......    0080: 00 00 00 00 00 00 00 00 FF FF FF FF 00 00 00 00  // ................    0090: 03 00 00 00 00 40 00 04 50 A0 DC A0 00 00 00 00  // .....@..P.......    00A0: 00 00 00 00 00 00 00 00 FF FF FF FF 00 00 00 00  // ................    00B0: 04 03 01 00 00 40 00 04 18 A0 DC A0 00 00 00 00  // .....@..........    00C0: 00 00 00 00 00 00 00 00 FF FF FF FF 00 00 00 00  // ................    00D0: 05 03 01 00 01 10 00 02 B2 00 00 00 00 00 00 00  // ................    00E0: 9A 00 00 00 00 00 00 00 FF FF 00 00 00 00 00 00  // ................    00F0: 06 01 00 00 00 40 00 04 58 A0 DC A0 00 00 00 00  // .....@..X.......    0100: 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  // ................    0110: 07 00 01 00 00 40 00 04 60 A0 DC A0 00 00 00 00  // .....@..`.......    0120: 00 00 00 00 00 00 00 00 FE 01 00 00 00 00 00 00  // ................    0130: 08 02 01 00 00 40 00 04 78 A0 DC A0 00 00 00 00  // .....@..x.......    0140: 00 00 00 00 00 00 00 00 FF FF FF FF 00 00 00 00  // ................    0150: 09 00 00 00 00 40 00 04 9C A0 DC A0 00 00 00 00  // .....@..........    0160: FF 3F 00 00 FF 7F 00 00 FF FF FF FF FF FF FF FF  // .?..............root@bsp:/home/bsp-server# 

安装edac-util

root@sonic:/home/admin# apt install edac-utils

注错的方法1.使用工具注错

ce注错

注错CE,BIOS设置CE的阈值是2000,所以第一次注错1900个并不会触发,第二次注错100个的时候就可以通过“edac-util -v”查看CE的注错个数,不同机器设置不一样。

root@sonic:/home/admin/test_tools# chmod +x inject_*root@sonic:/home/admin/test_tools# ./inject_ce -c 1900repeat_times = 1900allocate memory in virt addr=0x7fad9d079000 , physical addr=0x143c64000[2023-9-22 14:15:30] inject 1900 CE error to paddr = 0x143c64000 begin...[2023-9-22 14:15:41] inject 1900 CE error to paddr = 0x143c64000 finish.root@sonic:/home/admin/test_tools# ./inject_ce -c 100repeat_times = 100allocate memory in virt addr=0x7f75020d9000 , physical addr=0x135935000[2023-9-22 14:16:02] inject 100 CE error to paddr = 0x135935000 begin...[2023-9-22 14:16:02] inject 100 CE error to paddr = 0x135935000 finish.root@sonic:/home/admin/test_tools# root@sonic:/home/admin/test_tools# edac-util -vmc0: 0 Uncorrected Errors with no DIMM infomc0: 0 Corrected Errors with no DIMM infomc0: csrow0: 0 Uncorrected Errorsmc0: csrow0: CPU_SrcID#0_MC#0_Chan#1_DIMM#0: 2000 Corrected Errorsroot@sonic:/home/admin/test_tools# ./inject_ce -c 100repeat_times = 100allocate memory in virt addr=0x7f22f5384000 , physical addr=0x130454000[2023-9-22 14:16:19] inject 100 CE error to paddr = 0x130454000 begin...[2023-9-22 14:16:20] inject 100 CE error to paddr = 0x130454000 finish.

ue注错

注错UE,系统是否宕机需要看bios和os的配置

root@sonic:/home/admin/test_tools# ./inject_ue -addr=0x0003f004inject addr = 0x0allocate memory in virt addr=0x7efd8e3d7000 , physical addr=0x189bfe000[2023-9-22 14:17:37] inject 1 UE error to paddr = 0x189bfe000 begin...[2023-9-22 14:17:37] inject 1 UE error to paddr = 0x189bfe000 finish.Message from syslogd@sonic at Sep 22 14:17:39 ... kernel:[  219.279745] mce: [Hardware Error]: CPU 12: Machine Check Exception: f Bank 1: bd80000000100134Bus error (core dumped)root@sonic:/home/admin/test_tools# Message from syslogd@sonic at Sep 22 14:17:39 ... kernel:[  219.384566] mce: [Hardware Error]: RIP 33:<0000000000401b23> Message from syslogd@sonic at Sep 22 14:17:39 ... kernel:[  219.455029] mce: [Hardware Error]: TSC 8efdf8d6b7 ADDR 189bfe000 MISC 86 PPIN 5d6f3e0a91d69cd8 Message from syslogd@sonic at Sep 22 14:17:39 ... kernel:[  219.560913] mce: [Hardware Error]: PROCESSOR 0:606c1 TIME 1695363457 SOCKET 0 APIC 9 microcode 1000211root@sonic:/home/admin/test_tools# edac-util -vmc0: 0 Uncorrected Errors with no DIMM infomc0: 0 Corrected Errors with no DIMM infomc0: csrow0: 3 Uncorrected Errorsmc0: csrow0: CPU_SrcID#0_MC#0_Chan#1_DIMM#0: 4000 Corrected Errorsroot@sonic:/home/admin/test_tools# 

通过/sys节点注错

查看内存地址范围root@bsp:/home/bsp-server# cat /proc/iomem | grep "System RAM"00001000-0008dfff : System RAM00090000-0009ffff : System RAM00100000-89d7c017 : System RAM89d7c018-89dade57 : System RAM89dade58-89dae017 : System RAM89dae018-89edf457 : System RAM89edf458-89ee4fff : System RAM89f16000-8a02efff : System RAM8a037000-8a752fff : System RAM8a754000-94695fff : System RAM95733000-95736fff : System RAM9697a000-9e98a017 : System RAM9e98a018-9e9bbe57 : System RAM9e9bbe58-9e9bc017 : System RAM9e9bc018-9e9ede57 : System RAM9e9ede58-9e9ee017 : System RAM9e9ee018-9ea1fe57 : System RAM9ea1fe58-9ea20017 : System RAM9ea20018-9ea82857 : System RAM9ea82858-9ea83017 : System RAM9ea83018-9eae5857 : System RAM9eae5858-9eae6017 : System RAM9eae6018-9eaf0c57 : System RAM9eaf0c58-9eaf1017 : System RAM9eaf1018-9eaf9057 : System RAM9eaf9058-a0179fff : System RAMa347a000-a347afff : System RAMa34fc000-afffffff : System RAM100000000-203fffefff : System RAMcd /sys/kernel/debug/apei/einj/支持的注错类型cat available_error_type0x00000008 Memory Correctable0x00000010 Memory Uncorrectable non-fatal0x00000020 Memory Uncorrectable fatal注错类型echo 0x8 > error_type内存地址掩码echo 0xfffffffffffff000 > param2# 内存地址echo 0x32dec000 > param1# 写入0x0,若为1,则会跳过触发环节echo 0x0 > notrigger# 注错echo 1 > error_inject

标签: #c语言错误代码e0020