<output id="qn6qe"></output>

    1. <output id="qn6qe"><tt id="qn6qe"></tt></output>
    2. <strike id="qn6qe"></strike>

      亚洲 日本 欧洲 欧美 视频,日韩中文字幕有码av,一本一道av中文字幕无码,国产线播放免费人成视频播放,人妻少妇偷人无码视频,日夜啪啪一区二区三区,国产尤物精品自在拍视频首页,久热这里只有精品12

      linux內(nèi)核bug問題排查過程詳細(xì)報(bào)告

      Linux Kernel BUG:soft lockup CPU#1 stuck分析

      1.線上內(nèi)核bug日志

      kernel: Deltaway too big! 18428729675200069867 ts=18446743954022816244 write stamp =18014278822746377

       kernel:------------[ cut here ]------------

       kernel:WARNING: at kernel/trace/ring_buffer.c:1988 rb_reserve_next_event+0x2ce/0x370()(Not tainted)

       kernel:Hardware name: ProLiant DL360 G7

       kernel:Modules linked in: fuse ipv6 power_meter bnx2 sg microcode serio_raw iTCO_wdtiTCO_vendor_support hpilo hpwdt i7core_edac edac_core shpchp ext4 mbcache jbd2sd_mod crc_t10dif hpsa radeon ttm drm_kms_helper drm i2c_algo_bit i2c_coredm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

       kernel: Pid:5483, comm: master Not tainted 2.6.32-220.el6.x86_64 #1

       kernel: CallTrace:

       kernel:[<ffffffff81069b77>] ? warn_slowpath_common+0x87/0xc0

       kernel:[<ffffffff81069bca>] ? warn_slowpath_null+0x1a/0x20

       kernel:[<ffffffff810ea8ae>] ? rb_reserve_next_event+0x2ce/0x370

       kernel:[<ffffffff810eab02>] ? ring_buffer_lock_reserve+0xa2/0x160

       kernel:[<ffffffff810ec97c>] ? trace_buffer_lock_reserve+0x2c/0x70

       kernel:[<ffffffff810ecb16>] ? trace_current_buffer_lock_reserve+0x16/0x20

      kernel:[<ffffffff8107ae1e>] ? ftrace_raw_event_hrtimer_cancel+0x4e/0xb0

       kernel:[<ffffffff81095e7a>] ? hrtimer_try_to_cancel+0xba/0xd0

      kernel:[<ffffffff8106f634>] ? do_setitimer+0xd4/0x220

       kernel:[<ffffffff8106f88a>] ? alarm_setitimer+0x3a/0x60

       kernel:[<ffffffff8107c27e>] ? sys_alarm+0xe/0x20

      kernel:[<ffffffff8100b308>] ? tracesys+0xd9/0xde

       kernel: ---[end trace 4d0a1ef2e62cb1a2 ]---

      abrt-dump-oops: Reported 1 kernel oopses to Abrt

      kernel: BUG: softlockup - CPU#11 stuck for 4278190091s! [qmgr:5492]

      kernel:Modules linked in: fuse ipv6 power_meter bnx2 sg microcode serio_raw iTCO_wdtiTCO_vendor_support hpilo hpwdt i7core_edac edac_core shpchp ext4 mbcache jbd2sd_mod crc_t10dif hpsa radeon ttm drm_kms_helper drm i2c_algo_bit i2c_coredm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

       kernel: CPU 11

       kernel:Modules linked in: fuse ipv6 power_meter bnx2 sg microcode serio_raw iTCO_wdtiTCO_vendor_support hpilo hpwdt i7core_edac edac_core shpchp ext4 mbcache jbd2sd_mod crc_t10dif hpsa radeon ttm drm_kms_helper drm i2c_algo_bit i2c_coredm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

       kernel:

      kernel: Pid:5492, comm: qmgr Tainted: G W ---------------- 2.6.32-220.el6.x86_64 #1 HPProLiant DL360 G7

       kernel: RIP:0010:[<ffffffff8106f730>] [<ffffffff8106f730>]do_setitimer+0x1d0/0x220

       kernel: RSP:0018:ffff88080a661ef8 EFLAGS: 00000286

       kernel: RAX:ffff88080b175a08 RBX: ffff88080a661f18 RCX: 0000000000000000

       kernel: RDX:0000000000000000 RSI: 0000000000000082 RDI: ffff88080c8c4c40

       kernel: RBP:ffffffff8100bc0e R08: 0000000000000000 R09: 0099d7270e01c3f1

       kernel: R10:0000000000000000 R11: 0000000000000246 R12: ffffffff810ef9a3

      kernel: R13:ffff88080a661e88 R14: 0000000000000000 R15: ffff88080a65a544

      kernel: FS:00007f10b245f7c0(0000) GS:ffff88083c4a0000(0000) knlGS:0000000000000000

       kernel: CS:0010 DS: 0000 ES: 0000 CR0: 000000008005003b

       kernel: CR2:00007ff955977380 CR3: 000000100a80b000 CR4: 00000000000006e0

      kernel: DR0:0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000

      kernel: DR3:0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400

       kernel:Process qmgr (pid: 5492, threadinfo ffff88080a660000, task ffff880809577500)

       kernel: Stack:

       kernel:00007f10b323def0 00007f10b248ead0 00007f10b26d0f78 00007f10b248ede0

       kernel:<0> ffff88080a661f68 ffffffff8106f88a 0000000000000000 0000000000000000

       kernel:<0> 000000000000014c 00000000000f423d 0000000000000000 0000000000000000

       kernel: CallTrace:

       kernel:[<ffffffff8106f88a>] ? alarm_setitimer+0x3a/0x60

       kernel:[<ffffffff8107c27e>] ? sys_alarm+0xe/0x20

       kernel:[<ffffffff8100b308>] ? tracesys+0xd9/0xde

      kernel: Code:89 ef e8 74 66 02 00 83 3d 15 69 b5 00 00 75 37 49 8b 84 24 70 07 00 00 48 0508 08 00 00 66 ff 00 66 66 90 fb 66 0f 1f 44 00 00 <31> c0 e9 64 fe ff ff49 8b 84 24 68 07 00 00 48 c7 80 d0 00 00

       kernel: CallTrace:

      kernel:[<ffffffff8106f769>] ? do_setitimer+0x209/0x220

       kernel:[<ffffffff8106f88a>] ? alarm_setitimer+0x3a/0x60

      kernel:[<ffffffff8107c27e>] ? sys_alarm+0xe/0x20

       kernel:[<ffffffff8100b308>] ? tracesys+0xd9/0xde

      abrt-dump-oops: Reported 1 kernel oopses to Abrt

       

      2.內(nèi)核軟死鎖(soft lockup)bug原因分析

               Soft lockup名稱解釋:所謂,soft lockup就是說(shuō),這個(gè)bug沒有讓系統(tǒng)徹底死機(jī),但是若干個(gè)進(jìn)程(或者kernel thread)被鎖死在了某個(gè)狀態(tài)(一般在內(nèi)核區(qū)域),很多情況下這個(gè)是由于內(nèi)核鎖的使用的問題。

               Linux內(nèi)核對(duì)于每一個(gè)cpu都有一個(gè)監(jiān)控進(jìn)程,在技術(shù)界這個(gè)叫做watchdog(看門狗)。通過ps –ef | grep watchdog能夠看見,進(jìn)程名稱大概是watchdog/X(數(shù)字:cpu邏輯編號(hào)1/2/3/4之類的)。這個(gè)進(jìn)程或者線程每一秒鐘運(yùn)行一次,否則會(huì)睡眠和待機(jī)。這個(gè)進(jìn)程運(yùn)行會(huì)收集每一個(gè)cpu運(yùn)行時(shí)使用數(shù)據(jù)的時(shí)間并且存放到屬于每個(gè)cpu自己的內(nèi)核數(shù)據(jù)結(jié)構(gòu)。在內(nèi)核中有很多特定的中斷函數(shù)。這些中斷函數(shù)會(huì)調(diào)用soft lockup計(jì)數(shù),他會(huì)使用當(dāng)前的時(shí)間戳與特定(對(duì)應(yīng)的)cpu的內(nèi)核數(shù)據(jù)結(jié)構(gòu)中保存的時(shí)間對(duì)比,如果發(fā)現(xiàn)當(dāng)前的時(shí)間戳比對(duì)應(yīng)cpu保存的時(shí)間大于設(shè)定的閥值,他就假設(shè)監(jiān)測(cè)進(jìn)程或看門狗線程在一個(gè)相當(dāng)可觀的時(shí)間還沒有執(zhí)。Cpu軟鎖為什么會(huì)產(chǎn)生,是怎么產(chǎn)生的?如果linux內(nèi)核是經(jīng)過精心設(shè)計(jì)安排的CPU調(diào)度訪問,那么怎么會(huì)產(chǎn)生cpu軟死鎖?那么只能說(shuō)由于用戶開發(fā)的或者第三方軟件引入,看我們服務(wù)器內(nèi)核panic的原因就是qmgr進(jìn)程引起。因?yàn)槊恳粋€(gè)無(wú)限的循環(huán)都會(huì)一直有一個(gè)cpu的執(zhí)行流程(qmgr進(jìn)程示一個(gè)后臺(tái)郵件的消息隊(duì)列服務(wù)進(jìn)程),并且擁有一定的優(yōu)先級(jí)。Cpu調(diào)度器調(diào)度一個(gè)驅(qū)動(dòng)程序來(lái)運(yùn)行,如果這個(gè)驅(qū)動(dòng)程序有問題并且沒有被檢測(cè)到,那么這個(gè)驅(qū)動(dòng)程序?qū)?huì)暫用cpu的很長(zhǎng)時(shí)間。根據(jù)前面的描述,看門狗進(jìn)程會(huì)抓住(catch)這一點(diǎn)并且拋出一個(gè)軟死鎖(soft lockup)錯(cuò)誤。軟死鎖會(huì)掛起cpu使你的系統(tǒng)不可用。

               如果是用戶空間的進(jìn)程或線程引起的問題backtrace是不會(huì)有內(nèi)容的,如果內(nèi)核線程那么在soft lockup消息中會(huì)顯示出backtrace信息。

      3.根據(jù)linux內(nèi)核源碼分析錯(cuò)誤

               根據(jù)我們第一部分內(nèi)核拋出的錯(cuò)誤信息和call trace(linux內(nèi)核的跟蹤子系統(tǒng))來(lái)分析產(chǎn)生的具體原因。

      首先根據(jù)我們的centos版本安裝相應(yīng)的linux內(nèi)核源碼,具體步驟如下:

      (1)下載源碼的rpm包kernel-2.6.32-220.17.1.el6.src.rpm

      (2)安裝相應(yīng)的依賴庫(kù),命令:yuminstall rpm-build redhat-rpm-config asciidoc newt-devel

      (3)安裝源碼包:rpm -ikernel-2.6.32-220.17.1.el6.src.rpm

      (4)進(jìn)入建立源碼的目錄:cd~/rpmbuild/SPECS

      (5)建立生成源碼目錄:rpmbuild-bp --target=`uname -m` kernel.spec

       

      下面開始真正的根據(jù)內(nèi)核bug日志分析源碼:

      (1)第一階段內(nèi)核錯(cuò)誤日志分析(時(shí)間在Dec 4 14:03:34這個(gè)階段的日志輸出代碼分析,其實(shí)這部分代碼不會(huì)導(dǎo)致cpu軟死鎖,主要是第二階段錯(cuò)誤日志顯示導(dǎo)致cpu軟死鎖)

      我們首先通過日志定位到相關(guān)源代碼:看下面日志:Dec 4 14:03:34 BP-YZH-1-xxxx kernel: WARNING: atkernel/trace/ring_buffer.c:1988 rb_reserve_next_event+0x2ce/0x370() (Not tainted)

      根據(jù)日志內(nèi)容我們可以很容易的定位到kernel/trace/ring_buffer.c這個(gè)文件的1988行代碼如下:WARN_ON(1)。

               先簡(jiǎn)單解釋一下WARN_ON的作用:WARN_ON只是打印出當(dāng)前棧信息,不會(huì)panic。所以會(huì)看到后面有一大堆的棧信息。這個(gè)宏定義如下:

      #ifndef WARN_ON

      #defineWARN_ON(condition) ({                                            \

               int __ret_warn_on = !!(condition);                             \

               if (unlikely(__ret_warn_on))                                         \

                         __WARN();                                                               \

               unlikely(__ret_warn_on);                                              \

      })

      #endif

      這個(gè)宏很簡(jiǎn)單保證傳遞進(jìn)來(lái)的條件值為0或者1(兩次邏輯非操作的結(jié)果),然后使用分支預(yù)測(cè)技術(shù)(保證執(zhí)行概率大的分支緊鄰上面的指令)判斷是否需要調(diào)用__WARN()宏定義。如果滿足條件執(zhí)行了__WARN()宏定義也接著執(zhí)行一條空指令;。上面調(diào)用WARN_ON宏是傳遞的1,所以會(huì)執(zhí)行__WARN()。下面繼續(xù)看一下__WARN()宏定義如下:

      #define __WARN()            warn_slowpath_null(__FILE__,__LINE__)

      從接下來(lái)的call trace信息中我們也確實(shí)發(fā)現(xiàn)調(diào)用了warn_slowpath_null這個(gè)函數(shù)。通過在linux內(nèi)核源代碼中搜索這個(gè)函數(shù)的實(shí)現(xiàn),發(fā)現(xiàn)在panic.c(內(nèi)核恐慌時(shí)的相關(guān)功能實(shí)現(xiàn))中實(shí)現(xiàn)如下:

      voidwarn_slowpath_null(const char *file, int line)

      {

               warn_slowpath_common(file, line,__builtin_return_address(0),

                                      TAINT_WARN, NULL);

      }

      EXPORT_SYMBOL(warn_slowpath_null);//都出這個(gè)符號(hào),讓其他模塊可以使用這個(gè)函數(shù)

      同樣的我們看到了warn_slowpath_common這個(gè)函數(shù),而在call trace當(dāng)中這個(gè)函數(shù)在warn_slowpath_null函數(shù)之前打印出來(lái),再次印證了這個(gè)流程是正確的。同樣在panic.c這個(gè)文件中我發(fā)現(xiàn)了warn_slowpath_common這個(gè)函數(shù)的實(shí)現(xiàn)如下:

      static voidwarn_slowpath_common(const char *file, int line, void *caller,

                                            unsigned taint, struct slowpath_args *args)

      {

               const char *board;

       

               printk(KERN_WARNING "------------[ cut here]------------\n");

               printk(KERN_WARNING "WARNING: at %s:%d %pS()(%s)\n",

                         file, line, caller, print_tainted());

               board = dmi_get_system_info(DMI_PRODUCT_NAME);//得到dmi系統(tǒng)信息

               if (board)

                         printk(KERN_WARNING "Hardware name:%s\n", board);//通過我們的日志信息可以發(fā)現(xiàn)我們硬件名稱是ProLiant DL360 G7

       

               if (args)

                         vprintk(args->fmt, args->args);

       

               print_modules();//打印系統(tǒng)模塊信息

               dump_stack();//dump信息輸出(call trace開始)

               print_oops_end_marker();//打印oops結(jié)束

               add_taint(taint);

      }

      分析這個(gè)函數(shù)的實(shí)現(xiàn)不難發(fā)現(xiàn)我們的很多日志信息從這里開始輸出,包括打印一些系統(tǒng)信息,就不繼續(xù)深入分析了(請(qǐng)看代碼注釋,里面調(diào)用相關(guān)函數(shù)打印對(duì)應(yīng)信息,通過我分析這些函數(shù)的實(shí)現(xiàn)和我們的日志信息完全能夠?qū)?yīng),其中dump_stack是與cpu體系結(jié)構(gòu)相關(guān)的,我們的服務(wù)器應(yīng)該是屬于x86體系)。這里在繼續(xù)分析一下dump_stack函數(shù)的實(shí)現(xiàn),因?yàn)檫@個(gè)是與cpu體系結(jié)構(gòu)相關(guān)的,而且這個(gè)函數(shù)直接反應(yīng)出導(dǎo)致內(nèi)核panic的相關(guān)進(jìn)程。這個(gè)函數(shù)實(shí)現(xiàn)如下:

      /*

       * The architecture-independent dump_stackgenerator

       */

      void dump_stack(void)

      {

               unsigned long stack;

       

               printk("Pid: %d, comm: %.20s %s %s %.*s\n",

                         current->pid, current->comm,print_tainted(),

                         init_utsname()->release,

                         (int)strcspn(init_utsname()->version, ""),

                         init_utsname()->version);

               show_trace(NULL, NULL, &stack);

      }

      EXPORT_SYMBOL(dump_stack);

      通過我們的日志信息(Dec 4 14:03:34 BP-YZH-1-xxxx kernel: Pid: 5483, comm: master Nottainted 2.6.32-220.el6.x86_64 #1)很容易看出來(lái),這個(gè)進(jìn)程首先是由于master進(jìn)程導(dǎo)致的,進(jìn)程id是5483.這個(gè)函數(shù)的最后一行代碼調(diào)用show_trace函數(shù)打印call trace信息,這個(gè)函數(shù)實(shí)現(xiàn)如下:

      void show_trace(structtask_struct *task, struct pt_regs *regs,

                         unsigned long *stack)

      {

               show_trace_log_lvl(task, regs, stack, "");

      }

      這個(gè)函數(shù)繼續(xù)調(diào)用show_trace_log_lvl函數(shù),如下:

      void show_trace_log_lvl(structtask_struct *task, struct pt_regs *regs,

                         unsigned long *stack, char *log_lvl)

      {

               printk("%sCall Trace:\n", log_lvl);

               dump_trace(task, regs, stack, &print_trace_ops,log_lvl);

      }

      這個(gè)函數(shù)打印出我們?nèi)罩局械腃all Trace信息,然后繼續(xù)調(diào)用dump_trace函數(shù)(x86-64相關(guān)的體系結(jié)構(gòu),也就是64位的,還有一個(gè)32位的相應(yīng)實(shí)現(xiàn)),如下:

      /*

       * x86-64 can have up to three kernel stacks:

       * process stack

       * interrupt stack

       * severe exception (double fault, nmi, stackfault, debug, mce) hardware stack

       */

      void dump_trace(structtask_struct *task, struct pt_regs *regs,

                         unsigned long *stack,

                         const struct stacktrace_ops *ops, void *data)

      {

               const unsigned cpu = get_cpu();

               unsigned long *irq_stack_end =

                         (unsigned long *)per_cpu(irq_stack_ptr, cpu);

               unsigned used = 0;

               struct thread_info *tinfo;

               int graph = 0;

               unsigned long bp;

       

               if (!task)

                         task = current;

       

               if (!stack) {

                         unsigned long dummy;

                         stack = &dummy;

                         if (task && task != current)

                                  stack = (unsigned long*)task->thread.sp;

               }

       

               bp = stack_frame(task, regs);

               /*

                * Print function callentries in all stacks, starting at the

                * current stackaddress. If the stacks consist of nested

                * exceptions

                */

               tinfo = task_thread_info(task);

               for (;;) {

                         char *id;

                         unsigned long *estack_end;

                         estack_end = in_exception_stack(cpu, (unsignedlong)stack,

                                                              &used,&id);

       

                         if (estack_end) {

                                  if (ops->stack(data, id) < 0)

                                           break;

       

                                  bp = ops->walk_stack(tinfo, stack, bp,ops,

                                                          data, estack_end, &graph);

                                  ops->stack(data,"<EOE>");

                                  /*

                                   *We link to the next stack via the

                                   *second-to-last pointer (index -2 to end) in the

                                   *exception stack:

                                   */

                                  stack = (unsigned long *) estack_end[-2];

                                  continue;

                         }

                         if (irq_stack_end) {

                                  unsigned long *irq_stack;

                                  irq_stack = irq_stack_end -

                                           (IRQ_STACK_SIZE - 64) /sizeof(*irq_stack);

       

                                  if (in_irq_stack(stack, irq_stack,irq_stack_end)) {

                                           if (ops->stack(data,"IRQ") < 0)

                                                     break;

                                           bp = ops->walk_stack(tinfo,stack, bp,

                                                     ops, data,irq_stack_end, &graph);

                                           /*

                                            * We link to the next stack (which would be

                                            * the process stack normally) the last

                                            * pointer (index -1 to end) in the IRQ stack:

                                            */

                                           stack = (unsigned long *)(irq_stack_end[-1]);

                                           bp = fixup_bp_irq_link(bp,stack, irq_stack,

                                                                     irq_stack_end);

                                           irq_stack_end = NULL;

                                           ops->stack(data,"EOI");

                                           continue;

                                  }

                         }

                         break;

               }

       

               /*

                * This handles theprocess stack:

                */

               bp = ops->walk_stack(tinfo, stack, bp, ops, data, NULL,&graph);

               put_cpu();

      }

      EXPORT_SYMBOL(dump_trace);

      通過這個(gè)函數(shù)的注釋我們可以清楚的知道這個(gè)函數(shù)打印很多信息,包括進(jìn)程堆棧信息、中斷堆棧信息以及一些服務(wù)異常的硬件堆棧信息(如雙重故障,NMI,堆棧錯(cuò)誤,調(diào)試,MCE)。

      通過上面這些分析只想說(shuō)明一點(diǎn),整個(gè)流程是沒有問題的。到目前為止只有第一行日志信息還沒有找到輸出的地址,其實(shí)看源碼很容易就看出來(lái)了,就在WARN_ON宏的上一行代碼就是了,代碼如下:

                        printk(KERN_WARNING "Delta way toobig! %llu"

                                " ts=%llu write stamp =%llu\n%s",

                                (unsigned long long)*delta,

                                (unsigned long long)*ts,

                                (unsigned long long)cpu_buffer->write_stamp,

                                local_clock_stable ? "" :

                                "If you just came from asuspend/resume,\n"

                                "please switch to the trace globalclock:\n"

                                " echo global > /sys/kernel/debug/tracing/trace_clock\n");

      Printk函數(shù)的作用相當(dāng)于printf,只是這個(gè)是專門用于內(nèi)核的(有一些特殊實(shí)現(xiàn))。通過一句代碼我們也可以發(fā)現(xiàn)local_clock_stable是為0,因?yàn)闆]有打印最后那一點(diǎn)信息出來(lái)。到這里為止基本把第一階段出現(xiàn)內(nèi)核錯(cuò)誤和bug的代碼全部理解清楚了(日志時(shí)間為:Dec 4 14:03:34這個(gè)時(shí)間輸出的),后面附件具體分析導(dǎo)致這個(gè)原因代碼的原因。

       

      (2)第二階段(時(shí)間在Dec 4 14:03:41,與第一階段只相差了幾秒鐘,應(yīng)該還是存在一定的關(guān)聯(lián))內(nèi)核錯(cuò)誤日志輸出代碼分析,這部分日志輸出顯示了導(dǎo)致了cpu軟死鎖。

               根據(jù)第二部分關(guān)于可能產(chǎn)生軟死鎖原因的分析,我們能夠知道,這個(gè)軟死鎖最終是由watchdog程序拋出的。對(duì)第二階段的第一條輸出日志,我通過搜索找到在文件kernel/watchdog.c中的296行找到。這行代碼所在的函數(shù)是watchdog_timer_fn,這個(gè)函數(shù)定義如下:

      /* watchdog kickerfunctions */

      static enumhrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)

      {

               unsigned long touch_ts = __get_cpu_var(watchdog_touch_ts);

               struct pt_regs *regs = get_irq_regs();

               int duration;

               /* kick the hardlockup detector */

               watchdog_interrupt_count();

               /* kick the softlockup detector */

               wake_up_process(__get_cpu_var(softlockup_watchdog));

               /* .. and repeat */

               hrtimer_forward_now(hrtimer,ns_to_ktime(get_sample_period()));

               if (touch_ts == 0) {

                         if(unlikely(__get_cpu_var(softlockup_touch_sync))) {

                                  /* If the time stamp was touchedatomically

                                   *make sure the scheduler tick is up to date. */

                                  __get_cpu_var(softlockup_touch_sync) =false;

                                  sched_clock_tick();

                         }

                         __touch_watchdog();

                         return HRTIMER_RESTART;

               }

               /* check for a softlockup

                * This is done bymaking sure a high priority task is

                * being scheduled.  The task touches the watchdog to

                * indicate it isgetting cpu time.  If it hasn't then

                * this is a goodindication some task is hogging the cpu

                */

               duration = is_softlockup(touch_ts);

               if (unlikely(duration)) {

                         /* only warn once */

                         if (__get_cpu_var(soft_watchdog_warn) == true)

                                  return HRTIMER_RESTART;

                         printk(KERN_ERR "BUG: soft lockup - CPU#%dstuck for %us! [%s:%d]\n",

                                  smp_processor_id(), duration,

                                  current->comm, task_pid_nr(current));

                         print_modules();

                         print_irqtrace_events(current);

                         if (regs)

                                  show_regs(regs);

                         else

                                  dump_stack();

                         if (softlockup_panic)

                                  panic("softlockup: hungtasks");

                         __get_cpu_var(soft_watchdog_warn) = true;

               } else

                         __get_cpu_var(soft_watchdog_warn) = false;

               return HRTIMER_RESTART;

      }

               下面首先需要分析一下什么情況下會(huì)調(diào)用這個(gè)函數(shù)并且導(dǎo)致cpu軟死鎖,根據(jù)第二部分cpu軟死鎖產(chǎn)生原因的分析,每一個(gè)cpu都會(huì)啟動(dòng)一個(gè)對(duì)應(yīng)的watchdog線程來(lái)監(jiān)控cpu的執(zhí)行狀態(tài),創(chuàng)建watchdog線程是在使能watchdog功能的時(shí)候,在函數(shù)watchdog_enable里面實(shí)現(xiàn),代碼如下:

      static intwatchdog_enable(int cpu)

      {

               struct task_struct *p = per_cpu(softlockup_watchdog, cpu);//保存每一個(gè)cpu的私有數(shù)據(jù)

               int err = 0;

               /* enable the perf event */

               err = watchdog_nmi_enable(cpu);

               /* Regardless of err above, fall through and startsoftlockup */

               /* create the watchdog thread */

               if (!p) {

                         p = kthread_create(watchdog, (void *)(unsignedlong)cpu, "watchdog/%d", cpu);

                         if (IS_ERR(p)) {

                                  printk(KERN_ERR "softlockup watchdogfor %i failed\n", cpu);

                                  if (!err)

                                           /* if hardlockup hasn't alreadyset this */

                                           err = PTR_ERR(p);

                                  goto out;

                         }

                         kthread_bind(p, cpu);

                         per_cpu(watchdog_touch_ts, cpu) = 0;

                         per_cpu(softlockup_watchdog, cpu) = p;

                         wake_up_process(p);

               }

      out:

               return err;

      }

      從上面這個(gè)watchdog使能函數(shù)可以看出,每一個(gè)cpu的watchdog監(jiān)控線程的入口函數(shù)是watchdog函數(shù),還注冊(cè)了一個(gè)softlockup軟死鎖響應(yīng)函數(shù)。下面開始從watchdog函數(shù)繼續(xù)開始分析,代碼如下:

      /* The watchdog thread -touches the timestamp. */

      static int watchdog(void*unused)

      {

               struct sched_param param = { .sched_priority = MAX_RT_PRIO-1};

               struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);

               sched_setscheduler(current, SCHED_FIFO, &param);

               /* initialize timestamp */

               __touch_watchdog();

               /* kick off the timer for the hardlockup detector */

               /* done here because hrtimer_start can only pin tosmp_processor_id() */

               hrtimer_start(hrtimer, ns_to_ktime(get_sample_period()),

                              HRTIMER_MODE_REL_PINNED);

               set_current_state(TASK_INTERRUPTIBLE);

               /* Run briefly once per second to reset the softlockuptimestamp.

                * If this getsdelayed for more than 60 seconds then the

                * debug-printouttriggers in watchdog_timer_fn().*/

               while (!kthread_should_stop()) {

                         __touch_watchdog();

                         schedule();

                         if (kthread_should_stop())

                                  break;

                         set_current_state(TASK_INTERRUPTIBLE);

               }

               __set_current_state(TASK_RUNNING);

               return 0;

      }

      這個(gè)函數(shù)首先設(shè)置watchdog線程的cpu的調(diào)度策略為FIFO(先來(lái)先服務(wù)),并且優(yōu)先級(jí)最低,然后初始化一個(gè)時(shí)間戳,設(shè)置當(dāng)前的cpu狀態(tài)為可中斷的狀態(tài)。然后就進(jìn)入while循環(huán)了,通過while循環(huán)上面的注釋我們可以清楚的知道每一秒鐘都會(huì)重新設(shè)置softlockup的時(shí)間戳,如果超過60秒鐘那么就會(huì)觸發(fā)debug信息被打印,打印信息是在函數(shù)watchdog_timer_fn里面,這個(gè)函數(shù)是在watchdog模塊初始化的時(shí)候注冊(cè)的,每一個(gè)模塊加載最先執(zhí)行的都是模塊初始化函數(shù),當(dāng)然卸載模塊的時(shí)候也有相應(yīng)的資源清理函數(shù)。

               到這里為止我們知道了如果要打印出我們服務(wù)器內(nèi)核的錯(cuò)誤信息就必須得滿足一個(gè)條件,就是已經(jīng)超過60秒鐘沒有重新設(shè)置softlockup的時(shí)間戳了(每一,秒鐘重新設(shè)置softlockup的時(shí)間戳,有一個(gè)比較通俗的說(shuō)法叫做喂狗,因?yàn)樯厦娴膚atchdog程序也俗稱看門狗程序,cpu就是通過這種方式來(lái)檢測(cè)cpu是否運(yùn)行正常的,已經(jīng)超過60秒鐘沒有喂狗了,那么狗肯定餓了,也就是cpu處于饑餓狀態(tài),浪費(fèi)了cpu的時(shí)間了,最大的可能就是cpu產(chǎn)生了軟死鎖)。這里就不詳細(xì)分析cpu調(diào)度相關(guān)的功能了,還是繼續(xù)分析產(chǎn)生這種bug的原因。下面最主要的就是分析為什么超過60秒鐘沒有喂狗(重新設(shè)置softlockup時(shí)間戳)了。分析這個(gè)應(yīng)該需要結(jié)合第一階段的錯(cuò)誤日志分析,第一階段也是由于計(jì)算時(shí)間差值大于了一個(gè)設(shè)定的時(shí)間。

               我們還是先接著上面,分析一下日志的輸出是否和我們分析的代碼能夠?qū)?yīng)上吧。最后就是通過call trace分析原因。

               上面我們已經(jīng)分析到了函數(shù)watchdog_timer_fn打印debug信息的地方了,這個(gè)函數(shù)的在前面已經(jīng)粘貼出來(lái)了,我們結(jié)合我們的錯(cuò)誤日志分析一下對(duì)應(yīng)的輸出內(nèi)核,通過函數(shù)的分析下面這句代碼就是打印我們第二階段錯(cuò)誤日志信息的第一條:

                         printk(KERN_ERR"BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",

                                  smp_processor_id(), duration,

                                  current->comm,task_pid_nr(current));

                         打印的對(duì)應(yīng)日志信息:Dec 4 14:03:41BP-YZH-1-xxxx kernel: BUG: soft lockup - CPU#11 stuck for 4278190091s![qmgr:5492]

       

               而輸出這一條的條件就是linux內(nèi)核檢測(cè)出了已經(jīng)存在軟死鎖(通過前面if條件判斷的代碼以及前面的代碼注釋都可以說(shuō)明這一點(diǎn)),通過這條日志信息我們可以看到cpu的邏輯編號(hào)、進(jìn)程名稱和進(jìn)程id(通過我們的日志信息可以看出分別是11、qmgr進(jìn)程、進(jìn)程id是5492)。我們?cè)诳纯匆粋€(gè)更重要的值,這個(gè)值就是決定了是否產(chǎn)生了軟死鎖,這個(gè)值就是duration,這里顯示是4278190091秒。下面詳細(xì)分析一下這個(gè)值是怎么被計(jì)算出來(lái)的,通過代碼發(fā)現(xiàn)是通過下面這個(gè)函數(shù)計(jì)算出來(lái)的:

               staticint is_softlockup(unsigned long touch_ts)

      {

               unsignedlong now = get_timestamp(smp_processor_id());

               /*Warn about unreasonable delays: */

               if(time_after(now, touch_ts + softlockup_thresh))

                         returnnow - touch_ts;

               return0;

      }

               首先這個(gè)函數(shù)得到一個(gè)對(duì)應(yīng)cpu現(xiàn)在的時(shí)間戳,具體怎么詳細(xì)計(jì)算出來(lái)的就不分析了,反正與cpu的平率有關(guān)系,通過ns換算成為秒鐘(2^30ns=1.074s,然后通過cpu_clock(this_cpu) >> 30LL;這句代碼計(jì)算出來(lái)的)。接著通過調(diào)用time_after函數(shù)判斷現(xiàn)在的時(shí)間戳now與上一次喂狗的時(shí)間戳加上softlockup_thresh(60)的先后關(guān)系,也就是判斷是不是有超過60秒鐘沒有喂狗了,如果是就會(huì)返回一個(gè)現(xiàn)在的時(shí)間戳減去上一次喂狗的時(shí)間戳,如果沒有就會(huì)返回0。根據(jù)我們的日志信息,很明顯已經(jīng)超過了,所以返回值不為0,就導(dǎo)致了軟死鎖被檢測(cè)到。

      后面接著打印了linux內(nèi)核加載了的相關(guān)模塊信息,然后打印中斷trace的事件信息,一直跟蹤下去,發(fā)現(xiàn)和我們打印的內(nèi)核bug信息完全符合。

       

      下面開始分析一下call trace,不管是第一階段的日志還是第二階段的日志堆棧最上層的函數(shù)都是tracesys,說(shuō)明這些信息都是在trace子系統(tǒng)中發(fā)生的,至少說(shuō)明當(dāng)時(shí)都處于trace子系統(tǒng)的運(yùn)行。其實(shí)tracesys是用匯編實(shí)現(xiàn)的,主要用于跟蹤系統(tǒng)調(diào)用的功能。第二階段的call trace基本是沒有什么信息,分析不出什么效果。有一點(diǎn)需要說(shuō)明一下,第一階段的日志打印的信息顯示是master進(jìn)程,第二階段顯示是qmgr進(jìn)程。在centos上master進(jìn)程是qmgr進(jìn)程的父進(jìn)程。下面看第一階段日志的call trace信息,如下(去掉了公共前綴信息):

      Call Trace:

      [<ffffffff81069b77>]? warn_slowpath_common+0x87/0xc0

      [<ffffffff81069bca>]? warn_slowpath_null+0x1a/0x20

      [<ffffffff810ea8ae>]? rb_reserve_next_event+0x2ce/0x370

      [<ffffffff810eab02>]? ring_buffer_lock_reserve+0xa2/0x160

      [<ffffffff810ec97c>]? trace_buffer_lock_reserve+0x2c/0x70

      [<ffffffff810ecb16>]? trace_current_buffer_lock_reserve+0x16/0x20

      [<ffffffff8107ae1e>]? ftrace_raw_event_hrtimer_cancel+0x4e/0xb0

      [<ffffffff81095e7a>]? hrtimer_try_to_cancel+0xba/0xd0

      [<ffffffff8106f634>]? do_setitimer+0xd4/0x220

      [<ffffffff8106f88a>]? alarm_setitimer+0x3a/0x60

      [<ffffffff8107c27e>]? sys_alarm+0xe/0x20

      [<ffffffff8100b308>]? tracesys+0xd9/0xde

      我們直接從trace_current_buffer_lock_reserve這個(gè)函數(shù)看是吧,前面有兩個(gè)函數(shù)分別是嘗試停用計(jì)時(shí)器(hrtimer_try_to_cancel)和取消跟蹤調(diào)試的事件計(jì)時(shí)器(ftrace_raw_event_hrtimer_cancel),在前面幾個(gè)函數(shù)在trace子系統(tǒng)觸發(fā)的時(shí)候都會(huì)被調(diào)用(跟蹤一個(gè)系統(tǒng)調(diào)用)。trace_current_buffer_lock_reserve這個(gè)函數(shù)最終根據(jù)call trace會(huì)調(diào)用rb_reserve_next_event函數(shù)。這個(gè)函數(shù)實(shí)現(xiàn)如下:

      static struct ring_buffer_event * rb_reserve_next_event(structring_buffer *buffer,

                               struct ring_buffer_per_cpu *cpu_buffer, unsignedlong length)

      {

               structring_buffer_event *event;

               u64 ts, delta =0;

               int commit = 0;

               int nr_loops =0;

               rb_start_commit(cpu_buffer);

      #ifdef CONFIG_RING_BUFFER_ALLOW_SWAP

               /*

                * Due to the ability to swap a cpu buffer froma buffer

                * it is possible it was swapped before wecommitted.

                * (committing stops a swap). We check for ithere and

                * if it happened, we have to fail the write.

                */

               barrier();

               if(unlikely(ACCESS_ONCE(cpu_buffer->buffer) != buffer)) {

                         local_dec(&cpu_buffer->committing);

                         local_dec(&cpu_buffer->commits);

                         returnNULL;

               }

      #endif

               length =rb_calculate_event_length(length);

       again:

               /*

                * We allow for interrupts to reenter here anddo a trace.

                * If one does, it will cause this originalcode to loop

                * back here. Even with heavy interruptshappening, this

                * should only happen a few times in a row. Ifthis happens

                * 1000 times in a row, there must be either aninterrupt

                * storm or we have something buggy.

                * Bail!

                */

               if(RB_WARN_ON(cpu_buffer, ++nr_loops > 1000))

                         gotoout_fail;

               ts =rb_time_stamp(cpu_buffer->buffer);

               /*

                * Only the first commit can update thetimestamp.

                * Yes there is a race here. If an interruptcomes in

                * just after the conditional and it tracestoo, then it

                * will also check the deltas. More than onetimestamp may

                * also be made. But only the entry that didthe actual

                * commit will be something other than zero.

                */

               if(likely(cpu_buffer->tail_page == cpu_buffer->commit_page &&

                            rb_page_write(cpu_buffer->tail_page) ==

                            rb_commit_index(cpu_buffer))) {

                         u64diff;

                         diff= ts - cpu_buffer->write_stamp;

                         /*make sure this diff is calculated here */

                         barrier();

                         /*Did the write stamp get updated already? */

                         if(unlikely(ts < cpu_buffer->write_stamp))

                                  gotoget_event;

                         delta= diff;

                         if(unlikely(test_time_stamp(delta))) {

                                  commit= rb_add_time_stamp(cpu_buffer, &ts, &delta);

                                  if(commit == -EBUSY)

                                           gotoout_fail;

                                  if(commit == -EAGAIN)

                                           gotoagain;

                                  RB_WARN_ON(cpu_buffer,commit < 0);

                         }

               }

       get_event:

               event =__rb_reserve_next(cpu_buffer, 0, length, &ts);

               if(unlikely(PTR_ERR(event) == -EAGAIN))

                         gotoagain;

               if (!event)

                         gotoout_fail;

               if(!rb_event_is_commit(cpu_buffer, event))

                        delta = 0;

               event->time_delta= delta;

               return event;

       out_fail:

               rb_end_commit(cpu_buffer);

               return NULL;

      }

      我們首先看一下這個(gè)函數(shù)的參數(shù):其中有兩個(gè)比較復(fù)雜和重要的結(jié)構(gòu)體ring_buffer_per_cpu和ring_buffer,分析這兩個(gè)結(jié)構(gòu)體主要是為了后面的函數(shù)內(nèi)部分析做一個(gè)準(zhǔn)備。我們先看看ring_buffer結(jié)構(gòu)體(簡(jiǎn)稱為RB結(jié)構(gòu)體)的定義,如下:

      struct ring_buffer {

          unsigned            pages;  /*RB中的頁(yè)面數(shù)*/

          unsigned            flags; /*RB的標(biāo)志,目前只有RB_FL_OVERWRITE可用*/

          int             cpus; /*ring buffer中包含的cpu個(gè)數(shù)*/

          atomic_t  record_disabled; /*整個(gè)ring buffer的禁用標(biāo)志,用原子操作了防止競(jìng)爭(zhēng)*/

          cpumask_var_t           cpumask; /* cpu位圖*/

      struct lock_class_key                *reader_lock_key;/*讀鎖關(guān)鍵字*/

          struct mutex            mutex; /*RB訪問鎖*/

          structring_buffer_per_cpu  **buffers; /*CPU的緩存區(qū)頁(yè)面,每個(gè)CPU對(duì)應(yīng)一項(xiàng)*/

      #ifdef CONFIG_HOTPLUG_CPU

          structnotifier_block       cpu_notify; /*CPU情況下的cpu hotplug 通知鏈表*/

      #endif

          u64             (*clock)(void); /*RB所用的時(shí)間,用來(lái)計(jì)數(shù)時(shí)間戳*/

      }

      在RB的操作中,我們可以禁止全局的RB操作,例如,完全禁用掉Trace功能后,整個(gè)RB都是不允許再操做的,這時(shí),就可以將原子變量record_disabled 加1。相反的,如果啟用的話,將其減1即可。只有當(dāng)record_disabled的值等于0時(shí),才允許操作RB。

      同時(shí),有些時(shí)候,要對(duì)RB的一些數(shù)據(jù)進(jìn)行更新,比如,我要重新設(shè)置一下RB的緩存區(qū)大小,這都需要串行操作,因此,在ring_buffer結(jié)構(gòu)中有mutex成員,用來(lái)避免這些更改RB的操作的競(jìng)爭(zhēng)。

      ring_buffer_per_cpu定義如下:

      struct ring_buffer_per_cpu {

           int             cpu; /*cpu buffer所在的CPU*/

          structring_buffer      *buffer; /*cpu buffer所屬的RB*/

          /*讀鎖,用了避免讀者的操行操作,有時(shí)在

           *寫者切換頁(yè)面的時(shí)候,也需要持有此鎖*/

          spinlock_t          reader_lock; /* serialize readers */

          raw_spinlock_t          lock;

          structlock_class_key       lock_key;

          structlist_head        pages; /*cpu buffer的頁(yè)面鏈表*/

          structbuffer_page      *head_page; /* read fromhead *//*起始讀位置*/

          structbuffer_page      *tail_page; /* write totail *//*寫位置*/

          /*提交位置,只有當(dāng)被寫的頁(yè)面提交過后才允許被讀*/

          structbuffer_page      *commit_page;   /* committed pages */

          structbuffer_page      *reader_page; /*reader頁(yè)面, 用來(lái)交換讀頁(yè)面*/

          unsigned long           nmi_dropped;

          unsigned long           commit_overrun;

          unsigned long           overrun;

          unsigned long           read;

          local_t             entries;

          u64             write_stamp; /*最新的頁(yè)面commit時(shí)間*/

          u64             read_stamp; /*最新的頁(yè)面read時(shí)間*/

          atomic_t            record_disabled; /*cpu buffer的禁用啟用標(biāo)志*/

      }

      首先,對(duì)每一個(gè)cpu的操作限制是由ring_buffer_per_cpu->record_disabled來(lái)實(shí)現(xiàn)的。同ring_buffer一樣,禁用加1,啟用減1。

      我們也可以看到,每個(gè)cpu都有一系列的頁(yè)面,這樣頁(yè)面都鏈入在pages中。

      該頁(yè)面的結(jié)構(gòu)如下:

      struct buffer_page {

          struct list_headlist;      /* list of buffer pages *//*用來(lái)形成鏈表*/

          local_t     write;    /* index for next write *//*寫的位置*/

          unsigned    read;     /* index for next read *//*讀的位置*/

          local_t     entries;  /* entries on this page *//*頁(yè)面中有多少項(xiàng)數(shù)據(jù)*/

          structbuffer_data_page *page;  /* Actual datapage */

      };

      具體的緩存區(qū)是由structbuffer_data_page指向的,實(shí)際上,它是具體頁(yè)面的管理頭部,結(jié)構(gòu)如下:

      struct buffer_data_page {

          u64     time_stamp;    /* page time stamp *//*頁(yè)面第一次被寫時(shí)的時(shí)間戳*/

          local_t     commit;   /* write committed index */ /*提交的位置*/

          unsigned char   data[];   /* data of buffer page *//*用來(lái)存放數(shù)據(jù)的緩存區(qū)*/

      };

      把所有基本的結(jié)構(gòu)體解析清楚以后我們就可以正在分析這個(gè)函數(shù)了,代碼已經(jīng)在前面寫出來(lái)了。函數(shù)里面首先定義了一個(gè)ring_buffer_event結(jié)構(gòu)體的變量,先看一下這個(gè)結(jié)構(gòu)體的定義如下:

      /* Don't refer to this struct directly, use functions below.*/

      struct ring_buffer_event {

               kmemcheck_bitfield_begin(bitfield);//具體作用參考Documentation / kmemcheck.txt

               u32            type_len:5, time_delta:27;

               kmemcheck_bitfield_end(bitfield);

               u32            array[];

      };

      這個(gè)結(jié)構(gòu)體注釋提醒不要直接使用,只能在函數(shù)體里面使用,這里是在函數(shù)體里面定義和使用的。里面加入了動(dòng)態(tài)內(nèi)存檢測(cè)的字段kmemcheck,這個(gè)特性類似用戶態(tài)動(dòng)態(tài)內(nèi)存檢測(cè)工具valgrind,只是實(shí)現(xiàn)上有很大的區(qū)別,而且這個(gè)主要是用于內(nèi)核動(dòng)態(tài)內(nèi)存檢測(cè)技術(shù)。繼續(xù)上面函數(shù)的分析,然后定義了一些需要用到的變量。繼續(xù)就開始調(diào)用函數(shù)rb_start_commit,函數(shù)代碼如下:

      static void rb_start_commit(struct ring_buffer_per_cpu*cpu_buffer)

      {

               local_inc(&cpu_buffer->committing);

               local_inc(&cpu_buffer->commits);

      }

      分別以cpu_buffer的committing和commits字段為參數(shù)調(diào)用local_inc函數(shù),一直跟蹤下去最終會(huì)調(diào)用如下函數(shù):

      /**

       * atomic_add - addinteger to atomic variable

       * @i: integer value toadd

       * @v: pointer of typeatomic_t

       * Atomically adds @ito @v.

       */

      static inline void atomic_add(int i, atomic_t *v)

      {

               asmvolatile(LOCK_PREFIX "addl %1,%0"

                              : "=m" (v->counter)

                              : "ir" (i), "m"(v->counter));

      }

      其實(shí)最終就是使用匯編實(shí)現(xiàn)一個(gè)長(zhǎng)整型變量的原子加(addl)1操作。接下來(lái)是一段需要使能了RB交換swap功能(和虛擬內(nèi)存交換,linux系統(tǒng)稱為交換區(qū),有硬盤充當(dāng))。接下來(lái)調(diào)用函數(shù)rb_calculate_event_length計(jì)算事件長(zhǎng)度。繼續(xù)我們就會(huì)看到一大段英文注釋,大體上是說(shuō),在調(diào)用這個(gè)函數(shù)之前禁止了搶占,中斷和NMI在這里存在著競(jìng)爭(zhēng),因此在下面的運(yùn)行中,隨時(shí)都會(huì)被中斷/NMI所搶占。由于在從struct ring_buffer_per_cpu中取頁(yè)面的時(shí)候,會(huì)有當(dāng)前頁(yè)面空間不足,需要前進(jìn)一個(gè)頁(yè)面的情況.每次前進(jìn)一個(gè)頁(yè)面都會(huì)跳轉(zhuǎn)到again,此時(shí)nr_loops都會(huì)增加1,如果在一次請(qǐng)求中,這樣的情況出現(xiàn)了1000次,說(shuō)明中斷搶占的次數(shù)太多了,很可能是由于中斷風(fēng)暴(interrupte storm)或者是bug造成的。下面就是調(diào)用宏函數(shù)RB_WARN_ON,如果超過1000次就打印相關(guān)call trace信息,根據(jù)我們的服務(wù)器日志這里沒有打印。繼續(xù)就是調(diào)用函數(shù)ring_buffer_time_stamp取當(dāng)前的時(shí)間戳并賦值給ts臨時(shí)變量。下面又是一大段英文注釋,意思是只有第一次處于提交狀態(tài)的請(qǐng)求才能夠更新cpu_buffer->write_stamp(cpu的寫時(shí)間戳),這里存在著競(jìng)爭(zhēng)。下面詳細(xì)分析這里代碼執(zhí)行的情況,因?yàn)樯婕暗絛elta計(jì)算的問題,這也是導(dǎo)致我們服務(wù)器bug的最基本和開始的原因。

      從這里的if判斷可以看到,只有在fistcommit的時(shí)候才會(huì)計(jì)算delta,其它的情況下,會(huì)獲取下一個(gè)event。我們來(lái)思考一下,為什么在確認(rèn)了是fist commit,進(jìn)入到了if,還需要進(jìn)行:

      if (unlikely(ts < cpu_buffer->write_stamp))

                  gotoget_event;

      的判斷呢? 這個(gè)delta到底是用來(lái)做什么的呢?它為什么要用這樣的判斷方式呢?我們?cè)谥罢f(shuō)過,在ring_buffer_per_cpu中的每一塊數(shù)據(jù)都帶有一個(gè)event的頭部,即:

      struct ring_buffer_event {

          u32     type:2, len:3, time_delta:27;

          u32     array[];

      };

      它里面有一個(gè)time_delta的成員占27位。在每一個(gè)頁(yè)面的頭部,即Structbuffer_data_page里面也有一個(gè)時(shí)間戳,即:

      struct buffer_data_page {

          u64     time_stamp;    /* page time stamp */

          local_t     commit;   /* write commited index */

          unsigned char   data[];   /* data of buffer page */

      }

      那這幾個(gè)時(shí)間戳有什么樣的關(guān)聯(lián)呢?在ring_buffer_per_cpu中有一個(gè)timestamp,它表示最近c(diǎn)ommit時(shí)的時(shí)間戳。每次切換進(jìn)一個(gè)新頁(yè)面時(shí),該頁(yè)面對(duì)應(yīng)的:

      buffer_data_page->time_stamp會(huì)記錄當(dāng)前的時(shí)間戳。即buffer_date_page->time_stamp記錄頁(yè)面被切換成寫頁(yè)面時(shí)的時(shí)間戳。而ring_buffer_event->time_delta表示當(dāng)前時(shí)間和上一次commit時(shí)間即ring_buffer_per_cpu->time_stamp的差值。

      綜上所述,存在以下關(guān)系:

      頁(yè)面中的第一個(gè)event,event1在進(jìn)行寫操作時(shí)的時(shí)間戳為:

      buffer_data_page->time_stamp+ ring_buffer_event1->time_delta.

      第二個(gè)event,event2在進(jìn)行寫操作時(shí)的時(shí)間戳為:

      buffer_data_page->time_stamp+ring_buffer_event1->time_delta+

      ring_buffer_event2->time_delta.

      依次類推,不過有種情況是特別的,即RINGBUF_TYPE_TIME_EXTEND類型的EVENT,它是為了有時(shí)delta時(shí)間超過27位時(shí),擴(kuò)展了一個(gè)32位用來(lái)存放的時(shí)間戳。這也就是上面代碼中的if (test_time_stamp(delta))。另外需要注意,這里的返回值commit,只有在fistcommit的時(shí)候才會(huì)為1。

      分析到這里我們發(fā)現(xiàn)計(jì)算出來(lái)的delta值超過了設(shè)定的閥值,所以打印處理debug信息,我們分析一下打印出來(lái)的幾個(gè)時(shí)間值(應(yīng)該是系統(tǒng)節(jié)拍值),日志如下:

      Delta way too big! 18428729675200069867ts=18446743954022816244 write stamp = 18014278822746377

      通過上面這條日志我們可以確定此時(shí)的delta=18428729675200069867,ts=18446743954022816244write stamp = 18014278822746377(cpu里面記錄的上次寫入的值)。這三個(gè)值有如下關(guān)系:diff = ts - cpu_buffer->write_stamp;通過計(jì)算器計(jì)算確實(shí)滿足。那下面在看看ts是怎么計(jì)算出來(lái)的,他是計(jì)算linux服務(wù)器沖啟動(dòng)到現(xiàn)在的節(jié)拍數(shù),和設(shè)置的HZ有關(guān)系,最終還是調(diào)用sched_clock函數(shù)計(jì)算所得,這個(gè)函數(shù)代碼實(shí)現(xiàn)如下:

      unsigned long long __attribute__((weak)) sched_clock(void)

      {

               //因?yàn)閮?nèi)核在啟動(dòng)的時(shí)候就將jiffies的值初始化為INITIAL_JIFFIES-300*HZ

               return(unsigned long long)(jiffies - INITIAL_JIFFIES)

                                                     *(NSEC_PER_SEC / HZ);

      }

      以上函數(shù)是cpu調(diào)度的節(jié)拍數(shù)計(jì)算方式,全局變量jiffies用來(lái)記錄從系統(tǒng)啟動(dòng)以來(lái)產(chǎn)生的節(jié)拍的總數(shù),啟動(dòng)時(shí),內(nèi)核將該變量初始化為INITIAL_JIFFIES,網(wǎng)上有的說(shuō)法又是初始化為0,為了驗(yàn)證到底初始化為多少我們使用一個(gè)內(nèi)核模塊在啟動(dòng)的時(shí)候就把這個(gè)值打印出來(lái)看一看就清楚,通過測(cè)試初始化值確實(shí)是INITIAL_JIFFIES。此后,每次時(shí)鐘中斷處理程序都會(huì)增加該變量的值。一秒內(nèi)時(shí)鐘中斷的次數(shù)等于HZ,所以jiffies一秒內(nèi)增加的值也就是HZ。系統(tǒng)運(yùn)行時(shí)間以秒為單位,等于jiffies/HZ。注意,jiffies類型為無(wú)符號(hào)長(zhǎng)整型(unsigned long),其他任何類型存放它都不正確。將以秒為單位的時(shí)間轉(zhuǎn)化為jiffies:seconds * Hz。將jiffies轉(zhuǎn)化為以秒為單位的時(shí)間:jiffies / HZ。HZ的值在param.h中設(shè)置為100,那么通過日志中打印出來(lái)的ts值(單位是納秒)計(jì)算服務(wù)器已經(jīng)啟動(dòng)了多少天:18446743954022816244/1000000000/24/60/60=213504(天),這個(gè)值明顯不對(duì),那我們?cè)谟?jì)算一下上一次cpu記錄的寫入值(write stamp = 18014278822746377)是否正確?同樣的計(jì)算方式如下:18014278822746377/1000000000/24/60/60=208(天)這個(gè)值還算比較正確,就是上一次寫入對(duì)應(yīng)的值已經(jīng)是208天以前的時(shí)候。

      INITIAL_JIFFIES的宏定義如下:

      #defineINITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

      SEC_PER_SEC是每一秒鐘的納秒數(shù)(1000000000)。

      于是整個(gè)表達(dá)式計(jì)算為:系統(tǒng)當(dāng)前運(yùn)行的節(jié)拍值4294937296(INITIAL_JIFFIES計(jì)算所得),然后再除以1000000000/100=10000000。從我們的日志中可以看出當(dāng)前計(jì)算出來(lái)的值是ts:18446743954022816244。根據(jù)這個(gè)值我們可以反推出jiffies-INITIAL_JIFFIES=1844674395402。這個(gè)值明顯不對(duì),在看cpu軟死鎖產(chǎn)生的一條日志信息如下:

      BUG: soft lockup- CPU#11 stuck for 4278190091s! [qmgr:5492]

      通過這條日子好也可以看出這個(gè)時(shí)間4278190091s完全不正常。造成這種不正常的原因可能是內(nèi)存破壞造成的。

      新發(fā)現(xiàn):在使用內(nèi)核模塊修改jiffies值的時(shí)候,直接導(dǎo)致了centos產(chǎn)生了cpu softlockup錯(cuò)誤,而且從內(nèi)核打印出的錯(cuò)誤信息可以看出最后加載的模塊就是我寫的那個(gè)測(cè)試模塊,對(duì)比線上服務(wù)器崩潰的內(nèi)核日志查看最后加載的內(nèi)核模塊是scsi_scan_wait。由此可以推斷可能是由于scsi_scan_wait這個(gè)內(nèi)核模塊導(dǎo)致了qmgr進(jìn)程產(chǎn)生了cpu軟死鎖,也就是導(dǎo)致了qmgr已經(jīng)超過了60秒沒有得到中斷響應(yīng)了。

      4.結(jié)論

               由于一直不能重現(xiàn)這個(gè)錯(cuò)誤,所以一直沒有排查出問題的正在原因,目前還是繼續(xù)排查,想辦法還原錯(cuò)誤。如果有哪位以前遇到過同樣的問題并且得到了解決麻煩告知一聲,非常感謝!

      posted @ 2012-12-16 17:48  薔薇理想人生  閱讀(39103)  評(píng)論(0)    收藏  舉報(bào)
      主站蜘蛛池模板: 视频一区视频二区视频三区| 精品国产av一区二区三区| 久久久久久久久久久免费精品| 一区二区中文字幕久久| 男女性高爱潮免费网站| 欧美成人午夜精品免费福利| 精品伊人久久久香线蕉| 又大又硬又爽免费视频| 亚洲精品宾馆在线精品酒店| 18禁免费无码无遮挡不卡网站| 久久精品蜜芽亚洲国产AV| 欲乱人妻少妇邻居毛片| 四虎永久在线精品无码视频| 四虎成人在线观看免费| 亚洲伊人成无码综合网| 久久午夜私人影院| 久久被窝亚洲精品爽爽爽| 亚洲色婷婷综合久久| av在线播放国产一区| 人妻丝袜无码专区视频网站| 国产精品一线天在线播放| 国产在线精品福利91香蕉| 自拍偷拍第一区二区三区| 激情 自拍 另类 亚洲| 精品一区二区三区不卡| 无人区码一码二码三码区| 亚洲另类欧美综合久久图片区| 高清美女视频一区二区三区| 国产精品久久亚洲不卡| 日本在线a一区视频高清视频| 欧美人与动牲猛交A欧美精品| 国内精品无码一区二区三区| 九九在线精品国产| 亚洲无人区一码二码三码| 中文字幕人妻少妇引诱隔壁| 长腿校花无力呻吟娇喘| 尹人香蕉久久99天天拍| 国产精品视频一区不卡| 国产精品大片中文字幕| 国产AV福利第一精品| 亚洲国产成人va在线观看天堂|