對 .NET線程 異常退出引發(fā)程序崩潰的反思
一:背景
1. 講故事
前天收到了一個.NET程序崩潰的dump,經(jīng)過一頓分析之后,發(fā)現(xiàn)禍根是因為一個.NET托管線程(DBG=XXXX)的異常退出所致,參考如下:
0:011> !t
ThreadCount: 17
UnstartedThread: 0
BackgroundThread: 16
PendingThread: 0
DeadThread: 0
Hosted Runtime: no
Lock
DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
0 1 84d8 000001C0801EAC20 26020 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 STA
3 2 9d78 000001C0801F8210 2b220 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 MTA (Finalizer)
4 4 8760 000001C08466C800 102b220 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 MTA (Threadpool Worker)
...
44 16 b2fc 000001C08F949450 102b220 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 MTA (GC) (Threadpool Worker)
46 15 9904 000001C08F9487B0 102b220 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 MTA (Threadpool Worker)
XXXX 3 a23c 000001C08F948E00 102b220 Preemptive 0000000000000000:0000000000000000 000001c080266300 -00001 Ukn (Threadpool Worker)
由于線程異常退出,CLR此時完全不知情,當(dāng) GC 觸發(fā)時會在這個XXXX線程上尋找引用根,由于是一個不存在的線程,所以訪問它的空間自然就是訪問違例,從 ScanStackRoots 函數(shù)調(diào)用棧上可以清晰的看到,參考如下:
0:011> .ecxr
rax=00007ffdbefcc8a0 rbx=000000a42007f5f0 rcx=000000a42187f688
rdx=0000000000000000 rsi=000000a42007ee60 rdi=000000a42007f100
rip=00007ffdbec36cbb rsp=000000a42007f828 rbp=000001c08f948e00
r8=000000a42007f910 r9=000001c08f948e00 r10=00000fffb7da5860
r11=0555501544555545 r12=ffffffffffffffff r13=0000000000000000
r14=0000000000000000 r15=00007ffdbec14fb0
iopl=0 nv up ei pl nz ac pe cy
cs=0033 ss=002b ds=002b es=002b fs=0053 gs=002b efl=00010211
coreclr!InlinedCallFrame::FrameHasActiveCall+0x13:
00007ffd`bec36cbb 483b01 cmp rax,qword ptr [rcx] ds:000000a4`2187f688=????????????????
0:011> k
*** Stack trace for last set context - .thread/.cxr resets it
# Child-SP RetAddr Call Site
00 000000a4`2007f828 00007ffd`bec36c2e coreclr!InlinedCallFrame::FrameHasActiveCall+0x13 [D:\a\_work\1\s\src\coreclr\vm\frames.h @ 2927]
01 000000a4`2007f830 00007ffd`bec36aef coreclr!ScanStackRoots+0x3a [D:\a\_work\1\s\src\coreclr\vm\gcenv.ee.cpp @ 121]
02 000000a4`2007f8a0 00007ffd`bec29627 coreclr!GCToEEInterface::GcScanRoots+0x8f [D:\a\_work\1\s\src\coreclr\vm\gcenv.ee.cpp @ 282]
03 (Inline Function) --------`-------- coreclr!GCScan::GcScanRoots+0x73 [D:\a\_work\1\s\src\coreclr\gc\gcscan.cpp @ 152]
04 000000a4`2007f8e0 00007ffd`bec14865 coreclr!WKS::gc_heap::background_mark_phase+0xdf [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 37866]
05 000000a4`2007f990 00007ffd`bed286a0 coreclr!WKS::gc_heap::gc1+0x511 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 22315]
06 000000a4`2007f9f0 00007ffd`bed391c1 coreclr!WKS::gc_heap::bgc_thread_function+0x68 [D:\a\_work\1\s\src\coreclr\gc\gc.cpp @ 39244]
07 000000a4`2007fa20 00007ffe`3533e8d7 coreclr!<lambda_7303b2ca2c5f80d5f81ddddfcd2de660>::operator()+0xa1 [D:\a\_work\1\s\src\coreclr\vm\gcenv.ee.cpp @ 1441]
08 000000a4`2007fa50 00007ffe`363f14fc kernel32!BaseThreadInitThunk+0x17
09 000000a4`2007fa80 00000000`00000000 ntdll!RtlUserThreadStart+0x2c
說實話這種崩潰我見過很多例,但更多的都是 new Thread 創(chuàng)建出來的,所以用 harmony 對它的 Thread.StartCore 進(jìn)行攔截就能輕松找出,但這次崩潰有一些特殊,它并不是來自于 new Thread 而是線程池散養(yǎng)的線程(ThreadPool),這對問題分析增加了不少難度,既然是反思,那就好好的總結(jié)此類問題的解決思路吧。
二:故障重現(xiàn)
1. 問題代碼
為了方便演示,我們用 C# 調(diào)用 C,然后在 C 中通過 TerminateThread 讓程序異常退出,首先看下 C 代碼:
extern "C"
{
_declspec(dllexport) void dowork();
}
#include "iostream"
#include <Windows.h>
using namespace std;
void dowork()
{
DWORD threadId = GetCurrentThreadId();
printf("C++:當(dāng)前線程ID(十進(jìn)制):%lu,十六進(jìn)制:0x%X\n", threadId, threadId);
printf("C++:我準(zhǔn)備退出了哦。。。\n");
TerminateThread(GetCurrentThread(), 1);
}
接下來在 C# 中調(diào)用導(dǎo)出的 dowork 方法,參考代碼如下:
namespace Example_1_1
{
internal class Program
{
static void Main(string[] args)
{
DoRequest();
Console.ReadLine();
}
static void DoRequest()
{
Task.Run(() =>
{
Console.WriteLine("1. 調(diào)用 C++ 代碼...");
try
{
dowork();
Console.WriteLine("2. C++ 代碼執(zhí)行完畢...");
}
catch (Exception ex)
{
Console.WriteLine($"2. C++ 代碼執(zhí)行異常: {ex.Message}");
}
});
}
[DllImport("Example_1_2", CallingConvention = CallingConvention.Cdecl)]
public extern static void dowork();
}
}
最后將程序運行起來,用windbg附加,可以看到果然有一個 XXXX 線程,截圖如下:

故障已經(jīng)復(fù)現(xiàn),接下來就是尋找到底是誰讓 ThreadPool 線程異常退出了。。。
三:如何尋找第一現(xiàn)場
1. process monitor
要想找到這個問題的禍根,需要找到調(diào)用 TerminateThread 函數(shù)的調(diào)用棧,一種簡單粗暴的方法就是用 process monitor,根據(jù) Windows 的ETW 規(guī)則,一個線程退出時會發(fā)出一個 Event 事件,這種事件可以被 process monitor 捕獲,并且還能記錄到調(diào)用棧,有了想法之后說干就干,配置界面如下:

接下來運行程序,使用 windbg 附加進(jìn)程,尋找問題線程ID,參考如下:
0:005> !t
ThreadCount: 5
UnstartedThread: 0
BackgroundThread: 3
PendingThread: 0
DeadThread: 1
Hosted Runtime: no
Lock
DBG ID OSID ThreadOBJ State GC Mode GC Alloc Context Domain Count Apt Exception
0 1 153c 00000202C603C240 2a020 Preemptive 00000202CA819060:00000202CA81B020 00000202c6088980 -00001 MTA
3 2 afc 00000202C60F0DB0 2b220 Preemptive 0000000000000000:0000000000000000 00000202c6088980 -00001 MTA (Finalizer)
XXXX 4 4718 00000202C6057D10 102b220 Preemptive 00000202CA80CF70:00000202CA80E740 00000202c6088980 -00001 Ukn (Threadpool Worker)
4 5 4420 00000202C605D510 302b220 Preemptive 00000202CA80EB40:00000202CA810760 00000202c6088980 -00001 MTA (Threadpool Worker)
0:005> ? 4718
Evaluate expression: 18200 = 00000000`00004718
從卦中可以看到是一個叫 osid=18200 的線程異常退出,接下來從 process monitor 界面上果然看到了一個Thread ID:18200 的 Thread Exit 事件,完美,截圖如下:

接下來就是雙擊,打開 Stack 選項卡,可以清晰的看到是有人調(diào)用了 Example_1_2!dowork 導(dǎo)致的退出,截圖如下:

在真實項目中,我相信你看到 dowork 函數(shù)應(yīng)該知道發(fā)生了什么,排查范圍是不是一下子就小了很多。。。相信這個問題你能輕松搞定。
2. MinHook 注入
上面的 process monitor 雖好,但也有一個讓人不如意的地方,那就是不能顯示托管棧,這個確實沒辦法,那有沒有辦法讓我看到托管棧呢?如果能看到就完美了,做法非常簡單,對 kernel32!TerminateThread 進(jìn)行注入即可,一旦有人執(zhí)行了這個方法,記錄 Terminate 線程的線程ID以及調(diào)用棧即可,完整代碼如下:
namespace Example_1_1
{
internal class Program
{
static void Main(string[] args)
{
// Install the hook before any TerminateThread calls can occur
TerminateThreadHook.InstallHook();
Console.WriteLine("Hook installed. Starting test...");
DoRequest();
// Uninstall hook when done
TerminateThreadHook.UninstallHook();
Console.ReadLine();
}
static void DoRequest()
{
Task.Run(() =>
{
Console.WriteLine("1. 調(diào)用 C++ 代碼...");
try
{
dowork();
Console.WriteLine("2. C++ 代碼執(zhí)行完畢...");
}
catch (Exception ex)
{
Console.WriteLine($"2. C++ 代碼執(zhí)行異常: {ex.Message}");
}
});
}
[DllImport("Example_1_2", CallingConvention = CallingConvention.Cdecl)]
public extern static void dowork();
}
public static class TerminateThreadHook
{
// TerminateThread function signature
[UnmanagedFunctionPointer(CallingConvention.StdCall)]
private delegate bool TerminateThreadDelegate(IntPtr hThread, uint dwExitCode);
private static TerminateThreadDelegate _originalTerminateThread;
private static IntPtr _terminateThreadPtr = IntPtr.Zero;
public static void InstallHook()
{
// 1. Get TerminateThread address from kernel32.dll
_terminateThreadPtr = MinHook.GetProcAddress(
MinHook.GetModuleHandle("kernel32.dll"), "TerminateThread");
if (_terminateThreadPtr == IntPtr.Zero)
{
Console.WriteLine("Failed to find TerminateThread address.");
return;
}
// 2. Initialize MinHook
var status = MinHook.MH_Initialize();
if (status != MinHook.MH_STATUS.MH_OK)
{
Console.WriteLine($"MH_Initialize failed: {status}");
return;
}
// 3. Create Hook
var detourPtr = Marshal.GetFunctionPointerForDelegate(
new TerminateThreadDelegate(HookedTerminateThread));
status = MinHook.MH_CreateHook(_terminateThreadPtr, detourPtr, out var originalPtr);
if (status != MinHook.MH_STATUS.MH_OK)
{
Console.WriteLine($"MH_CreateHook failed: {status}");
return;
}
_originalTerminateThread = Marshal.GetDelegateForFunctionPointer<TerminateThreadDelegate>(originalPtr);
// 4. Enable Hook
status = MinHook.MH_EnableHook(_terminateThreadPtr);
if (status != MinHook.MH_STATUS.MH_OK)
{
Console.WriteLine($"MH_EnableHook failed: {status}");
return;
}
Console.WriteLine("TerminateThread hook installed successfully!");
}
public static void UninstallHook()
{
if (_terminateThreadPtr == IntPtr.Zero)
return;
// 1. Disable Hook
var status = MinHook.MH_DisableHook(_terminateThreadPtr);
if (status != MinHook.MH_STATUS.MH_OK)
Console.WriteLine($"MH_DisableHook failed: {status}");
// 2. Uninitialize MinHook
status = MinHook.MH_Uninitialize();
if (status != MinHook.MH_STATUS.MH_OK)
Console.WriteLine($"MH_Uninitialize failed: {status}");
_terminateThreadPtr = IntPtr.Zero;
Console.WriteLine("Hook uninstalled.");
}
private static bool HookedTerminateThread(IntPtr hThread, uint dwExitCode)
{
// Get current thread ID
uint currentThreadId = GetCurrentThreadId();
uint targetThreadId = GetThreadId(hThread);
Console.WriteLine($"[HOOK] TerminateThread intercepted!");
Console.WriteLine($" Attempting to terminate thread: 0x{targetThreadId.ToString("X")} (ID: {targetThreadId})");
Console.WriteLine($" Called from thread ID: {currentThreadId}");
// Print managed call stack
Console.WriteLine("\n [Managed Call Stack]:");
Console.WriteLine(Environment.StackTrace);
return _originalTerminateThread(hThread, dwExitCode);
}
[DllImport("kernel32.dll")]
private static extern uint GetCurrentThreadId();
[DllImport("kernel32.dll")]
private static extern uint GetThreadId(IntPtr hThread);
}
public static class MinHook
{
public enum MH_STATUS
{
MH_OK = 0,
MH_ERROR_ALREADY_INITIALIZED,
MH_ERROR_NOT_INITIALIZED,
// ... other status codes
}
[DllImport("MinHook.x64.dll", CallingConvention = CallingConvention.Cdecl)]
public static extern MH_STATUS MH_Initialize();
[DllImport("MinHook.x64.dll", CallingConvention = CallingConvention.Cdecl)]
public static extern MH_STATUS MH_Uninitialize();
[DllImport("MinHook.x64.dll", CallingConvention = CallingConvention.Cdecl)]
public static extern MH_STATUS MH_CreateHook(IntPtr pTarget, IntPtr pDetour, out IntPtr ppOriginal);
[DllImport("MinHook.x64.dll", CallingConvention = CallingConvention.Cdecl)]
public static extern MH_STATUS MH_EnableHook(IntPtr pTarget);
[DllImport("MinHook.x64.dll", CallingConvention = CallingConvention.Cdecl)]
public static extern MH_STATUS MH_DisableHook(IntPtr pTarget);
[DllImport("kernel32.dll", CharSet = CharSet.Unicode)]
public static extern IntPtr GetModuleHandle(string lpModuleName);
[DllImport("kernel32.dll", CharSet = CharSet.Ansi)]
public static extern IntPtr GetProcAddress(IntPtr hModule, string lpProcName);
}
}

從卦中信息看果然攔截到了,通過 Environment.StackTrace 屬性將托管棧完美的展示出來,但這里也有一個小遺憾就是沒看到非托管部分,如果真想要的話可以借助 dbghelp.dll,這個就不細(xì)說了,總之根據(jù)這些調(diào)用棧日志 再比對 dump 中的異常退出線程,最終就會真相大白。。。
四:總結(jié)
如今.NET的主戰(zhàn)場在工控,而工控中有大量的C#和C++交互的場景,C++處理不慎就會導(dǎo)致C#災(zāi)難性后果,這篇文章所輸出的經(jīng)驗希望給后來者少踩坑吧!

浙公網(wǎng)安備 33010602011771號