記一次 .NET 某跨境物流系統 內存暴漲分析
一:背景
1. 講故事
分享了幾篇崩潰,這一篇跟大家分享一例內存暴漲,訓練營里有位學員朋友找到我,說他們的軟件存在內存暴漲,讓我幫忙看下怎么回事,dump也抓到了。
二:內存暴漲分析
1. 內存真的暴漲嗎
一直都給訓練營里的學員灌輸一個思想,就是不要相信任何人說的話,而應該是以數據說話,由于是 linux dump,可以使用 !maddress -summary 命令觀察。
0:000> !maddress -summary
+-------------------------------------------------------------------------+
| Memory Type | Count | Size | Size (bytes) |
+-------------------------------------------------------------------------+
| GCHeap | 12 | 2.39gb | 2,565,107,712 |
| Stack | 32 | 237.33mb | 248,860,672 |
| Image | 1,380 | 200.21mb | 209,934,848 |
| PAGE_READWRITE | 83 | 121.45mb | 127,348,736 |
| LowFrequencyHeap | 208 | 13.89mb | 14,561,280 |
| HighFrequencyHeap | 146 | 9.10mb | 9,539,584 |
| LoaderCodeHeap | 9 | 9.00mb | 9,437,184 |
| HostCodeHeap | 6 | 360.00kb | 368,640 |
| ResolveHeap | 1 | 348.00kb | 356,352 |
| DispatchHeap | 1 | 196.00kb | 200,704 |
| PAGE_READONLY | 84 | 175.50kb | 179,712 |
| CacheEntryHeap | 2 | 100.00kb | 102,400 |
| IndirectionCellHeap | 2 | 88.00kb | 90,112 |
| LookupHeap | 2 | 80.00kb | 81,920 |
| StubHeap | 1 | 12.00kb | 12,288 |
| PAGE_EXECUTE_WRITECOPY | 2 | 8.00kb | 8,192 |
| PAGE_EXECUTE_READ | 1 | 4.00kb | 4,096 |
+-------------------------------------------------------------------------+
| [TOTAL] | 1,972 | 2.97gb | 3,186,194,432 |
+-------------------------------------------------------------------------+
從卦象看,總計 2.97G 的總內存,托管堆就吃了 2.39G,也就表示當前存在托管內存泄露,接下來的關注點就是托管堆了。
2. 托管堆暴漲分析
要想觀察托管堆中到底有什么,可以使用 !dumpheap -stat 觀察詳情,輸出如下:
0:000> !dumpheap -stat
Statistics:
MT Count TotalSize Class Name
...
7f5640cb1168 115,010 71,585,040 System.Collections.Generic.Dictionary<System.String, Newtonsoft.Json.Linq.JToken>+Entry[]
7f5640ae64f8 629,539 75,544,680 xxx.TraceContext+TraceScope
7f5640937348 2,100,061 134,403,904 Newtonsoft.Json.Linq.JValue
7f563b060f90 1,945,658 205,832,084 System.String
7f5640c8d660 2,118,456 220,319,424 Newtonsoft.Json.Linq.JProperty
7f563b0614c0 101,757 1,047,579,498 System.Byte[]
Total 18,591,667 objects, 2,539,831,376 bytes
從卦中看托管堆中有 10w 個 System.Byte[] ,大概吃了一半的托管堆內存,從經驗上來說,這里面肯定有不少大的 Byte[],接下來做個簡單的過濾,發現有2.8w 的 size=0n32792 的 Byte[] 數組,簡化后如下:
0:000> !dumpheap -mt 7f563b0614c0 -min 0x8018 -max 0x8018
Address MT Size
7f552fc394a0 7f563b0614c0 32,792
7f552fc48858 7f563b0614c0 32,792
7f552fc52590 7f563b0614c0 32,792
7f552fc5e500 7f563b0614c0 32,792
7f552fc77fa0 7f563b0614c0 32,792
7f552fc834d0 7f563b0614c0 32,792
7f552fc8b718 7f563b0614c0 32,792
7f552fc93730 7f563b0614c0 32,792
...
7f5623f3f9a0 7f563b0614c0 32,792
7f5623f912e0 7f563b0614c0 32,792
7f5623f9b0a8 7f563b0614c0 32,792
7f5623fa5ee0 7f563b0614c0 32,792
Statistics:
MT Count TotalSize Class Name
7f563b0614c0 28,563 936,637,896 System.Byte[]
Total 28,563 objects, 936,637,896 bytes
那為什么有 2.8w 的 byte[] 得不到GC回收呢?隨機抽幾個使用 !gcroot 命令觀察,參考如下:
0:000> !gcroot 7f55c7a6cf90
r15:
-> 7f5567484b08 System.Threading.ThreadPoolWorkQueueThreadLocals
-> 7f5567484b40 System.Threading.Thread
-> 7f55bb5d7d58 System.Threading.ExecutionContext
-> 7f55bb5d7ce8 System.Threading.AsyncLocalValueMap+MultiElementAsyncLocalValueMap
-> 7f55bb5d7d00 System.Collections.Generic.KeyValuePair<System.Threading.IAsyncLocal, System.Object>[]
-> 7f55bb5d7b98 xxxt+TraceScope
-> 7f555658d478 Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope
-> 7f56140ea038 Microsoft.Extensions.DependencyInjection.ServiceLookup.DynamicServiceProviderEngine
-> 7f56140ea0c8 Microsoft.Extensions.DependencyInjection.ServiceLookup.ServiceProviderEngineScope
-> 7f56140f7160 System.Collections.Generic.List<System.IDisposable>
-> 7f561412fd80 System.IDisposable[]
...
-> 7f561434b6f8 xxx.RequestTrace.AspNetTraceContext
...
-> 7f561406ef10 Microsoft.Extensions.Configuration.ConfigurationRoot
-> 7f561406ef30 Microsoft.Extensions.Configuration.ConfigurationReloadToken
-> 7f561406ef48 System.Threading.CancellationTokenSource
-> 7f56140f50a0 System.Threading.CancellationTokenSource+CallbackPartition[]
-> 7f56142b1d38 System.Threading.CancellationTokenSource+CallbackPartition
-> 7f55cfe26900 System.Threading.CancellationTokenSource+CallbackNode
-> 7f55a585a858 System.Threading.CancellationTokenSource+CallbackNode
...
-> 7f55a59b4848 System.Threading.CancellationTokenSource+CallbackNode
-> 7f55a59b4178 System.Threading.ExecutionContext
-> 7f55a59b4108 System.Threading.AsyncLocalValueMap+MultiElementAsyncLocalValueMap
-> 7f55a59b4120 System.Collections.Generic.KeyValuePair<System.Threading.IAsyncLocal, System.Object>[]
-> 7f55a59b3bb0 xxx.RequestTrace.AspNetTraceContext
-> 7f55a59b3708 Microsoft.AspNetCore.Http.DefaultHttpContext
-> 7f55a59b2308 Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.Http1Connection
-> 7f55a59b4b68 Microsoft.AspNetCore.WebUtilities.FileBufferingReadStream
-> 7f55a5b219b0 System.Byte[]
通過卦中的輸出,很快就發現有超多的 CallbackNode 節點,熟悉這個類的朋友應該知道,它是你通過 CancellationTokenSource.Token.Register 注冊的回調函數,在底層的話可以觀察首節點的 Id=17812 即可,參考如下:
0:000> !do 7f55cfe26900
Name: System.Threading.CancellationTokenSource+CallbackNode
MethodTable: 00007f56405c1c18
EEClass: 00007f56405aa078
Size: 80(0x50) bytes
File: /usr/share/dotnet/shared/Microsoft.NETCore.App/3.1.32/System.Private.CoreLib.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007f56405c2690 4000705 8 ...CallbackPartition 0 instance 00007f56142b1d38 Partition
00007f56405c1c18 4000706 10 ...urce+CallbackNode 0 instance 0000000000000000 Prev
00007f56405c1c18 4000707 18 ...urce+CallbackNode 0 instance 00007f55cc695010 Next
00007f563b05b468 4000708 40 System.Int64 1 instance 17812 Id
00007f56405b55b8 4000709 20 ...Private.CoreLib]] 0 instance 00007f55cfe268c0 Callback
00007f563a5dc7a8 400070a 28 System.Object 0 instance 00007f561406ef10 CallbackState
00007f563b0f8fe0 400070b 30 ....ExecutionContext 0 instance 00007f55cfe23e88 ExecutionContext
00007f563b0f91d0 400070c 38 ...ronizationContext 0 instance 0000000000000000 SynchronizationContext
從卦中看到這個單鏈表已經有17812個節點,正常情況下不會有w級別的注冊函數,這是一種失控的狀態。
3. 為什么Register失控了
要想找到這個答案,可以深挖 Callback 和 CallbackState 兩個字段,參考如下:
0:000> !dumpdelegate 00007f55cfe268c0
Target Method Name
00007f55cfe26788 00007f5640ae1040 xxx.ConfigurationExtensions+<>c__DisplayClass18_0.<OnChange>g__Callback|0(System.Object)
0:000> !DumpObj /d 00007f561406ef10
Name: Microsoft.Extensions.Configuration.ConfigurationRoot
MethodTable: 00007f56407bd910
EEClass: 00007f564079f808
Size: 32(0x20) bytes
File: /data/apps/xxx/Microsoft.Extensions.Configuration.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007f56407bd3a0 400000a 8 ...on.Abstractions]] 0 instance 00007f561406ece0 _providers
00007f56407be120 400000b 10 ...rationReloadToken 0 instance 00007f561406ef30 _changeToken

從卦象來看是和 IConfiguration 相關的變更令牌有關,比如對配置文件的監視,由于這些都是底層的東西,看樣子是高層使用者用的不對。
那到底是哪里的不對呢?一時也沒搞清楚,接下來回頭觀察引用鏈,發現有用戶代碼 TraceScope 和 AspNetTraceContext,趕緊研究相關代碼,發現有相關的 Register 操作。

并且在托管堆上也發現了3w多的 AspNetTraceContext 和 62w 的TraceScope,這也是一種失控狀態,輸出如下:
...
7f5640af9a70 31,949 4,345,064 xxx.RequestTrace.AspNetTraceContext
...
7f5640ae64f8 629,539 75,544,680 xxx.TraceContext+TraceScope
...
7f563b0614c0 101,757 1,047,579,498 System.Byte[]
最后發現它是一個 HttpTrace的中間件,公司的架構組做的,然后就是告訴朋友重點關注對 AspNetTraceContext 和 TraceScope 相關的注冊代碼,或者直接將這個中間件剔除,觀察下效果。
三:總結
這篇文章需要你對 CancellationToken 和 IChangeToken 有一定的了解,調試工作者難呀,不管是開源的還是非開源的,若要解決問題,需要對這些東西的底層有一個較微觀的認識。。。否則如何給出指導意見呢!

浙公網安備 33010602011771號