記錄一個Mono Runtime與Mono Debugger-Agent的兼容性問題
在mono集成到C++應(yīng)用程序中時,可以通過一些參數(shù)來配置mono啟動時暫停,并在調(diào)試器鏈接上來后再繼續(xù)運行。這對于調(diào)試一些啟動時機非常早的代碼會非常有用。這通過給mono_jit_parse_options傳遞參數(shù)embedding和suspend來實現(xiàn)。
static void print_usage (void) { PRINT_ERROR_MSG ("Usage: mono --debugger-agent=[<option>=<value>,...] ...\n"); PRINT_ERROR_MSG ("Available options:\n"); PRINT_ERROR_MSG (" transport=<transport>\t\tTransport to use for connecting to the debugger (mandatory, possible values: 'dt_socket')\n"); PRINT_ERROR_MSG (" address=<hostname>:<port>\tAddress to connect to (mandatory)\n"); PRINT_ERROR_MSG (" loglevel=<n>\t\t\tLog level (defaults to 0)\n"); PRINT_ERROR_MSG (" logfile=<file>\t\tFile to log to (defaults to stdout)\n"); PRINT_ERROR_MSG (" suspend=y/n\t\t\tWhether to suspend after startup.\n"); PRINT_ERROR_MSG (" timeout=<n>\t\t\tTimeout for connecting in milliseconds.\n"); PRINT_ERROR_MSG (" server=y/n\t\t\tWhether to listen for a client connection.\n"); PRINT_ERROR_MSG (" keepalive=<n>\t\t\tSend keepalive events every n milliseconds.\n"); PRINT_ERROR_MSG (" setpgid=y/n\t\t\tWhether to call setpid(0, 0) after startup.\n"); PRINT_ERROR_MSG (" help\t\t\t\tPrint this help.\n"); }
實測timeout可能與大家想象中的不一致,如果超時調(diào)試器沒有鏈接上來,那么將不會再繼續(xù)接收調(diào)試器鏈接請求。所以這是一個問題。
延遲啟動后,在通過mono_jit_init_version初始化Domain完成后,繼續(xù)執(zhí)行其它代碼,此時會遇到類似下面這個錯誤:
mono_coop_mutex_lock Cannot transition thread 000000??0000???? from STATE_BLOCKING with DO_BLOCKING
經(jīng)過一些分析后發(fā)現(xiàn),這個可能與協(xié)作式gc同步的兼容性有關(guān)系。如果延遲了Mono啟動,那么調(diào)試器鏈接上來后,會立馬執(zhí)行一系列的信息獲取操作,這些操作就包括: CMD_VM_GET_TYPES_FOR_SOURCE_FILE
#define CMD_VM_VERSION MDBGPROT_CMD_VM_VERSION #define CMD_VM_SET_PROTOCOL_VERSION MDBGPROT_CMD_VM_SET_PROTOCOL_VERSION #define CMD_VM_ALL_THREADS MDBGPROT_CMD_VM_ALL_THREADS #define CMD_VM_SUSPEND MDBGPROT_CMD_VM_SUSPEND #define CMD_VM_RESUME MDBGPROT_CMD_VM_RESUME #define CMD_VM_DISPOSE MDBGPROT_CMD_VM_DISPOSE #define CMD_VM_EXIT MDBGPROT_CMD_VM_EXIT #define CMD_VM_INVOKE_METHOD MDBGPROT_CMD_VM_INVOKE_METHOD #define CMD_VM_INVOKE_METHODS MDBGPROT_CMD_VM_INVOKE_METHODS #define CMD_VM_ABORT_INVOKE MDBGPROT_CMD_VM_ABORT_INVOKE #define CMD_VM_SET_KEEPALIVE MDBGPROT_CMD_VM_SET_KEEPALIVE #define CMD_VM_GET_TYPES_FOR_SOURCE_FILE MDBGPROT_CMD_VM_GET_TYPES_FOR_SOURCE_FILE #define CMD_VM_GET_TYPES MDBGPROT_CMD_VM_GET_TYPES #define CMD_VM_START_BUFFERING MDBGPROT_CMD_VM_START_BUFFERING #define CMD_VM_STOP_BUFFERING MDBGPROT_CMD_VM_STOP_BUFFERING
case CMD_VM_GET_TYPES_FOR_SOURCE_FILE: { char *fname, *basename; gboolean ignore_case; GPtrArray *res_classes, *res_domains; fname = decode_string (p, &p, end); ignore_case = decode_byte (p, &p, end); basename = dbg_path_get_basename (fname); res_classes = g_ptr_array_new (); res_domains = g_ptr_array_new (); mono_loader_lock (); t_start = clock(); GetTypesForSourceFileArgs args; memset (&args, 0, sizeof (args)); args.ignore_case = ignore_case; args.basename = basename; args.res_classes = res_classes; args.res_domains = res_domains; mono_de_foreach_domain (get_types_for_source_file, &args); t_end = clock(); mono_loader_unlock (); time_spent = (double)(t_end - t_start) / CLOCKS_PER_SEC; g_print("CMD_VM_GET_TYPES_FOR_SOURCE_FILE:%f\n", (float)time_spent); g_free (fname); g_free (basename); buffer_add_int (buf, res_classes->len); for (guint i = 0; i < res_classes->len; ++i) buffer_add_typeid (buf, (MonoDomain *)g_ptr_array_index (res_domains, i), (MonoClass *)g_ptr_array_index (res_classes, i)); g_ptr_array_free (res_classes, TRUE); g_ptr_array_free (res_domains, TRUE); break; }
這是對應(yīng)的代碼,其中會對loader進行加鎖。這個操作的時間可長可短。而這個操作可能會對MonoRuntime的其它代碼執(zhí)行產(chǎn)生影響。在mono_loader_lock中:
/** * mono_loader_lock: * * See \c docs/thread-safety.txt for the locking strategy. */ void mono_loader_lock (void) { mono_locks_coop_acquire (&loader_mutex, LoaderLock); if (G_UNLIKELY (loader_lock_track_ownership)) { mono_native_tls_set_value (loader_lock_nest_id, GUINT_TO_POINTER (GPOINTER_TO_UINT (mono_native_tls_get_value (loader_lock_nest_id)) + 1)); } } static inline void mono_coop_mutex_lock (MonoCoopMutex *mutex) { /* Avoid thread state switch if lock is not contended */ if (mono_os_mutex_trylock (&mutex->m) == 0) return; MONO_ENTER_GC_SAFE; mono_os_mutex_lock (&mutex->m); MONO_EXIT_GC_SAFE; }
在嘗試trylock失敗后,會直接進入資源競爭狀態(tài),進入之前會嘗試將當(dāng)前線程狀態(tài)調(diào)整到GC_SAFE狀態(tài)。但是這個看起來沒有什么問題的操作就可能引起前面說的崩潰。在我的應(yīng)用場景中,存在對mono_assembly_get_image的使用,它本身會調(diào)整線程的狀態(tài),日志如下:
[ABORT_BLOCKING][000000000000B844] STATE_BLOCKING . -> RUNNING . (0 -> 0) mono_assembly_get_image
[DO_BLOCKING][000000000000B844] RUNNING . -> STATE_BLOCKING . (0 -> 0) mono_assembly_get_image
可以看到此時主線程已經(jīng)進入STATE_BLOCKING狀態(tài),如果此時再執(zhí)行其它mono api,且調(diào)試器支持線程正在因為執(zhí)行CMD_VM_GET_TYPES_FOR_SOURCE_FILE而對loader加了鎖,那么主線程的mono_loader_lock將嘗試進入GC_SAFE狀態(tài),此時會嘗試調(diào)整線程狀態(tài),然而此時線程還處于STATE_BLOCKING狀態(tài),這樣前面的報錯就出現(xiàn)了:
/* This transitions the thread into a cooperative state where it's assumed to be suspended but can continue. Native runtime code might want to put itself into a state where the thread is considered suspended but can keep running. That state only works as long as the only managed state touched is blitable and was pinned before the transition. It returns the action the caller must perform: - Continue: Entered blocking state successfully; - PollAndRetry: Async suspend raced and won, try to suspend and then retry; */ MonoDoBlockingResult mono_threads_transition_do_blocking (MonoThreadInfo* info, const char *func) { int raw_state, cur_state, suspend_count; gboolean no_safepoints; retry_state_change: UNWRAP_THREAD_STATE (raw_state, cur_state, suspend_count, no_safepoints, info); switch (cur_state) { case STATE_RUNNING: //transition to blocked if (!(suspend_count == 0)) mono_fatal_with_history ("suspend_count = %d, but should be == 0", suspend_count); if (no_safepoints) mono_fatal_with_history ("no_safepoints = TRUE, but should be FALSE in state RUNNING with DO_BLOCKING"); if (thread_state_cas (&info->thread_state, build_thread_state (STATE_BLOCKING, suspend_count, no_safepoints), raw_state) != raw_state) goto retry_state_change; trace_state_change_sigsafe ("DO_BLOCKING", info, raw_state, STATE_BLOCKING, no_safepoints, 0, func); return DoBlockingContinue; case STATE_ASYNC_SUSPEND_REQUESTED: if (!(suspend_count > 0)) mono_fatal_with_history ("suspend_count = %d, but should be > 0", suspend_count); if (no_safepoints) mono_fatal_with_history ("no_safepoints = TRUE, but should be FALSE in state ASYNC_SUSPEND_REQUESTED with DO_BLOCKING"); trace_state_change_sigsafe ("DO_BLOCKING", info, raw_state, cur_state, no_safepoints, 0, func); return DoBlockingPollAndRetry; /* STATE_ASYNC_SUSPENDED STATE_SELF_SUSPENDED: Code should not be running while suspended. STATE_BLOCKING: STATE_BLOCKING_SUSPEND_REQUESTED: STATE_BLOCKING_SELF_SUSPENDED: Blocking is not nestabled STATE_BLOCKING_ASYNC_SUSPENDED: Blocking is not nestable _and_ code should not be running while suspended */ default: mono_fatal_with_history ("%s Cannot transition thread %p from %s with DO_BLOCKING", func, mono_thread_info_get_tid (info), state_name (cur_state)); } }
我目前也沒有什么好的辦法來解決這個問題,畢竟mono不是我寫的,協(xié)作gc還挺復(fù)雜的,不敢亂改。目前我采用的策略是延遲Domain初始化完成后的其它代碼的調(diào)用。比如推遲1-2秒,確保不會跟調(diào)試器支持線程的資源競爭即可。當(dāng)然這不能根治這個問題,具體細(xì)節(jié)我提交到github上讓微軟看去。

浙公網(wǎng)安備 33010602011771號