圖像分割之Dense Prediction with Attentive Feature Aggregation

Dense Prediction with Attentive Feature Aggregation

原始文檔: https://www.yuque.com/lart/papers/xnqoi0

偶然間從 arxiv 上翻到的論文, 可以看做是對之前工作 Hierarchical multi-scale attention for semantic segmentation 的一個擴展.

從摘要讀論文

Aggregating information from features across different layers is an essential operation for dense prediction models.

本文重點關(guān)注與跨層的特征集成的問題.

Despite its limited expressiveness, feature concatenation dominates the choice of aggregation operations.

雖說是特征拼接, 但是大多數(shù)還會跟一些復雜的卷積結(jié)構(gòu).

In this paper, we introduce Attentive Feature Aggregation (AFA) to fuse different network layers with more expressive non-linear operations. AFA exploits both spatial and channel attention to compute weighted average of the layer activations.

核心模塊之 AFA. 使用空間和通道注意力來加權(quán)求和不同層的特征. 從而構(gòu)建一種非線性的集成操作.

Inspired by neural volume rendering, we extend AFA with Scale-Space Rendering (SSR) to perform late fusion of multi-scale predictions.

這里提到一個很有趣的點, 是用來融合多尺度預測的結(jié)構(gòu). 結(jié)構(gòu)的設(shè)計借鑒了神經(jīng)體渲染的想法(我不太了解這個方面).

AFA is applicable to a wide range of existing network designs.

由于 AFA 本身就是一個模型通用的模塊, 所以可以很容易的遷移到不同的模型中來實現(xiàn)特征的跨層集成.

Our experiments show consistent and significant improvements on challenging semantic segmentation benchmarks, including Cityscapes, BDD100K, and Mapillary Vistas, at negligible computational and parameter overhead. In particular, AFA im-proves the performance of the Deep Layer Aggregation (DLA) model by nearly 6% mIoU on Cityscapes. Our experimental analyses show that AFA learns to progressively refine segmentation maps and to improve boundary details, leading to new state-of-the-art results on boundary detection benchmarks on BSDS500 and NYUDv2.

嘗試了分割任務(wù)和邊緣檢測任務(wù).

主要內(nèi)容

We propose Attentive Feature Aggregation (AFA) as a non-linear feature fusion operation to replace the prevailing tensor concatenation or summation strategies.
- Our attention module uses both spatial and channel attention to learn and predict the importance of each input signal during fusion. Aggregation is accomplished by computing a linear combination of the input features at each spatial location, weighted by their relevance.
- Compared to linear fusion operations, our AFA module can take into consideration complex feature interactions and attend to different feature levels depending on their importance.
- AFA introduces negligible computation and parameter overhead and can be easily used to replace fusion operations in existing methods, such as skip connections.
- Unlike linear aggregation, our AFA module leverages extracted spatial and channel information to efficiently select the essential features and to increase the receptive field at the same time.
Inspired by neural volume rendering [Volume rendering, Nerf: Representing scenes as neural radiance fields for view synthesis], we propose Scale-Space Rendering (SSR) as a novel attention computation mechanism to fuse multi-scale predictions.
- _We treat those predictions as sampled data in scale-space and design a coarse-to-fine attention concept to render final predictions. _(這個想法很有意思. 把最終預測的獲取看做是一個從尺度空間中采樣不同尺度的預測來渲染最終預測的問題)
- Repeated use of attention layers may lead to numerical instability or vanishing gradients. We extend the above-mentioned attention mechanism to fuse the dense predictions from multi-scale inputs more effectively.
- Our solution resembles a volume rendering scheme applied to the scale space. This scheme provides a hierarchical, coarse-to-fine strategy to combine features, leveraging a scale-specific attention mechanism. We will also show that our approach generalizes the hierarchical multi-scale attention method [Hierarchical multi-scale attention for semantic segmentation].

Attentive Feature Aggregation (AFA)

這里設(shè)計了兩種整合形式, 一種適用于雙輸入, 另一種適合用于多輸入遞進式集成. 核心都是基于空間注意力和通道注意力. 注意, 這里的計算都是兩兩集成的形式, 所以都是算出一個注意力后, 使用 sigmoid 來構(gòu)造相對權(quán)重.

對于雙輸入形式, 空間注意力由較淺層特征計算, 因為其包含著豐富的空間信息. 而通道注意力由較深層特征計算, 因為其包含著更復雜的通道特征. 對于多輸入形式(圖中僅僅展示了三層, 實際上可以引入更多層的輸入), 通道和空間注意力完全由當前層輸入計算, 并且如果有靠前計算的一級的化, 該注意力會用來對當前和之前的輸出作加權(quán). 另外集成的順序原文中如此描述"a feature with higher priority will have gone through a higher number of aggregations", 我的理解是, 應該就是從深到淺層的一個過程.

提出的集成模塊可以用于許多結(jié)構(gòu)中, 例如 DLA、UNet、HRNet 和 FCN 中.

Scale-Space Rendering (SSR)

這里提出的 SSR 是一個更加類似于模型集成的策略.

其通過計算針對不同尺度下預測的輸出的相對權(quán)重來對多尺度推理進行集成. 所以, 這里涉及到兩個問題:

SSR 如何學習？論文中并沒有提到。但是按照上圖中的說法，訓練使用兩個尺度的輸入，說明這是可以訓練 SSR 的。由于是個會預測參數(shù)的可學習的結(jié)構(gòu)，對于每個輸入會自動預測一個注意力參數(shù)。通過這些不同尺度輸入下對應計算得到的參數(shù)從而獲得最終針對多個尺度的加權(quán)比重。
不同大小的預測最后會整合到哪個尺度？這一點論文中沒有提。但是按照上圖使用基于原始輸入的相對尺寸的表述來看，最終應該還是會集成到 1.0 倍原始輸入尺度上（與 hierarchical multi-scale attention 中的設(shè)計形式應該是一致的）。

表達形式

為了表達對多尺度預測的集成的問題, 作者首先將關(guān)注的重點放在單個像素上. 并且假設(shè)模型為目標像素在 \(k\) 個不同的尺度上提供了預測.
對于第 \(i\) 個尺度的預測可以表示為 \(P_i \in \mathbb{R}^w0obha2h00\). 由此, 在尺度空間中針對目標像素的特征表征可以定義為 \(P \triangleq (P_1, \dots, P_k)\). 進一步, 這里假設(shè) \(i<j\) 表示尺度 \(i\) 比尺度 \(j\) 更加粗糙.

于是目標像素就可以想象成在尺度空間中移動的光線, 從尺度\(1\)朝向尺度\(k\).

基于這樣的想法, 重新設(shè)計在提出的多特征融合機制中的原始的分層注意力, 并且模擬 volume-rendering equation, 這里的 volume 由尺度空間隱式給出.

為此, 除了位于尺度 \(i\) 的特征表征 \(P_i\), 假設(shè) 模型還會針對目標像素預測一個標量 \(y_i \in \mathbb{R}\). 在 volume rendering 的語境下, 粒子將 會穿過尺度\(i\)的概率, 在給定一些非負標量函數(shù) \(\phi: \mathbb{R} \rightarrow \mathbb{R}_{+}\) 時, 就可以表示為 \(e^{-\phi(y_i)}\).

于是可以將尺度注意力 \(\alpha_i\) 表達為粒子到達尺度 \(i\) 并停留在這里的概率(每一次都滿足伯努利分布, 非留即走, 前面都走, 就當前次留了下來):

\(\alpha_i(y) \triangleq [1 - e^{-\phi(y_i)}] \prod^{i-1}_{j=1}e^{-\phi(y_j)}, \, y \triangleq (y_1, \dots, y_k)\)

\(y\) 表示針對各個尺度的目標像素預測的標量參數(shù).

\(P_{final} \triangleq \sum^{k}_{i=1}P_i \alpha_i(y)\)

最終, 按照 volume rendering equation, 針對目標像素多尺度預測融合得到的最終預測, 由不同尺度的注意力參數(shù)來加權(quán)求和獲得. 這也反映了對于目標像素獲得的最終特征, 是在 \(y\) 驅(qū)動下融合所有尺度的特征表達獲得的.

綜合上下文的分析, 這里的設(shè)計中應該是最終是要將所有尺度集成到 1 上的.

提出的 SSR 可以看作是 Hierarchical Multi-Scale Attention (HMA) [Hierarchical multi-scale attention for semantic segmentation, https://github.com/NVIDIA/semantic-segmentation]的一種一般化形式.

通過設(shè)置 \(\phi(y_i) \triangleq \log(1 + e^{y_i})\), 并固定 \(\phi(y_k) \triangleq \infty\), 就可以獲得后者的形式了. 此時有:

\[\alpha_i = [1-\frac{1}{1+e^{y_i}}] \prod^{i-1}_{j=1}\frac{1}{1+e^{y_j}}, \\ \alpha_1=1-\frac{1}{1+e^{y_1}}, \\ \alpha_k=\prod^{k-1}_{j=1}\frac{1}{1+e^{y_j}}. \]

從這里的形式來看, 這里有兩處令人疑惑的地方:

形式不太對. 原本的 hierarchical multi-scale attention 使用的是 sigmoid 來集成不同的尺度. 這里與 sigmoid 并不一致.
按照這里的形式, 并結(jié)合空間注意力(sigmoid)的級聯(lián)關(guān)系, 可以看出輸出是在 \(i=1\) 的位置, 也就是其它層的信息按照層序號遞減的形式逐步集成起來. 這倒是和下圖的形式大致類似.

輸入是被再次放縮后才送入模型的. 而這里最終輸出的尺寸是對應于 1.0 倍原始輸入尺寸的. 所以, 假設(shè)按照尺度編號從 k 到 1 集成特征, 并在 1 層輸出結(jié)果.

由于本文中構(gòu)造的注意力是基于不選擇當前層的概率(穿過當前層), 則對應上圖的形式, 總的形式為:

\[\alpha_i = [1-p(y_i)]\prod_{j=1}^{i-1} p(y_j), \\ \alpha_1 = 1-p(y_1), \\ \alpha_k = \prod_{j=1}^{k-1} p(y_j), \\ p(y_i) = 1-\text{sigmoid}(y_i), \\ \Rightarrow P = \sum^{k}_{i=1} P_{i}\alpha_i(y). \]

可以看到, 對于第一層的注意力權(quán)重就是直接 sigmoid 的輸出結(jié)果. 而對于第 k 層的輸出, 則是對各層 sigmoid 輸出取補并類乘而獲得.

\(\phi\) 的選擇

實驗中使用絕對值函數(shù): \(\phi(y_i) \triangleq |y_i|\). 這受啟發(fā)于更好的保留經(jīng)過注意力機制的梯度流的分析, 因為作者們發(fā)現(xiàn)現(xiàn)存的注意力機制可能會遭受梯度消失的問題.

前面整理的注意力系數(shù)的形式:

\[\alpha_i(y) \triangleq [1 - e^{-\phi(y_i)}] \prod^{i-1}_{j=1}e^{-\phi(y_j)} = \prod^{i-1}_{j=1}e^{-\phi(y_j)} - \prod^{i}_{j=1}e^{-\phi(y_j)}, \, y \triangleq (y_1, \dots, y_k) \]

考慮第 \(i\) 層系數(shù) \(\alpha_i(y)\) 關(guān)于可學習參數(shù) \(y_l\) 的導數(shù):

\[J_{il} \triangleq \frac{\partial \alpha_i(y))}{\partial y_l} \begin{cases} \frac{\partial [-e^{-\phi(y_i)}]}{\partial y_l}\prod^{i-1}_{j=1}e^{-\phi(y_j)} = \frac{\partial \phi(y_i)}{\partial y_l}\prod^{i}_{j=1}e^{-\phi(y_j)} = \phi '(y_l)\prod^{i}_{j=1}e^{-\phi(y_j)} & \text{ if } l= i\\ 0 & \text{ if } l> i \\ -\phi '(y_l)\prod^{i-1}_{j=1}e^{-\phi(y_j)} + \phi '(y_l)\prod^{i}_{j=1}e^{-\phi(y_j)} = -\phi '(y_l)\alpha_i(y) & \text{ if } l< i \end{cases} \]

當考慮兩個尺度的時候, 即 \(k=2\) 時:

\[J = \begin{bmatrix} -\phi '(y_1)a_1 & 0 \\ -\phi '(y_1)a_1(1-a_2) & \phi '(y_2)a_1a_2 \end{bmatrix}, \\ a_i \triangleq e^{-\phi(y_i)}. \]

左上角計算的是第 1 層的注意力系數(shù)關(guān)于第 1 層的參數(shù)的導數(shù), 右上角則是第 1 層關(guān)于第 2 層的導數(shù). 可以看到, 如果 \(a_1 \rightarrow 0\) 的時候, 梯度會消失, 不管 \(a_2\) 是多少.

所以為了避免梯度消失的問題, 這里需要對 \(\phi\) 進行仔細的設(shè)定. 當選擇絕對值函數(shù)的時候, 這里的 Jacobian 矩陣將不會在 \(a_1 > 0\) 且 \((y_1, y_2) \neq (0, 0)\) 的情況出現(xiàn)消失的問題.

但是這里如果取了絕對值函數(shù), 求導數(shù)是+-1, 這依然會有梯度消失的問題誒?

考慮 HMA 中的情況, 按照作者給出的形式, 此時有:

\[\phi '(y_i) = \frac{\partial \log(1+e^{y_i})}{\partial y_i} = \frac{e^{y_i}}{1+e^{y_i}} = 1 - \frac{1}{1+e^{y_i}} = 1 - e^{-\log(1+e^{y_i})} = 1 - a_i, \\ a_2 = 0. \]

分支 2 不參與注意力計算. 當 \(a_1 \rightarrow 1\) 時會出現(xiàn)梯度消失.

而按照我前面的形式, 則有:

\[\phi '(y_i) = \frac{\partial \log(1+e^{-y_i})}{\partial y_i} = -\frac{e^{-y_i}}{1+e^{-y_i}}, \\ a_i = e^{-\log(1+e^{-y_i})} = \frac{1}{1+e^{-y_i}}, \\ \phi '(y_i) = a_i - 1. \]

同樣也會出現(xiàn)消失的問題.

鏈接

論文:https://arxiv.org/abs/2111.00770
代碼:http://vis.xyz/pub/dla-afa
本文的思想來源于 NeRF, 可以看看 NeRF 的介紹再來看著 SSR 的設(shè)計.
關(guān)于體繪制的一些資料:
- 非常豐富全面的一份中文 CG 學習材料: GPU 編程與 CG 語言之陽春白雪下里巴人(GPU Programming And Cg Language Primer)
- 21 年出現(xiàn)在知網(wǎng)上的一份小綜述: 基于神經(jīng)輻射場的視點合成算法綜述

posted @ 2021-11-04 23:06 lart 閱讀(289) 評論(0) 收藏舉報

刷新頁面返回頂部

隨緣的風

生活，記憶，與情緒。