[link]
Robust and fast detection of anatomical structures is a prerequisite for both diagnostic and interventional medical image analysis. Current solutions for anatomy detection are typically based on machine learning techniques that exploit large annotated image databases in order to learn the appearance of the captured anatomy. These solutions are subject to several limitations, including the use of suboptimal feature engineering techniques and most importantly the use of computationally suboptimal search-schemes for anatomy detection. To address these issues, we propose a method that follows a new paradigm by reformulating the detection problem as a behavior learning task for an artificial agent. To this end, Ghesu et al. reformulate the problem of landmark detection as a reinforcement learning task, where an agent is trained to navigate the 3D image space in an efficient way in order to find the landmark as spatially closely as possible. This paper marks the first time that RL has been used for a specific problem like this. The authors use Deep Q-Learning (DQN) algorithm as their framework to build the agent. The DQN algorithm uses a Convolutional Neural Network to parameterize the Q* function in a non-linear way. Also, in order to ensure that the agent always finds an object of interest regardless of the size, the authors proposed to use multi-scale search strategy, in which a different CNN works on an image of a different scale. The authors also some other interesting techniques like $\epsilon-$greedy approach, which uses a randomized strategy to choose the actions in every iteration. They also use experience replay where an experience buffer is maintained that records the trajectory of the agent. This buffer is then used to fine-tune the Q network. The authors test their method on an internal data set of CT scans, with a few different metrics. The most important metric was failure percentage rate, that was defined if the agent is x mm close to the landmark. The method achieved 0\% failure percentage rate on detecting landmarks in CT scans. |
[link]
Accurate anatomical landmark correspondence is highly critical for medical image registration. Traditionally many of the previous works proposed a number of hand-crafted feature sets that can be used to perform correspondence. However these feature tend to be highly specialized in terms of application area, and cannot be always generalized well to other applications without significant modifications. There have been other works that perform automatic feature extraction, but their reliance on labelled data hinders their ability to perform in cases where there is none. To this end Wu et al. propose an unsupervised feature learning method which does not require labelled data. Their approach aims to directly learn the basis filters that can effectively represent all observed image patches. The learnt basis filters are later regarded as general image features representing the morphological structure of the patch. In order to learn the basis filters, the authors propose a two-layer convolutional neural network called the Independent Subspace Analysis (ISA) algorithm. As an extension of ICA, the responses are not required to be all to be mutually independent in ISA. Instead, these responses can be divided into several groups, each of which is called independent subspace. Then, the responses are dependent inside each group, but dependencies among different groups are not allowed. Thereby, similar features can be grouped into the same subspace to achieve invariance. To ensure the accurate correspondence detection, multi-scale image features are necessary to use. However, it also raised a problem of high-dimensionality in learning features from the large-scale image patches. This is achieved by constructing a two-layer network for scaling up the ISA to the large-scale image patches. Specifically, the ISA is first trained in the first layer based on the image patches with smaller scale. After that, a sliding window (with the same scale in the first layer) convolves with each large-scale patch to get a sequence of overlapped small-scale patches. The combined responses of these overlapped patches through the first layer ISA are whitened by PCA and then used as the input to the second layer that is further trained by another ISA. In this way, high-level understanding of large-scale image patch can be perceived from the low-level image features detected by the basis filters in the first layer. The authors compare their work with two other methods, and apply their method on IXI and ADNI datasets. |
[link]
Tumor segmentation from brain MRI sequences is usually done manually by the radiologist. Being a highly tedious and error prone task, mainly due to factors such as human fatigue, overabundance of MRI slices per patient, and increasing number of patients, manual operations often lead to inaccurate delineation. Moreover, use of qualitative measures of evaluation by radiologists results in high inter- and intraobserver error rates. There is an evident need for automated systems to perform this task. To this end Pereira et al. propose to use a deep learning method called Convolutional Neural Network for predicting a segmentation mask of the patient MRI scans. The approach the segmentation problem as a pixel-wise classification problem, where each pixel in the input MRI scan 2D slice is classified into one of the five categories: background, necrosis, edema, non-enhancing and enhancing region. The authors propose two networks, one for High Grade Gliomas (HGG) and one for Low Grade Gliomas (LGG). The HGG network has more number of convolution layers than the LGG due to lack of data. The proposed network architectures are a combination of convolution, relu, and max pooling layers, followed by some fully connected layers in the end. The networks are trained using categorical cross-entropy loss function. The networks were trained on 2D 33x33 patches extracted from the 2D MRI slices of the brain from the BRATS 2012 dataset. The patches were randomly sampled from the images and the task was to predict the class of the pixel in the middle of the patch. To approach the problem of class imbalance, they sample approximately 40\% of the patches were normal patches. They also use data augmentation to increase the number of effective patches to train. In order to test the performance of their proposed method, as well as understand the impact of various hyperparameters, the authors perform a multitude of tests. The tests included turning data augmentation on/off, using leaky relu instead of relu, changing patch extraction plane, using deeper networks, and using larger kernels. |
[link]
Alzheimer's Disease (AD) is characterized by impairment of cognitive and memory function, mostly leading to dementia in elderly subjects. For the last decade, it has been shown that neuroimaging can be a potential tool for the diagnosis of Alzheimer's Disease (AD) and its prodromal stage, Mild Cognitive Impairment (MCI), and also fusion of different modalities can further provide the complementary information to enhance diagnostic accuracy. Multimodal information like that from MRI and PET can be used to aid in diagnosis of AD in early stages. However most of the previous works in this domain either concentrate on only one domain (MRI or PET), or use hand-crafted features which are then concatenated together to form a single vector. There are increasing evidences that biomarkers from different modalities can provide complementary information in AD/MCI diagnosis. In this paper, Suk et al. propose a Deep Boltzmann Machine (DBM) based method that performs high-level latent and shared feature representation obtained from two neuroimaging modalities (MRI and PET). Specifically they use DBM as a building block, to find a latent hierarchical feature representation from a 3D patch, and then devise a systematic method for a joint feature representation from the paired patches of MRI and PET with a multimodal DBM. The method first selects class-discriminative patches from a pair of MRI and PET images, by using a statistical significance test between classes. A MultiModal DBM (MM-DBM) is then built that finds a shared feature representation from the paired patches. However the MM-DBM is not trained directly on patches, instead, it's trained using binary vectors obtained after running the patches through a Restricted Boltzmann Machine (RBM) which transforms the real-valued observations into binary vectors. The MM-DBM network's top hidden layer has multiple entries of the lower hidden layers and the label layer, to extract a shared feature representation by fusing neuroimaging information of MRI and PET. Using this multimodal model, a single fused feature representation is obtained. Using this feature representation, A Support Vector Machine (SVM) based classification step is added. Instead of considering all patch-level classifiers' output simultaneously, the output from the SVMs are agglomerated the information of the locally distributed patches by constructing spatially distributed ‘mega-patches’ under the consideration that the disease-related brain areas are distributed over some distant brain regions with arbitrary shape and size. Following this step, the training data is divided into multiple subsets, and used to train an image-level classifier on each subset individually. The method was tested on ADNI dataset with MR and PET images. |
[link]
Medical image segmentation have been a classic problem in medical image analysis, with a score of research backing the problem. Many approaches worked by designing hand-crafted features, while others worked using global or local intensity cues. These approaches were sometimes extended to 3D, but most of the algorithms work with 2D images (or 2D slices of a 3D image). It is hypothesized that using the full 3D volume of a scan may improve segmentation performance due to the amount of context that the algorithm can be exposed to, but such approaches have been very expensive computationally. Deep learning approches like ConvNets have been applied to segmentation problems, which are computationally very efficient during inference time due to highly optimized linear algebra routines. Although these approaches form the state-of-art, they still utilize 2D views of a scan, and fail to work well on full 3D volumes. To this end, Milletari et al. propose a new CNN architecture consisting of volumetric convolutions with 3D kernels, on full 3D MRI prostate scans, trained on the task of segmenting the prostate from the images. The network architecture primarily consisted of 3D convolutions which use volumetric kernels having size 5x5x5 voxels. As the data proceeds through different stages along the compression path, its resolution is reduced. This is performed through convolution with 2x2x2 voxels wide kernels applied with stride 2, hence there are no pooling layers in the architecture. The architecutre resembles an encoder-decoder type architecture with the decoder part, also called downsampling, reduces the size of the signal presented as input and increases the receptive field of the features being computed in subsequent network layers. Each of the stages of the left part of the network, computes a number of features which is two times higher than the one of the previous layer. The right portion of the network extracts features and expands the spatial support of the lower resolution feature maps in order to gather and assemble the necessary information to output a two channel volumetric segmentation. The two features maps computed by the very last convolutional layer, having 1x1x1 kernel size and producing outputs of the same size as the input volume, are converted to probabilistic segmentations of the foreground and background regions by applying soft-max voxelwise. In order to train the network, the authors propose to use Dice loss function. The CNN is trained end-to-end on a dataset of 50 prostate scans in MRI. The network approached a 0.869 $\pm$ 0.033 dice loss, and beat the other state-of-art models. |
[link]
Skin cancer is one of the most common cancer type in humans. Primarily, the lesion is diagnosed visually through a series of 2D color images taken of the affected area. This may be followed by dermoscopic analysis, a biopsy and histopathological examination. Automated classification of skin lesions using images is a challenging task owing to the fine-grained variability in the appearance of skin lesions. To this end, Esteva et al. propose a deep learning based solution to automate the task of diagnosing lesions of the skin into fine-grained categories. Specifically, they use a GoogleNet Inception v3 CNN architecture which won the ImageNet Large Scale Visual Recognition Challenge in 2014. The method also leverages pre-training, in which an already trained DNN can be fine-tuned on a slightly varied task, which allows the network to leverage the convolutional filters it might have learnt from a much larger dataset. To achieve this, the Inception v3 CNN was fine-tuned from a pre-trained state. The model was initially trained on approximately 1.28 million images with about 1000 classes, from the 2014 ImageNet Large Scale Visual Recognition Challenge. Following which, the network is then fine-tuned on the dermatology dataset. The dataset used in the study was obtained clinically from open-access online repositories and Stanford Medical Center. It consists of 127,463 training and validation images, and held out set of 1942 labelled test images. The labels are organized hierarchically in a tree like structure, where each succeeding depth level represents a fine-grained classification of the disease. The network is trained to perform three tasks: i) classify the first-level nodes of the taxonomy, which represent benign lesions, malignant lesions and non-neoplastic. ii) nine-class disease partition—the second-level nodes—so that the diseases of each class have similar medical treatment plans, and finally iii) using only biopsy-proven images on medically important use cases, whether the algorithm and dermatologists could distinguish malignant versus benign lesions of epidermal (keratinocyte carcinoma compared to benign seborrheic keratosis) or melanocytic (malignant melanoma compared to benign nevus) origin. The CNN achieved 72.1 $\pm$ 0.9\% (mean $\pm$ s.d.) overall accuracy (the average of individual inference class accuracies) and two dermatologists attain 65.56\% and 66.0\% accuracy on a subset of the validation set for the first task. The CNN achieves 55.4 $\pm$ 1.7\% overall accuracy whereas the same two dermatologists attain 53.3\% and 55.0\% accuracy in the second task. For the third task, the CNN outperforms the dermatologists, and obtains an area under the curve (AUC) over 91\% for each case. |
[link]
Shape registration problem have been an active research topic in computational geometry, computer vision, medical image analysis and pattern recognition communities. Also called the shape alignment, it has extensive uses in recognition, indexing, retrieval, generation and other downstream analysis of a set of shapes. There have been a variety of works that approach this problem, with the methods varying mostly in terms of (can be called pillars of registration) the shape representation, transformation and registration criteria that is used. One such method is proposed by Huang et al. in this paper, which uses a novel combination of the three pillars, where an implicit shape representation is used to register an object both globally and locally. For the registration criteria, the proposed method uses Mutual Information based criteria for its global registration phase, while sum-squared differences (SSD) for its local phase. The method starts off with defining an implicit, non-parameteric shape representation which is translation, rotation and scale invariant. This makes the first step of the registration pipeline which transforms the input images into a domain where the shape is implicitly defined. The image is first partitioned into three spaces, namely $[\Omega]$ (the image domain), $[R_S]$ (points inside the shape), $[\Omega - R_S]$ (points outside the shape), and $[S]$ (points lying on the shape boundary). Using this partition, a function based upon the Lipschitz function $\phi : \Omega -> \mathbb{R}^+$ is defined as: \begin{equation} \phi_S(x,y) \begin{cases} 0 & (x,y) \in S \\ + D((x,y), S)>0 & (x,y) \in [R_s] \\ - D((x,y), S)<0 & (x,y) \in [\Omega - R_s] \end{cases} \end{equation} Where $D((x,y),S)$ is the distance function which gives the minimum Euclidean distance between point $(x,y)$ and the shape $S$. Given the implicit representation, global shape alignment is performed using the Mutual Information (MI) objective function defined between the probability density functions of the pixels in source image and the target image sampled from the domain $\Omega$. \begin{equation} MI(f_{\Omega}, g_{\Omega}^{A}) = \underbrace{\mathcal{H}[p^{f_{\Omega}}(l_1)]}_{\substack{\text{Entropy of the}\\ \text{distribution representing $f_{\Omega}$}}} + \underbrace{\mathcal{H}[p^{g_{\Omega}^{A}}(l_2)]}_{\substack{\text{Entropy of the}\\ \text{distribution representing $g_{\Omega}^{A}$} \\ \text{which is the} \\ \text{transformed source ISR using $A(\theta)$}}} - \underbrace{\mathcal{H}[p^{f_{\Omega}, g_{\Omega}^{A}}(l_1, l_2)]}_{\substack{\text{Entropy of the}\\ \text{joint distribution}\\\text{representing $f_{\Omega}, g_{\Omega}^{A}$}}} \end{equation} Following global registration, local registration is performed by embedding a control point grid using the Incremental Free Form Deformation (IFFD) method. The objective function to minimize is used as the sum squared differences (SSD). The local registration is also offset by using a multi-resolution framework, which performs deformations on control points of varying resolution, in order to account for small local deformations in the shape. In case where there is prior information available for feature point correspondence between the two shapes, this prior knowledge can be added as a plugin term in the overall local registration optimization term. The method was applied on statistically modeling anatomical structures, 3D face scan and mesh registration. |
[link]
Point pattern matching problem have been an active research topic in computation geometry and pattern recognition communities. These point sets typically arise in a variety of applications, where the problem lies in registration of these point sets which is encountered in stereo matching, feature based image registration and so on. Mathematically, the problem of registering two point sets translates to the following: Let $\{\mathcal{M}, \mathcal{S}\}$ be two finite set points which need to be registered, where $\mathcal{M}$ is the "moving" point set and $\mathcal{S}$ is the "fixed" or scene set. A transformation $\mathcal{T}$ is then calculated which can transform points from $\mathcal{M}$ to $\mathcal{S}$. To this end, Jian and Vemuri propose to use a Gaussian Mixture Model based representation of point sets which are registered together my minimizing a cost function using a slightly modified version of the Iterative Closest Point (ICP) algorithm. In this setting, the problem of point set registration becomes as that of aligning two Gaussian mixture models by minimizing discrepancy between two Gaussian mixtures. The cost function for this optimization is chosen as a closed-form version of the $L_2$ distance between Gaussian mixtures, which allows the algorithm to be computationally efficient. The main reason behind choosing Gaussian mixture models to represent discrete point sets was that it directly translates to interpreting point sets as randomly sampled data from a distribution of random point locations. This models the uncertainty of point sets well during feature extraction. The second reason was that hard discrete optimization problems that are encountered in point matching literature become tractable continuous optimization problems. The probability density function of a general Gaussian mixture is defined as follows: \begin{equation} p(x) = \sum_{i=1}^{k}w_i \phi(x|\mu_i, \sigma_i) \end{equation} where: $\phi(x|\mu_i, \sigma_i) = \dfrac{exp\left[-\frac{1}{2}(x - \mu_i)\sigma_{i}^{-1}(x-\mu_i)\right]}{\sqrt{(2\pi)^d |det(\sigma_i)|}}$ The GMM from the given point set is constructed as follows: i) the number of GMM components is equal to the number of points in the point set, and every component is weighted equally, ii) every component's mean vectors is represented by the spatial location of each point, iii) all components have same spherical covariance matrix. An intuitive reformulation of the point set registration problem is to solve an optimization problem such that a certain dissimilarity measure between the Gaussian mixtures constructed from the transformed model set and the fixed scene set is minimized. For this method, L2 distance is chosen as the dissimilarity measure for measuring similarity between two Gaussian mixtures. The objective can then be minimized by either a closed-form numerical method (in case the function is convex) or an iterative gradient based method. The method was applied and tested on both rigid and non-rigid point set registration problems. |
[link]
Despite being an ill-posed problem, non-rigid image registration has been the subject of numerous works, which apply the framework on different applications where rigid and affine transformations cannot completely model the variations between image sets. One such application of non-rigid registration is to register pre- and post-contrast breast MR images for estimating contrast uptake, which in turn is an indicator of the tumor malignancy. Due to large variations between the pre- and post-contrast images in terms of patient movement, and breast movement which is both global and local, registration of these images become challenging. Classic methods cannot capture the exact semantics of movements that the images exhibit. To this end, the authors propose a non-rigid registration method which combines advantages of voxel-based similarity measures like Mutual Information (MI) as well as non-rigid transformation models of the breast. The method is built using a hierarchical transformation model which capture both the global and local movement of the breast across pre- and post-contrast scans. The proposed method consists of two interesting contributions which model the motion of the breast across scans using a global and local model. The global motion model conists of a 3D affine transformation parameterized by 12 degrees of freedom. The local model is based upon FFD model, which is based upon B-splines which is a powerful tool for modeling 3D deformable objects. The main idea behind this approach is that FFDs can be used to deform an object by manipulating the underlying mesh of the control points, which is estimated in the form of a B-spline. The formulation exposes a trade-off between computational running time and accurate modelling of the object (breast). In order to achieve the best compromise, a hierarchical multi-resolution approach is implemented in which the resolution of the control mesh is increased along with the image resolution in a coarse to fine fashion. In addition to modeling the movement of the breast, a regularization term is also added to the final optimization function which forces the B-spline based FFD transformation to be smooth. The term is zero in case of affine transformation, and only penalizes non-affine transformations. The function that the method optimizes is as follows: $\mathcal{C}(\theta, \phi) = - \mathcal{C}_{similarity}\left( I(t_0), T(I(t))\right) + \lambda \mathcal{C}_{smooth}(T)$ The optimization of the above objective function is performed in multiple stages, and by using gradient descent algorithm which takes steps towards the gradient vector with a certain step size of $\mu$. Local optimum is assumed if $||\nabla \mathcal{C}|| <= \epsilon$. In order to assess the quality of the proposed method, the method is tested on both clinical and volunteer 3D MR data. The results show that the rigid and affine only transformations based methods perform significantly worse than the proposed method. Moreover, it was shown that the results improve with better control point resolution. However the use of SSD as a quantitative metric is debatable since the contrast enhanced and pre-contrast images will have varying distributions of intensity values. |
[link]
Image registration has been well studied problem in medical image analysis community, with rigid registration taking much of the spotlight. In addition to rigid registration, non-rigid registration is of great interest due to it's applications in inter-patient modality registration where deformations of organs are highly pronounced. However non-rigid registration is an ill-posed problem with numerous degrees of freedom, which makes finding the best transformation from source to final image very difficult. To counter this, some methods were proposed which constraint the non-rigid transformation $T$ to be within certain bounds, which is not always ideal. To this end, Tang et al. propose a novel framework which encapsulates the problem of non-rigid image registration into a graph-cut framework, which guarantees a global maxima (or minima) under certain conditions. The formulation requires that each pixel in source image has a displacement label (whcih is a vector) indicating its corresponding position in the floating image, according to an objective function. A smoothness constraint is also added to ensure that the values of the transformation function $T$ are meaningful and stay within natural limits (no large displacement should occur between absolute neighbouring pixels). The authors propose the following formulation as their objective function, which they then solve using graph-cuts methods: $D^* = argmin_{D} \sum_{x\in X}||I(x) - J(x + D(x))|| + \lambda \sum_{(x,y)\in \mathcal{N}}||D(x) - D(y)||$ The equation above is not fully discretized, in the sense that $D$ is still unbounded and can vary from $[-\infty, \infty]$. To allow for optimization using graph-cuts, the transformation function $D$ is mapped to a finite set $\mathcal{W} = \{0, \pm s, \pm 2s...\pm ws\}^d$. Using this discretization, the equation above can be solved using graph-cuts via a sequence of alpha-expansion. $\alpha$-expansion is a two label problem where the cost of assigning a label $\alpha$ is calculated on the basis of the previous label of the pixel. Different costs are assigned to different scenarios where the previous label may keep it's original label, or change to new label $\alpha$. This is important since it is imperative that the cost conditions satisifes the inequality given by Kolmogorov \& Zabih which then guarantees a global optima. The method was tested on on MR data from BrainWeb dataset which were affinely pre-registered and intensity normalized to be within 0 and 255. The method demonstrated good qualitative results when compared two state-of-art methods DEMONS and FFD, where the average intensity differences for the proposed method was much lower than the competition, while the tissue overlap was higher. |
[link]
Current medical imaging modalities like Computed Tomography (CT), Magnetic Resonance Imaging (MRI) or Positron Emission Tomography (PET) has allowed minimally-invasive imaging of internal organs. Rapid advancement in these technologies have lead to an influx of data, which, along with rising clinical need, has lead towards a need for quantitative image interpretation in routine practice. Some of the applications include volumetric measurements of regions of the brain, surgery or radiotherapy planning using CT/MRI images. This influx of data have opened up avenues for using multi-modality images for making decisions. However this is not always straightforward as the imaging parameters, scanner types, patient movement, or anatomical movement make the images miss-aligned against each other, making direct comparison between, say, as CT and MRI image tricky. This is formally known as the problem of image registration, and to this end, numerous computational methods have been proposed, which this paper surveys. Out of the methods proposed for both inter- and intra-patient modality registration, Mutual Information based objective maximization strategy has been extremely successful at computing the registration between 3D multi-modal medical images of various organs from the images. Mutual Information (MI) stems from the field of information theory, pioneered by Shannon, which when applied in the context of medical image registration, postulates that the MI between two images (say CT and MRI) is maximum when the images are aligned. The basic formulation of a MI based registration algorithm is as follows: Let $\mathcal{A}$ and $\mathcal{B}$ be two images which are geometrically related according to a transformation $T_\alpha$, such that voxels $p$ in $\mathcal{A}$ with intensity $a$ physically correspond to voxels $T_\alpha(p)$ in $\mathcal{B}$ with intensity $b$. The relationship between $p_{AB}(a,b)$ between $a$ and $b$, and hence their MI depends on $T_\alpha$. The MI criterian postulates that the MI for images that are geometrically aligned is maximum: \begin{equation} \alpha^* = argmax_{\alpha} I(A,B) \end{equation} Where $A$ and $B$ are two discrete random variables, and $I(A,B) = \sum_{a,b}p_{AB}(a,b) log \frac{p_{AB}(a,b)}{p_A(a).p_B(b)}$. An optimization algorithm is utilized to find a parameter set $\alpha^*$ that maximizes the MI between $A$ and $B$. Classically, Powell's multidimensional direction set method is used to otipimize the objective function, but other methods do exist as well. Although the formulation of MI criterion suggests that spatial dependence of image intensities are not taken into account, is in fact essential for the criterion to be well-behaved around the registration solution. MI does not rely on pure intensity values to measure correspondence between images, but rather on their joint distribution and the relationship of occurrence. It also does not impose any modality specific constraints which makes it general enough to be applied to any problem formulation (inter- or intra modality). Some of the areas where MI based image registration may fail is when there is insufficient information in images due to low resolution, low number of images, images not spatially invariant, images with shading artifacts. In some of these cases, MI based criterion will have multiple local optimals. |
[link]
Active Shape Models brought with them the ability to intelligentally deform to various intra-shape variations according to a labelled training set of landmark points. However the dependence of such methods on a low-noise training set marked manually poses challenges due to inter-observer differences which becomes even more pronounced in higher-dimensions (3D). To this end, the authors propose a method that addresses this problem, but introducing automatic shape modelling. The method is based upon the idea of Occam's Razor, or more formally, The minimum description length (MDL). It is the principle formalization of Occam's razor in which the best hypothesis (a model and its parameters) for a given set of data is the one that leads to the best compression of the data. This essentially means that the MDL characteristic can be used to learn from a set of data points, the best hypothesis that fully describes the training data set, but in a compressed form. The authors use a simple two-part coding formulation of MDL, which although does not guarantee a minimum coding length,but does provide a computationally simple functional form to evaluate which is suitable to be used as an objective function for numerical optimization. The proposes objective function is as follows: $F = \sum_{p=1}^{n_g}D^{(1)}\left(\hat{Y}^p, R, \delta \right) + \sum_{q=n_g + 1}^{n_g + n_{min}}D^{2}\left(\hat{Y}^q, R, \delta \right)$ The algorithm proceeds by first parameterizing a single shape using a recursive algorithm. Once the recursive parameterization is complete, optimization of the objective function presented above proceeds. The algorithm first generates a parameterization for each shape recursively, to the same level. Then shapes are sampled according to the correspondence defined by the parameterization. Once this is done, a model is built automatically from the above sampled shapes. This model is then used to calculate the objective function. The parameterization is changed as to converge to an optimal value for the objective function. In order to change the parameterization of the model to converge to an optimal value of objective function, a ``reference" shape is chosen in order to avoid having the points converge to a bad a local minima (all points collapse to single part of the boundary). Due to the non-convex nature of the objective function, optimization is performed using genetic algorithm. The method was tested both qualitatively and quantitatively on several sets of outlines of 2-D biomedical objects. Multiple anatomical sites in human body were chosen to test the model to provide an idea of how the method performs in a variety of shape settings. Quantitatively the models were shown to be highly compact in terms of the MDL. Qualitatively, the models were able to generate shapes that respected the overall shape of the training set, while still maintaining a good amount of deformation without going haywire. |
[link]
Delineation of vessel structures in human vasculature forms the precursor to a number of clinical applications. Typically, the delineation is performed using both 2D (DSA) and 3D techniques (CT, MR, XRay Angiography). However the decisions are still made using a maximum intensity projection (MIP) of the data. This is problematic since MIP is also affected by other tissues of high intensity, and low intensity vasculature may never be fully realized in the MIP compared to other tissues. This calls for a need for a type of vessel enhancement which can be applied prior to MIP to ensure MIP of the imaging have significant representation of low intensity vessels for detection. It can also facilitate volumetric views of vasculature and enable quantitative measurements. To this end, Frangi et al. propose a vessel enhancement method which defines a "vesselness measure" by using eigenvalues of the Hessian matrix as indicators. The eigenvalue analysis of Hessian provides the direction of the smallest curvature (along the tubular vessel structure). The eigenvalue decomposition of a Hessian on a spherical neighbourhood around a point $x_0$ maps an ellipsoid with the axis represented by the eignevectors and their magnitude represented by their corresponding eigenvalues. The method provides a framework with three eigenvalues $|\lambda_1| <= |\lambda_2| <= |\lambda_3|$ with heuristic rules about their absolute magnitude in the scenario where a vessel is present. Particularly, in order to derive a well-formed ``vessel measure" as a function of these eigenvalues, it is assumed that for a vessel structure, $\lambda_1$ will be very small (or zero). The authors also add prior information about the vessel in the sense that the vessels appear as bright tubes in a dark background in most images. Hence they indicate that a vessel structure of this sort must have the following configuration of $\lambda$ values $|\lambda_1| \approx 1$, $|\lambda_1| << |\lambda_2|$, $|\lambda_2| \approx |\lambda_3|$. Using a combination of these $\lambda$ values, as well as a Hessian-based function, the authors propose the following vessel measure: $\mathcal{V}_0(s) = \begin{cases} 0 \quad \text{if} \quad \lambda_2 > 0 \quad \text{or} \quad \lambda_3 > 0\\ (1 - exp\left(-\dfrac{\mathcal{R}_A^2}{2\alpha^2}\right))exp\left(-\dfrac{\mathcal{R}_B^2}{2\beta^2}\right)(1 - exp\left(-\dfrac{S^2}{2c^2}\right)) \end{cases}$ The three terms that make up the measure are $\mathcal{R}_A$, $\mathcal{R}_B$, and $S$. The first term $\mathcal{R}_A$ refers to the largest area cross section of the ellipsoid represented by the eigenvalue decomposition. It distinguishes between plate-like and line-like structures. The second term $\mathcal{R}_B$ accounts for the deviation from a blob-like structure, but cannot distinguish between line- and a plit-like pattern. The third term $S$ is simply the Frebenius norm of the Hessian matrix which accounts for lack of structure in the background, and will be high when there is high contrast compared to background. The vesselness measure is then analyzed at different scales to ensure that vessels of all sizes get detected. The method was applied on 2D DSA images which are obtained from X-ray projection before and after contrast agent is injected. The method was also applied to 3D MRA images. The results showed promising background suppression when vessel enhancement filtering was applied before performing MIP. |
[link]
Object detection in 2D scenes have mostly been performed using model-based approaches, which model the appearance of certain objects of interest. Although such approaches tend to work well in cluttered, noisy and occluded settings, the failure of such models to adapt to intra-object variability that is apparent in many domains like medical imaging, where the organ shapes tend to vary a lot, have lead to a need for a more robust approach. To this end, Cootes et al. propose a training based method which adapts and deforms well to per-object variations according to the training data, but still maintains rigidity across different objects. The proposed method relies on a hand-labelled training set featuring a set of points called "landmark points" that describe certain specific positions of any object. For example, for a face the points may be "noise end, left eyebrow start, left eyebrow mid, left eyebrow end" and so on. Next, the landmark points across the whole training set are algined using affine transformations by minimizing a weighted-sum of squares difference (SSD) between corresponding landmark points amongst training examples. The optmization function (SSD) is weighted using the apparent variance of each landmark point. The higher the variance across training samples, the lower the weight. In order to ``summarize" the shape in the high-dimensional space of landmark point vectors, the proposed method uses Principal Component Analysis (PCA). PCA provides the eigenvectors which point to the direction of highest change in points in $2n$-dimensional space, while the corresponding eigenvalues provide the significane of each eigenvector. The best $t$ eigenvectors are chosen such that they describe a certain perctange of variance of the data. Once this is done, the model becomes capable of producing any shape by deriving from the mean shape of the object, using the equation: $x = \bar{x} + Pb$ where $\bar{x}$ is the mean shape, $P$ = matrix of $t$ eigenvectors and $b$ = vector of free weights that can be tuned to generate new shapes. The values of $b$ are constrained to stay within boundaries determined using the training set, which essentially forms the basis of the argument that the model only deforms as per the training set. The method was tested on a variety shapes, namely resistor models in electric circuits, heart model, worm model, and hand model. The models thus generated were robust and could successfully generate new examples by varying the values of $b$ on a straight line. However for worm-model, it was found that varying the values of $b$ only along a line may not be always suitable, especially in cases where the different dimensions of $b$ may have some existing non-linear relationship. Once a shape model is generated, it is used to detect objects/shapes from new images. This is done by first initializing the model points on the image. The model points are then adjusted to the shape by using information from the image like edges. The adjustment is performed iteratively, by applying constraints on the calculated values of $dX$ and $dB$ so that they respect the training set. The iterations are performed until convergence of the model points to the actual shape of interest in the image. One drawback of the proposed method is its high sensitivity to noise in training data annotations. Also, the relationship between various variables in $b$ is not entirely clear, and may negatively affect models when there exists a non-linear relationship. Also, the final convergence is somewhat dependent upon the initialization of the model points, and depend on local edge features for guidance, which may fail in some instances. |
[link]
Image segmentation have been a topic of research in computer vision domain for decades. There have been a multitude of methods proposed for segmentation, but most have been dependent on a high level user input which guides the contour or boundaries towards the real boundaries. In order to come close to a fully automated or partially automated solution, a novel method is proposed for performing multilabel, interactive image segmentation using Random Walk algorithm as the fundamental driver of segmentation. The problem is formulated as follows: given a small number of pixels with user-defined (or pre-defined) labels, assign the the probability that a random walker starting at each unlabeled pixel will first reach one of the pre-labeled pixels. The current pixel is then assigned the label corresponding to the max of this probability. This leads to high-quality segmentations of an image into $K$ different components. The algorithm is based on image graphs, where image pixels are represented as graphs connected by edges to its 8-connected neighbours. In this paper, a novel approach to $K$-class image segmentation problem is proposed which utilizes user-defined seeds representing the example regions of the image belonging to $K$ objects. Each seed specifies a location with a user-defined label. The algorithm labels an unseeded pixel by resolving the question: Given a random walker starting at this location, what is the probability that it first reaches each of the K seed points? It will be shown that this calculation may be performed exactly without the simulation of a random walk. By performing this calculation, the algorithm assigns a K-tuple vector to each pixel that specifies the probability that a random walker starting from each unseeded pixel will first reach each of the K seed points. A final segmentation may be derived from these K-tuples by selecting for each pixel the most probable seed destination for a random walker. The graph weights are determined to be a function of the pixel intensities, specifically $w_{ij}$ = $exp(-(g_i - g_j)^2)$. The algorithm works by biasing the random walker to avoid crossing sharp intensity gradients, which leads to a quality segmentation that respects object boundaries (including weak boundaries). The algorithm exposes only one free variable $\beta$, and can be combined with other approaches involving pre- and post-filtering techniques. Additionally, the algorithm provides on-the-fly correction of previous detected boundary in an computationally efficient way. |
[link]
Over the last decade and a half, a plethora of image segmentation algorithms have been proposed, which can be categorized into belonging to roughly four categories, represented by a combination of two labels: explicit or implicit boundary representation, and variational or combinatorial methods. While classic methods like Snakes [1] and Level-Sets [2] belong to explicit/variational and implicit/variational category, there have been another set of algorithms falling under the combinatorial domain, which are DP or path-based algorithms which are explicit/combinatorial, and finally Graph-Cuts, which are implicit/combinatorial. The main difference between the categories is the space of solutions where search is performed. For variational methods, the search space is $\mathcal{R}^\infty$, while for combinatorial methods the search space is confined to $\mathcal{Z}^n$. An obvious advantage of combinatorial methods seem to better computational performance, but they also provide a globally optimal solution, which is global to the image. This makes the algorithm performance independent of numerical stability design decisions, and only dependent on the quality of global descriptors. Hence the algorithms provide a highly generalized, globally optimal framework which can be applied to a variety of problems, including image segmentation. Graph-Cut methods are based upon the $s-t$ decomposition of a given image graph (where pixels are nodes and edges are formed between 8-connected neighbours). An $s-t$ decomposition of a graph $\mathcal{G} = \{V,E\}$ is a subset of edges $C \subset E$ such that the graph gets completely separated into individual components $s$ and $t$. The divided nodes are assigned to two terminal nodes, representing foreground and background of the image. The best-cut in an image graph is optimal if the cost of a cut (defined as $|C| = \sum_{e\in C}w_e$) is minimal. This corresponds to segmenting an image with a desirable balance of boundary and regional properties. Graph-Cut exposes a general segmentation energy function (which constitutes the ``cost") as a combination of a regional term and boundary term, given as $E(A) = \lambda.R(A) + B(A)$. Regional term can be used to model apriori distribution of the pixel classes (probability of the pixel belonging to background or foreground). The boundary term can be represented by any boundary feature like local intensity gradients, zero-crossing, gradient direction or geometric costs. Although the region and boundary terms force the algorithm to find a boundary which strikes a good balance between the two, sometimes the lack of information for either or both terms may lead to incorrect segmentation. To offset these terms, hard constraints can be applied to the energy function. The constraints can be anything, for instance, a term that defines that pixels of particular intensities would belong to either foreground or background. The constraint can also be shape based, where shapes like circles or ellipses can be forced upon the final segmentation. The proposed algorithm was applied on a variety of 2D and 3D images. Note that graph-cut can be generalized to N-D segmentation problems as well. A number of qualitative observations are reported. However there is a lack of quantitative foundation of the algorithm performance, compared to other state-of-art algorithms not necessarily from the same category as graph-cut. |
[link]
Edge, contour or boundary detection in 2D images have been an area of active research, with a variety of different algorithms. However due to a wide variety of image types and content, developing automatic segmentation algorithms have been challenging, while manual segmentation is tedious and time consuming. Previous algorithms approaching this task have tried to incorporate higher level constraints, energy functional (snakes), global properties (graph based). However the approaches still do not entirely fulfill the fully automated criteria due to a variety of reasons. To this end, Barrett et al. propose a graph-based boundary extraction algorithm called Interactive Live-Wire, which is an extension to the original live-wire algorithm presented in Mortensen et al. [1]. The algorithm is built upon a reformulation of the segmentation approach into graphs, particularly, an image $I$ is converted to an undirected graph with edges from a pixel $p$ to all it's neighbouring 8-connected pixels. Each pixel or node is assigned a local cost according to a function (described later). The segmentation task then becomes a problem where there needs to be a shortest path from a pixel $p$ (seed) to another pixel $q$ (free goal point), where the cumulative cost of path is minimum. The local cost function is defined as: $l(p,q) = w_G.f_G(q) + w_Z.f_Z(q) + w_D.f_D(p,q)$ where $w_i, i = {G, Z, D}$ are weight coefficients controlling relative importance of the terms. The three terms that make up the local cost function are gradient magnitude ($f_G(q)$), Laplacian Zero-Crossing feature ($f_Z(q)$), and Gradient Direction feature ($f_D(p,q)$). The first term $f_G(q)$ defines a strong measure of edge strength, and is heavily weighted. The term $f_Z(q)$ provides a second degree measure for edge strength in the form of zero-crossing information. The third term $f_D(p,q)$ adds a smoothness constraint to the live-wire boundary by adding high cost for rapidly changing gradient directions. The algorithm also exhibits some other features, namely boundary freezing, on-the-fly learning and data-driven cooling. Boundary freezing proves useful when Live-wire segmentation digresses from the desired object boundary during interactive mode. The boundary can be "frozen" right before the digression point by specifying another seed point, until which the boundary is frozen and not allowed to be changed. On-the-fly learning provides robustness to the method by learning the underlying cost distribution of a known "good" boundary, and using that to guide the live-wire further to follow similar distribution. Data-driven path cooling allows the live-wire to generate new seed points automatically as a function of image data and path properties. Pixels on "stable" paths will cool down and eventually freeze, producing new seed points. The results report that average times taken for segmenting a region using Live-Wire was roughly 4.6 times less than manual human tracing time. Live-Wire provided same amount of accuracy as manual tracing would in a fraction of time, with high reproducibility. However, the method does not provide a way to ``snap out" of an \textit{automatically} frozen live-wire segmentation. On-the-fly training can fail at instances where the edges of the object change too fast, and not much implementation related information is provided, especially for freezing and training parts. |
[link]
Automated segmentation of various anatomical structures of interest from medical images has been a well grounded field of research in medical imaging. One such problem is related to segmenting whole heart region from a sequence of magnetic resonance imaging (MRI), which is currently done manually, and is time consuming and tedious. Although many automated techniques exist for this, the task remains challenging due to the complex nature of the problem, partly because of low contrast between heart and nearby tissue. Moreover many of the methods are unable to incorporate prior information into the process. To this end, Pluempitiwiriyawej et al. proposed a version of active contour energy minimization based method to segment the whole heart region, including the epicardium, and the left and right ventricular endocardia. The proposed method follows the framework laid out by Chan and Vese\cite{Chan2001}. However Pluempitiwiriyawej et al. propose a modified energy function, which consists of four energy terms. The energy function is given below, where $C$ is the contour represented as a level set function $\phi(x,y)$: $J(C) = \lambda_1 J_1(C) + \lambda_2 J_2(C) + \lambda_3 J_3(C) + \lambda_4 J_4(C)$ The coefficients $\lambda_{1..4}$ determine the weight of terms $J_{1..4}$. The first term $J_1(C)$ is designed to add stochastic models $\mathcal{M}_1, \mathcal{M}_2$ corresponding to the regions inside and outside of the active contour $C$. The models dictate the probability distribution from which the image intensities making up the inside and outside region of the contour are sampled. The negative log of this term is minimized, which essentially maximizes the probability $p(u | C, \mathcal{M}_1, \mathcal{M}_2)$ given the active contour $C$, and the models $\mathcal{M}_1, \mathcal{M}_2$. The second term $J_2(C)$ is designed similar to the classical Snakes\cite{Kass1988} in the sense that it uses edges to guide the contour towards the structure of interest. For this term, a simple edge map is used after convolving with a Gaussian filter which smooths out the noise. The term $J_3(C)$ encodes an shape prior which constraints the contour to follow an elliptical shape, and guides it in conjunction with the region and edge information. The final term $J_4(C)$ which encodes the total Euclidean arc length of the contour. This forces the contour to be ``smooth", without rough edges. The process of minimizing the energy function follows a three-task approach. The first task is to estimate the stochastic model parameters $\mu_k, \sigma^2_k$, and is performed by fixing the position of initial contour $C$, taking derivatives of $J$ w.r.t stochastic model parameters, and solving by equating to zero. The second task estimates the parameters of the ellipse using least squares method. The third and final task involves the contour using the estimated parameters in task one and two, such that it minimizes the function $J$. The method also performs stochastic relaxation, by dynamically changing the values of parameters $\lambda_1, \lambda_2, \lambda_3, \lambda_4$ as the optimization process proceeds. The intuition is that when the optimization starts, the edge and region terms must guide the contour, and as the process proceeds to it's end, the shape prior and contour length term should carry more weight to regularize the effective shape of the contour. The study used 48 MRI studies acquired by imaging rat hearts, and compared the proposed method with two earlier methods, namely Xu and Prince's GVF \cite{ChenyangXu1998}, and Chan and Vese \cite{Chan2001}. The authors also design a new quantitative metric, which is a modification of the Chamfer matching \cite{Barrow} technique. The reported results are observed to be in excellent agreement with the gold standard hand-traced contours. However the similarity values for other methods against human gold-standard were not reported. |
[link]
Typically, the energy minimization or snakes based object detection frameworks evolve a parametrized curve guided by some form of image gradient information. However due to heavy reliance on gradients, the approaches tend to fail in scenarios where this information is misleading or unavailable. This cripples the snake and renders it unusable as it gets stuck in a local-minima away from the actual object. Moreover, the parametrized snake lacks the ability to model multiple evolving curves in a single run. In order to address these issues, Chan and Vese introduced a new framework which utilized region based information to guide a spline, and tries to solve the minimal partition problem formulated by Mumford and Shah. The framework is built upon the following energy equation, where $C$ is a level-set formulation of the curve:$F(c1, c2, C) = \mu . \text{Length}(C) + v . \text{Area}(inside(C))\\ \lambda_1 \int_{inside(C)}|u_0(x,y) - c_1|^2 dxdy + \lambda_2 \int_{outside(C)}|u_0(x,y) - c_2|^2 dxdy$ The framework essentially divides the image into two regions (per curve), which are referred to as inside and outside of the curve. The first two terms of the equation control the physical aspects of the curve, particularly the length, and area inside the curve, with their contributions controlled by two parameters $\mu$ and $v$. The image forces in this equation correspond to the third and fourth terms, which are identical but work in respective regions identified by the curve. The terms use $c_1$ and $c_2$, which are the mean intensity values inside and outside the curve respectively to guide the curve towards a minima where the both regions are consistent with respect to the mean intensity values. The proposed framework also utilizes an improved representation of the curve in the form of a level set function $\phi(x, y)$, which has many numerical advantages and naturally supports multiple curves evolved during a single run, as compared to the traditional snakes model where only one curve can be evolved in a single run. The unknown function $\phi$ is computed using Euler-Lagrange equations formulated using modified Heaviside function $H$, and Dirac measure $\delta$. The proposed framework was applied on numerous challenging 2D images with varying degree of difficulties. The framework was also capable of segmenting point clouds decomposed into 2D images, objects with blurred boundaries, and contours without gradients, all without requiring image denoising. Due to the formulation of $\phi$ approximation routine, the framework has tendencies to find actual global minima independent of the initial position of the curve. However the choice of multiple parameters namely $\lambda_{1,2}, \mu, v$ is done heuristically, and seem to be problem dependent. Also, the framework's implicit dependency on absolute image intensities in regions inside and outside of curve sometimes fail in very specific cases where the averages tend to be zero, though the authors proposed to use image curvature and orientation information from the initial image $u_0$. |
[link]
Low level tasks such as edge, contour and line detection are an essential precursor to any downstream image analysis processes. However, most of the approaches targeting these problems work as isolated and autonomous entities, without using any high-level image information such as context, global shapes, or user-level input. This leads to errors that can further propagate through the pipeline without providing an opportunity for future correction. In order to address this problem, Kass et al. investigate the application of an energy minimization based framework for edge, line and contour detection in 2D images. Although energy minimization had earlier been utilized for similar tasks, Kass et al’s framework exposes a novel external force factor, which allows external forces or stimuli to guide the ``snake" towards a “correct answer”. Moreover, the framework exhibits an active behaviour since it is designed to always minimize the energy functional. A “snake” is a controlled continuity spline which is under the influence of forces in the energy functional. The energy functional is made up of three terms, where $v(s)$ is a parametrized snake. $ E_{snake}^{*} = \int_{0}^{1}E_{internal}(v(s)) + E_{image}(v(s)) + E_{constraint}(v(s)) $ The internal energy force term is entirely dependent on the shape of the curve, which constraints the snake to be “continuous” and well behaved. The force further encapsulates two terms which control the degree of stretchness of the curve (represented by the first derivative of the image), and ensure that the curve does not have too many bends (using the second derivative of the image). The image energy force term controls what kind of salient features does the snake track. The force encapsulates three terms, corresponding to the presence of lines, edges and a termination term. The line term uses raw image intensity as energy term, while the edge term uses a negative square of image gradient to make the snake attracted to contours with large image gradients. Further, a termination term allows the snake to find terminations of line segments and corners in the image. The combination of line and termination terms forces the snake to be attracted to edges and terminations. The constraint term is used to model external stimuli, which can come from high-level processes, or through user intervention. The framework allows the application of a spring to the curve to constraint the snake or move it in a desired direction. In order to test the framework, a user interface called ``Snake Pit" was created which allowed user control in the form of a spring attachment. The overall approach was tested on a number of different images, including the ones with contour illusion. The framework was also extended for application to stereo matching and motion tracking. For stereo, an extra energy term is added to constraint the snakes in two disparate images to stay close in the coordinate space, which models the fact that disparities in images do not change too rapidly. However the framework suffers when subjected to a high rate-of-change in both stereo matching and motion tracking problems. The proposed framework performed acceptably in many challenging scenarios. However the framework's underlying assumption about following edges and contours using image gradients may fail in cases where there is not much gradient information present in images. |