Computational modeling for parallel grid-based recursive Bayesian estimation: parallel computation using graphics processing unit

Tong, Xianqiao; Furukawa, Tomonari; Durrant-Whyte, Hugh

doi:10.1186/2195-5468-1-15

Research
Open access
Published: 16 December 2013

Computational modeling for parallel grid-based recursive Bayesian estimation: parallel computation using graphics processing unit

Xianqiao Tong¹,
Tomonari Furukawa¹ &
Hugh Durrant-Whyte²

Journal of Uncertainty Analysis and Applications volume 1, Article number: 15 (2013) Cite this article

2071 Accesses
1 Citations
Metrics details

Abstract

This paper presents the performance modeling of the real-time grid-based recursive Bayesian estimation (RBE), particularly the parallel computation using graphics processing unit (GPU). The proposed modeling formulates data transmission between the central processing unit (CPU) and the GPU as well as floating point operations to be carried out in each CPU and GPU necessary for one iteration of the real-time grid-based RBE. Given the specifications of the computer hardware, the proposed modeling can thus estimate the total amount of time cost for performing the grid-based RBE in a real-time environment. A new prediction formulation, which adopted separable convolution, is proposed to further accelerate the real-time grid-based RBE. The performance of the proposed modeling was investigated, and parametric studies have first demonstrated its validity in various conditions by showing that the average error of estimation in computational performance stays below 6% to 7%. Utilizing the prediction with separable convolution, the grid-based RBE has also been found to perform within 1 ms, although the size of the problem was relatively large.

Introduction

Recursive Bayesian estimation (RBE) allows the estimation of belief of a dynamically moving target by updating the belief both in time and observation [1]. There are two fundamental processes for the RBE: prediction process and correction process. The prediction process updates the belief by the motion model of the target, whereas the correction process updates the belief through the current observation. If the target is observable, the accuracy of the RBE can be maintained by the correction process using the valid observations. When the target is not observable, the accuracy of the RBE heavily relies on the prediction process and the error accumulates due to the lack of the valid observation for the correction process. In order for an accurate estimation, the RBE has to be performed fast enough to catch the motion of the target with a well-defined target motion model, which requires a good synchronization between its discrete representation and the RBE. Recent years, as a result, have seen many real-time enhanced RBE techniques that help improve the speed of the RBE.

One of such techniques is the modified ensemble Kalman filter (EnKF). The EnKF allows non-Gaussian estimation by minimizing a cost function defined by a non-Gaussian observation error with a pre-conditioned conjugate gradient method [2]. Langevin-Markov Chain Monte Carlo (MCMC) method, which represents the non-Gaussian belief by sampling it using a Markov chain and Langevin equation, could be a non-Gaussian RBE technique [3]. Another sampling method is the interactive particle filter (IPF), which is able to flexibly mitigate the belief space complexity [4]. An ensemble Kalman-particle predictor-corrector filter is a hybrid method that combines the advantages of EnKF and IPF and is able to effectively deal with high-dimensional non-Gaussian problems [5]. A tree-based estimator approximates the posterior belief distribution at multiple resolutions to be effective for high-dimensional problems [6], whereas maximum likelihood state estimation method could also achieve non-Gaussian RBE [7] by using a finite Gaussian mixture model.

Grid-based RBE technique is able to maintain a good accuracy for the belief since the entire target space is spatially discretized [8]. The good accuracy is obtained by the subtle discretization of the target space but leads to an inefficient computation at the same time. Furukawa et al. [9, 10] refined the grid-based RBE by developing a more general element-based RBE. The generalized element can help accurately represent the arbitrary target space with only the small number of elements compared with the grid-based RBE so as to reduce the computation of the RBE. Lavis et al. proposed an enhanced grid-based RBE that allows the update of not only the belief but also the target space [11]. Because of the dynamic adjustment of the target space, the computation of the RBE is additionally reduced. Further, the parallel grid-based RBE has been proposed, and it significantly accelerated the computation of the RBE and made its real-time implementation possible by utilizing the GPU’s strong parallel computational capability [12]. Despite that these efforts successfully reduce the computation of the RBE to achieve the fast RBE, the accuracy of the RBE is not well kept when the prediction process dominates the RBE during the no-observation period. The time cost of one iteration of the RBE becomes critical for overcoming this issue because that only if it matches the time increment of the discrete target motion model, the RBE can maintain the accuracy during the no-observation period.

This paper presents a performance modeling for the parallel grid-based RBE, particularly the parallel computation using the GPU, and it is able to determine the time cost of one iteration of the RBE. The proposed modeling formulates the total amount of data transmission between the CPU and the GPU and the total number of floating point operations to be carried out in each CPU and GPU necessary for one iteration of the parallel grid-based RBE. Given the specifications of the computer hardware, it is thus possible to estimate the time cost for one iteration of the parallel grid-based RBE. In order to perform the parallel grid-based RBE at maximum speed, the proposed modeling also reformulates and implements the prediction process with separable convolution.

The paper is organized as follows. The following section reviews the recursive Bayesian estimation as well as the parallel grid-based RBE. Section presents the proposed reformulation of the prediction process for the parallel grid-based RBE and its computational performance modeling. Section demonstrates the validation and efficacy of the proposed modeling through numerical examples, and the Conclusion and future work are summarized in the final section.

Parallel grid-based RBE

Problem statement

The motion of an object, o, is deterministically given by the following equation:

{\dot{x}}^{o} = f^{o} (x^{o}, u^{o}, w^{o}, t),

(1)

where x^o represents the state of the object, u^o represents the object control input, w^o represents the system noise, which includes environmental influences on the target, and t represents the time. In general, the state of the object describes its two-dimensional location but may also include other variables such as velocity. Let the time interval between the consecutive time steps be defined as Δ t. By integrating Equation (1), the state of the object at the time step k is given by

x_{k}^{o} = x_{k - 1}^{o} + \int_{t_{k - 1}}^{t_{k - 1} + Δt} f^{o} (x^{o}, u^{o}, w^{o}, t) dt,

(2)

where t _k−1 is the time which corresponds to the time step k−1.

Recursive Bayesian estimation

Prediction

The prediction process starts with the numerical implementation of the object motion model defined in Equation (2). For simplicity, the numerical integration is carried out by Riemann left sum algorithm. By dividing the time interval Δ t between the consecutive time steps into n subintervals. The state of the object at the time step k is given by

x_{k}^{o} = x_{k - 1}^{o} + \sum_{i = 0}^{n - 1} f^{o} (x^{o}, u^{o}, w^{o}, t_{k - 1} + i \frac{Δt}{n}) \frac{Δt}{n} .

(3)

Let a sequence of the observations of the object from time step 1 to k−1 be defined as $^{s} {\tilde{z}}_{1 : k - 1} \equiv {^{s} {\tilde{z}}_{i} | \forall i \in {1, \dots, k - 1}}$ . Notice here that $\tilde{(\cdot)}$ represents an instance of variable (·). The prediction process computes the belief of the current state $p (x_{k}^{o} |^{s} {\tilde{z}}_{1 : k - 1})$ from the belief in the previous time step $p (x_{k - 1}^{o} |^{s} {\tilde{z}}_{1 : k - 1})$ . The prediction is iteratively carried out by Chapman-Kolmogorov equation and given by

\begin{array}{lcr} p (x_{k}^{o} |^{s} {\tilde{z}}_{1 : k - 1}) \\ = \int_{X^{o}} p (x_{k}^{o} | x_{k - 1}^{o}) p (x_{k - 1}^{o} |^{s} {\tilde{z}}_{1 : k - 1}) d x_{k - 1}^{o}, \end{array}

(4)

where $p (x_{k}^{o} | x_{k - 1}^{o})$ is the probabilistic representation of the object motion model defined in Equation (3), which maps the probability of transition from the previous state $x_{k - 1}^{o}$ to the current state $x_{k}^{o}$ . The prediction process at k=1 is carried out by letting $p (x_{k - 1}^{o} |^{s} {\tilde{z}}_{1 : k - 1}) = p ({\tilde{x}}_{0}^{o})$ , where $p ({\tilde{x}}_{0}^{o})$ is defined as a prior belief of the object in terms of the probability density function. Equation (4) indicates that the performance of the prediction process relies on the object motion model $p (x_{k}^{o} | x_{k - 1}^{o})$ . Due to the fact that the object motion model is usually non-Gaussian when only prediction process applies to the RBE, the belief could eventually become heavily non-Gaussian.

Correction

The correction process is associated with the definition of the observation model. Let the probability of detection (PoD) be $0 \leq P_{d} (x_{k}^{o}) \leq 1$ as a reliable measure for detecting the object in terms of the object state. Observable region $^{s} X_{k}^{o}$ is defined as

^{s} X_{k}^{o} = {x_{k}^{o} | 0 < P_{d} (x_{k}^{o}) \leq 1} .

(5)

The observation ^s z _k at the time step k is given by

^{s} z_{k} = \{\begin{matrix} h^{s} (x_{k}^{o}, v_{k}^{s}) & x_{k}^{o} \in^{s} X_{k}^{o} \\ \emptyset & x_{k}^{o} \notin^{s} X_{k}^{o}, \end{matrix}

(6)

where $v_{k}^{s}$ represents the observation noise at the time step k, and $\emptyset$ represents an empty element, indicating that the observation contained no information on the object or that the target is unobservable when it is not within the observable region.

The correction process then computes the belief $p (x_{k}^{o} |^{s} {\tilde{z}}_{1 : k})$ given the corresponding observations up to the previous time step $p (x_{k}^{o} |^{s} {\tilde{z}}_{1 : k - 1})$ and a new observation $^{s} {\tilde{z}}_{k}$ . The equation is derived by applying formulas for marginal distribution and conditional independence and given by

\begin{array}{lcr} p (x_{k}^{o} |^{s} {\tilde{z}}_{1 : k}) \\ = \frac{l (x_{k}^{o} |^{s} {\tilde{z}}_{k}) p (x_{k}^{o} |^{s} {\tilde{z}}_{1 : k - 1})}{\int_{X^{o}} l (x_{k}^{o} |^{s} {\tilde{z}}_{k}) p (x_{k}^{o} |^{s} {\tilde{z}}_{1 : k - 1}) d x_{k}^{o}}, \end{array}

(7)

where $l (x_{k}^{o} |^{s} {\tilde{z}}_{k})$ represents the observation likelihood of $x_{k}^{o}$ . The observation likelihood is defined with reference to the PoD and is given by

l (x_{k}^{o} |^{s} {\tilde{z}}_{k}) = \{\begin{matrix} p (x_{k}^{o} |^{s} {\tilde{z}}_{k}) & ^{s} {\tilde{z}}_{k} \in^{s} X_{k}^{o} \\ 1 - P_{d} (x_{k}^{o}) & ^{s} {\tilde{z}}_{k} \notin^{s} X_{k}^{o}, \end{matrix}

(8)

where $p (x_{k}^{o} |^{s} {\tilde{z}}_{k})$ is the probabilistic representation of the observation model defined in Equation (6). When the object is within the observable region, a positive observation is obtained and the observation likelihood is a probability density function given the current of the object observation. When the object is out of the observable region, the negative observation is defined with respect to the PoD as the observation likelihood. Due to the fact that the observation likelihood of the negative observation is non-Gaussian, when the negative observation occurs in the RBE, the object belief would immediately become heavily non-Gaussian.

Parallel grid-based RBE

Representation of target space and belief

The grid-based RBE achieves non-Gaussian belief estimation by first representing the arbitrary target space $X^{t}$ in terms of a set of grid cells by constructing a rectangular space $X^{r}$ that covers the target space. For simplicity, let us consider a two-dimensional target space, and it is represented as $m^{t} = [x^{t}, y^{t}] \in X^{t}$ . The creation of a rectangular space $X^{r}$ is achieved then by defining the minimum and maximum values of the target space

\begin{array}{lcr} x_{min}^{t} = min {x^{t}}, x_{max}^{t} = max {x^{t}} \\ y_{min}^{t} = min {y^{t}}, y_{max}^{t} = max {y^{t}} \end{array}

and subsequently creating a rectangular space as $X^{r} = {m | \forall x \in [x_{min}^{t}, x_{max}^{t}], \forall y \in [y_{min}^{t}, y_{max}^{t}]} \supseteq X^{t}$ , where m= [ x,y]. The grid space is further introduced by discretizing the rectangular space by n _x and n _y grid cells in two directions, respectively. The dimensions of a grid cell are defined as $Δ x^{r} = (x_{max}^{t} - x_{min}^{t}) / n_{x}$ and $Δ y^{r} = (y_{max}^{t} - y_{min}^{t}) / n_{y}$ . This results in introducing the center of each grid cell as

\begin{array}{lcr} {\bar{m}}_{i_{x}, i_{y}}^{r} = [{\bar{x}}_{i_{x}}^{r}, {\bar{y}}_{i_{y}}^{r}] = [(i_{x} - 0.5) Δ x^{r} + x_{min}^{t}, (i_{y} - 0.5) Δ y^{r} + y_{min}^{t}], \end{array}

(9)

where ∀i _x∈{1,…,n _x} and ∀i _y∈{1,…,n _y}. Each grid cell is defined as

X_{i_{x}, i_{y}}^{r} = {m | | x - {\bar{x}}_{i_{x}}^{r} | < \frac{1}{2} Δ x^{r}, | y - {\bar{y}}_{i_{y}}^{r} | < \frac{1}{2} Δ y^{r}} .

(10)

Note that $⋃_{i_{x} = 1}^{n_{x}} ⋃_{i_{y} = 1}^{n_{y}} X_{i_{x}, i_{y}}^{r} = X^{r}$ and $⋂_{i_{x} = 1}^{n_{x}} ⋂_{i_{y} = 1}^{n_{y}} X_{i_{x}, i_{y}}^{r} = \emptyset$ . Finally, the selection of grid cells that represent the target space is performed by selecting a grid cell when its center is located in the target space, $X_{i_{x}, i_{y}}^{r} \subset X^{t}$ if ${\bar{x}}_{i_{x}, i_{y}}^{r} \in X^{t}$ . The approximate target space derived by the processes described above is $X^{t} \approx {X_{1}^{r}, X_{2}^{r}, \dots, X_{n_{g}}^{r}}$ , where n _g is the number of grid cells approximating the target space.

The belief is usually represented by a probability density function over the target space. Similar to the discretization of the target space, the belief could also be represented discretely by grid cells. The position of each grid cell can be described in the two-dimensional integer space as [ i _x,i _y], where i _x∈1,…,n _x and i _y∈1,…,n _y. With the integer representation, the belief at the grid cell [ i _x,i _y] can be represented as $p^{i_{x}, i_{y}} (\cdot)$ .

Prediction

The prediction process requires the numerical evaluation of Equation (4). Given the belief of the previous state $p^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k - 1})$ at the grid cell [ i _x,i _y] and the target motion model $p^{I_{x}, I_{y}} (x_{k}^{t} | x_{k - 1}^{t})$ constructed in the matrix of size I _x×I _y as a convolution kernel, the predicted belief of the current state can be numerically computed as

\begin{array}{lcr} p^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) \\ = p^{i_{x}, i_{y}} (x_{k - 1}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) \otimes p^{I_{x}, I_{y}} (x_{k}^{t} | x_{k - 1}^{t}), \end{array}

(11)

where ⊗ indicates the two-dimensional convolution of the belief of the previous state with the probabilistic target motion model. Therefore, the belief of the current state is given by

\begin{array}{lcr} p^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) \\ = \sum_{β = 1}^{I_{y}} \sum_{α = 1}^{I_{x}} p^{α, β} (x_{k}^{t} | x_{k - 1}^{t}) p^{i_{x} - α + 1, i_{y} - β + 1} (x_{k - 1}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) . \end{array}

(12)

The parallelization of the prediction process is straightforward. Since the prediction at each grid cell, given by Equation (12), can be performed independently, the parallelization of the prediction corresponds to the parallelization of the equation and achieves a parallel efficiency of 100% in an ideal environment. However, this equation also shows that the computation for the prediction process is largely dominated by the size of the convolution kernel. In order for real-time performance, it is important that the convolution kernel of an appropriate size, which needs to be big enough to capture the motion of the target as well as small enough to perform fast computation, is utilized.

Correction

The correction process corresponds to the numerical computation of Equation (7). Given the predicted belief $p (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k - 1})$ and the new observation likelihood $l^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{k})$ at the grid cell [ i _x,i _y], the corrected belief is computed by

p^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k}) = \frac{q^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})}{A_{c} \sum_{α = 1}^{n_{g}} q^{α} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})},

(13)

where A _c is the area of a grid cell, and

q^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k}) = l^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{k}) p^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) .

(14)

The parallelization of the correction process requires the breakdown of the process as it identifies which subprocesses are parallelizable. By observing the mathematical operations, the correction process can be broken down into three steps:

1.
Calculate $q^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})$ by multiplying the predicted belief $p^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k - 1})$ with the observation likelihood $l^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{k})$ ;
2.
Sum $\sum_{α = 1}^{n_{g}} q^{α} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})$ and multiply the sum by A _c;
3.
Calculate $p^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})$ by dividing $q^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})$ by $A_{c} \sum_{α = 1}^{n_{g}} q^{α} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})$ .

The breakdown indicates that steps 1 and 3 are grid-wise sub-processes, which can be conducted independently. Therefore, for the correction process, steps 1 and 3 can be computed in parallel, whereas step 2 is not parallelizable.

Target state evaluation

In the parallel grid-based RBE, the state of the target is evaluated by Equation (2) in the integral form at each time step. For an accurate evaluation of the target state an appropriate choice of the time interval Δ t is necessary. Given a specific computer hardware configuration, each iteration of the parallel grid-based RBE requires the certain amount of time Δ t _c to perform the computation, including both the prediction and correction processes. In order to achieve an accurate evaluation of the target state, the time interval Δ t needs to be chosen such that it matches the Δ t _c. As shown in Figure 1, only when the Δ t is identical with the Δ t _c the evaluated target states could match the real target states. When the Δ t is smaller or larger than the Δ t _c, the evaluation of the target states fails and eventually leads to large accumulated errors. The Δ t _c is determined by not only the parallel grid-based RBE itself but also its computational performance for the specific computer hardware configuration.

Computational performance modeling

Acceleration of prediction process

Since the RBE designed with high frequency results in using the Markovian target motion model well approximated by a Gaussian probability density, the proposed modeling first reformulates the prediction process with the Gaussian assumption as a pre-process and accelerates the parallel grid-based RBE to achieve the maximum performance. With the Gaussian assumption, the convolution kernel in the matrix of size I _x×I _y can be separated into two vector kernels in the name of separable convolution: a column kernel of length I _x and a row kernel of length I _y. Therefore, the target motion model matrix is separated as

p^{I_{x}, I_{y}} (x_{k}^{t} | x_{k - 1}^{t}) =^{c} p^{I_{x}} {(x_{k}^{t} | x_{k - 1}^{t})}^{r} p^{I_{y}} (x_{k}^{t} | x_{k - 1}^{t}),

(15)

where $^{c} p^{I_{x}} (x_{k}^{t} | x_{k - 1}^{t})$ and $^{r} p^{I_{y}} (x_{k}^{t} | x_{k - 1}^{t})$ are the column kernel and row kernel, respectively, with the size of a vector of I _x+I _y. Substituting Equation (15) into Equation (11), the predicted belief of the current state can be computed as

\begin{array}{lcr} p^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) \\ = p^{i_{x}, i_{y}} (x_{k - 1}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) \otimes^{c} p^{I_{x}} (x_{k}^{t} | x_{k - 1}^{t}) \otimes^{r} p^{I_{y}} (x_{k}^{t} | x_{k - 1}^{t}), \end{array}

(16)

which means that the prediction process can be broken down into two steps:

\begin{array}{lcr} u^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) \\ = p^{i_{x}, i_{y}} (x_{k - 1}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) \otimes^{c} p^{I_{x}} (x_{k}^{t} | x_{k - 1}^{t}) \\ = \sum_{α = 1}^{I_{x}}^{c} p^{α} (x_{k}^{t} | x_{k - 1}^{t}) p^{i_{x} - α + 1, i_{y}} (x_{k - 1}^{t} |^{s} {\tilde{z}}_{1 : k - 1}), \end{array}

(17)

and

\begin{array}{lcr} p^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) \\ = u^{i_{x}, i_{y}} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) \otimes^{r} p^{I_{y}} (x_{k}^{t} | x_{k - 1}^{t}) \\ = \sum_{β = 1}^{I_{y}}^{r} p^{β} (x_{k}^{t} | x_{k - 1}^{t}) u^{i_{x}, i_{y} - β + 1} (x_{k - 1}^{t} |^{s} {\tilde{z}}_{1 : k - 1}) . \end{array}

(18)

These equations show that the prediction process at each grid cell is carried out by performing two one-dimensional convolutions, each in horizontal and vertical directions instead of the original one two-dimensional convolution while remaining complete parallelizability. For Equation (17), the number of floating point operations for each grid cell is seen 2I _x since I _x times of one multiplication and one summation are necessary, whereas the number of floating point operations for Equation (18) is 2I _y via the similar observation. Having a total of n _g grid cells, the total number of floating point operations for the prediction process is thus given by

N_{p} = 2 n_{g} I_{x} + I_{y} .

(19)

This is considerably small compared to that of the original formulation which is derived as 2n _g I _x I _y via Equation (12) since I _x+I _y≪I _x I _y for an appropriate prediction process.

Parallel computation using GPU

Following Equations (16) and (13) for the prediction and correction process, respectively, Figure 2 shows the schematic diagram of the proposed accelerated parallel grid-based RBE using GPU. For efficiency, the GPU stores the entire data for RBE in the global memory and performs RBE using local memories. As a result, the data transmission between the CPU’s memory and the GPU’s local memories is carried out via the GPU’s global memory, and all the parallelizable floating point operations are executed using the local memories. For the prediction process, the data to be transmitted from the CPU’s memory to the GPU’s local memories are the previous belief $p (x_{k - 1}^{t} |^{s} {\tilde{z}}_{1 : k - 1})$ and the target motion model $p (x_{k}^{t} | x_{k - 1}^{t})$ . Since the predicted belief is in the local memories, the correction needs only the observation likelihood to be initially transmitted in addition. After performing the multiplication of $p (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k - 1})$ and the observation likelihood $l (x_{k}^{t} |^{s} {\tilde{z}}_{k})$ using GPU’s local memories, the result $q (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})$ is transmitted to the CPU’s memory to calculate the sum $A_{c} \sum_{α = 1}^{n_{g}} q^{α} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})$ . The sum is then transmitted back to the GPU’s local memories to perform divisions in parallel and update the belief $p (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})$ . Finally, the belief is transmitted back to the CPU’s memory for the next iteration of the accelerated parallel grid-based RBE.

Modeling of computational performance

The computational performance of the accelerated parallel grid-based RBE using GPU is determined not only by the performance of the CPU but also by the performance of the GPU and that of data transmission. As a result, the time cost of one iteration of the accelerated parallel grid-based RBE is given by

\begin{array}{lcr} Δ t_{c} & = & Δ t_{trans} + Δ t_{G} + Δ t_{C}, \end{array}

(20)

where Δ t _trans represents the data transmission time cost between the CPU’s memory and the GPU’s global memory as well as that between the local and the global memory inside the GPU, Δ t _G represents the time cost of the parallel computation performed on the GPU, and Δ t _C represents the time cost of the computation performed on the CPU.

Data transmission

In order to determine the data transmission time cost Δ t _trans for one iteration of the accelerated parallel grid-based RBE, the data transmitted among the CPU’s memory, GPU’s global memory, and GPU’s local memory need to be evaluated in both the prediction and correction processes. Let the amount of data transmitted in the unit of bytes be defined as

A = PN,

(21)

where P is the precision of the numerical representation, and N is defined as the number of data transmitted. Since the precision is usually constant, the amount of data transmitted could be derived in terms of the number of data transmitted. The numbers of data of the belief and the target motion model for the prediction process are n _g and I _x+I _y, respectively. The same numbers of data, n _g and I _x+I _y, are transmitted to the GPU’s local memory to perform parallel calculation. In the correction process, the number of data of the likelihood to be transmitted from the CPU’s memory to the GPU’s local memory through the GPU’s global memory is n _g, whereas the number of data of the result $q (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})$ to be transmitted from the GPU’s local memory to the CPU’s memory through the GPU’s global memory is similarly n _g. The number of data of the sum, $A_{c} \sum_{α = 1}^{n_{g}} q^{α} (x_{k}^{t} |^{s} {\tilde{z}}_{1 : k})$ , to be then transmitted to the GPU’s local memory to perform parallel divisions is 1, and finally, the number of data to be transmitted back to the CPU’s memory for the next RBE is n _g.

By observing the data transmission for one iteration of the accelerated parallel grid-based RBE, the total number of data transmitted from the CPU’s memory to the GPU’s global memory is given by

\begin{array}{lcr} N_{CG} & = n_{g} + I_{x} + I_{y} + 1 + n_{g} \\ = 2 n_{g} + I_{x} + I_{y} + 1, \end{array}

(22)

and all the data are transmitted continuously from the GPU’s global memory to the GPU’s local memory

\begin{array}{lcr} N_{GL} = N_{CG} = 2 n_{g} + I_{x} + I_{y} + 1 . \end{array}

(23)

The total number of data transmitted from the GPU’s local memory to the GPU’s global memory is

N_{LG} = n_{g} + n_{g} = 2 n_{g},

(24)

and that from the GPU’s global memory to the CPU’s memory similarly becomes

N_{GC} = N_{LG} = 2 n_{g} .

(25)

The data transmission time cost Δ t _trans for one iteration of the accelerated parallel grid-based RBE is given by

\begin{array}{lcr} Δ t_{trans} = P (\frac{N_{CG}}{B_{CG}} + \frac{N_{GC}}{B_{GC}} + \frac{N_{GG}}{B_{GG}}), \end{array}

(26)

where N _CG and B _CG are the total number of data transmitted and the copy bandwidth with the unit of bytes per second from the CPU’s memory to the GPU’s global memory, respectively, N _GC and B _GC are those from the GPU’s global memory to the CPU’s memory, respectively, and N _GG and B _GG represent those between the GPU’s global memory and the GPU’s local memory. Due to the fact that the copy bandwidth from the GPU’s global memory to the GPU’s local memory and the one in opposite direction are the same, the number of data transmitted inside the GPU is given by

N_{GG} = N_{GL} + N_{LG} = 4 n_{g} + I_{x} + I_{y} + 1 .

(27)

Substitute Equations (22), (25), and (27) into Equation (26), the data transmission time cost for one iteration of the accelerated parallel grid-based RBE is given by

Δ t_{trans} = P (\frac{2 n_{g} + I_{x} + I_{y} + 1}{B_{CG}} + \frac{2 n_{g}}{B_{GC}} + \frac{4 n_{g} + I_{x} + I_{y} + 1}{B_{GG}}) .

(28)

It is to be noted here that these parameters of copy bandwidths are inherent for a specific computer hardware configuration and can be determined experimentally.

Floating point operations

In order to determine the GPU computation time cost Δ t _G and CPU computation time cost Δ t _C for one iteration of the accelerated parallel grid-based RBE, the number of floating point operations performed on both CPU and GPU needs to be evaluated. The number of floating point operations performed on the GPU for the prediction process is seen 2n _g(I _x+I _y) as the Equation (19) indicated. The number of floating point operations performed on the GPU for the correction process is identified as 2n _g in total since n _g parallel multiplications and n _g parallel divisions are performed for steps 1 and 3 in Subsection 2, respectively. Meanwhile, the number of floating point operations performed on the CPU is n _g by n _g summations in step 2 of the Subsection 2. As a consequence, the total number of floating point operations performed on the GPU and the CPU for one iteration of the accelerated parallel grid-based RBE is given, respectively, by

N_{G} = 2 n_{g} (I_{x} + I_{y}) + 2 n_{g} = 2 n_{g} (I_{x} + I_{y} + 1),

N_{C} = n_{g} .

The GPU computation time cost for one iteration of the accelerated parallel grid-based RBE is given by

Δ t_{G} = \frac{N_{G}}{V_{G}},

(29)

where N _G is the number of floating point operations performed on the GPU, and V _G is the computational rate of GPU with the unit of FLOPS. Substituting Equation (29) into Equation (29), the GPU computation time cost is given by

Δ t_{G} = 2 n_{g} \frac{I_{x} + I_{y} + 1}{V_{G}} .

(30)

Similarly, the CPU computation time cost for one iteration of the accelerated parallel grid-based RBE is given by

Δ t_{C} = \frac{N_{C}}{V_{C}},

(31)

where N _C represents the number of floating point operations performed on the CPU, and V _C is the computational rate of CPU with the unit of FLOPS. In the same way, by substituting Equation (29) into Equation (31), the CPU computation time cost is given by

Δ t_{C} = \frac{n_{g}}{V_{C}} .

(32)

It is to be noted here that the computational rates, V _G and V _C, are also inherent for a specific CPU and GPU configuration and can be determined experimentally.

Experimental validation

Table 1 shows the setup specifications which have been available for the validation and other investigations. Setup 1 is the fastest in both CPU and GPU, whereas setup 3 is the slowest. This section firstly shows the improvement of the parallel grid-based RBE using GPU by adopting the separable convolution in the prediction process with the specification listed in setup 1. Moreover, the proposed computational modeling for the parallel grid-based RBE is validated via setups 1 to 3. In the end, a simulated target searching task is introduced to further evaluate the efficacy of the proposed modeling.

Table 1 Test computer system specifications

Full size table

Improvement in prediction process

The efficiency of the prediction process accelerated by separable convolution was evaluated with a problem having a fixed grid space size of 1,000×1,000 and varying the convolution kernel size from 1 to 50 on the computer setup 1. The result of the time cost by GPU is shown in Figure 3 together with the corresponding result by the original prediction. Even when the convolution kernel size is 50, the accelerated prediction is seen to require the time cost of only 1 ms. Its superiority can also be understood by comparing it to the original prediction, which needs the time cost 25 times as much as that of the accelerated prediction process when the convolution kernel size is 50.

Validation

This set of tests was aimed at validating the proposed modeling of computer performance by estimating the total iteration time cost Δ t of the parallel grid-based RBE using GPU and comparing it with the actual iteration time cost experimentally measured in three different computer setups. Each component, Δ t _trans, Δ t _G, or Δ t _C, is also compared with the actual performance, respectively. All the time cost results are measured by averaging the time cost of 10,000 iterations. Needless to say, the convolution kernel size I _x+I _y and grid space size n _g are the two major factors in the proposed modeling. Two tests were thus conducted by each, changing the convolution kernel size and the grid space size.

Test 1

Test 1 was performed by fixing the grid space size of the parallel grid-based RBE to 1,000×1,000 and varying the convolution kernel size I _x=I _y=i from 1 to 200. A convolution kernel size over 200 was not explored since it is unlikely that the target motion model requires such a large convolution kernel. The square convolution kernel was because of the insignificance in changing size in both x and y directions, and this additionally allows visualization of results in two-dimensional space.

The results of all the components of the time cost for the three computer setups are shown in Figures 4, 5, and 6. Each solid line represents the estimated total and component time costs, whereas each solid dot line with the same color represents the corresponding actual performance. These figures primarily show that the total and component time costs estimated by the proposed modeling well match to the actual performance. Values listed in Table 2 also support this and indicate the effectiveness of the proposed modeling since the average and the maximum relative errors are below 7% and 12%, respectively. While the time cost of data transmission is seen to contribute most, it is also seen that the time cost by GPU increases the total time cost with increase in convolution kernel size particularly when the GPU is of low quality. It is thus important to use a high-performance GPU if fast RBE with large convolution kernel size is necessary.

Table 2 Quantitative results for test 1

Full size table

Test 2

Test 2 was performed by fixing the convolution kernel size of the parallel grid-based RBE to 16×16 or 32×32 and varying grid space size n _x=n _y=n from 100 to 1,000. These convolution kernel sizes often represent the target motion model with sufficient accuracy, and the grid space size n=1,000, which creates 1,000,000 grid cells, also provides good accuracy in many practical problems. Similarly to test 1, the square grid size enables two-dimensional visualization of results.

The results for all the components of the time cost for the three computer setups are shown in Figures 7, 8, and 9, respectively. These figures firstly show that the proposed modeling is also able to well estimate the actual performance of the parallel grid-based RBE regardless of different grid space sizes. Similarly to test 1, Table 3 shows small average and maximum relative errors, which are below 6% and 11%, respectively. Secondly, from these results, it is seen that the total time cost is dominated by the time cost of data transmission particularly when the ratio of the grid space size to the convolution kernel size is large. Since the data transmission rate is determined by the quality of the memory, the utilization of a high-quality memory is the first priority for fast RBE.

Table 3 Quantitative results for test 2

Full size table

Simulated target searching task

The performance of the prediction process dominates the accuracy of the RBE when no valid observations are obtained. The aim of this test is to evaluate how well the proposed modeling help the prediction process keep the accuracy during the no observation period. A simplified target searching task is described in this subsection. The motion model of the simulated target is given by

\begin{array}{lcr} x_{k + 1}^{t} & = & x_{k}^{t} + Δt \cdot v_{k}^{t} cos γ_{k}^{t} \\ y_{k + 1}^{t} & = & y_{k}^{t} + Δt \cdot v_{k}^{t} sin γ_{k}^{t}, \end{array}

(33)

where v^t and γ^t are the velocity and direction of the target motion, respectively, each subject to a Gaussian noise, and Δ t is the time increment. The prior belief on the target is also Gaussian. The autonomous sensor platforms are assumed to move on a horizontal plane and given by

\begin{array}{lcr} x_{k + 1}^{s_{i}} & = & x_{k}^{s_{i}} + Δt \cdot v_{k}^{s_{i}} cos γ_{k}^{s_{i}} \\ y_{k + 1}^{s_{i}} & = & y_{k}^{s_{i}} + Δt \cdot v_{k}^{s_{i}} sin γ_{k}^{s_{i}} \\ θ_{k + 1}^{s_{i}} & = & θ_{k}^{s_{i}} + Δt \cdot α^{s_{i}} γ_{k}^{s_{i}}, \end{array}

(34)

where $v^{s_{i}}$ and $γ^{s_{i}}$ are the velocity and turn of the sensor platform (s _i) respectively, and $α^{s_{i}}$ is a coefficient governing the rate of turn. The probability of detection $P_{d} (x_{k}^{t} | x_{k}^{s_{i}})$ is given by a Gaussian distribution, whereas the likelihood $l (x_{k}^{t} |^{s_{i}} {\tilde{z}}_{k}^{t}, {\tilde{x}}_{k}^{s_{i}})$ when the target is detected is given by a Gaussian distribution with variances proportional to the distance between the sensor platform s _i and the target. Table 4 shows the major parameters of this simulated target searching task. The convolution kernel constructed by the target motion model is represented by a 32×32 matrix, and the grid space size is set as 1,000×1,000. The computer specifications followed the setup 3 in the Table 1. With the proposed approach, the time increment Δ t was chosen as 0.032 s, the time cost of one iteration of the RBE estimated by the proposed modeling. For the case without the proposed approach, the time increment Δ t was chosen as 0.02 s randomly in order to show the comparison.

Table 4 Major parameters of the target searching task

Full size table

Figure 10 shows the initial and final states of four sensor platforms without and with proposed prediction reformulation, respectively. Without the proposed prediction improvement, all the sensor platforms lost the target, whereas all of them successfully found the target under the condition of utilizing the proposed prediction reformulation. The reason is that the proposed prediction process made the grid-based RBE to update the belief much faster than the original one, resulting in a reliable tracking on the moving target. The evaluation of the proposed modeling for this simulated search and rescue task is conducted, and the corresponding quantitative results were concluded in the Table 5. The result shows small average and maximum relative errors as well, which are below 7% and 10% respectively, and indicates that the proposed modeling is able to estimate the actual time cost for the grid-based RBE using GPU.

Table 5 Quantitative results for simulated search and rescue task

Full size table

Conclusion and future work

The performance modeling for the real-time grid-based RBE, especially parallel computation using GPU, has been proposed to identify the best resolution of the RBE with given computer hardware. The modeling allows the estimation of time costs necessary within CPU and GPU and that of data transmission between CPU and GPU for the real-time grid-based RBE. In order to speed up the RBE, the prediction has been additionally reformulated with the separable convolution.

The proposed modeling was experimentally investigated by varying its major parameters. The result of the first test with varying convolution kernel size shows that the average error of the estimation by the proposed modeling stays below 7% regardless of the convolution kernel size and that a high-performance GPU is necessary if the convolution kernel size is large. In the second test with varying grid space size, it is found that the proposed modeling estimates within the average error of 6%, irrespective of the grid space size, and that a high-quality memory is necessary if fast RBE is required for large grid space. Utilizing prediction with separable convolution, the RBE has also been found to perform within 1 ms, although the size of the problem was relatively large.

The current study is still the first step for achieving high-fidelity RBE in a real-time environment. The project is further planned to utilize the best resolution of the RBE identified by the proposed modeling and investigate its efficacy.

References

Tarantola A: Inverse Problem Theory and Methods for Model Parameter Estimation. Philadelphia: Society for Industrial and Applied Mathematics; 2005.
Book Google Scholar
Harlim J, Hunt BR: A non-Gaussian ensemble filter for assimilating infrequent noisy observations. Tellus A 2007, 59: 225–237. 10.1111/j.1600-0870.2007.00225.x
Article Google Scholar
Apte A, Hairer M, Stuart AM, Voss J: Sampling the posterior: an approach to non-Gaussian data assimilation. Physica D 2007, 230: 50–64. 10.1016/j.physd.2006.06.009
Article MathSciNet Google Scholar
Doshi P, Gmytrasiewicz PJ: Monte Carlo sampling methods for approximating interactive POMDPs. J. Artif. Intell. Res 2009, 34: 297–337.
Google Scholar
Mandel J, Beezley JD: An ensemble Kalman-particle predictor-corrector filter for non-Gaussian data assimilation. Comput. Sci. ICCS 2009, 2009: 470–478.
Google Scholar
Stenger B, Thayananthan A, Torr PHS, Cipolla R: Filtering using a tree-based estimator. IEEE Int. Conf. Comput. Vis 2003, 2: 1063–1070.
Google Scholar
Huang D, Leung H: Maximum likelihood state estimation of semi-Markovian switching system in non-Gaussian measurement noise. IEEE Trans. Aerosp. Electron. Syst 2010, 46: 133–146.
Article Google Scholar
Bergman N: Recursive Bayesian estimation navigation and tracking applications. PhD Dissertation, Linkopings University; 1999.
Google Scholar
Furukawa T, Durrant-Whyte HF, Lavis B: The element-based method—theory and its application to Bayesian search and tracking. San Diego: Paper presented at the IEEE/RSJ international conference on intelligent robots and systems; 29 Oct–2 Nov 2007.
Google Scholar
Lavis B, Furukawa T: HyPE: Hybrid particle-element approach for recursive Bayesian searching and tracking. Proceedings of Robotic: Science and Systems IV. Zurich: MIT Press; 2008.
Google Scholar
Lavis B, Furukawa T, Durrant-Whyte HF: Dynamic space reconfiguration for Bayesian search and tracking with moving targets. Auto. Robots 2008,24(4):387–399. 10.1007/s10514-007-9081-4
Article Google Scholar
Furukawa T, Lavis B, Durrant-Whyte HF: Parallel grid-based recursive Bayesian estimation using GPU for real-time autonomous navigation. Paper presented at the IEEE international conference on robotics and automation. Anchorage, AK, USA: ; 3–7 May 2010.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mechanical Engineering, Virginia Tech, 800 Drillfield Dr, Blacksburg, VA, 24061, USA
Xianqiao Tong & Tomonari Furukawa
Australian Centre for Field Robotics (ACFR), Rose St, Sydney, 2006, Australia
Hugh Durrant-Whyte

Authors

Xianqiao Tong
View author publications
You can also search for this author in PubMed Google Scholar
Tomonari Furukawa
View author publications
You can also search for this author in PubMed Google Scholar
Hugh Durrant-Whyte
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xianqiao Tong.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Authors’ original file for figure 5

Authors’ original file for figure 6

Authors’ original file for figure 7

Authors’ original file for figure 8

Authors’ original file for figure 9

Authors’ original file for figure 10

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Tong, X., Furukawa, T. & Durrant-Whyte, H. Computational modeling for parallel grid-based recursive Bayesian estimation: parallel computation using graphics processing unit. J. Uncertain. Anal. Appl. 1, 15 (2013). https://doi.org/10.1186/2195-5468-1-15

Download citation

Received: 29 August 2013
Accepted: 11 November 2013
Published: 16 December 2013
DOI: https://doi.org/10.1186/2195-5468-1-15

Computational modeling for parallel grid-based recursive Bayesian estimation: parallel computation using graphics processing unit

Abstract

Introduction

Parallel grid-based RBE

Problem statement

Recursive Bayesian estimation

Prediction

Correction

Parallel grid-based RBE

Representation of target space and belief

Prediction

Correction

Target state evaluation

Computational performance modeling

Acceleration of prediction process

Parallel computation using GPU

Modeling of computational performance

Data transmission

Floating point operations

Experimental validation

Improvement in prediction process

Validation

Test 1

Test 2

Simulated target searching task

Conclusion and future work

References

Author information

Authors and Affiliations

Corresponding author

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords