 Research
 Open Access
 Published:
Computational modeling for parallel gridbased recursive Bayesian estimation: parallel computation using graphics processing unit
Journal of Uncertainty Analysis and Applications volume 1, Article number: 15 (2013)
Abstract
This paper presents the performance modeling of the realtime gridbased recursive Bayesian estimation (RBE), particularly the parallel computation using graphics processing unit (GPU). The proposed modeling formulates data transmission between the central processing unit (CPU) and the GPU as well as floating point operations to be carried out in each CPU and GPU necessary for one iteration of the realtime gridbased RBE. Given the specifications of the computer hardware, the proposed modeling can thus estimate the total amount of time cost for performing the gridbased RBE in a realtime environment. A new prediction formulation, which adopted separable convolution, is proposed to further accelerate the realtime gridbased RBE. The performance of the proposed modeling was investigated, and parametric studies have first demonstrated its validity in various conditions by showing that the average error of estimation in computational performance stays below 6% to 7%. Utilizing the prediction with separable convolution, the gridbased RBE has also been found to perform within 1 ms, although the size of the problem was relatively large.
Introduction
Recursive Bayesian estimation (RBE) allows the estimation of belief of a dynamically moving target by updating the belief both in time and observation [1]. There are two fundamental processes for the RBE: prediction process and correction process. The prediction process updates the belief by the motion model of the target, whereas the correction process updates the belief through the current observation. If the target is observable, the accuracy of the RBE can be maintained by the correction process using the valid observations. When the target is not observable, the accuracy of the RBE heavily relies on the prediction process and the error accumulates due to the lack of the valid observation for the correction process. In order for an accurate estimation, the RBE has to be performed fast enough to catch the motion of the target with a welldefined target motion model, which requires a good synchronization between its discrete representation and the RBE. Recent years, as a result, have seen many realtime enhanced RBE techniques that help improve the speed of the RBE.
One of such techniques is the modified ensemble Kalman filter (EnKF). The EnKF allows nonGaussian estimation by minimizing a cost function defined by a nonGaussian observation error with a preconditioned conjugate gradient method [2]. LangevinMarkov Chain Monte Carlo (MCMC) method, which represents the nonGaussian belief by sampling it using a Markov chain and Langevin equation, could be a nonGaussian RBE technique [3]. Another sampling method is the interactive particle filter (IPF), which is able to flexibly mitigate the belief space complexity [4]. An ensemble Kalmanparticle predictorcorrector filter is a hybrid method that combines the advantages of EnKF and IPF and is able to effectively deal with highdimensional nonGaussian problems [5]. A treebased estimator approximates the posterior belief distribution at multiple resolutions to be effective for highdimensional problems [6], whereas maximum likelihood state estimation method could also achieve nonGaussian RBE [7] by using a finite Gaussian mixture model.
Gridbased RBE technique is able to maintain a good accuracy for the belief since the entire target space is spatially discretized [8]. The good accuracy is obtained by the subtle discretization of the target space but leads to an inefficient computation at the same time. Furukawa et al. [9, 10] refined the gridbased RBE by developing a more general elementbased RBE. The generalized element can help accurately represent the arbitrary target space with only the small number of elements compared with the gridbased RBE so as to reduce the computation of the RBE. Lavis et al. proposed an enhanced gridbased RBE that allows the update of not only the belief but also the target space [11]. Because of the dynamic adjustment of the target space, the computation of the RBE is additionally reduced. Further, the parallel gridbased RBE has been proposed, and it significantly accelerated the computation of the RBE and made its realtime implementation possible by utilizing the GPU’s strong parallel computational capability [12]. Despite that these efforts successfully reduce the computation of the RBE to achieve the fast RBE, the accuracy of the RBE is not well kept when the prediction process dominates the RBE during the noobservation period. The time cost of one iteration of the RBE becomes critical for overcoming this issue because that only if it matches the time increment of the discrete target motion model, the RBE can maintain the accuracy during the noobservation period.
This paper presents a performance modeling for the parallel gridbased RBE, particularly the parallel computation using the GPU, and it is able to determine the time cost of one iteration of the RBE. The proposed modeling formulates the total amount of data transmission between the CPU and the GPU and the total number of floating point operations to be carried out in each CPU and GPU necessary for one iteration of the parallel gridbased RBE. Given the specifications of the computer hardware, it is thus possible to estimate the time cost for one iteration of the parallel gridbased RBE. In order to perform the parallel gridbased RBE at maximum speed, the proposed modeling also reformulates and implements the prediction process with separable convolution.
The paper is organized as follows. The following section reviews the recursive Bayesian estimation as well as the parallel gridbased RBE. Section presents the proposed reformulation of the prediction process for the parallel gridbased RBE and its computational performance modeling. Section demonstrates the validation and efficacy of the proposed modeling through numerical examples, and the Conclusion and future work are summarized in the final section.
Parallel gridbased RBE
Problem statement
The motion of an object, o, is deterministically given by the following equation:
where x^{o} represents the state of the object, u^{o} represents the object control input, w^{o} represents the system noise, which includes environmental influences on the target, and t represents the time. In general, the state of the object describes its twodimensional location but may also include other variables such as velocity. Let the time interval between the consecutive time steps be defined as Δ t. By integrating Equation (1), the state of the object at the time step k is given by
where t _{ k−1} is the time which corresponds to the time step k−1.
Recursive Bayesian estimation
Prediction
The prediction process starts with the numerical implementation of the object motion model defined in Equation (2). For simplicity, the numerical integration is carried out by Riemann left sum algorithm. By dividing the time interval Δ t between the consecutive time steps into n subintervals. The state of the object at the time step k is given by
Let a sequence of the observations of the object from time step 1 to k−1 be defined as ${\phantom{\rule{0.3em}{0ex}}}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k1}\equiv {\{}^{s}{\stackrel{~}{\mathbf{z}}}_{i}\forall i\in \{1,\dots ,k1\left\}\right\}$. Notice here that $\stackrel{~}{(\xb7)}$ represents an instance of variable (·). The prediction process computes the belief of the current state $p\left({\mathbf{x}}_{k}^{o}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k1}\right)$ from the belief in the previous time step $p\left({\mathbf{x}}_{k1}^{o}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k1}\right)$. The prediction is iteratively carried out by ChapmanKolmogorov equation and given by
where $p\left({\mathbf{x}}_{k}^{o}\right{\mathbf{x}}_{k1}^{o})$ is the probabilistic representation of the object motion model defined in Equation (3), which maps the probability of transition from the previous state ${\mathbf{x}}_{k1}^{o}$ to the current state ${\mathbf{x}}_{k}^{o}$. The prediction process at k=1 is carried out by letting $p\left({\mathbf{x}}_{k1}^{o}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k1}\right)=p\left({\stackrel{~}{\mathbf{x}}}_{0}^{o}\right)$, where $p\left({\stackrel{~}{\mathbf{x}}}_{0}^{o}\right)$ is defined as a prior belief of the object in terms of the probability density function. Equation (4) indicates that the performance of the prediction process relies on the object motion model $p\left({\mathbf{x}}_{k}^{o}\right{\mathbf{x}}_{k1}^{o})$. Due to the fact that the object motion model is usually nonGaussian when only prediction process applies to the RBE, the belief could eventually become heavily nonGaussian.
Correction
The correction process is associated with the definition of the observation model. Let the probability of detection (PoD) be $0\le {P}_{\mathrm{d}}\left({\mathbf{x}}_{k}^{o}\right)\le 1$ as a reliable measure for detecting the object in terms of the object state. Observable region ${\phantom{\rule{0.3em}{0ex}}}^{s}{\mathcal{X}}_{k}^{o}$ is defined as
The observation ^{s} z _{ k } at the time step k is given by
where ${\mathbf{v}}_{k}^{\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{0.3em}{0ex}}s}$ represents the observation noise at the time step k, and $\varnothing $ represents an empty element, indicating that the observation contained no information on the object or that the target is unobservable when it is not within the observable region.
The correction process then computes the belief $p\left({\mathbf{x}}_{k}^{o}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k}\right)$ given the corresponding observations up to the previous time step $p\left({\mathbf{x}}_{k}^{o}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k1}\right)$ and a new observation ${\phantom{\rule{0.3em}{0ex}}}^{s}{\stackrel{~}{\mathbf{z}}}_{k}$. The equation is derived by applying formulas for marginal distribution and conditional independence and given by
where $l\left({\mathbf{x}}_{k}^{o}{}^{s}{\stackrel{~}{\mathbf{z}}}_{k}\right)$ represents the observation likelihood of ${\mathbf{x}}_{k}^{o}$. The observation likelihood is defined with reference to the PoD and is given by
where $p\left({\mathbf{x}}_{k}^{o}{}^{s}{\stackrel{~}{\mathbf{z}}}_{k}\right)$ is the probabilistic representation of the observation model defined in Equation (6). When the object is within the observable region, a positive observation is obtained and the observation likelihood is a probability density function given the current of the object observation. When the object is out of the observable region, the negative observation is defined with respect to the PoD as the observation likelihood. Due to the fact that the observation likelihood of the negative observation is nonGaussian, when the negative observation occurs in the RBE, the object belief would immediately become heavily nonGaussian.
Parallel gridbased RBE
Representation of target space and belief
The gridbased RBE achieves nonGaussian belief estimation by first representing the arbitrary target space ${\mathcal{X}}^{t}$ in terms of a set of grid cells by constructing a rectangular space ${\mathcal{X}}^{r}$ that covers the target space. For simplicity, let us consider a twodimensional target space, and it is represented as ${\mathbf{m}}^{t}=\phantom{\rule{0.3em}{0ex}}[\phantom{\rule{0.3em}{0ex}}{x}^{t},{y}^{t}]\in {\mathcal{X}}^{t}$. The creation of a rectangular space ${\mathcal{X}}^{r}$ is achieved then by defining the minimum and maximum values of the target space
and subsequently creating a rectangular space as ${\mathcal{X}}^{r}=\left\{\mathbf{m}\right\forall x\in \phantom{\rule{0.3em}{0ex}}[\phantom{\rule{0.3em}{0ex}}{x}_{min}^{t},{x}_{max}^{t}],\forall y\in \phantom{\rule{0.3em}{0ex}}[{y}_{min}^{t},{y}_{max}^{t}]\}\supseteq {\mathcal{X}}^{t}$, where m= [ x,y]. The grid space is further introduced by discretizing the rectangular space by n _{ x } and n _{ y } grid cells in two directions, respectively. The dimensions of a grid cell are defined as $\Delta {x}^{r}=({x}_{max}^{t}{x}_{min}^{t})/{n}_{x}$ and $\Delta {y}^{r}=({y}_{max}^{t}{y}_{min}^{t})/{n}_{y}$. This results in introducing the center of each grid cell as
where ∀i _{ x }∈{1,…,n _{ x }} and ∀i _{ y }∈{1,…,n _{ y }}. Each grid cell is defined as
Note that $\bigcup _{{i}_{x}=1}^{{n}_{x}}\bigcup _{{i}_{y}=1}^{{n}_{y}}{\mathcal{X}}_{{i}_{x},{i}_{y}}^{r}={\mathcal{X}}^{r}$ and $\bigcap _{{i}_{x}=1}^{{n}_{x}}\bigcap _{{i}_{y}=1}^{{n}_{y}}{\mathcal{X}}_{{i}_{x},{i}_{y}}^{r}=\varnothing $. Finally, the selection of grid cells that represent the target space is performed by selecting a grid cell when its center is located in the target space, ${\mathcal{X}}_{{i}_{x},{i}_{y}}^{r}\subset {\mathcal{X}}^{t}$ if ${\stackrel{\u0304}{\mathbf{x}}}_{{i}_{x},{i}_{y}}^{r}\in {\mathcal{X}}^{t}$. The approximate target space derived by the processes described above is ${\mathcal{X}}^{t}\approx \{{\mathcal{X}}_{1}^{r},{\mathcal{X}}_{2}^{r},\dots ,{\mathcal{X}}_{{n}_{g}}^{r}\}$, where n _{ g } is the number of grid cells approximating the target space.
The belief is usually represented by a probability density function over the target space. Similar to the discretization of the target space, the belief could also be represented discretely by grid cells. The position of each grid cell can be described in the twodimensional integer space as [ i _{ x },i _{ y }], where i _{ x }∈1,…,n _{ x } and i _{ y }∈1,…,n _{ y }. With the integer representation, the belief at the grid cell [ i _{ x },i _{ y }] can be represented as ${p}^{{i}_{x},{i}_{y}}(\xb7)$.
Prediction
The prediction process requires the numerical evaluation of Equation (4). Given the belief of the previous state ${p}^{{i}_{x},{i}_{y}}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k1}\right)$ at the grid cell [ i _{ x },i _{ y }] and the target motion model ${p}^{{I}_{x},{I}_{y}}\left({\mathbf{x}}_{k}^{t}\right{\mathbf{x}}_{k1}^{t})$ constructed in the matrix of size I _{ x }×I _{ y } as a convolution kernel, the predicted belief of the current state can be numerically computed as
where ⊗ indicates the twodimensional convolution of the belief of the previous state with the probabilistic target motion model. Therefore, the belief of the current state is given by
The parallelization of the prediction process is straightforward. Since the prediction at each grid cell, given by Equation (12), can be performed independently, the parallelization of the prediction corresponds to the parallelization of the equation and achieves a parallel efficiency of 100% in an ideal environment. However, this equation also shows that the computation for the prediction process is largely dominated by the size of the convolution kernel. In order for realtime performance, it is important that the convolution kernel of an appropriate size, which needs to be big enough to capture the motion of the target as well as small enough to perform fast computation, is utilized.
Correction
The correction process corresponds to the numerical computation of Equation (7). Given the predicted belief $p\phantom{\rule{0.3em}{0ex}}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k1}\right)$ and the new observation likelihood ${l}^{{i}_{x},{i}_{y}}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{k}\right)$ at the grid cell [ i _{ x },i _{ y }], the corrected belief is computed by
where A _{c} is the area of a grid cell, and
The parallelization of the correction process requires the breakdown of the process as it identifies which subprocesses are parallelizable. By observing the mathematical operations, the correction process can be broken down into three steps:

1.
Calculate ${q}^{{i}_{x},{i}_{y}}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k}\right)$ by multiplying the predicted belief ${p}^{{i}_{x},{i}_{y}}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k1}\right)$ with the observation likelihood ${l}^{{i}_{x},{i}_{y}}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{k}\right)$;

2.
Sum $\sum _{\alpha =1}^{{n}_{g}}{q}^{\alpha}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k}\right)$ and multiply the sum by A _{c};

3.
Calculate ${p}^{{i}_{x},{i}_{y}}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k}\right)$ by dividing ${q}^{{i}_{x},{i}_{y}}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k}\right)$ by ${A}_{\mathrm{c}}\sum _{\alpha =1}^{{n}_{g}}{q}^{\alpha}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k}\right)$.
The breakdown indicates that steps 1 and 3 are gridwise subprocesses, which can be conducted independently. Therefore, for the correction process, steps 1 and 3 can be computed in parallel, whereas step 2 is not parallelizable.
Target state evaluation
In the parallel gridbased RBE, the state of the target is evaluated by Equation (2) in the integral form at each time step. For an accurate evaluation of the target state an appropriate choice of the time interval Δ t is necessary. Given a specific computer hardware configuration, each iteration of the parallel gridbased RBE requires the certain amount of time Δ t _{c} to perform the computation, including both the prediction and correction processes. In order to achieve an accurate evaluation of the target state, the time interval Δ t needs to be chosen such that it matches the Δ t _{c}. As shown in Figure 1, only when the Δ t is identical with the Δ t _{c} the evaluated target states could match the real target states. When the Δ t is smaller or larger than the Δ t _{c}, the evaluation of the target states fails and eventually leads to large accumulated errors. The Δ t _{c} is determined by not only the parallel gridbased RBE itself but also its computational performance for the specific computer hardware configuration.
Computational performance modeling
Acceleration of prediction process
Since the RBE designed with high frequency results in using the Markovian target motion model well approximated by a Gaussian probability density, the proposed modeling first reformulates the prediction process with the Gaussian assumption as a preprocess and accelerates the parallel gridbased RBE to achieve the maximum performance. With the Gaussian assumption, the convolution kernel in the matrix of size I _{ x }×I _{ y } can be separated into two vector kernels in the name of separable convolution: a column kernel of length I _{ x } and a row kernel of length I _{ y }. Therefore, the target motion model matrix is separated as
where ${\phantom{\rule{0.3em}{0ex}}}^{c}{p}^{{I}_{x}}\left({\mathbf{x}}_{k}^{t}\right{\mathbf{x}}_{k1}^{t})$ and ${\phantom{\rule{0.3em}{0ex}}}^{r}{p}^{{I}_{y}}\left({\mathbf{x}}_{k}^{t}\right{\mathbf{x}}_{k1}^{t})$ are the column kernel and row kernel, respectively, with the size of a vector of I _{ x }+I _{ y }. Substituting Equation (15) into Equation (11), the predicted belief of the current state can be computed as
which means that the prediction process can be broken down into two steps:
and
These equations show that the prediction process at each grid cell is carried out by performing two onedimensional convolutions, each in horizontal and vertical directions instead of the original one twodimensional convolution while remaining complete parallelizability. For Equation (17), the number of floating point operations for each grid cell is seen 2I _{ x } since I _{ x } times of one multiplication and one summation are necessary, whereas the number of floating point operations for Equation (18) is 2I _{ y } via the similar observation. Having a total of n _{ g } grid cells, the total number of floating point operations for the prediction process is thus given by
This is considerably small compared to that of the original formulation which is derived as 2n _{ g } I _{ x } I _{ y } via Equation (12) since I _{ x }+I _{ y }≪I _{ x } I _{ y } for an appropriate prediction process.
Parallel computation using GPU
Following Equations (16) and (13) for the prediction and correction process, respectively, Figure 2 shows the schematic diagram of the proposed accelerated parallel gridbased RBE using GPU. For efficiency, the GPU stores the entire data for RBE in the global memory and performs RBE using local memories. As a result, the data transmission between the CPU’s memory and the GPU’s local memories is carried out via the GPU’s global memory, and all the parallelizable floating point operations are executed using the local memories. For the prediction process, the data to be transmitted from the CPU’s memory to the GPU’s local memories are the previous belief $p\left({\mathbf{x}}_{k1}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k1}\right)$ and the target motion model $p\left({\mathbf{x}}_{k}^{t}\right{\mathbf{x}}_{k1}^{t})$. Since the predicted belief is in the local memories, the correction needs only the observation likelihood to be initially transmitted in addition. After performing the multiplication of $p\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k1}\right)$ and the observation likelihood $l\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{k}\right)$ using GPU’s local memories, the result $q\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k}\right)$ is transmitted to the CPU’s memory to calculate the sum ${A}_{\mathrm{c}}\sum _{\alpha =1}^{{n}_{g}}{q}^{\alpha}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k}\right)$. The sum is then transmitted back to the GPU’s local memories to perform divisions in parallel and update the belief $p\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k}\right)$. Finally, the belief is transmitted back to the CPU’s memory for the next iteration of the accelerated parallel gridbased RBE.
Modeling of computational performance
The computational performance of the accelerated parallel gridbased RBE using GPU is determined not only by the performance of the CPU but also by the performance of the GPU and that of data transmission. As a result, the time cost of one iteration of the accelerated parallel gridbased RBE is given by
where Δ t _{trans} represents the data transmission time cost between the CPU’s memory and the GPU’s global memory as well as that between the local and the global memory inside the GPU, Δ t _{G} represents the time cost of the parallel computation performed on the GPU, and Δ t _{C} represents the time cost of the computation performed on the CPU.
Data transmission
In order to determine the data transmission time cost Δ t _{trans} for one iteration of the accelerated parallel gridbased RBE, the data transmitted among the CPU’s memory, GPU’s global memory, and GPU’s local memory need to be evaluated in both the prediction and correction processes. Let the amount of data transmitted in the unit of bytes be defined as
where P is the precision of the numerical representation, and N is defined as the number of data transmitted. Since the precision is usually constant, the amount of data transmitted could be derived in terms of the number of data transmitted. The numbers of data of the belief and the target motion model for the prediction process are n _{ g } and I _{ x }+I _{ y }, respectively. The same numbers of data, n _{ g } and I _{ x }+I _{ y }, are transmitted to the GPU’s local memory to perform parallel calculation. In the correction process, the number of data of the likelihood to be transmitted from the CPU’s memory to the GPU’s local memory through the GPU’s global memory is n _{ g }, whereas the number of data of the result $q\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k}\right)$ to be transmitted from the GPU’s local memory to the CPU’s memory through the GPU’s global memory is similarly n _{ g }. The number of data of the sum, ${A}_{\mathrm{c}}\sum _{\alpha =1}^{{n}_{g}}{q}^{\alpha}\left({\mathbf{x}}_{k}^{t}{}^{s}{\stackrel{~}{\mathbf{z}}}_{1:k}\right)$, to be then transmitted to the GPU’s local memory to perform parallel divisions is 1, and finally, the number of data to be transmitted back to the CPU’s memory for the next RBE is n _{ g }.
By observing the data transmission for one iteration of the accelerated parallel gridbased RBE, the total number of data transmitted from the CPU’s memory to the GPU’s global memory is given by
and all the data are transmitted continuously from the GPU’s global memory to the GPU’s local memory
The total number of data transmitted from the GPU’s local memory to the GPU’s global memory is
and that from the GPU’s global memory to the CPU’s memory similarly becomes
The data transmission time cost Δ t _{trans} for one iteration of the accelerated parallel gridbased RBE is given by
where N _{CG} and B _{CG} are the total number of data transmitted and the copy bandwidth with the unit of bytes per second from the CPU’s memory to the GPU’s global memory, respectively, N _{GC} and B _{GC} are those from the GPU’s global memory to the CPU’s memory, respectively, and N _{GG} and B _{GG} represent those between the GPU’s global memory and the GPU’s local memory. Due to the fact that the copy bandwidth from the GPU’s global memory to the GPU’s local memory and the one in opposite direction are the same, the number of data transmitted inside the GPU is given by
Substitute Equations (22), (25), and (27) into Equation (26), the data transmission time cost for one iteration of the accelerated parallel gridbased RBE is given by
It is to be noted here that these parameters of copy bandwidths are inherent for a specific computer hardware configuration and can be determined experimentally.
Floating point operations
In order to determine the GPU computation time cost Δ t _{G} and CPU computation time cost Δ t _{C} for one iteration of the accelerated parallel gridbased RBE, the number of floating point operations performed on both CPU and GPU needs to be evaluated. The number of floating point operations performed on the GPU for the prediction process is seen 2n _{ g }(I _{ x }+I _{ y }) as the Equation (19) indicated. The number of floating point operations performed on the GPU for the correction process is identified as 2n _{ g } in total since n _{ g } parallel multiplications and n _{ g } parallel divisions are performed for steps 1 and 3 in Subsection 2, respectively. Meanwhile, the number of floating point operations performed on the CPU is n _{ g } by n _{ g } summations in step 2 of the Subsection 2. As a consequence, the total number of floating point operations performed on the GPU and the CPU for one iteration of the accelerated parallel gridbased RBE is given, respectively, by
The GPU computation time cost for one iteration of the accelerated parallel gridbased RBE is given by
where N _{G} is the number of floating point operations performed on the GPU, and V _{G} is the computational rate of GPU with the unit of FLOPS. Substituting Equation (29) into Equation (29), the GPU computation time cost is given by
Similarly, the CPU computation time cost for one iteration of the accelerated parallel gridbased RBE is given by
where N _{C} represents the number of floating point operations performed on the CPU, and V _{C} is the computational rate of CPU with the unit of FLOPS. In the same way, by substituting Equation (29) into Equation (31), the CPU computation time cost is given by
It is to be noted here that the computational rates, V _{G} and V _{C}, are also inherent for a specific CPU and GPU configuration and can be determined experimentally.
Experimental validation
Table 1 shows the setup specifications which have been available for the validation and other investigations. Setup 1 is the fastest in both CPU and GPU, whereas setup 3 is the slowest. This section firstly shows the improvement of the parallel gridbased RBE using GPU by adopting the separable convolution in the prediction process with the specification listed in setup 1. Moreover, the proposed computational modeling for the parallel gridbased RBE is validated via setups 1 to 3. In the end, a simulated target searching task is introduced to further evaluate the efficacy of the proposed modeling.
Improvement in prediction process
The efficiency of the prediction process accelerated by separable convolution was evaluated with a problem having a fixed grid space size of 1,000×1,000 and varying the convolution kernel size from 1 to 50 on the computer setup 1. The result of the time cost by GPU is shown in Figure 3 together with the corresponding result by the original prediction. Even when the convolution kernel size is 50, the accelerated prediction is seen to require the time cost of only 1 ms. Its superiority can also be understood by comparing it to the original prediction, which needs the time cost 25 times as much as that of the accelerated prediction process when the convolution kernel size is 50.
Validation
This set of tests was aimed at validating the proposed modeling of computer performance by estimating the total iteration time cost Δ t of the parallel gridbased RBE using GPU and comparing it with the actual iteration time cost experimentally measured in three different computer setups. Each component, Δ t _{trans}, Δ t _{G}, or Δ t _{C}, is also compared with the actual performance, respectively. All the time cost results are measured by averaging the time cost of 10,000 iterations. Needless to say, the convolution kernel size I _{ x }+I _{ y } and grid space size n _{ g } are the two major factors in the proposed modeling. Two tests were thus conducted by each, changing the convolution kernel size and the grid space size.
Test 1
Test 1 was performed by fixing the grid space size of the parallel gridbased RBE to 1,000×1,000 and varying the convolution kernel size I _{ x }=I _{ y }=i from 1 to 200. A convolution kernel size over 200 was not explored since it is unlikely that the target motion model requires such a large convolution kernel. The square convolution kernel was because of the insignificance in changing size in both x and y directions, and this additionally allows visualization of results in twodimensional space.
The results of all the components of the time cost for the three computer setups are shown in Figures 4, 5, and 6. Each solid line represents the estimated total and component time costs, whereas each solid dot line with the same color represents the corresponding actual performance. These figures primarily show that the total and component time costs estimated by the proposed modeling well match to the actual performance. Values listed in Table 2 also support this and indicate the effectiveness of the proposed modeling since the average and the maximum relative errors are below 7% and 12%, respectively. While the time cost of data transmission is seen to contribute most, it is also seen that the time cost by GPU increases the total time cost with increase in convolution kernel size particularly when the GPU is of low quality. It is thus important to use a highperformance GPU if fast RBE with large convolution kernel size is necessary.
Test 2
Test 2 was performed by fixing the convolution kernel size of the parallel gridbased RBE to 16×16 or 32×32 and varying grid space size n _{ x }=n _{ y }=n from 100 to 1,000. These convolution kernel sizes often represent the target motion model with sufficient accuracy, and the grid space size n=1,000, which creates 1,000,000 grid cells, also provides good accuracy in many practical problems. Similarly to test 1, the square grid size enables twodimensional visualization of results.
The results for all the components of the time cost for the three computer setups are shown in Figures 7, 8, and 9, respectively. These figures firstly show that the proposed modeling is also able to well estimate the actual performance of the parallel gridbased RBE regardless of different grid space sizes. Similarly to test 1, Table 3 shows small average and maximum relative errors, which are below 6% and 11%, respectively. Secondly, from these results, it is seen that the total time cost is dominated by the time cost of data transmission particularly when the ratio of the grid space size to the convolution kernel size is large. Since the data transmission rate is determined by the quality of the memory, the utilization of a highquality memory is the first priority for fast RBE.
Simulated target searching task
The performance of the prediction process dominates the accuracy of the RBE when no valid observations are obtained. The aim of this test is to evaluate how well the proposed modeling help the prediction process keep the accuracy during the no observation period. A simplified target searching task is described in this subsection. The motion model of the simulated target is given by
where v^{t} and γ^{t} are the velocity and direction of the target motion, respectively, each subject to a Gaussian noise, and Δ t is the time increment. The prior belief on the target is also Gaussian. The autonomous sensor platforms are assumed to move on a horizontal plane and given by
where ${v}^{\phantom{\rule{2.77626pt}{0ex}}\phantom{\rule{0.3em}{0ex}}{s}_{i}}$ and ${\gamma}^{{s}_{i}}$ are the velocity and turn of the sensor platform (s _{ i }) respectively, and ${\alpha}^{{s}_{i}}$ is a coefficient governing the rate of turn. The probability of detection ${P}_{\mathrm{d}}\left({\mathbf{x}}_{k}^{t}\right{\mathbf{x}}_{k}^{{s}_{i}})$ is given by a Gaussian distribution, whereas the likelihood $l({\mathbf{x}}_{k}^{t}{}^{{s}_{i}}{\stackrel{~}{\mathbf{z}}}_{k}^{t},{\stackrel{~}{\mathbf{x}}}_{k}^{{s}_{i}})$ when the target is detected is given by a Gaussian distribution with variances proportional to the distance between the sensor platform s _{ i } and the target. Table 4 shows the major parameters of this simulated target searching task. The convolution kernel constructed by the target motion model is represented by a 32×32 matrix, and the grid space size is set as 1,000×1,000. The computer specifications followed the setup 3 in the Table 1. With the proposed approach, the time increment Δ t was chosen as 0.032 s, the time cost of one iteration of the RBE estimated by the proposed modeling. For the case without the proposed approach, the time increment Δ t was chosen as 0.02 s randomly in order to show the comparison.
Figure 10 shows the initial and final states of four sensor platforms without and with proposed prediction reformulation, respectively. Without the proposed prediction improvement, all the sensor platforms lost the target, whereas all of them successfully found the target under the condition of utilizing the proposed prediction reformulation. The reason is that the proposed prediction process made the gridbased RBE to update the belief much faster than the original one, resulting in a reliable tracking on the moving target. The evaluation of the proposed modeling for this simulated search and rescue task is conducted, and the corresponding quantitative results were concluded in the Table 5. The result shows small average and maximum relative errors as well, which are below 7% and 10% respectively, and indicates that the proposed modeling is able to estimate the actual time cost for the gridbased RBE using GPU.
Conclusion and future work
The performance modeling for the realtime gridbased RBE, especially parallel computation using GPU, has been proposed to identify the best resolution of the RBE with given computer hardware. The modeling allows the estimation of time costs necessary within CPU and GPU and that of data transmission between CPU and GPU for the realtime gridbased RBE. In order to speed up the RBE, the prediction has been additionally reformulated with the separable convolution.
The proposed modeling was experimentally investigated by varying its major parameters. The result of the first test with varying convolution kernel size shows that the average error of the estimation by the proposed modeling stays below 7% regardless of the convolution kernel size and that a highperformance GPU is necessary if the convolution kernel size is large. In the second test with varying grid space size, it is found that the proposed modeling estimates within the average error of 6%, irrespective of the grid space size, and that a highquality memory is necessary if fast RBE is required for large grid space. Utilizing prediction with separable convolution, the RBE has also been found to perform within 1 ms, although the size of the problem was relatively large.
The current study is still the first step for achieving highfidelity RBE in a realtime environment. The project is further planned to utilize the best resolution of the RBE identified by the proposed modeling and investigate its efficacy.
References
 1.
Tarantola A: Inverse Problem Theory and Methods for Model Parameter Estimation. Philadelphia: Society for Industrial and Applied Mathematics; 2005.
 2.
Harlim J, Hunt BR: A nonGaussian ensemble filter for assimilating infrequent noisy observations. Tellus A 2007, 59: 225–237. 10.1111/j.16000870.2007.00225.x
 3.
Apte A, Hairer M, Stuart AM, Voss J: Sampling the posterior: an approach to nonGaussian data assimilation. Physica D 2007, 230: 50–64. 10.1016/j.physd.2006.06.009
 4.
Doshi P, Gmytrasiewicz PJ: Monte Carlo sampling methods for approximating interactive POMDPs. J. Artif. Intell. Res 2009, 34: 297–337.
 5.
Mandel J, Beezley JD: An ensemble Kalmanparticle predictorcorrector filter for nonGaussian data assimilation. Comput. Sci. ICCS 2009, 2009: 470–478.
 6.
Stenger B, Thayananthan A, Torr PHS, Cipolla R: Filtering using a treebased estimator. IEEE Int. Conf. Comput. Vis 2003, 2: 1063–1070.
 7.
Huang D, Leung H: Maximum likelihood state estimation of semiMarkovian switching system in nonGaussian measurement noise. IEEE Trans. Aerosp. Electron. Syst 2010, 46: 133–146.
 8.
Bergman N: Recursive Bayesian estimation navigation and tracking applications. PhD Dissertation, Linkopings University; 1999.
 9.
Furukawa T, DurrantWhyte HF, Lavis B: The elementbased method—theory and its application to Bayesian search and tracking. San Diego: Paper presented at the IEEE/RSJ international conference on intelligent robots and systems; 29 Oct–2 Nov 2007.
 10.
Lavis B, Furukawa T: HyPE: Hybrid particleelement approach for recursive Bayesian searching and tracking. Proceedings of Robotic: Science and Systems IV. Zurich: MIT Press; 2008.
 11.
Lavis B, Furukawa T, DurrantWhyte HF: Dynamic space reconfiguration for Bayesian search and tracking with moving targets. Auto. Robots 2008,24(4):387–399. 10.1007/s1051400790814
 12.
Furukawa T, Lavis B, DurrantWhyte HF: Parallel gridbased recursive Bayesian estimation using GPU for realtime autonomous navigation. Paper presented at the IEEE international conference on robotics and automation. Anchorage, AK, USA: ; 3–7 May 2010.
Author information
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Received
Accepted
Published
DOI
Keywords
 RBE
 Bayesian
 GPU
 Realtime
 Gridbase
 Parallel