🔒 Log in to see the prompt →
The following paragraphs are part of an academic paper.
Please fix the grammar mistakes and polish them as a native speaker.
Global point cloud transformers~\cite{Guo2021,Yu2022,He2022} can only process point clouds containing several thousands points,
thus we omit the comparisons with them and focus on the comparisons with efficient point cloud transformers proposed recently, including Stratified Transformer~\cite{Lai2022} and Point Transformer v2~\cite{Wu2022}.
Stratified Transformer extends window attentions~\cite{Liu2021a} to point clouds with cubic windows~\cite{Fan2022,Sun2022} and leverages stratified sampling to improve its performance.
Point Transformer v2 applies attentions to k nearest neighbours of each point in a sliding-window fashion.
Since the network configurations vary greatly, we first record the running time of one \emph{single} transformer block on an Nvidia 3090 GPU to eliminate the influence of uncontrolled factors,
We choose the input tensor's spatial number from $\{10k, 20k, 50k, 100k, 200k\}$ and set the channel as 96.
For the attention modules, we set the head number to 6, and set the point number and neighbourhood number to 32 for our OctFormer and Point Transformer v2.
Since the point number is variant in each window for Stratified Transformer, we set the window size to 7 so that the average point number is about 32.
The results are shown in Figure~\ref{fig:efficiency}.
It can be seen that although the computation complexities of three methods are all linear, our OctFormer runs significantly faster than Point Transformer v2 and Stratified Transformer.
The speed of OctFormer is over 17 times faster than the other two methods when the spatial number of the input tensor is $200k$.
The key reason for the efficiency of our Octformer is that our novel octree attention
% does not require expensive neighbourhood searching and turns sparse point clouds to dense tensors to
mainly leverages standard operators supported by deep learning frameworks, e,g. the multi-head attention of PyTorch, which is further based on general matrix multiplication routines on GPUs and has been optimized towards the computation limit of GPUs~\cite{cublas2022}.
However, the point number in each window of Stratified Transformer is highly unbalanced, which is hard for efficiency tuning even with hand-crafted GPU programming.
Although the neighbourhood number of Point Transformer v2 is fixed, the sliding window execution pattern wastes considerable computation which could have been shared.
We also compare the efficiency of the whole network as shown in Figure~\ref{fig:teaser}.
We record the time of one forward pass of each network on an Nivida 3090 GPU taking a batch of $250k$ points.
The speed of our Octformer-Small is slightly faster than MinkowskiNet, and faster than Point Transformer V2 by 3 times and thanStratified Transformer by 20 times.
It is worth mentioning that our OctFormer takes point clouds quantized by a voxel size of 1cm as input, whereas the other networks takes point clouds quantized by a voxel size of 2cm.
We will ablate the effect of voxel sizes in the following experiments.