
LIDAR SEGMENTATION
Luke Ludington, Skyler Kim, Yu Ruan, Yuncong Ma
How it Works
01
LEARN FEATURES
A Lidar sample is first analyzed by breaking up the sample into a grid of voxels. The network tries to learn the shape of the major surface captured in each voxel.
02
CONVOLVE RESULTS
Data is then passed through a series of convolution layers to better extract edges and features of cars on the x-y plane. These layers also have the purpose of slowly flattening the z dimension, allowing the sample to be treated as a 2D image for region proposal calculations.
03
REGION IDENTIFICATION
The network then applies a region proposal network to identify objects in the scene and suggest their bounding box. The output of the network is a classification map for finding objects and a regression map for finding bounding box dimensions.

Learn Features

Voxel Partitioning
Pre-processing starts by partitioning the sample's Lidar data into voxel. Our voxels were 0.5 m long x 0.25 m wide x 0.25 m high. For each set of Lidar data, we broke up the 50m x 50m x 2m area around the car into 200 x 400 x 8 voxels. This is later reshaped to be 8 x 200 x 400.
Grouping
For each voxel, we find the mean point in the voxel and subtract it from every point in the voxel. We then combine this difference with the original point to create the input to our network. Saving a point's difference from the mean will help the beginning of the network better learn feature shapes in the voxel.
Since some voxels could have thousands of points, we random sample to limit the number of points considered per voxel. Our network takes a random sample of 35 points per voxel. If a voxel has less than 35 points, then the remaining unused space is all 0.
Stacked VFE
The network starts with a series of Voxel Feature Encoding (VFE) layers. Each VFE layer consists of a FCN, Max Pooling per voxel, then a concatenation with the output of the FCN and the Max Pooling layer. The FCN consists of a Dense layer, a Batch Normalization layer, and a ReLU Activation Layer.
Combining point-wise transformations with the Max Pooling per voxel output helps the network better learn descriptive shape information. This is more apparent as more VFE layers are fed into each other.
Our network has two VFE layers: One that maps the input 6 features to 32 features per point, and one that maps 32 features to 64 features.
Sparse Tensor
Due to the fact that a lot of the observable sample space has no Lidar points, we construct and save the input to our model as a Sparse Tensor. This means that we only save instances where there is meaningful data. Using Sparse Tensors greatly reduces the size of our input to the network, and is necessary for helping the model run in the memory of a general use computer.
Unfortunately, there are issues with using Keras layers with Sparse Tensors, and the input data must be converted to a dense Tensor in order to work with the network.
.png)
Convolutional Middle Layers
3D Convolution
The purpose of the Convolution middle layers is to extract voxel features while reducing the size of the data. The Convolution middle layers were made up of a series of Convolution 3D layers, Batch Normalization layers, and a ReLU activation layer. A Convolution 3D takes a rank 4 Tensor and convolves it by applying a kernel to every section in the input Tensor. The convolution is applied at every step, which is an instruction of how to get from one section to the next. For these layers, we use the default kernel function in Keras’ Conv3D layers. Padding is also applied at the edges of the data to allow for a kernel to be applied at the edges of the input data.
The Convolution 3D layers apply a Convolution on the rank 4 Tensor using a step size of either (2, 1, 1) or (1, 1, 1). The padding applied to the input tensor was 1 in all dimensions, which resulted in a step of (1, 1, 1) causing the output Tensor to have the same size as the input Tensor and a step of (2, 1, 1) causes the dimension representing the z dimension to be cut in half. The z dimension was reduced until it was 1, which allowed the data to be reshaped into a rank 3 Tensor.
Batch Normalization
A Batch Normalization layer is a layer that takes the output of the previous layer and attempts to normalize it. This means that the previous layer’s activations are transformed to have a mean of 0 and a standard deviation of 1 across the whole batch. This transformation is used so the input to the next layer in the network will see input activations that closely resemble the standard normal distribution. Having the input to the next layer follow a tighter distribution helps the network quickly tune the weights in the next layer. This is compared to without the Batch Normalization layer, where the input to the next layer may have a large and varied distribution of inputs across different samples in a batch.
ReLU Layers
The ReLU layers are used by the network to isolate the important features found by the convolutional layers. This output is better for the region proposal network as it is the general identity and size of the objects without containing all of the original information.
Region Proposal Network

The final part of the network is a region proposal network (RPN). RPNs function by breaking up the space around the car into voxels. These voxels are similar to the voxels used in pre-processing, but the z dimension of each voxel is set to the height of the scene and the dimensions of the voxel are doubled. The dimensions of the voxel being doubled comes from the fact that the convolution layers in the network reduce what was the x and y position of the voxel by 2.
In each new voxel, there are a number of anchor boxes. These boxes are used to identify the presence of objects. For our network, we had 2 anchor boxes per x-y section: an anchor that is 1.6 m x 3.9 m x 1.56 m, and the same size box but rotated 90 degrees around the z axis. These boxes were centered at the center of the voxel, but placed 1 m off the ground.
The output of the RPN are two Tensors: a classification map and a regression map. The classification map encodes the likelihood of an object being present in a particular anchor box. The regression map holds regression values per anchor box that describes the proposed bounding box for the object found inside the anchor box. We have 7 regression values per anchor: x center, y center, z center, length, width, height, and yaw.
Source and Model
To schedule a product demo with one of our product consultants, please fill in your contact details
Source and Model
To schedule a product demo with one of our product consultants, please fill in your contact details
Model
To schedule a product demo with one of our product consultants, please fill in your contact details
Training Guidlines
To schedule a product demo with one of our product consultants, please fill in your contact details
Model
To schedule a product demo with one of our product consultants, please fill in your contact details
Training
Due to the Size of the network and the fact that we had to represent sparse tensors as dense tensors due to the limitations of Tensorflow made the network very large. To overcome this we train the network in stages. At each stage we train on a very small batch and then save the model onto disk, we then load the saved model weights and train on those. This process led to a significantly reduced training speed but allows the model to be run on far lesser computers down to 64GB of ram with a batch size of 1.
Results


Above are two sample predictions of our trained model. The first has an IoU of 0.006 while the second has an IoU of 0.008.
The model seems to be able to find the bounding box of a car if it can find one, but has difficulty distinguishing cars with other features in the background.
