top of page

 

LIDAR SEGMENTATION

Luke Ludington, Skyler Kim, Yu Ruan, Yuncong Ma

How it Works

01

LEARN FEATURES

A Lidar sample is first analyzed by breaking up the sample into a grid of voxels. The network tries to learn the shape of the major surface captured in each voxel.

02

CONVOLVE RESULTS

Data is then passed through a series of convolution layers to better extract edges and features of cars on the x-y plane. These layers also have the purpose of slowly flattening the z dimension, allowing the sample to be treated as a 2D image for region proposal calculations.

03

REGION IDENTIFICATION 

The network then applies a region proposal network to identify objects in the scene and suggest their bounding box. The output of the network is a classification map for finding objects and a regression map for finding bounding box dimensions.

unnamed.png
PRODUCT

Learn Features

Voxel Partitioning

Pre-processing starts by partitioning the sample's Lidar data into voxel. Our voxels were 0.5 m long x 0.25 m wide x 0.25 m high. For each set of Lidar data, we broke up the 50m x 50m x 2m area around the car into 200 x 400 x 8 voxels. This is later reshaped to be 8 x 200 x 400.

Grouping

For each voxel, we find the mean point in the voxel and subtract it from every point in the voxel. We then combine this difference with the original point to create the input to our network. Saving a point's difference from the mean will help the beginning of the network better learn feature shapes in the voxel.

Since some voxels could have thousands of points, we random sample to limit the number of points considered per voxel. Our network takes a random sample of 35 points per voxel. If a voxel has less than 35 points, then the remaining unused space is all 0.

Stacked VFE

The network starts with a series of Voxel Feature Encoding (VFE) layers. Each VFE layer consists of a FCN, Max Pooling per voxel, then a concatenation with the output of the FCN and the Max Pooling layer. The FCN consists of a Dense layer, a Batch Normalization layer, and a ReLU Activation Layer.

Combining point-wise transformations with the Max Pooling per voxel output helps the network better learn descriptive shape information. This is more apparent as more VFE layers are fed into each other.

Our network has two VFE layers: One that maps the input 6 features to 32 features per point, and one that maps 32 features to 64 features.

Sparse Tensor

Due to the fact that a lot of the observable sample space has no Lidar points, we construct and save the input to our model as a Sparse Tensor. This means that we only save instances where there is meaningful data. Using Sparse Tensors greatly reduces the size of our input to the network, and is necessary for helping the model run in the memory of a general use computer.

Unfortunately, there are issues with using Keras layers with Sparse Tensors, and the input data must be converted to a dense Tensor in order to work with the network.

unnamed (4).png
ABOUT

Convolutional Middle Layers

3D Convolution

The purpose of the Convolution middle layers is to extract voxel features while reducing the size of the data. The Convolution middle layers were made up of a series of Convolution 3D layers, Batch Normalization layers, and a ReLU activation layer. A Convolution 3D takes a rank 4 Tensor and convolves it by applying a kernel to every section in the input Tensor. The convolution is applied at every step, which is an instruction of how to get from one section to the next. For these layers, we use the default kernel function in Keras’ Conv3D layers. Padding is also applied at the edges of the data to allow for a kernel to be applied at the edges of the input data.

The Convolution 3D layers apply a Convolution on the rank 4 Tensor using a step size of either (2, 1, 1) or (1, 1, 1). The padding applied to the input tensor was 1 in all dimensions, which resulted in a step of (1, 1, 1) causing the output Tensor to have the same size as the input Tensor and a step of (2, 1, 1) causes the dimension representing the z dimension to be cut in half. The z dimension was reduced until it was 1, which allowed the data to be reshaped into a rank 3 Tensor.

Batch Normalization

A Batch Normalization layer is a layer that takes the output of the previous layer and attempts to normalize it. This means that the previous layer’s activations are transformed to have a mean of 0 and a standard deviation of 1 across the whole batch. This transformation is used so the input to the next layer in the network will see input activations that closely resemble the standard normal distribution. Having the input to the next layer follow a tighter distribution helps the network quickly tune the weights in the next layer. This is compared to without the Batch Normalization layer, where the input to the next layer may have a large and varied distribution of inputs across different samples in a batch.

ReLU Layers

The ReLU layers are used by the network to isolate the important features found by the convolutional layers. This output is better for the region proposal network as it is the general identity and size of the objects without containing all of the original information.

Convolutional Middle Layers
RPN

Region Proposal Network

RPN2.png

The final part of the network is a region proposal network (RPN). RPNs function by breaking up the space around the car into voxels. These voxels are similar to the voxels used in pre-processing, but the z dimension of each voxel is set to the height of the scene and the dimensions of the voxel are doubled. The dimensions of the voxel being doubled comes from the fact that the convolution layers in the network reduce what was the x and y position of the voxel by 2.

In each new voxel, there are a number of anchor boxes. These boxes are used to identify the presence of objects. For our network, we had 2 anchor boxes per x-y section: an anchor that is 1.6 m x 3.9 m x 1.56 m, and the same size box but rotated 90 degrees around the z axis. These boxes were centered at the center of the voxel, but placed 1 m off the ground.

The output of the RPN are two Tensors: a classification map and a regression map. The classification map encodes the likelihood of an object being present in a particular anchor box. The regression map holds regression values per anchor box that describes the proposed bounding box for the object found inside the anchor box. We have 7 regression values per anchor: x center, y center, z center, length, width, height, and yaw.

Source and Data

1_aFHTAkhTkyWD93-UGRttPw.png
Source and Data

Source and Model

To schedule a product demo with one of our product consultants, please fill in your contact details

Thanks for submitting!

Source and Model

To schedule a product demo with one of our product consultants, please fill in your contact details

Thanks for submitting!

Model

To schedule a product demo with one of our product consultants, please fill in your contact details

Thanks for submitting!

Training Guidlines

To schedule a product demo with one of our product consultants, please fill in your contact details

Model

To schedule a product demo with one of our product consultants, please fill in your contact details

Training

Due to the Size of the network and the fact that we had to represent sparse tensors as dense tensors due to the limitations of Tensorflow made the network very large. To overcome this we train the network in stages. At each stage we train on a very small batch and then save the model onto disk, we then load the saved model weights and train on those. This process led to a significantly reduced training speed but allows the model to be run on far lesser computers down to 64GB of ram with a batch size of 1.

Results

sample0.png
sample2.png

Above are two sample predictions of our trained model. The first has an IoU of 0.006 while the second has an IoU of 0.008.

The model seems to be able to find the bounding box of a car if it can find one, but has difficulty distinguishing cars with other features in the background.

Training
Results
Future Improvements

Future Improvements

In order to make LiSec more publicly accessible we hope to improve its RAM efficiency to allow it to be run on personal computers rather than servers. In order to do this, we will be trying various methods to reduce the complexity of the model without reducing efficacy. There are 3 main ideas we hope will make the model smaller in ram.

  1. General complexity reduction using principles shown in the Shrinknet paper

  2. Using real sparse tensors instead of null values

  3. Reducing Batch normalization which takes up large amounts of RAM

bottom of page