Selected from Google Ai Blog


Author: shreeyak sajjan

Machine heart compilation

Participation: Prince Jia, Egg Sauce


Optical 3D distance sensors that have been widely used in the industry have always been solved – the transparent object will fail each time. Recently, Google has developed a machine learning algorithm ClearGraSP with researchers from Synthesis Ai and Columbia University, which can estimate the exact 3D data of transparent objects from the RGB-D image.

From automatic driving vehicles to automated robots, optical 3D distance sensors are as widely applicable to RGB-D cameras, which can generate rich and accurate 3D environment maps.

But it also has a “natural enemy”, and the transparent object can make a commonly used expensive sensor to scratch the head, even if it is just a general glass container.

This is because the optical 3D sensor is based on one premise – assuming that all the surfaces are lambertian, that is, the light reflected in this surface is uniform, thereby visually from all angles, and the brightness of the surface is consistent. However, transparent objects are clearly not in line with this hypothesis because their surfaces are reflective in addition to reflective light. Therefore, the depth data of most transparent objects is invalid, or contains unpredictable noise.


Optical three-dimensional sensors are often unable to detect transparent objects. On the right: For example, there is no glass bottle in the 3D depth image taken by the Intel®RSenseTM D415 RGB-D camera. Bottom: 3D visualization of the depth image point cloud.

Let the machine better percetently transparent surfaces, not only improve security, but also open new interactions in non-structural applications, such as handling kitchen utensils or recycling of plastics, or navigating in indoor environments Or generate enhanced reality (AR) visualization effects on the glass desktop.

In order to solve this problem, Google has developed ClearGrasp with researchers from Synthesis Ai and Columbia University. ClearGraSP is a machine learning algorithm that estimates accurate 3D data of transparent objects from the RGB-D image.

This effect is mainly due to a large-scale synthetic data set published in Google. ClearGrasp’s input can be from any standard RGB-D camera, then it uses deep learning to accurately rebuild the depth of the transparent object, and generally into a new object, this object is not seen during the training process. of. This is different from the previous method, the previous method requires prior to understanding the transparent objects (such as their 3D model), and then combines background lighting and camera location maps. In this work, Google also proves that ClearGraSP can improve the operational efficiency of the robot by integrating it into its Pick and Place robot control system, which has a significant success rate of transparent plastic objects in the system. improve.

ClearGraSP reconstructs precise three-dimensional depth data in transparent surfaces by depth learning.

Transparent object visual data set

Any effective depth learning model requires a lot of data to train (such as ImageNet and Bert used in the visual field, ClearGraSP is no exception. Unfortunately, there is no data set with transparent object 3D data. Existing 3D datasets (such as MatterPort3D, Scannet, etc.) have no transparent surface records because this marking process consumes time consuming.

In order to overcome this problem, Google creates its own large transparent object data set, which contains more than 50,000 real-sense renderings with corresponding surface normal (indicating surface curvature), split mask, edge, and depth, which are trained for various 2D and 3D detection tasks are very useful. Each image contains up to 5 transparent objects, some on a plane, some in a tote, and contains a variety of backgrounds and lighting scenes.

ClearGrasp Synthesis of some transparent object instances of the data set.

Google also collected 286 realistic test sets in the data set, which have depth marks. The shooting process of the real map is tough. When shooting, you need to draw an image that is exactly consistent with its position size in the position of each transparent object in the scene. These images are photographed under many different indoor lighting conditions, using a variety of different cloth and finished background, and contain random opaque objects scattered around the scene. They include both synthetic training sets, and also contain new objects.

Left: Real map shooting settings; medium: Custom user interface supports accurately replacing each transparent object with paint; right: An example of capturing data.


Although the distorted background view of the transparent object is confused, a typical depth estimate is confused, but there are some clues that implies the shape of the object. The transparent surface also has a mirror reflection. Like the mirror, this reflection is high in an environment in which the light is sufficient. Since these visual clues are more prominent in the RGB image, and are mainly affected by the shape of the object, the convolutional neural network can use these reflections to influe the exact surface normal, and then for depth estimation.

The specular reflection on the transparent object reflects different features, which varies depending on the shape of the object and provides an extremely useful visual trail for the estimated surface normal.

Most machine learning algorithms attempt to estimate the depth directly from a single-grade RGB image. However, even if human beings, single depth estimate is also a non-POSED task. The team observed that there is a large error in estimating the depth of the flat background, which increases the error of the depth estimation of the transparent object located thereon. Therefore, the depth estimate of all geometric graphics is different from the direct estimation, the initial depth estimate of the RGB-D 3D camera may be more practical – the depth of the transparent surface can be notified using the depth of the non-transparent surface.

ClearGraSP algorithm


ClearGraSP uses three neural networks: a network for estimating surface normal, an interrupt boundary (discontinuous in depth), and another for obscuring transparent objects. Blocking will delete all pixels related to transparent objects to fill their correct depth. Then, a global optimization module is used, and the extension depth is started from the known surface, and the predicted surface normal is used to guide the reconstructed shape, and then the separation between the different objects is maintained using a predicted occlusion boundary.

Method summary: Point cloud first is generated according to the output depth and then colored according to its surface normal.

Each neural network is trained on a transparent synthetic data set, which is good in transparent objects in the real map. However, for other surfaces, such as walls or fruits, the surface of the surface is very poor. This synthetic data set also has limitations, which only contains transparent objects on the ground. In order to alleviate this problem, the team has added some real indoor scenes from the MatterPort3D and Scannet datasets in the surface normal training cycle. This model performs well in the test concentration through the training of real data sets outside the domain.

In a) MatterPort3D and Scannet (MP + SN), B) Google’s Synthetic Data Set, C) MP + SN and Google’s Synthetic Data Set Surface Line Estimation after Training. Note that the model trained on MP + SN does not detect a transparent object. The model that only training synthetic data can recognize the true plastic bottle well, but it is not possible to identify other objects and object surfaces. When the model is trained on these two data sets, these two needs can be met.



Overall, quantitative experiments show that ClearGraSP can reconstruct the depth of the transparent object and have a higher fidelity than other methods. Although the model is only training on the synthetic transparent object, it can be well adapted to the real-world field, such as achieving almost the same quantitative reconstruction performance on cross-domain known objects. This model can also be well promoted to new objects with unaptified complex shapes.


In order to verify the quantitative performance of ClearGrasp, the team constructs 3D point cloud based on the input and output depth images, as shown in the following figure (more examples can be found on the project page: https: // Results. This estimated three-dimensional surface has a clean and coherent reconstruction shape – which is important for applications such as three-dimensional map and three-dimensional object detection, nor a sawtooth noise seen in a single-wide depth estimation method. It can be proved that the model is robust, and under complex conditions (such as identifying transparent objects in the pattern background, transparent objects).

Quantitative results for real images. The first two lines: the result of the known object. Both lines: the result of the new object. Point cloud is generated based on its corresponding depth image, which is colored with its surface normal.

Most importantly, ClearGrasp’s output depth can be used directly as an input to the most advanced operational algorithm using RGB-D images. After estimating the original sensor data with ClearGrasp, the UR5 robot arm gripping algorithm has improved significantly on the success rate of the transparent object. When the parallel jaw clamp is used, the success rate is increased from 12% of the baseline to 74%, and the suck is increased from 64% to 86%.

Use ClearGrasp to operate new transparent objects. It is worth noting that these conditions are challenging: backgrounds without texture, complex object shapes and directional light, and also have confusing shadows and focal scatters (mode of light generated when light reflects from surface reflection or refraction .


Limitations and future work

One of this synthetic data set is that it cannot accurately represent the focal scatter, which is also from traditional path tracking algorithm rendering. Therefore, the model ignores bright focus and shadows are independent transparent objects. Despite these disadvantages, Google and ClearGrasp cooperation indicate that synthetic data is still a feasible method that can obtain effective results based on learning-based depth reconstruction methods. A better direction in future work is to improve the domain migration of real-world images by generating physical correct focal and surface defects such as fingerprints.


ClearGrasp proves that high quality rendering can successfully train well in the real world. The team also hopes that the data set can drive further research on data-driven transparent object sensational algorithms. Download links and more sample images can be found in Google’s project website (previously mentioned) and Google’s Github page (

You might also enjoy: