Visual Perception Enabled Industry Intelligence: State of the Art, Challenges and Prospects

Visual perception refers to the process of organizing, identifying, and interpreting visual information in environmental awareness and understanding. With the rapid progress of multimedia acquisition technology, research on visual perception has been a hot topic in the academical field and industrial applications. Especially after the introduction of artificial intelligence theory, intelligent visual perception has been widely used to promote the development of industrial production towards intelligence. In this article, we review the previous research and application of visual perception in different industrial fields such as product surface defect detection, intelligent agricultural production, intelligent driving, image synthesis, and event reconstruction. The applications basically cover most of the intelligent visual perception processing technologies. Through this survey, it will provide a comprehensive reference for research on this direction. Finally, this article also summarizes the current challenges of visual perception and predicts its future development trends.


Visual Perception Enabled Industry Intelligence:
State of the Art, Challenges and Prospects

I. INTRODUCTION
A MONG the five senses, vision provides a wealth of information for humans to observe and understand the world. Visual perception is an intuitive and internal observation and understanding process. Based on the feature-integration theory proposed by Treisman and Gelade [1], human visual perception is divided into feature registration stage and integration stage. The first stage is that the visual system performs parallel and automated processing of features such as color, brightness, orientation, and size from the light stimulation mode. In the feature integration stage, the visual system locates the feature representations that are separate from each other. Through concentrated attention, just like glue, the original and separate features are integrated into a single object to complete visual perception. People always imitate the human visual perception mode, hope that the machine can convert the real-world 3-D information into pictures and videos through visual acquisition devices (CCD cameras, CMOS cameras, etc.), and then process, identify, and explain these pictures or videos to understand the real environment, so that the machine has the ability to replace humans to complete various tasks. In industrial applications, this process is also called machine vision [2].
At present, although there is a large gap between the machine's visual perception comprehension ability and human visual perception level, it has a wide range of observations, which cannot be observed by the human eye, such as infrared [3], microwave [4], ultrasound [5], etc. In addition, it is noncontact and can be widely used in long and harsh working environments [6], so the theoretical research and practical application of industrial visual information perception technology has been a research hotspot in various industrial fields. Especially in the era of Industry 4.0, visual perception technology is destined to become the leading technology [7].
With the rapid development of artificial intelligence technology, vision perception have become essential research topics pursued in the field of artificial intelligence. Computer vision is usually the pioneer and it is gradually used and developed in industry to promote automation and informatization of industrial production, enable the machine to autonomously perform intelligent activities such as analysis, reasoning, judgment, conception, and decision making [8] to save manpower, improve efficiency, and reduce risk.
See https://www.ieee.org/publications/rights/index.html for more information. technology, there have been many review articles [14], [22]- [24], but most of them summarize the single application field or technology field and there has not been an review literature on the overall introduction of visual perception. In this survey, we introduce some applications and corresponding technologies of intelligent visual perception from a macro perspective, showing its advantages and its progressive impact on human production and life. The outline of the contributions of this article can be summarized as follows.
1) Compared with other survey papers in the fields of visual perception, this survey summarizes the latest industrial applications of visual perception for the first time and not only does research on a single field including product surface defect detection, agricultural production intelligence, intelligent driving, image synthesis, event reconstruction, and object pose measurement, these applications cover the aerospace, military, ocean, medical, infrastructure, agriculture, transportation, and food fields shown in Fig. 1. At the same time, these applications also basically cover the latest technologies used in current visual perception, such as image and video processing, generation, object location, recognition, detection, and 3-D reconstruction. Through these, reader can clearly understand the application field and technical composition of intelligent visual perception. 2) We have summarized some of the limitations and introduced the main challenges faced by current visual perception from the perspective of users and researchers. 3) We also put forward some visible development prospects of visual perception to reflect the theme of this article, for which researchers in this field have pointed out the direction of scientific research. The rest of this article is organized as follows. Section II introduces some different industrial applications of vision perception and corresponding technologies, Section III gives some challenges faced by visual perception we can see now, Section IV illustrates its development prospects and trends, and finally, Section V concludes this article.

II. INDUSTRIAL APPLICATIONS
As visual perception becomes more intelligent and information-based, its applications have penetrated into many aspects of industry [25]. In this section, we introduce some applications of intelligent vision, including product surface defect detection, intelligent agricultural production, intelligent driving, image synthesis, event reconstruction, and pose measurement. These applications often affect people's production and life that are hot spots for researchers and basically cover different intelligent visual perception technologies.

A. Product Surface Defect Detection
A high-quality industrial product not only meets the requirements of clients and production enterprises in terms of performance, but also has good aesthetics and safety in appearance. The surface defects of some products will seriously affect the use of the products and even cause serious consequences. Different products have different definitions and types. Generally, surface defects are areas with uneven physical or chemical properties on the product surface, such as scratches, spots, holes, glass on metal surfaces, and inclusions, stains and damage on nonmetal surfaces, etc. In the process of manufacturing products, the occurrence of surface defects is often unavoidable. Therefore, in the industrial field, the detection of surface defects has been paid much attention. Vision-based manual detection is the traditional detection method for product surface defects, which has low sampling rate, low accuracy, poor real-time performance, low efficiency, high labor intensity, and is greatly affected by artificial experience and subjective factors. Detection based on machine vision methods can largely overcome the above drawbacks [26]. The usual detection process includes predefine normal products, image preprocessing, target region segmentation, defect recognition, and classification, as shown in Fig. 2.
Machine vision has been used to carry out a lot of work on the automatic detection and classification of textile defects.

1) Textile Defects Detection:
For textile defects detection, there are mainly statistical-based methods, transform-domainbased methods, and model-based methods [27], [28]. Ngan et al. [29] proposed a method based on pattern primitives to detect defects in patterned textured fabrics, and used the symmetry of the primitives to calculate the moving energy variance between different primitives. The distribution of values is learned, the boundary conditions are determined, and then defects are identified. Chandra et al. [30] believed that because basic morphological operations were difficult to select structural elements expediently, it was not easy to detect all kinds of defects appearing on woven fabrics. He proposed that utilizes artificial neural networks to obtain structural elements and perform morphological reconstruction of the binary image of the fabric to detect defects. Chan et al. [31] used the simulated fabric model to get the relationship between the fabric structure in image space and frequency space and defined the two central space spectra in the 3-D spectrum, then used the difference between the simulation model and the real samples to analyze fabric defects.
2) Textile Defects Classification: For machine vision-based defect classification methods, there are some studies that use Bayesian classifiers to classify fabric defects [32]. The gray level co-occurrence matrix method was used to extract the features of defects in the image, and then the k-nearest neighbor algorithm was used to classify the defects [33]. Support vector machine (SVM) was also often used in fabric defect classification [34]. There were also some studies that use neural networks to complete the task of classifying fabric defects in images or videos [35]. Although traditional image processing-based fabric detection and classification methods have achieved good results, most methods require manual feature extraction, which often consumes a lot of computing time. As a result, the processing process is not intelligent and robust. Until deep learning was applied to the field of image processing, the convolutional neural network (CNN) [36] has made remarkable progress in the field of image processing, and it has also been widely used in textile defect detection and classification. Jing et al. [37] improved that AlexNet extracted the characteristics of defective fabrics, and realized the classification of yarn-dyed fabric defects. Weie et al. [38] studied the combination of compression sampling theorem and CNN in the case of few-shot and applies it to the classification of fabric texture defects, which achieves good results. Recently, Zhao et al. [39] inspired by human visual perception and memory mechanism proposed a CNN model based on visual long-term and short-term memory, which greatly improved the classification of fabric defects.

B. Intelligent Agricultural Production
Agricultural production is an important part of the global economy. As the global population continues to grow, urbanization will lead to a continuous reduction in the area of arable land and the number of farmers. The agricultural production system faces many challenges [46], so we must seek to some efficient intelligent and information agricultural technologies, which save manpower and material resources to promote high-quality and high-yield agricultural development [47]. The application research of machine vision technology in the agricultural field started in the 1970s when most of the initial researches were on the feasibility of machine vision in agricultural applications and the development of image processing and analysis algorithm. With the rapid development of computer software and hardware, image acquisition and processing devices, and image processing technology, the application of machine vision technology in agriculture continuously expands. At present, some countries have begun to apply machine vision systems to various stages of agricultural production to solve the problem of increasing population aging and labor shortage [48]. This section mainly introduces two applications of machine vision technology in the field of agricultural production: agricultural robot, crop pest, and disease monitoring.
1) Agricultural Robot: Agricultural robot is an automated or semiautomated equipment to identify targets, collaborate, and identify color, texture, and odor characteristics [49], which can not only greatly increase labor productivity and reduce labor costs, but also reduce the damage of pesticides to the natural environment such as soil and water resources [50].
Research on agricultural robots began in Japan where picking robots began in the early 1980s. Kondo et al. [51] developed a cherry tomato picking robot, which used color cameras to collect images, through thresholding, filtering and other steps to segment fruits from the image background and identify the number of fruits, locating fruit 3-D information by stereoscopic vision. However, due to environmental influences, it was unable to complete obstacle-free picking and it was difficult to harvest short and hard fruits with inflorescences. Yaguchi et al. [52] used an electric wheeled omnidirectional chassis, a robotic arm, a binocular stereo camera, and a two-degree-of-freedom twisting actuator to form a tomato picking robot, which could complete picking operations in the shallow passage of the greenhouse under natural light.
In recent years, in order to realize the automatic recognition of cherries in the natural environment, Zhang et al. [53] designed a method to implement robot vision using median filtering preprocessing, otsu algorithm threshold segmentation, and region threshold denoising by which the cherry recognition success rate was over 96% , and the picking efficiency was improved. In order to make the robot more efficient in picking mature apples, and has the ability to continuously recognize and operate at night, Wei et al. [54] proposed a Retinex algorithm based on pilot filtering to enhance nighttime images. The improvement of the vision-based control system has expanded the application scope of agricultural robots in greenhouses and orchards, etc., [55], [56], and has reduced workload and labor intensity.
2) Crop Pest and Disease Monitoring: The control of crop diseases and insect pests and weeds is the key to achieving high quality, pollution-free, and high yield in agricultural production. Traditional large-scale spraying of pesticides not only wastes resources, but also causes pollution and damage to the environment. The development of intelligent vision technology makes crop disease and pest diagnosis and weed identification faster, cheaper, and nondestructive [46].
In early years, Pydipati et al. [57] proposed a method for identifying citrus diseased leaves and normal leaves. This method used color and texture to represent the features of the image, combined with the designed feature extraction and classification algorithm. Finally, the detection of citrus leaf disease was realized. Mayo et al. [58] described the features in moth images, used various classifiers and datasets for various experiments, and applied them to the automatic recognition of living moths.
In recent years, Liu et al. [59] proposed the method of using region descriptors to simplify images containing aphids, and then used the histogram's directed gradient feature and SVM to build models to realize the identification and population monitoring of aphids. Simple and easy to use, it could be used to investigate aphid infection in wheat field. Liu et al. [60] believed that traditional machine vision was limited by laboratories or pest traps in counting and identification, and has developed a visionbased multispectral detector for detecting 12 species on crops. Recently, research has proposed a method [61] for long-term pest behavior observation and integrated pest management. This work proposed a sensor network based on an integrated camera module and an embedded system that could simultaneously perform automatic detection and counting of sticky trap pests and other tasks, achieving integrated pest monitoring.
In this section, we introduced the advantages of visual perception in agricultural intelligence, and introduced agricultural robots, crop pest, and disease monitoring in detail, which are more representative in the application of visual perception technology. In addition to these two parts, the application of machine vision in agricultural intelligence also includes agricultural product quality inspection [62], crop healthy growth monitoring [63], agricultural vehicle visual navigation [64], and unmanned aerial vehicle (UAV) farmland information monitoring [65], which are shown in Fig. 3.

C. Intelligent Driving
Vision is the main source of information for humans in the course of various types of traffic [66]. Since the successful attempt of autonomous driving in the 1980s [67], people have paid attention to autonomous driving, At present, research on autonomous driving has attracted a large number of researchers and investors, and a large number of review papers have appeared [68]- [70]. Autonomous driving refers to the process of autonomously completing environmental perception and action execution for which the visual-based environmental perception is an important source of information. Its main technologies are: detection of vehicles, pedestrians, and nonmotorized vehicles on the road [71], [72], traffic sign detection [73], lane detection [74], departure warning [75], drivable area detection [76], 3-D detection [77] [78], map 3-D reconstruction [79], and object ranging [80]. Among them, lane detection is an important link to realize autonomous driving. In this section, we mainly introduce the application and development of visual perception technology in lane detection.
Most lane detection mainly have three steps: image preprocessing, feature extraction, and parameter curve fitting [81], as shown in Fig. 4.
1) Image Preprocessing: The main purpose of image preprocessing is to enhance the image features and robustness, so that the detection can adapt to a variety of weather conditions, such as day, night, sunny, rain, etc. Usual preprocessing methods include color image gray processing [82], gradient enhancement Image 6507318, low-contrast image enhanced by histogram equalization [83], image binarization by edge detector [84], and cropped image by region of interest [85], etc.
2) Feature Extraction: Feature extraction is a key step in detecting lanes. Liu et al. [86] obtained the light intensity and width characteristics of the lane, and used the local threshold segmentation algorithm and morphological operations to accurately identify the lane. Others were different, Gopalan et al. [87] used the edge and texture features of the lane to achieve detection. Abramov et al. [88] proposed to use multisource sensors to collect information, and used Graph simultaneous localization and mapping (SLAM) to fuse multisource features obtained from various sensors, and finally through the fusion results to perceive multiple lanes in real time.
3) Parameter Curve Fitting: The detected lane points usually need to form lane lines in the graph by curve fitting. General algorithms usually use approximate clothoid curve models, such as quadratic curve, cubic curve, hyperbolic polynomial curve, parabola, B-spline, straight line [89], etc. Some algorithms do not use a fixed curve model, and are mainly used in scenarios without clear lanes, such as deserts. Broggi and Cattani [90] proposed an ant colony optimization method, which was a road boundary determination method based on reinforcement learning.
The detection algorithms mentioned previously are mostly manual feature extraction at the feature extraction stage, which usually has disadvantages such as low detection efficiency, poor robustness, and poor curve detection effect. In recent years, lane detection methods using CNN have become popular, making lane detection easier to implement and highly accurate. Lee et al. [91] proposed an end-to-end multitasking network that utilized vanishing point information to simultaneously identify lane and road markings in extreme weather conditions, and solved rainy and low-light conditions for the first time. Pan et al. [92] proposed a new network structure Spatial CNN by converting the traditional convolutional layer-by-layer connection form into the form of slice-by-slice in a continuous volume, which enabled information to be transmitted between rows and columns in a pixel, and enhances the ability of CNN to obtain semantic information of long continuous shape structures or large objects, such as lane lines, telephone poles, etc. In the detection phase, a network branch was added to enable the network to directly distinguish between different lanes and improve robustness. Recently, Hou et al. [93] introduced a method named self-attention distillation (SAD). The CNN model can learn by itself without labels and achieve substantial improvements with SAD. This method not only has a good detection effect, but also runs fast and has fewer model parameters.

D. Image Synthesis
Generative adversarial networks (GAN), proposed in 2014, is an emerging technology in the field of neural networks in recent years whose basic idea is derived from the zero-sum game. It regards the generation problem as the confrontation and game between the two networks, the generator and the discriminator. The former tries to produce data closer to the real and the latter tries to distinguish between real data and generated data more perfectly [94]. GAN can be applied to different types of signal processing. Visual image or video processing is one of its important application fields, such as image generation [95], video generation and prediction [96], object detection [97], image translation [98], image editing [99], image restoration [100], style migration [101], superresolution reconstruction [102], etc.
Face synthesis has been a research hotspot in the field of computer vision, and has achieved many remarkable results [103], [104]. With the success of GANs in image and video generation, this end-to-end approach has attracted many researchers to start using GANs for face synthesis. Until now, GAN-based face synthesis technology has been impossible for human eyes to recognize, as shown in Fig. 5, and it is expected to be used in movies, animations, games, virtual reality, etc. But at the same time, it has caused a lot of public discussions about the dangers of this technology and many techniques for identifying computer-generated fake faces have also been developed [105]. In this section, we introduce a new application of visual perception: GAN-based face synthesis and corresponding fake face recognition techniques.
1) Face Synthesis: In 2016, Isola et al. [106] proposed a method that changed the objective function of the conditional GAN [107], the network structure of the generator and the discriminator's discrimination, so that the network could learn the mapping relationship between the input and the output image and the loss function during training, providing a new framework for pixel-to-pixel conversion. On this basis, Wang et al. [108] used a multiscale generator and discriminator to implement an interactive semantic editing for generating high-resolution images. Based on the principle of pix2pix [106], a face swapping software called Face2Face has aroused people's interest, which could control the facial expressions and movements of people in TV or videos through cameras and face tracking software. Shen et al. [109] learned a symmetrical triad GAN to ease the training difficulty of the GAN, which could generate faces with multiple perspectives and expressions and retained the identity of this person. For video-to-video synthesis, Wang et al. [110] added optical flow constraints to the generator and discriminator, and designed a spatio-temporal objective function to focus on the inconsistency in front and back frames during video-to-video conversion applied to face swapping well. Recently, in order to solve the previous model's inability to work on few-shot during the training process and need to consume a lot of data resources. Wang et al. [111] introduced video-to-video synthesis under the condition of few shot by adding attention mechanism in the network. In addition to the techniques discussed previously, there were some studies that changed specific facial features, such as aging [112], makeup [113], complexion [114], etc.
2) Forgery Detection: With the continuous upgrading of faking face technology, the public's voice for identifying fake is getting higher and higher, and more and more researchers have begun to study forgery detection methods. Two scientists from the Idiap Institute in Switzerland conducted a comprehensive evaluation of the effectiveness of face recognition methods in detecting DeepFake [115], which is also a popular face swapping software, and they found that general face recognition algorithms, such as FaceNet [116], identifying face generated by GAN was extremely poor. It was proposed that only image-based methods could effectively detect DeepFake videos. At present, many detection algorithms are proposed specifically for detecting forged images. Cozzolino et al. [117] used CNN to complete this task that the performance was significantly improved compared to traditional detectors. Later, many CNN-based image forgery detection technologies are developed, such as [118], [119]. Rossler et al. [120] evaluated the performance of related technologies in face forgery detection. Aiming at the problem that many people apply face fraud technology to national leaders, Agarwal et al. [105] studied that the facial expressions and movements of people when they were speaking from which they used the correlation to identify real and fake faces, the probability of identifying fake videos reached 92% , and they said that the next study would be made on the rhythm and characteristics of people's speaking voices to further improve the accuracy of forgery detection.

E. Event Reconstruction
The development of the world's information industry has experienced two major trends: computer and Internet. With the rapid development of mobile communication and perceptive technology, a large number of innovative applications and services have emerged, which has quickly brought us into the third information industry revolution-the Internet-of-Things (IoT) [121]. In the era of the IoT, people are increasingly using mobile smart terminals with cameras and various sensors, such as laptops, smartphones, GPS, smart bracelets, automotive sensing devices, smart watches, etc. A large amount of data obtained by using mobile terminals will be connected together through the network (Wi-Fi, 3 G/4 G/5 G, Bluetooth, etc.) to form a group-aware network, which enables us to more comprehensively and large-scale perception of various physical objects and environmental conditions in the real world [122], [123]. They greatly expand the dimensions of human perception of the world, change the way people perceive the world, and open up a new field of mobile Internet-mobile crowd sensing (MCS) [124], whose architecture is shown in Fig. 6. At present, MCS has entered a stage of rapid and deep development and has penetrated deeply into all aspects of society, such as intelligent transportation [125], infrastructure and municipal management services [126], environmental monitoring and early warning [127], social relations and public safety services [128], etc.
In MCS, using the built-in camera of the mobile device to perceive is still an extremely important way. In this field, related research on vision-based MCS has also attracted a large number of researchers. In this regard, Guo et al. [129] puts forward the concept of visual crowd sensing (VCS), and summarizes the task models, characteristics, important technologies, and applications of VCS in recent years. According to the summaryof VCS [129], its application scope can be divided into: floor plan generation [130], scene reconstruction [131], event reconstruction [132], indoor localization [133], indoor navigation [134], personal wellness and health [135], disaster relief [136], and city awareness [137]. In most cases, MCS is better than traditional visual perception methods that rely on fixed visual perception devices for monitoring. Event reconstruction is closest to people's daily life, so it has high  research value. In this section, we discuss the development and significance of event reconstruction related technologies based on VCS.
With the popularity of wireless Internet and smart phones by which people can record events in their lives in the form of pictures or videos and share them with others via the Internet, such as the popular short video platforms Vine, Instagram, Douyin, etc. Thousands of users record events in all corners of the world in this way, which not only broadens people's horizons, but also provides sufficient data for researchers in various fields [138]. Bao and Choudhury [132] proposed a smartphone-based on-demand system MoVi, which used smartphones to cooperatively sense the surrounding environment, and performed video recording based on event trigger points (laughter, etc.). Videos recorded on different phones would be spliced into video highlights to provide users with key social information. Giridhar et al. [139] introduced an adaptive positioning algorithm that utilized image information in the social network Instagram to locate events that occured in cities in time and space. People in other cities could also experience the current event remotely from the sight of a witness. Bano and Cavallaro [138] proposed a framework that could match and cluster user-generated videos at the same time and space, which automatically grouped and aligned videos captured by multiple user devices from different locations simultaneously, completing event reconstruction. Participants can review the entire event from different perspectives through information provided by others. Bohez et al. [140] introduced an integrated framework to mix users phones shooting perspectives with professional camera lenses and displayed during the event. The framework could transmit, process, and display hundreds of user videos in real time in an ultradense Wi-Fi environment.
Some studies focus on user feedback on videos on the network to evaluate video quality and classification, and then feed them back to users. Singhal et al. [141] analyzed the emotions of multiple users watching the same video through electroencephalogram signals, including sadness, happiness, and neutrality, and combined the video with various emotions. Then, he adopted crowdsourcing mode [142] to summarize and evaluate the video, extracted the video summary, let users better understand it. Event reconstruction based on VCS will also have a certain impact on ecommerce. Recently, Diwanji and Cortese [143] discussed the users review videos of the product after purchasing. He found that this user-generated video greatly affected other consumers perceptions, attitudes and purchase intentions of the product, and provided an important management reference for online sellers.

F. Pose Measurement
Object pose measurement is also one of the important application directions of visual perception, referring to obtaining three position parameters and three attitude parameters of the target in a specific coordinate system, which can be the world coordinate system, object coordinate system, or camera coordinate system. Object pose measurement has very important applications in the fields of robots [146], aerospace [145], industrial production [147], rotorcraft [148], vehicles [149], and ocean [150]. For example, in the space docking between a spacecraft and a target spacecraft, it is indispensable to accurately measure the relative position and attitude parameters between the spacecraft and the target spacecraft. The same is true in industrial production. Only by accurately measuring the pose of the accessory can the industrial robot grasp the object in a prescribed posture and align it for installation. Vision-based pose measurement has the characteristics of noncontact, high accuracy, and good stability [162], which is of great significance for improving industrial production efficiency. Among them, the method based on monocular vision is the mainstream of pose measurement, and its biggest advantage is that the equipment is simple and easy to implement [151]. There is also a method based on binocular vision, which adds auxiliary depth information to the RGB image to help improve the measurement accuracy [152].
Most of the traditional pose measurement methods are based on geometric features. These methods have a certain dependence on the texture of the target surface and are susceptible to factors such as lighting, occlusion, and complex backgrounds. Later, the pose measurement mostly used feature descriptor-based methods to train classifiers by constructing distinguishing feature descriptors around the feature points of objects [153], [154]. Gee and Mayol-Cuevas [155] proposed a method for estimating the 6-D pose of the camera using RGB-D information. This method extracted points of interest from the image based on sparse features, and described these points of interest with local descriptors then matching to the database. The sparse featurebased method and the traditional geometry-based method have some similarities, both of which are more difficult to recognize objects with less texture. Some studies used dense feature-based methods to predict the desired result with each pixel. Brachmann et al. [156] introduced a method for estimating the 6-D pose of a specific target from a single frame of RGB-D images by using a new representation that combined dense 3-D target coordinates and object class labels. This method could flexibly deal with textured or nontextured targets, and was robust under different lighting conditions. Later, the author improved on the basis of previous work [157]. He proposed a method to estimate the 6-D pose using only a single RGB image by marginalizing the weight of the depth image and using only color to obtain the pose. There are also some studies that use template-based matching methods to scan pictures with a fixed template to find the best match. In the article [158], the author sampled the object to be detected sufficiently by rendering in the possible SE3 space, extracted a sufficiently robust template, and then matched the template to estimate the pose.
In recent years, CNNs have also been widely used for visionbased object pose estimation in the field of industrial production, spacecraft docking, and robots. This process can be simply summarized as Fig. 7. Based on the literature [158], Wohlhart and Lepetit [159] trained object types and object view templates together by CNN to learn descriptors representing object types and poses to detect low textures. The method had a certain effect. Kehl et al. [160] used a 2-D detector SSD to achieve 3-D object detection and full 6-D pose estimation only by RGB data and training the data of the synthetic model. For each 2-D detection result, the most likely perspective and in-plane rotation were analyzed, and then a series of 6-D hypotheses were established to select an optimal one as the result. Recently, Yang et al. [161] proposed a method of target pose measurement using CNN. This method directly returned the 6-D attitude information of the object, eliminating the template used by the previous methods, which was simpler, faster speed, and higher accuracy.
In order to more intuitively compare the applications of visual perception we have listed, we summarize the relevant fields and technologies of each application, which are shown in Table 1.

III. SERIOUS CHALLENGES FACED BY VISUAL PERCEPTION
With the development of software and hardware technologies such as parallel computing, cloud computing, and machine learning, related technologies of visual perception have been greatly improved whose applications have also taken root in various fields. However, there are also many problems with current visual perception. The technology and application in many aspects are not mature enough, and even cannot be applied to actual production and life. In this section, we analyze the challenges faced by current visual perception.

A. Vision Acquisition
Most of the existing methods of vision acquisition use various sensors to convert perceptual information into images or videos. For example, the most common CCD and CMOS cameras are converted into electronic signals according to different light. The quality of vision acquisition and imaging technology directly affects the authenticity of information and is an important basis for visual information processing. Although the existing vision acquisition equipment and imaging technology have made significant progress, such as high dynamic range, global shutter, near infrared enhancement (NIR+), RGB-IR, power scalability, and so on. However, under the influence of changes in realworld lighting and lens distortion, current vision acquisition and imaging technologies sometimes do not accurately reflect the real world. Backward vision acquisition equipment and imaging technology may become an obstacle to the development of visual perception technology.

B. Information Security
With the combination of artificial intelligence and visual perception technology, there are endless examples of perception is not true, so it is especially important to think about the security of visual information. For example, the GAN-based face synthesis technology mentioned earlier. Now some criminals use AI face swapping to pretend to be a national leader to make a bad speech and interfere with the presidential election. If the society cannot detect it in time, there may be serious consequences [105]. The security of visual perception is a key issue that researchers must attach great importance to during its rapid development.

C. Speed, Accuracy, and Robustness
The tradeoff between speed and accuracy has always been an important issue in the field of visual perception, especially in the field of computer vision [163]. Increasing the processing speed will inevitably reduce the information acquisition and analysis capabilities of deep networks, and vice versa. Its importance is self-evident. For example, in the field of automatic driving, the speed of detecting obstacles cannot achieve real-time or insufficient accuracy of recognition will hinder the realization of automatic driving [70]. And there is currently no machine vision technology that can achieve batch detection in the true sense while ensuring extremely high accuracy, minimal false detection rates, and eliminating missed detections. This goal cannot be achieved, reducing the application expectations of machine vision.
Due to the variability of the real world, the visual information collected by people is also diverse, and current visual perception and processing technologies in various fields often cannot adapt to such changing visual conditions, such as light intensity and shadows. The low robustness of the algorithm is also a universal problem in this field.

D. Construction in Deep Learning
CNN under deep learning is currently widely used in the processing of visual images or videos. Its theoretical problems are mainly reflected in statistics and computing. For any nonlinear function, a shallow network and a deep network can be found to represent it. The deep model has better performance for nonlinear functions than the shallow model. But the representability of deep networks does not represent learnability [164]. That is to say, deep learning is not intelligent enough, often accompanied by overfitting and underfitting problems [165], and requires the support of big data, but humans do not complete a large number of calculations to achieve related functions. Therefore, deep learning cannot be used as the main idea for the development of intelligent vision. Whether in terms of learning or implementation, the intelligence of visual perception is still a severe test.

E. Computing Power and Device Volume
The success of computer vision depends not only on deep learning and large-scale data, but also on the computing carriers it implements, such as central processing unit, graphics processing unit, application specific integrated circuit, and field programmable gate array [166]. In the future, visual perception technology will also be inseparable from these computing units. Insufficient and slow computing power will also restrict the development of visual perception. The volume of integrated computing devices is also an important factor. At present, many companies are making such high-performance development boards for edge computing, such as Jetson TX2, Jetson AGX Xavier, and so on. Small computing devices are of great significance to the practical application of the algorithm, but now there are still problems such as slow speed, insufficient computing power, and small memory.

F. Combination of Software and Hardware
The convergence of hardware and software has reached a turning point, and the two are no longer independent of each other, but are increasingly showing a mirror dependency. However, since software and hardware are two completely different fields, in the application of visual perception, many researchers have failed to implement the hardware well after proposing excellent visual perception algorithms, so the problem of combining software and hardware is also a challenge in this field.

IV. DEVELOPMENT PROSPECTS OF VISUAL PERCEPTION
Vision is the most important source of information for humans to understand the world. The research on visual perception and processing will always accompany human scientific steps. In this section, we introduce its future development directions and trends based on the current challenges of visual perception, as shown in Fig. 8.
1) Multisource information fusion technology will become a hot research topic in the future. A single vision sensor has a specific range of use, and there are shortcomings such as less information and less accuracy. Different visual sensors have specific advantages. For example, ordinary visible light camera is good at acquiring color and shape information, lidar can obtain more depth information and point cloud information, infrared detectors can sense ambient temperature information, hyperspectral sensors can improve the ability to detect the attribute information of ground objects, etc. Multisource information fusion technology has always been an effective method to maximize the amount of information. In the future, it will still be an important research direction. On the one hand, researchers can focus their research on sensors and hardware devices that can simultaneously acquire more visual information to improve ability to acquire visual information and compute big data. On the other hand, in terms of software, the fusion algorithm with high precision, low latency, and less calculation will be further upgraded to achieve more reliable and accurate results for specific visual perception tasks. 2) Active vision and visual question answering is a hotspot in the field of computer vision and machine vision research today, and will be an important direction for solving current visual perception problems. Here, the vision system can actively sense the environment, and according to certain rules, let the computer actively extract the required image features and answer questions about the picture. In active vision, multiple artificial intelligence methods may be integrated, such as reinforcement learning and other unsupervised, weakly supervised learning, which may help solve the current state of research that relies too much on mathematical modeling and mathematical calculations to meet the requirements of system speed and intelligence. 3) Visual perception will develop towards higher adaptability and robustness in the face of different tasks, which may include domain adaptation and meta-learning. Domain adaptation is a subdiscipline of machine learning that deals with the use of models trained on information source distributions in the context of different target distributions. According to the amount of training data required for a new specific computer vision task, the performance of the function of deep domain adaptation is closer to human intelligence. Progress in this field is critical to the entire field of computer vision, and deep-domain adaptation can ultimately lead people to reuse effective and simple knowledge in vision tasks. Similarly, metalearning is intended to allow machines to learn to learn. When the machine has the ability to learn, it can quickly adapt to different tasks. Meta-learning is also an important direction for improving the robustness of future visual perception. 4) Visual crowd sensing is a technological idea that conforms to the trend of world development. As humans enter the age of the IoT, valuable data are gradually being socialized, shared, and experiential. In VCS, pictures and videos can contain richer information, and they are more closely related to the environment and others. The volume of data items is larger, and conform to the development idea of IoT, it may become a mainstream technology in visual perception. Similarly, federated machine learning [167] is an emerging artificial intelligence basic technology, which is proposed in order to solve the problem of data islands and strengthen data security. In recent years, research on federal learning has continuously emerged, and will lead the next wave of commercialization of machine learning technology. Federal learning is also a new road for the development of visual perception under the tide of the IoT. 5) The global Internet and semiconductor giants have laid out, showing that intelligent image and video processing will be the next arena, which may mean that vision technology is ushering in a golden period of development.
In the future, visual perception will continue to make breakthroughs in applications such as UAV, autonomous driving, smart doctors, smart security, and smart cities.
Exploring new technical support and application areas is always the trend of visual perception development.

V. CONCLUSION
Overall, in this article, we reviewed and analyzed several major application fields of visual perception, including industrial quality inspection, agricultural production, autonomous driving, visual fraud, and crowd sensing. Specifically, we introduced textile defect detection in product surface inspection, agricultural robots, agricultural pest, and disease monitoring in intelligent agricultural production, lane detection in autonomous driving, image synthesis, and forgery detection in visual fraud and event reconstruction in crowd sensing and object measurement. These applications basically cover the popular visual perception research directions in recent years, including classification in image or video, segmentation, object detection, tracking, image or video generation, forgery detection, 3-D reconstruction, and multisource information fusion. We can conclude that most of the current visual perception technologies and applications were combined with artificial intelligence, which is helpful to human production and life, and has the advantages of low cost, high precision, and high efficiency.
In addition, based on the status quo, we analyzed the current challenges faced by humans when using visual perception technology, including vision acquisition, computing power, device volume, technology security, speed, accuracy, robustness, intelligence, software and hardware combination, etc. Based on these challenges, we made predictions about the development prospects of visual perception. In the future, visual perception will be more closely integrated with artificial intelligence, and will move towards multisource information fusion, active vision, domain adaptation, meta-learning, reinforcement learning, federal learning, crowd sensing and other directions, and more fields will be applied to visual perception technology. With the continuous development and intelligentization of visual perception technology, human production efficiency, and quality will continue to improve, which will be one of the important driving forces for human social progress.