Computer vision is a field in computer science that falls under the umbrella of artificial intelligence (AI). Computer vision (CV) software developers strive to give computers the ability to process images in much the same way that humans do. They expect the computer will be able to identify objects, to make appropriate decisions based on what it “sees,” and then to produce relevant outputs.
Today, facial recognition software, autonomous vehicles, certain forms of surveillance, and gesture recognition are just a few examples of CV systems at work.
Why is computer vision so complicated? Every parent can recall their child going through phases when “what's that?” became a recurring question. The child was building a collection of words (definitions) and memories that inform and allow the child to process what they're seeing — where “process” means to recognize what they're seeing, then to make judgments and decisions about those objects, and often, to take specific actions.
We see the world around us with our eyes, which couple to the human nervous system. Even today, scientists don't understand precisely how photons reflect off objects in the environment, enter our eyes and are translated into the constellation of things we refer to — and respond to — as “the world around us.”
This riddle of how perception takes place and makes us consciously aware of our surroundings has puzzled scientists and philosophers for decades. It is a “hard problem of consciousness.”  This philosophical and neurological question, coupled with the complexity of the brain and nervous system, make modeling a CV software solution after the human experience of perception nearly impossible.
However, with advances in machine learning, deep learning, and neural networks, computers are becoming more adept at defining, categorizing and remembering objects. So, yes, even very young children continue to outpace today's CV systems. Despite humanity's naturally superior perceptual skills and capabilities, CV continues to advance at a rapid pace.
One might ask why it was necessary to design software and hardware that allowed computers to read text from printed pages. Ray Kurzweil[2,3] answered that question when he brought a machine to market in the 1970s that could scan text and read it to visually impaired people via voice synthesis.
That innovation served as a flashpoint that spawned an entire industry segment based on optical character recognition (OCR). Today, computers can read printed text in a variety of writing systems (e.g.,Latin, Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, and others)  and convert it to text strings suitable for computer storage and processing. Scanning technology has streamlined almost countless business processes ranging from data entry to Project Gutenberg's book scanning, to vehicle license plate recognition.
Optical character recognition is no longer considered an example of AI or CV. It's been commercialized and has become commonplace, but it serves as an excellent example of how CV can revolutionize many aspects of the human experience. Therein lies the primary value of teaching computers to “see.”
But, there's another reason to continue pursuing CV. Once the software has been developed to solve a particular problem, that solution can be applied to similar issues. For instance, late-20thcentury efforts at speech recognition driven by DARPA, IBM, Microsoft, and others have matured into everyday conveniences. Anyone who uses a smartphone, a computer or an Alexa-like device can issue voice commands. Now, speech recognition can be “dropped into” a new product almost as a plug-and-play exercise — which is just what Samsung did when it released a line of refrigerators you can talk to. 
More critically, CV is finding use and improving medical screening and diagnosis, physical security of premises, and manufacturing, to name a few.
First, computers don't “see” images the way people do; they see only numbers. The pixels that make up an image each have a numeric value. In a black and white image, a pixel can have a value ranging from zero (black) to 255 (white), which relates to an 8-bit memory location. If the image has color, each pixel is represented by a 24-bit number that can range from zero to just over 16 million. Imagine a still image converted, one pixel at a time, to a number. That's what a CV application “sees.” It's a matrix, an array, filled with numbers that each represent a tiny part of the overall image.
A static image presented to a CV system is a two-dimensional representation of physical objects in our three-dimensional world. Images contain shadows and other gradients of intensity and may contain more than one object. Analysis of the image that’s so natural to human observers requires a great deal of processing (see below) before a CV system can determine the contents of the scene.
In the case of videos, CV treats each frame of the video as a series of flat 2-D images that stream into the software. The CV converts each video frame into an array, as explained above. Processing videos requires significant computing power. For instance, each frame of a 1080p video in color contains 49,766,400 bits (1920 x 1080 pixels x 24-bits per pixel). With most videos running at 30 frames per second, the conversion to numeric pixel values for each frame is a massive job.
Developers and engineers working in AI and CV must master many complex and detailed knowledge domains. However, the key steps and technologies involved in CV can be summarized as follows.
Once you convert the image or video to numeric pixel values, features within the image need to be defined. That’s a time-consuming and expensive process performed by humans. It’s called feature engineering. A feature can be an edge of an object, a corner, a blob (an area where most pixels are reasonably similar but different from surrounding areas), and other notable characteristics. Fortunately, various deep learning procedures can help in automating this process. [29,30]
Without prior training, the CV system cannot “know” what various objects look like. Therefore, thousands (even millions) of images are presented to the system so it can gradually learn the difference between, say, an airplane, and a toothbrush. When using CV in specific knowledge domains, say, ophthalmology to diagnose pathologies of the retina, the system is trained by submitting images that contain both healthy, diseased and damaged images of retinas.
As pixel values are analyzed, CV uses filters and neural network computing to estimate what the image contains. Each estimate is fed back through the layers of the neural network several times, with each pass refining the forecast. With enough processing, the system will eventually reach a likely conclusion on what the image contains. The system presents its judgment in the form of a probability, a percentage. For example, it may determine the model includes a damaged retina with, say, 91 percent certainty.
CV plays a more significant role in our daily lives than most people realize. Here are just a few examples from a sampling of domains.
Large enterprises taking part in a digital transformation over the past several years have recognized that a good deal of corporate data is stored in binary form — largely in photographic images and video files. Even those companies that have mastered the challenges of capturing unstructured big data find that binary data requires a different approach. CV is the solution to that latest conundrum.
Retailers like eBay have launched a CV tool that confirms “a picture is worth a thousand words.” Rather than searching for items on eBay using textual descriptions and guessing at the keywords that will find the desired product, customers can now submit a picture of a physical object — say a clothing item, a football, or a lawnmower — which triggers the CV app to find similar and matching pieces. Web-savvy customers can also find the URL of an object on any website, then submit it to eBay to search its inventory of more than one billion items. 
In a similar vein, since 2016, Shutterstock has allowed customers to shop among their 70 million photo and video images by uploading a picture or dropping it into its query box. Google has offered a similar CV functionality for even longer. 
In the automotive industry, the promise of fully autonomous vehicles has spawned exceptional levels of investment over the past decade. All the traditional automakers, as well as Waymo, Tesla, and others are gradually working toward building what is known in the industry as SAE Level 4 and Level 5 vehicles — which can operate safely without human input to the driving task.  As has been well documented in the popular press, several technologies, including LIDAR, radar and conventional video serve as inputs to the CV systems in such vehicles.
Companies like Orbital Insight have brought satellite imaging to market. The company uses CV and AI to turn raw satellite data into actionable information that's used by investors, governments, and corporations to identify opportunities and manage risk. For example, insurance companies use satellite imaging to predict sales at shopping malls and to monitor oil production. [23,24] Governments and insurance companies use it to document damage caused by hurricanes and other natural disasters.
There's no question that medical applications for CV can positively affect health and longevity. A partnership between Massachusetts General Hospital, Nvidia, and Google established in 2016 aims to use its database of 10 billion medical images to improve patient care and outcomes. This effort began using radiographic images that identify pathological conditions, then moved to electronic health records and genomics. [24,25]
Beyond shortcomings in computer chips and servers designed for CV applications, there are a few philosophical and process problems.
Science has yet to understand how the human mind works; the “hard problem of consciousness” remains unsolved. In the same fashion, understanding how artificial intelligence works at the granular level has been called “the black box problem of AI.” Even the designers, data scientists, and engineers that create an AI rarely understand the individual steps that result in various conclusions. There's an ongoing call throughout the AI community to build transparency into AIs so humans can, at least, evaluate the processes that lead to a conclusion.
Because several of AI's component technologies — machine learning, deep learning, neural networks, to name a few — are all foundational to computer vision, the questions and uncertainty surrounding the inner workings of an AI apply as well to CV. How, for example, did Google's attempt to use CV to identify a weight-lifting dumbbell result in images that often had a human arm attached? And how was that strange “bug” in the system discovered and repaired? Moreover, when AI and CV are used for, say, military targeting systems — or when an autonomous vehicle encounters a unique new situation — we're best served if we can know the system will choose the safest solution.
While this black box problem doesn't limit the advancement of CV in itself, the attention and focus of researchers and developers in the field will hopefully work toward transparency and safety, which could slow the progress of CV.
Humans have general knowledge that allows us to understand scenes, whether they're captured in a still image, a video, or existing in the physical world. For instance, even a child viewing a scene that contains a person riding a horse understands the two are individual entities, and that riding is a form of transportation. AI and CV, once again, don't have the broad experience humans do, nor do they have human consciousness. This may well be a long-lasting, perhaps permanent, limitation on advancement in the field. 
As noted earlier, CV software processes numbers that represent the intensity and color of pixels that make up a given image. The image itself is a translation of the real world projected through high quality, but imperfect, camera systems. Image compression, noise, and artifacts reduce the accurate representation of the real world further. Such degradations are perhaps impossible to overcome entirely, and they impose limits on the utility of CV systems. 
The enjoyment of one's privacy has always been considered a fundamental human right, but it's often difficult to maintain it in the online world. Nowhere is this truer than in China where the government is issuing each of its citizens a “social credit score” that determines what privileges one may enjoy and what penalties and restrictions one must endure. Cameras and CV software are the foundation of this intrusive, “Big Brother” social project. According to CBS News, China has installed nearly 200 million surveillance cameras. By 2020 that number will grow to 600 million.
Citizens are tracked throughout the day with government officials noting where each person goes, what he does, and with whom he meets. An infraction as simple as jaywalking can lower one's social credit score. Those with the lowest scores find themselves without permission to travel by plane or train, unable to buy a car or enroll their children in a private school. 
In the U.S., facial recognition is cropping up in unexpected places. The NFL has screened people's faces as they've entered the stadium for the Super Bowl. Retailers including Walmart and Walgreen's do, too. Walgreen's is testing soft drink coolers that look at customers' faces and classify them according to their gender, age, and what they buy, all in hopes of collecting more actionable marketing intelligence. 
The ACLU and Electronic Frontier Foundation have each weighed in on the domestic use of facial recognition, pointing out that facial recognition systems could lead to a compromise of civil liberties.
Research into CV is underway at all the technology giants today. Key players have all released cloud-based CV packages designed for businesses who want an “easy start” into the complexities of AI and CV.
For instance, Google’s Perception research team aims to “interpret, reason about, and transform sensory data,” which includes still pictures, videos, music, and sound. Teaching machines to understand human actions taking place in videos, Google personnel have assigned labels to people involved in various activities within selected YouTube videos. They've built a dataset that contains 96,000 people and more than 210,000 actions that fall into some 80 classes: talking, listening, sleeping, standing up, and so on. This research project, one of many that focus on CV, is expected to improve the ability for computers to accurately recognize human action within video files. 
Amazon, too, has considerable research efforts underway in CV, and their Rekognition software provides a visual search and object classification SaaS package that is trainable or can work from pre-trained algorithms. The package allows a fast and easy addition of facial recognition to existing applications. At present, it can recognize gender, age range, emotions, and a variety of facial details. 
Over at IBM, teams there have repurposed their Watson system to perform visual recognition tasks. The IBM Watson Visual Recognition product analyzes images for faces, scenes, text, and other objects. This, too, is a cloud-based CV solution. 
A one-liner that tries to define AI suggests that “Artificial intelligence is what we can't make computers do today.” For instance, forty years ago optical character recognition was considered an example of AI. Today, it's not. CV today can identify objects within a photograph or video or street sign, but pundits predict it will become commonplace in our electronic devices and will evolve into a technology people use every day — much like we use smartphones now.
Just as GPS receivers have shrunk from handheld appliances down to tiny chips that fit inside a smartphone, CV is expected to fit on a silicon chip or two that contain neural network computing and other required components. Connecting such devices to cloud analytics servers will enable every industry, government, and social segment to engage more fully in their unique activities, both business and personal. It's even conceivable that companies may use CV tools in tandem, where one CV system makes recommendations to another CV, then to another, so that the final outcome is fully vetted and optimized. 
Based on the trajectories already in place, CV is going to have significant effects on medical diagnoses, which can improve medical outcomes and help extend life expectancies. Similarly, CV is already improving manufacturing by identifying defects as items pass cameras on the assembly line. Businesses across many industries are already ramping up productivity by letting CV applications do what humans used to do — whether curating legal documents or handling automobile insurance claims.
As Alan Turing famously said, “I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.”  While he may have been a century too early, there’s no question that the advances underway in deep learning, AI, andcomputer visionwill prove his prediction right.
Sanjyot Gindi leads the computer vision team at Skan.AI. A graduate of Purdue University, Sanjyot has over a decade of experience in building advanced computer vision applications in many sectors. An accomplished author and patent holder, Sanjyot also conducts workshops in Computer Vision to advance the practice.
“One of the things you don’t ever want to do is to automate a bad process. You are just going to make bad things happen faster, and that is not what anyone wants.”