back
Back to Glossary

What is Convolutional Neural Network

A convolutional neural network (CNN or ConvNet) is a type of neural network that is especially good at working with grid-like data such as images, video frames, and sensor maps. Instead of looking at all input pixels at once with one big layer, a CNN scans small filters across the input and learns patterns like edges, textures, and shapes that repeat across the image.

At a high level, a CNN is made of a few standard building blocks:

  1. Convolution layers apply many small learnable filters (also called kernels) over the input. Each filter produces a feature map that responds strongly when a particular pattern is present in a local region of the image.
  2. Non-linearities such as ReLU are applied after each convolution so the network can model complex functions, not just simple linear combinations.
  3. Pooling layers downsample feature maps, typically by taking a max or average over small windows. This reduces spatial resolution, makes representations more robust to small shifts, and keeps computation manageable.
  4. Fully connected layers near the end of the network combine extracted features to make a prediction, for example, which class an image belongs to or the coordinates of a bounding box.

Because convolution filters are reused across all locations in the image, CNNs use far fewer parameters than a fully connected network with the same input size. This weight sharing, plus the fact that each neuron only looks at a small local region (its receptive field), makes CNNs efficient and well suited to high-resolution inputs.

Although CNNs are most famous in 2D computer vision, the same idea works in 1D and 3D:

  1. Image classification – assigning a label like “stop sign,” “cat,” or “defect present” to an image.
  2. Object detection – predicting bounding boxes and classes for each object in a scene.
  3. Instance and semantic segmentation – assigning a class to each pixel and, in the instance case, separating individual objects.
  4. Keypoint and pose estimation – predicting landmarks, skeletons, or other structured outputs on top of images.
  5. Stereo and depth estimation – inferring depth or disparity maps from one or more camera views.

In many production systems, CNNs sit inside larger pipelines. For example, a self-driving stack might use CNNs to detect lane markings, traffic signs, and pedestrians, and then hand those detections to a planning module. A quality inspection line might use CNN-based detectors and segmenters to flag defects, which then trigger downstream business logic or human review.

Training strong CNNs requires large, well-labeled datasets: millions of images or frames, precise bounding boxes, segmentation masks, keypoints, or other annotations. This is where high-quality annotation workflows and tools matter. If bounding boxes are inconsistent, masks are sloppy, or labels drift over time, CNNs will learn those errors and their performance in production will degrade.

Platforms like Taskmonk help teams create, manage, and audit the labels that CNNs rely on: setting up clear ontologies, routing tricky edge cases to expert reviewers, and tracking quality metrics by class or region. That foundation is critical whether you are training classic CNN architectures or newer hybrid models that mix convolutional layers with transformers.