What is keypoint annotation?

Keypoint annotation is the process of marking precise landmarks on objects or people, typically as small points with x–y coordinates, to teach models about structure and pose. In computer vision, it is most common for human pose estimation, facial landmarking, hands, and product geometry in retail images [Image annotation]. Semantic variants you might see include landmark annotation and pose keypoints.

A keypoint dataset stores a list of points per instance, each with an order and sometimes a visibility flag. For example, COCO-style human pose uses 17 body joints per person, each represented by pixel coordinates and a visibility value that indicates whether the joint is visible, occluded, or not labeled. This structure lets models learn relationships among joints, improving downstream tracking and action recognition. COCO and other pose datasets remain the reference standard for training and benchmarking.

Keypoint annotation differs from bounding boxes and instance segmentation because it communicates topology rather than only extent.

Boxes answer “where is the person,” masks answer “which pixels are the person,” while keypoints answer “where are the joints and how are they connected.” In practice, teams often combine all three to power richer applications such as motion analysis or shelf-planogram checks in retail video

A concise data model usually includes:

an ontology listing keypoints and their order,
per-instance arrays of coordinates,
visibility flags, and
optional skeletal edges for visualization.

Choosing the right ontology matters. Whole-body schemes add facial, hand, and foot landmarks to improve fine-grained actions and human–object interactions. Public references like COCO Keypoints and COCO-WholeBody show how richer definitions increase model capability.

Quality assurance focuses on inter-annotator agreement for the same frames, point tolerance thresholds (in pixels or normalized units), and reviewer spot checks of occluded or motion-blurred parts. Teams often pre-label with a model and route low-confidence frames to humans, which speeds throughput without sacrificing accuracy (see also model-assisted labeling and active learning).

Example: A sports analytics team labels 17 body joints for basketball players in broadcast footage. The model uses keypoints to estimate jump height and detect illegal screens. Frames with fast motion blur that produce low keypoint confidence get queued for human review.

‍