Computer vision is an interdisciplinary field focused on enabling computers to gain high-level understanding from images and video—automatically extracting, analyzing, and interpreting visual information to produce outputs such as labels, measurements, 3D structure, or decisions.
In practice, computer vision methods combine geometry, physics, statistics, and machine learning to connect pixel data to semantic concepts like objects, actions, and scenes.
Image processing primarily transforms images (e.g., denoising, contrast enhancement, geometric warping) where the output is another image.
Computer vision uses images/video as input but often outputs information about the scene (e.g., detections, segmentation masks, pose estimates, tracking results, or a decision), which may then drive downstream behavior in a larger system.