Depth (d): The distance of a 3D point from the camera.
Camera Pose (T): The position and orientation of the camera in the world (usually represented as a 4×4 transformation matrix, combining rotation and translation).

Projection (π):Maps a 3D point (in camera coordinates) to 2D pixel coordinates.
Backprojection (π⁻¹): Given a pixel (u, v) and its depth d, reconstructs the 3D point in camera coordinates.

Given a pixel (u, v) in image Iᵢ and its depth d, we compute its 3D coordinates in the camera frame of Iᵢ:

P_{i} = π^{- 1} (u, v, d)

For a pinhole camera: $P_{i} = d \cdot [\begin{matrix} (u - c_{x}) / f_{x} \\ (v - c_{y}) / f_{y} \\ 1 \end{matrix}]$ where $(f_{x}, f_{γ})$ are focal lengths and $(c_{x}, c_{γ})$ is the principal point.

The 3D point $P_{i}$ is transformed to the camera frame of $I_{j}$ using:

T_{i j} = (T_{j w})^{- 1} \cdot T_{i w}

where:

The transformed point is:

P_{j} = T_{i j} \cdot P_{i}

Project $P_{j}$ onto the image plane of $I_{j}$ :

(u^{'}, v^{'}) = π (P_{j})

For a pinhole camera:

u^{'} = f_{x} \cdot (P_{j, x} / P_{j, z}) + c_{x}, v^{'} = f_{y} \cdot (P_{j, y} / P_{j, z}) + c_{y}

The flow vector for pixel (u, v) is:

\hat{f} (u, v) = (u^{'} - u, v^{'} - v)

Intuition

Static Scene Assumption: Only the camera moves; the world is static.
Flow Depends on Depth: Closer points (small d) induce larger flow (more motion in the image), while distant points (large d) induce smaller flow.
Flow Depends on Camera Motion: The direction and magnitude of flow depend on how the camera moves (rotation vs. translation).

Example

Two images
Camera intrinsics: fx = fy = 500, cx = cy = 320.
Camera poses:
- Tᵢw = identity (camera i is at world origin).
- Tⱼw = translation of (0.1, 0, 0) (camera j moves right by 0.1m).
Pixel (u, v) = (320, 320) ( Assume center of the image).
Depth d = 1m.

P_{i} = π^{- 1} (320, 320, 1) = [\begin{matrix} 0 \\ 0 \\ 1 \end{matrix}]

T_{i j} = (T_{j w})^{- 1} \cdot T_{i w} = [\begin{matrix} 1 & 0 & 0 & - 0.1 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{matrix}]

P_{j} = T_{i j} \cdot P_{i} = [\begin{matrix} - 0.1 \\ 0 \\ 1 \end{matrix}]

u^{'} = 500 \cdot (- 0.1 / 1) + 320 = 270

v^{'} = 500 \cdot (0 / 1) + 320 = 320

\hat{f} (320, 320) = (270 - 320, 320 - 320) = (- 50, 0)

Instead of computing flow for one pixel, we compute it for all pixels (u, v) in a grid (h × w), given a dense depth map d. This gives a flow field:

h_{i j} (T_{i}^{w}, T_{j}^{w}, d_{i}) = π (T_{i j} \cdot π^{- 1} (u, v, d_{i})) - [\begin{matrix} u \\ v \end{matrix}]

For visual odometry: Estimate camera motion by comparing predicted flow (from depth and pose) with observed flow.
For depth estimation: If camera motion is known, we can estimate depth from flow.