Mobile Robotics Mid-Semester Q&A Compilation

NOTE: AI AI CAPTAIN

Q1: Warm-up Table

Question

Fill up the following table by indicating the quantities that are known, to be estimated, or unknown, and the type of measurements that are needed.

Solution Table

Problem Structure (Scene Geometry) Motion (Camera Parameters) Measurements
F-matrix estimation Unknown To be Estimated 2D-2D features
Camera calibration Known To be Estimated 2D-3D features
Triangulation To be Estimated Known 2D-2D features
Stereo rectification Unknown Known & To be Estimated Unknown
PnP (Perspective-n-Point) Known To be Estimated 2D-3D features
Bundle adjustment To be Estimated To be Estimated 2D-3D features

Q2: Transformations

Question (i)

Derive the expression for TWC if TCW=[RCWPCORGW01×31].

Solution (i)

We know that the transformation from a frame to itself is the identity matrix. Therefore, the product of a transform and its inverse must be the identity matrix:

TCWTWC=I

Expanding this with the block matrix form gives:

[RCWPCORGW01][RWCPWORGC01]=[I001]

From the block matrix multiplication, we get two key equations:

  1. For the rotation part: RCWRWC=I. This implies that RWC=(RCW)1. Since rotation matrices are orthogonal, the inverse is equal to the transpose, so RWC=(RCW)T.
  2. For the translation part: RCWPWORGC+PCORGW=0.

Solving the second equation for the unknown translation vector PWORGC:

PWORGC=(RCW)1PCORGW=(RCW)TPCORGW

Combining these results, the final expression for the inverse transformation is:

TWC=[(RCW)T(RCW)TPCORGW01]

Question (ii)

Consider the figure below, where {W} is the world frame and {C} is the camera frame. The Z-axis of {W} comes out of the plane, while the Y-axis of {C} goes into the plane. Find (a) RCW, (b) the YXZ-Euler angles for RCW, and (c) PWORGC and TWC.

Solution (ii)

(a) Find the Rotation Matrix RCW

To find the rotation matrix that describes the orientation of frame {C} with respect to frame {W}, we express the unit axes of {C} in {W}'s coordinates.
From the diagram:

Stacking these as column vectors gives the rotation matrix:

RCW=[001100010]

(b) Find the YXZ-Euler Angles

The YXZ-Euler angle representation corresponds to the matrix product RY(α)RX(β)RZ(γ). We need to find angles α,β,γ that produce the RCW matrix. One possible solution is a rotation of 90 around the Y-axis, followed by 0 around the new X-axis, and then 90 around the new Z-axis.

RCW=RY(90)RX(0)RZ(90)=[001010100][100010001][010100001]=[001100010]

So, the YXZ-Euler angles are (α,β,γ)=(90,0,90). Note: Euler angle representations are not unique.

( c ) Find PWORGC and TWC

The origin of frame {C} is located at (1,5,2) in the world frame {W}, so PCORGW=[1,5,2]T. We want to find the position of the world origin as seen from the camera frame, PWORGC. Using the formula from part (i):

PWORGC=(RCW)TPCORGW=[010001100][152]=[521]=[521]

Finally, we construct the full transformation matrix TWC:

TWC=[RWCPWORGC01]=[(RCW)TPWORGC01]=[0105001210010001]

Q3: Single-View Geometry and Reconstruction

Question 3.1

Given a camera matrix P, detail how you can obtain the camera center C and the rotation matrix R without knowing the intrinsic parameter matrix K.

Solution 3.1

The camera matrix P is a 3×4 matrix that can be decomposed as P=K[R|t], where K is the 3×3 intrinsic matrix, R is the 3×3 rotation matrix, and t is the 3×1 translation vector. This can be written as P=[KR|Kt].

  1. Finding the Camera Center: The camera center C is the point in 3D space that projects to the null-space of P. That is, PC=0. Since P is 3×4, its null-space is one-dimensional, so C is uniquely determined up to a scale factor.

  2. Finding the Rotation Matrix: Without knowing K, we cannot directly recover R. However, we can use the fact that the first 3×3 submatrix of P, let's call it M=KR, is the product of an upper-triangular matrix (K) and an orthogonal matrix (R). We can use a QR decomposition on M1. If M1=RqrKqr, where Rqr is orthogonal and Kqr is upper-triangular, then M=Kqr1Rqr1. From this, we can identify R=Rqr1 and K=Kqr1.

Alternatively, we can use the property that RRT=I.

MMT=(KR)(KR)T=KRRTKT=KKT

This equation relates the known matrix MMT to the unknown intrinsic matrix K. Since KKT is symmetric, we can solve for K (up to the sign ambiguities inherent in Cholesky decomposition). Once K is found, R can be recovered simply as R=K1M.

Question 3.2

State and justify the cases when the 3D reconstruction obtained from two views is (a) Unambiguous (b) Up to an unknown scaling factor (c) Up to an unknown projective transformation.

Solution 3.2

(a) Unambiguous (Euclidean) Reconstruction:
Reconstruction is unambiguous (up to a global rigid motion) when the intrinsic matrices of both cameras (K1,K2) and the relative extrinsic parameters (rotation R, translation t) between them are all known. With this information, the camera matrices P1=K1[I|0] and P2=K2[R|t] are fully defined. Corresponding image points can be back-projected into 3D rays, and their intersection gives the unique 3D point.

(b) Up to an Unknown Scaling Factor (Metric Reconstruction):
Reconstruction is determined up to a scale factor when the intrinsic matrices (K1,K2) are known, but the relative motion (R, t) is unknown. In this case, we can estimate the Essential Matrix (E) from 2D point correspondences. The Essential Matrix can be decomposed to find R and t, but the translation vector t can only be recovered up to a scale. Since the magnitude of the translation (the baseline) is unknown, the absolute scale of the reconstructed 3D scene is also unknown.

( c ) Up to an Unknown Projective Transformation:
Reconstruction is only determined up to a projective transformation when the intrinsic (K1,K2) and extrinsic (R, t) parameters are all unknown. With only 2D point correspondences, we can only estimate the Fundamental Matrix (F). The Fundamental Matrix does not encode the full Euclidean geometry of the scene. Since F is determined only by image point correspondences, there is a projective ambiguity in the reconstruction; any non-singular 4×4 matrix H applied to the scene points and camera matrices will result in the same image points (x=PX=(PH1)(HX)).


Q4: Essential Matrix

Question

Two cameras fixate on a point P in 3D space such that their optical axes intersect at this point. Show that the E33 element of their associated Essential matrix E is zero.

Solution

When a camera fixates on a point, its optical axis passes through that point. The optical axis intersects the image plane at the principal point. If both cameras fixate on the same 3D point X, it means that X is imaged onto the principal point of each camera.

In normalized image coordinates, the principal point is at the origin (0,0). In homogeneous coordinates, this corresponds to x1=[0,0,1]T for the first camera and x2=[0,0,1]T for the second camera.

The epipolar constraint is defined by the Essential matrix E as:

x2TEx1=0

Substituting the coordinates of the principal points into this equation:

[001][E11E12E13E21E22E23E31E32E33][001]=0

Performing the matrix multiplication:

[001][E13E23E33]=0(0E13)+(0E23)+(1E33)=0

This simplifies directly to:

E33=0

Thus, the (3,3) element of the Essential matrix must be zero.


Q5: Homography

Question

Suppose a camera with intrinsic matrix K rotates about its optical centre by a rotation matrix R. (a) Show that its two views are related by a homography H such that x2=Hx1. (b) Show that if θ is the rotation between two views, then the angle 2θ corresponds to the homography H2.

Solution

(a) Derivation of the Homography

Let a point in 3D space be X. Its projection x1 in the first camera view is given by:

λ1x1=K[I|0]X

Since the camera only rotates, the translation is zero. The 3D point X can be expressed in terms of its image point x1 by inverting this relationship:

X=λ1K1x1

The projection of the same 3D point X in the second camera view, after a rotation R, is given by:

λ2x2=K[R|0]X

Now, we substitute the expression for X from the first view into the equation for the second view:

λ2x2=K[R|0](λ1K1x1)

Since [R|0] operates on a 3D vector, this becomes:

λ2x2=λ1KRK1x1

Dropping the scalar depth constants λ1 and λ2 (as we are in homogeneous coordinates), we get the relationship:

x2=(KRK1)x1

We define the homography matrix H as:

H=KRK1

Thus, the two image points are related by the homography x2=Hx1.

(b) Homography for Rotation 2θ

If the camera is rotated again by the same rotation R (corresponding to an angle θ), the new point x3 would be related to x2 by the same homography H:

x3=Hx2

Substituting the expression for x2:

x3=H(Hx1)=H2x1

The total rotation from the first view to the third view is RR=R2. A rotation by an angle θ followed by another rotation by θ around the same axis results in a total rotation by 2θ. Therefore, the homography corresponding to a rotation of 2θ is indeed H2.


Q6: Dense Visual Odometry

Question

The photometric error for Dense-VO is given as xI1||I1(x)I2(w(x,(R|t)))||2. Assuming the depth d of a point x in image I1 is known, (a) describe the steps to map this point to the second image and provide an expression for the warping function w(x,(R|t)). (b) What is the nature of this error and how can it be solved?

Solution

(a) The Warping Function w(x,(R|t))

The function w maps a pixel coordinate x from the first image (I1) to its corresponding pixel coordinate in the second image (I2) given the relative camera motion {R,t} and the pixel's depth d. The steps are:

  1. Back-projection to 3D: First, we lift the 2D pixel x (in homogeneous coordinates) back into a 3D point X in the first camera's coordinate frame. This is done by inverting the pinhole camera projection equation, using the known intrinsic matrix K and depth d.

    X=dK1x
  2. Transform to Second Camera Frame: Next, we transform this 3D point X into the coordinate system of the second camera using the rigid body motion {R,t}.

    X=RX+t
  3. Re-projection to 2D: Finally, we project the transformed 3D point X back onto the image plane of the second camera using the intrinsic matrix K.

    x=KX

Combining these steps gives the full mathematical expression for the warping function w(x,(R|t)):

w(x,(R|t))=K(dRK1x+t)

(b) Nature of the Photometric Error and its Solution

The photometric error is the difference in intensity values between a pixel in the first image and its corresponding warped location in the second image. This error function is non-linear and non-convex with respect to the motion parameters R and t.

Because it's a non-linear optimization problem, we cannot solve for the best camera motion {R,t} in closed form. Instead, we must use an iterative estimation method like Gauss-Newton or Levenberg-Marquardt. These methods start with an initial guess for the motion and iteratively refine it by solving a sequence of linear approximations of the problem, with the goal of finding the motion parameters that minimize the total photometric error.



Q7: Bundle Adjustment

Question

What is Bundle Adjustment? Describe its objective function mathematically. Explain the structure of the Jacobian matrix used in its optimization.

Solution

Bundle Adjustment (BA) is a large-scale non-linear optimization problem that is fundamental to 3D reconstruction tasks like Structure from Motion (SfM) and SLAM. Its primary purpose is to simultaneously refine the 3D coordinates of all map points and the parameters of all camera poses (position and orientation) to achieve a globally consistent and optimal reconstruction. The name comes from the "bundles" of light rays that project from each 3D point to each camera center, which are adjusted to minimize error.

The Objective Function

The goal of Bundle Adjustment is to minimize the total reprojection error. This error is the sum of squared distances between the observed 2D feature locations in the images and the predicted 2D projections of the 3D map points based on the current camera pose estimates.

Mathematically, if we have m cameras and n 3D points, the objective function is:

minCi,Xji=1mj=1nvijxijπ(Ci,Xj)2

Where:

The Jacobian Structure

This non-linear least squares problem is solved iteratively using methods like Levenberg-Marquardt. This requires computing the Jacobian matrix of the residuals. The state vector is partitioned into two sets of variables: the camera parameters and the 3D point coordinates.

State Vector=[c1,,cm,x1,,xn]T

The Jacobian matrix has a very specific sparse block structure. For a single residual term rij=xijπ(Ci,Xj), the derivatives are non-zero only with respect to the parameters of the i-th camera and the j-th point. All other derivatives are zero.

This results in a Jacobian matrix that looks like this:

J=[AB]=[rCrX]

This sparse structure is critical because it can be exploited by specialized solvers (like the Schur complement trick) to solve the large linear system required at each iteration much more efficiently than a dense solver could.


Q8: EKF-SLAM

Question

Describe the key steps of the EKF-SLAM algorithm. What are the main challenges or limitations of this approach?

Solution

Extended Kalman Filter (EKF) SLAM is a classic probabilistic approach to solving the SLAM problem. It maintains a state estimate as a single multivariate Gaussian distribution, represented by a mean vector μ and a covariance matrix Σ. The state vector includes both the robot's pose and the positions of all landmarks in the map.

xt=[xrobotxlandmark1xlandmarkN],Σt=[ΣrrΣrLΣLrΣLL]

The algorithm operates in a two-step prediction-correction cycle:

  1. Prediction Step: When the robot moves, a motion model g is used to predict its new state. This step propagates the robot's uncertainty, causing the covariance to grow.

    • Predict Mean: μ¯t=g(μt1,ut)
    • Predict Covariance: Σ¯t=GtΣt1GtT+Rt
      Here, ut is the control input, Gt is the Jacobian of the motion model with respect to the state, and Rt is the motion noise covariance.
  2. Correction (Update) Step: When the robot observes a landmark, the measurement zt is used to correct the state estimate. This step reduces uncertainty.

    • Compute Kalman Gain: Kt=Σ¯tHtT(HtΣ¯tHtT+Qt)1
    • Update Mean: μt=μ¯t+Kt(zth(μ¯t))
    • Update Covariance: Σt=(IKtHt)Σ¯t
      Here, h is the observation model, Ht is its Jacobian with respect to the state, and Qt is the measurement noise covariance. The term (zth(μ¯t)) is the innovation or measurement residual.

Challenges and Limitations of EKF-SLAM


Q9: Image Formation

Question

Derive the pinhole camera model. Show the full relationship between a 3D world coordinate and a 2D pixel coordinate, explaining the role of the intrinsic and extrinsic parameter matrices.

Solution

The pinhole camera model is the simplest mathematical model for describing how a 3D scene is projected onto a 2D image plane.

Ideal Pinhole Model and Similar Triangles

Imagine a camera as an ideal box with a tiny aperture (a pinhole) on the front and an image plane at the back. A light ray from a 3D point X=[X,Y,Z]T in the camera's coordinate system passes through the pinhole (the origin) and strikes the image plane, which is located at a distance f (the focal length) from the pinhole.

Using similar triangles, we can derive the coordinates (x,y) of the projected point on the image plane:

xf=XZx=fXZyf=YZy=fYZ

From Ideal to Pixel Coordinates: The Intrinsic Matrix

The ideal coordinates (x,y) are in metric units on the image plane. We need to convert them to pixel coordinates (u,v). This involves two steps:

  1. Scaling by pixel size: The focal lengths fx and fy (in units of pixels per meter) scale the metric coordinates.
  2. Offsetting by the principal point: The origin of the pixel coordinate system is usually at the top-left corner, while the optical axis intersects the image plane at the principal point (cx,cy).

This relationship can be expressed in matrix form using homogeneous coordinates:

[uv1]=[fx0cx0fycy001][X/ZY/Z1]

The 3×3 matrix is the intrinsic matrix K, which contains the internal parameters of the camera.

From World to Camera Coordinates: The Extrinsic Matrix

A 3D point is usually defined in a world coordinate system, not the camera's. We need to transform the world point Xworld into the camera's coordinate system, Xcam. This is a standard rigid body transformation defined by a rotation matrix R and a translation vector t.

Xcam=RXworld+t

This can be written in homogeneous coordinates using a 3×4 matrix [R|t], known as the extrinsic matrix.

The Full Camera Projection Matrix

Combining all steps, we can write a single equation that maps a 3D point in the world frame directly to a 2D point in the image frame.
Let ximg=[u,v,1]T and Xworld=[Xw,Yw,Zw,1]T.

λ[uv1]=[fx0cx0fycy001][r11r12r13txr21r22r23tyr31r32r33tz][XwYwZw1]

Here, λ is the depth of the point in the camera frame. This is often written more compactly using the 3×4 camera projection matrix P:

λximg=PXworldwhereP=K[R|t]

Q10: Triangulation

Question

What is triangulation? Given two camera matrices P1 and P2 and a pair of corresponding image points x1 and x2, explain the mathematical procedure for finding the 3D point X.

Solution

Triangulation is the geometric process of determining the 3D position of a point by observing it from two or more different viewpoints with known positions and orientations. Given a pair of corresponding 2D points in two images, we can back-project them as rays into 3D space. The intersection of these rays reveals the location of the 3D point.

Mathematical Procedure

Let the two camera projection matrices be P1 and P2, and the corresponding homogeneous image points be x1 and x2. The projection equations are:

λ1x1=P1Xλ2x2=P2X

where λ1 and λ2 are the unknown depths. We can eliminate these unknown scalars by using the vector cross product property, v×v=0. Since P1X must be parallel to x1, their cross product is zero:

x1×(P1X)=0x2×(P2X)=0

Let x1=[u1,v1,1]T and let the rows of P1 be p1(1)T, p2(1)T, and p3(1)T. The cross product x1×(P1X) yields three linear equations in the components of X:

{v1(p3(1)TX)1(p2(1)TX)=01(p1(1)TX)u1(p3(1)TX)=0u1(p2(1)TX)v1(p1(1)TX)=0

Only two of these equations are linearly independent. We select two from the first camera and two from the second camera to form a system of four linear equations in the four components of the homogeneous 3D point X. This system can be written in the form:

AX=0

where A is a 4×4 matrix constructed from the elements of x1,x2,P1, and P2.

In practice, due to noise in the image point locations, the back-projected rays will not perfectly intersect. Therefore, the system AX=0 will not have a trivial solution. The problem becomes finding the point X that is "closest" to the intersection, which means finding the X that minimizes AX. This is a linear least-squares problem. The solution for X is the eigenvector of the matrix ATA that corresponds to its smallest eigenvalue.


Q11: Recovering Rotation from Homography

Question

If a homography H is known to be induced by a pure camera rotation R, and the camera's intrinsic matrix K is also known, how can the rotation R be recovered?

Solution

When a camera undergoes a pure rotation R about its optical center, with no translation, the relationship between a point x1 in the first view and its corresponding point x2 in the second view is described by the homography matrix H:

x2=Hx1

This homography matrix H is related to the camera's intrinsic matrix K and the rotation matrix R by the following equation:

H=KRK1

Our goal is to solve this equation for R. This can be achieved through straightforward algebraic manipulation.

Derivation:

  1. Start with the known relationship:H=KRK1
  2. Pre-multiply both sides by the inverse of the intrinsic matrix, K1:K1H=K1(KRK1)
  3. Since matrix multiplication is associative, we can regroup the terms on the right side:K1H=(K1K)RK1
  4. The product of a matrix and its inverse is the identity matrix, I:K1H=IRK1=RK1
  5. Now, to isolate R, post-multiply both sides by the intrinsic matrix, K:(K1H)K=(RK1)K
  6. Again, using associativity and the property of the inverse:K1HK=R(K1K)=RI=R

This gives us the final expression for recovering the rotation matrix:

R=K1HK

Intuition: This result shows that the homography matrix H is a similarity transform of the pure rotation matrix R. The rotation is "conjugated" or "twisted" by the camera's internal geometry, represented by K. To recover the pure rotation, we must "un-twist" the homography by applying the inverse intrinsics from the left and the regular intrinsics from the right.


Q12: Properties of the Fundamental Matrix

Question

What is the relationship between the Fundamental Matrix F and the epipoles e1 and e2? Prove these relationships mathematically.

Solution

The Fundamental Matrix F is a 3×3 matrix of rank 2 that encapsulates the epipolar geometry between two camera views. The epipoles, e1 and e2, are the projections of one camera's center onto the other camera's image plane. There are two fundamental relationships that define the epipoles as the null-spaces of the Fundamental Matrix.

Relationship 1: The Left Null-Space

The epipole in the second image, e2, is the left null-space of the Fundamental Matrix.

e2TF=0

Intuition and Derivation:
The epipolar line l2 in the second image corresponding to a point x1 in the first image is given by the equation l2=Fx1.
The epipole e2 is the point in the second image where all epipolar lines intersect. Therefore, for any and every epipolar line l2, the epipole e2 must lie on that line.

The condition for a point to lie on a line in homogeneous coordinates is that their dot product is zero. Thus, for any x1:

e2Tl2=0

Substituting the expression for l2:

e2T(Fx1)=0

Using associativity of matrix multiplication:

(e2TF)x1=0

This equation must hold true for all possible image points x1. The only way a vector-vector product can be zero for any arbitrary vector x1 is if the vector multiplying it is the zero vector. Therefore:

e2TF=0T

This proves that e2 is the left null-space of F.

Relationship 2: The Right Null-Space

The epipole in the first image, e1, is the right null-space (or simply, the null-space) of the Fundamental Matrix.

Fe1=0

Intuition and Derivation:
Similarly, the epipolar line l1 in the first image corresponding to a point x2 in the second image is given by l1=FTx2.
The epipole e1 is the intersection point of all epipolar lines in the first image. Therefore, e1 must lie on every line l1. The condition is:

e1Tl1=0

Substituting the expression for l1:

e1T(FTx2)=0

Using the property of the transpose, (AB)T=BTAT:

((Fe1)T)Tx2=0(Fe1)Tx2=0

This equation must hold for all possible image points x2. For this to be true, the vector (Fe1) must be the zero vector. Therefore:

Fe1=0

This proves that e1 is the right null-space of F.


Q13: Lie Algebra for Robotics Optimization

Question

What are the Lie Group SE(3) and the Lie Algebra se(3)? Explain the role of the exponential and logarithmic maps in connecting them, especially in the context of optimization for SLAM.

Solution

Lie theory provides the mathematical foundation for representing and optimizing rigid body motions in a way that correctly respects their geometric structure.

The Lie Group SE(3): The Space of Poses

SE(3) is the Special Euclidean Group in 3D. It is the set of all 4×4 homogeneous transformation matrices that represent rigid body motions (a rotation and a translation).

T=[Rt01]where RSO(3),tR3

It is called a group because it satisfies the group axioms (closure under multiplication, identity element, inverse for every element, associativity). More importantly, it is a smooth manifold. This means it is a space that locally looks like a standard Euclidean space, but has a global curved structure. This curvature is the reason why standard vector addition is not a valid operation for poses; the sum of two transformation matrices is not, in general, a valid transformation matrix.

The Lie Algebra se(3): The Space of Velocities

The Lie Algebra se(3) can be intuitively understood as the tangent space to the SE(3) manifold at the identity element. It is the space of all possible infinitesimal motions, or velocities. It is a 6-dimensional vector space.

An element ξse(3) is a 6-dimensional vector called a twist, which combines a linear velocity v and an angular velocity ω:

ξ=[vω]R6

This vector can be mapped to a 4×4 matrix form, often written with a "hat" or "wedge" operator:

[ξ]×=[[ω]×v00]where [ω]×=[0ωzωyωz0ωxωyωx0]

Connecting the Spaces: Exponential and Logarithmic Maps

Role in SLAM Optimization

In optimization problems like SLAM or Bundle Adjustment, we need to iteratively update an estimate of a camera pose, Told. Since we cannot simply add an update to Told in the curved SE(3) manifold, we use Lie theory as follows:

  1. The optimization algorithm (e.g., Gauss-Newton) computes an update step. This update step, δξ, is calculated in the local tangent space, which is a well-behaved 6D Euclidean vector space.
  2. This small update vector δξ is then mapped from the Lie algebra back onto the manifold using the exponential map to become a small transformation matrix, exp([δξ]×).
  3. This small transformation is then applied multiplicatively to the current pose estimate to get the new, refined pose:Tnew=Toldexp([δξ]×)

This procedure guarantees that the updated pose Tnew is always a valid member of SE(3), correctly respecting the geometry of the problem.