Best Foot Forward: Robust Foot Reconstruction in-the-wild

Hike Medical & University of Cambridge

ICCV 2025
Workshop on Advanced Perception for Autonomous Healthcare

Description of image

Abstract

Accurate 3D foot reconstruction is crucial for personalized orthotics, digital healthcare, and virtual try-ons. However, existing methods struggle with incomplete scans and anatomical variations, particularly in self-scanning scenarios where user mobility is limited, making it difficult to capture areas like the arch and heel. We propose a novel end-to-end pipeline that enhances Structure-from-Motion (SfM) reconstruction by integrating a transformer-based geometry completion network trained on synthetically augmented point clouds. This network encodes anatomical priors to improve reconstruction accuracy, while a separate viewpoint prediction module provides SE(3) canonicalization, reducing ambiguities in scan alignment. Our approach achieves state-of-the-art performance on reconstruction metrics while preserving clinically validated anatomical fidelity. By combining synthetic training data with learned geometric priors, we enable robust foot reconstruction under unconstrained capture conditions, unlocking new opportunities for mobile-based 3D scanning in healthcare and retail.


Method Overview

Description of image

This paper presents a robust two-phase approach for complete foot reconstruction from unposed images. The first phase employs Structure-from-Motion (SfM) and Multi-View Stereo (MVS) to estimate camera parameters and generate an initial point cloud of the foot. The second phase leverages a shape completion module to refine and complete the partial geometry, producing a dense point cloud representation. A key innovation is the introduction of a viewpoint prediction (VPP) module, which provides a robust mechanism for transforming the output from SfM/MVS into a canonical frame suitable for shape completion.

The method's pipeline involves two main branches that process the input images: a viewpoint prediction branch that estimates the pose of the foot relative to a predefined template mesh, and an SfM & MVS branch that applies GLOMAP for camera parameter estimation and MVSFormer++ for dense point cloud generation. These branches work together with SAM2 for image segmentation, providing accurate foot isolation in the images. After processing through these branches, the system performs point cloud canonicalization using the estimated camera parameters, transforming the partial point cloud into a known reference frame to prepare it for completion.

For the shape completion stage, the system employs PointAttN, an attention-based model that captures both local geometric details and global shape structures. This network predicts a global latent vector from the input point cloud to guide the reconstruction in a two-stage coarse-to-fine manner. A key feature is the skip connection that incorporates part of the input point cloud into the coarse structure prediction, ensuring that the completed foot geometry conforms closely to the input. Finally, the system applies screened Poisson surface reconstruction to produce a meshed geometry suitable for applications like orthotics.


Results

Evaluation of Foot Completion Module

We evaluated our foot completion module against three established foot models:

  • PCA-based method: Leveraging functional maps for vertex correspondences.
  • SUPR model: Anatomically precise.
  • FIND framework: Expansive latent space for shape and pose control.

For each model, we optimized shape, pose, and transformation parameters via gradient descent, using the Adam optimizer to minimize Chamfer distance. The results, assessed with Chamfer and Hausdorff distance metrics, are visually showcased in the gallery below.


Input Scan PCA SUPR FIND Ours + PCA Ours + SUPR Ours + FIND Ground Truth



End-to-End Reconstruction Evaluation

We tested our method in an end-to-end reconstruction setup using unposed video images, benchmarking against two leading pipelines:

  • COLMAP: Reconstructs 3D geometry via Structure-from-Motion (SfM) and Multi-View Stereo (MVS).
  • Gaussian Opacity Fields: A state-of-the-art differentiable rendering approach that optimizes 3D Gaussians from images.

Using 30 consumer-grade videos captured in varied conditions, three expert clinicians in foot anatomy and orthotics reviewed randomized renders from each method and our own. Reconstructions were rated on a 5-point scale for: Anatomical Accuracy, Completeness, and Smoothness & Realism.


Reference COLMAP Gaussian Opacity Fields Ours
Reference Image 3
Reference Image 1
Reference Image 2

BibTeX

@misc{fogarty2025bestfootforwardrobust,
        title={Best Foot Forward: Robust Foot Reconstruction in-the-wild}, 
        author={Kyle Fogarty and Jing Yang and Chayan Kumar Patodi and Aadi Bhanti and Steven Chacko and Cengiz Oztireli and Ujwal Bonde},
        year={2025},
        eprint={2502.20511},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2502.20511}, 
  }