d4rt clone
A single feed-forward model that turns ordinary video into 4D geometry: per-pixel depth, long-range 3D point tracking, and camera pose, all recovered from one unified query interface.
A short video clip is resized and fed to the model.
Reproduced the paper's benchmarks on one NVIDIA L40S GPU. Matched the paper's ViT-B Sintel depth loss and stayed competitive with the 1B-parameter VGGT model on Sintel PC L1 geometric-mean loss (1.635 vs 1.582) at 7× fewer parameters.
Engineered a Pi3/LoGeR-style 5-layer pose head with 9D-rotation SVD orthogonalization to fix the pose-estimation bottleneck, cutting ATE by 53% and beating the published ViT-B baseline on camera pose by 26%.
Vectorized query generation (256× speedup), pre-resized fp16 dataset caches (35× less disk), AMP mixed-precision, and CUDA-allocator tuning enabled full ViT-B training in 46 GB on one consumer GPU. 50 clips/sec at fp16, 3.5 GB peak for inferences.









