Optimizing vertex deformations from video for mesh animation is often limited by the standard rendering-based reconstruction loss. While existing approaches mitigate this limitation through novel representation, supervision, and paradigm, their supervision signals remain confined to the 2D domain. Such 2D supervision is inherently problematic: it provides no signal for occluded regions and only indirect cues for visible areas. Consequently, these methods often suffer from severe shape and motion artifacts. To that end, we propose Shape Flow Guidance (SFG), a sequence of 3D shapes derived from videos that serves as explicit 3D supervision for mesh animation. This derivation is realized by intervening in the sampling process of a pretrained mesh generator in a training-free manner. We further tailor a skeletal animation model that separates local deformation from global transformations. Such a model allows SFG to supervise complex local motion while reserving rendering-based supervision for simple global motion. Extensive experiments confirm our method significantly outperforms prior works qualitatively, quantitatively, and in terms of processing speed.