With the advent of large-scale multimodal pretraining, MLLMs have started to demonstrate emergent reason- ing capabilities. However, such inferences are often shallow, relying primarily on implicit ...