Large language models have shown promise across specialized domains, but their performance limits in disaster risk reduction remain poorly understood. We conduct a version-specific evaluation of ...