Fluent human-robot interactions are essential for robots to be accepted and used in social scenarios including education, healthcare, or entertainment. Foundation models and Large Language Models (LLMs) in particular facilitate natural generation of text in a speech. However, embodied reactions and expressions, such as gestures and body motions, are also key for more fluent and natural human-robot conversations. By developing a framework for human-robot conversation that leverages LLMs to generate the robots' gestures, speech, and even eye color when responding to a human inquiry that effectively captures the context of the dialogue, we perform two user studies to evaluate how the more embodied properties affect the perceived fluency of interactions. In these studies, we demonstrate how the addition of gestures can improve the perceived naturalness by almost 2 points in a 7-point Likert scale, especially in real-world scenarios compared to virtual environments.