Spoken dialogue systems that assist users to solve complex tasks such as movie ticket booking have become an emerging research topic in artificial intelligence and natural language processing areas. With a well-designed dialogue system as an intelligent personal assistant, people can accomplish certain tasks more easily via natural language interactions. Today there are several virtual intelligent assistants in the market; however, most systems only focus on textual or vocal interaction. In this paper, we present HUMBO, a system aiming at generating dialogue responses and simultaneously synthesize corresponding visual expressions on faces for better multimodal interaction. HUMBO can (1) let users determine the appearances of virtual assistants by a single image, and (2) generate coherent emotional utterances and facial expressions on the user-provided image. This is not only a brand new research direction but more importantly, an ultimate step toward more human-like virtual assistants.