Lab 5: LLM-Based Human-Robot Dialogue

Learning Goals

Students will gain exposure to programming Misty to talk adaptively with a human participant leveraging (1) Deepgram's live transcription for speech-to-text, (2) a generative model from Gemini for robot text generation, and (3) OpenAI for the robot's text-to-speech.
Students will learn how to prompt engineer the text generation model to enable the Misty robot to guide a human subject through the "Three Good Things" positive psychology exercise.
Students will gain experience programming different Misty robot expressions that will enable the robot to exhibit movement and appear dynamic when conversing with people.
Students will explore the different speech-to-text voice options available to them through the OpenAI API.

To Complete Before Lab

Preparing Your Development Environment

Beyond the steps you've already taken to prepare your development environment in Lab 3, this Lab will require the following steps to run the code properly:

Python version: Ensure that your python version is >= 3.10. You can check your python version by running in your terminal: python3 --version.
Note: If your python version is < 3.10, you can update your python version by downloading the most recent version of python from python.org. If you download a new python version, you will also need to update or recreate your virtual environment so the new version of python is used there as well.

ffmpeg: You will need to have ffmpeg installed in order to stream Misty's audio to the Deepgram transcription service. You can check to see if you have ffmpeg installed by running ffmpeg in your terminal. If ffmpeg is installed, you should see its configuration options that should look like:

ffmpeg version 7.1.1 Copyright (c) 2000-2025 the FFmpeg developers
built with Apple clang version 16.0.0 (clang-1600.0.26.6)
configuration: --prefix=/usr/local/Cellar/ffmpeg/7.1.1_1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-audiotoolbox
libavutil      59. 39.100 / 59. 39.100
libavcodec     61. 19.101 / 61. 19.101
libavformat    61.  7.100 / 61.  7.100
libavdevice    61.  3.100 / 61.  3.100
libavfilter    10.  4.100 / 10.  4.100
libswscale      8.  3.100 /  8.  3.100
libswresample   5.  3.100 /  5.  3.100
libpostproc    58.  3.100 / 58.  3.100
Universal media converter
usage: ffmpeg [options] [[infile options] -i infile]... {[outfile options] outfile}...

Use -h to get full help or, even better, run 'man ffmpeg'

If you do not have ffmpeg installed, I'd suggest following the instructions on the following page that matches your operating system:

Mac: Installing FFmpeg on Mac
Windows: Installing FFmpeg on Windows
Ubuntu: Installing FFmpeg on Ubuntu

Once you install ffmpeg, you can check to see if it is installed by running ffmpeg in your terminal again and you should see the output as shown above.

Install additional dependency packages: First, active your virtual environment that you created in Lab 3. Then, install the following additional dependencies:
```
pip install deepgram-sdk
pip install ffmpeg-python
pip install openai
pip install mutagen
```
Store API keys as environment variables in a .env file: In order to use Deepgram, Gemini, and OpenAI for this lab, you'll need to specify the API keys for each service in your code. As we don't want these API keys to be publicly accessible on the web, we have shared these with you directly both via email and a Canvas announcement. To sore these API keys properly in your project:
- Create a new file named .env within your hri_course_misty_programming directory (the same directory where you both have your virtual environment and you have your PythonSDK Misty Python SDK directory).
- Copy and paste the API keys into your .env file. You can find the API key information both via email and a Canvas announcement.
Add starter code: Within your HRI course directory (hri_course_misty_programming), either:
- Use our template repo (recommended): Create a new git repository by using our template repo lab_5_LLM_based_human_robot_dialogue . To do this, click on the "use this template" button and cloning your new repository into your HRI course directory (hri_course_misty_programming).
- Manually copy and paste starter code: To manually copy and paste the starter code, create a new directory for lab 5 (e.g., lab_5_LLM_based_human_robot_dialogue) and copy and paste all of the contents within our template repo lab_5_LLM_based_human_robot_dialogue to your new directory.
Test your setup: To test and see if everything is working properly, run the test_dependencies.py file:
```
python3 test_dependencies.py
```
If everything is working properly, you should not see any errors and the code should exit without printing anything in your terminal.

Working in Groups

During this lab, you will work with the same group that you worked with for Lab 4. Similar to Lab 4, each group will turn in one piece of code / set of deliverables.

Lab 5 Deliverables & Submission

With the starter code we've provided, in Lab 5 you are expected to:

Prompt Engineering: Modify three_good_things_system_instruction.txt to enable the Misty robot to guide a human participant through the "Three Good Things" exercise.
Misty expressions: Define 5 additional Misty robot expressions in llm_based_human_robot_dialogue.py.
OpenAI voice: Select a different OpenAI voice for Misty and put your selection in the OpenAI client definition in llm_based_human_robot_dialogue.py.

Your are expected to upload the following to Canvas after you have completed the lab:

A 30-60 second video of Misty engaging in the "Three Good Things" exercise
Your source code: llm_based_human_robot_dialogue.py
Your system instruction: three_good_things_system_instruction.txt

To receive credit for this lab, you will need to submit your video and code to Canvas by Friday, April 25 at 6:00pm.

Running the Code

Active your virtual environment: source .venv/bin/activate
Run an HTTP server from your hri_course_misty_programmingdirectory: python3 -m http.server
- This is required because the Misty robot needs to access the speech files generated using OpenAI in order to play them on the robot.
Run the code: python3 llm_based_human_robot_dialogue.py MIST_IP_ADDRESS

An Overview of the Starter Code

The starter code contains several files:

The main code required to run the lab:
- llm_based_human_robot_dialogue.py - this is the main file you'll run for this lab to chat back-and-forth wth Misty for the "Three Good Things" exercise
- three_good_things_system_instruction.txt - the system instruction for the Gemini generative text model
Code for testing separate lab components:
- test_dependencies.py - used to test the dependency packages and API keys required for this lab
- test_custom_actions.py - used to test the custom actions you will develop for Misty
- gen_ai_test.py - used to test the Gemini generative text model based on the system instruction (three_good_things_system_instruction.txt) without needing to be connected to or run anything on the robot

Talking Back-and-Forth with Misty: Speech-to-Text, Text Generation, Text-to-Speech

While it is not required to know how llm_based_human_robot_dialogue.py in the starter code works in detail for the purposes of completing this lab, I want to provide a brief overview for those interested in how this code enables Misty to have a back-and-forth conversation with a person. This conversation consists of three main steps: speech-to-text, text generation, and text-to-speech.

Speech-to-text: The Misty robot uses the Deepgram API to transcribe the human participant's speech to text. Whenever the robot is ready to "listen" to a person, the starter code turns the robot's LED blue, starts streaming Misty's microphone feed in start_cam(), and initializes a DeepgramClient and connects it to the Misty microphone feed via a websocket in initialize_depgram(). Once the person has done speaking, the transcript of their speech retrieved by Deepgram is stored in the variable self.current_deepgram_transcript.

Text generation: The code in this lab uses Gemini's text generation chat model, allowing for multi-turn conversations. The model is initialized in lines 82-86 and 93 of starter code. The text generation occurs in line 104.

Text-to-Speech: The text generated by the Gemini model is then converted to speech using OpenAI's text-to-speech API. This conversion occurs in lines 119-125 of starter code and then played on the robot on line 128.

Prompt Engineering

The primary focus of this lab will be on prompt engineering. In the three_good_things_system_instruction.txt file, you will find a system instruction that is used to prompt the Gemini model to generate text for Misty. Right now, the system instruction guides the behavior of a robot receptionist in the CS department at UChicago. You will need to modify this system instruction to enable Misty to guide a human participant through the "Three Good Things" exercise.

If you want to test your system prompt independently from the Misty robot, you can do so by running gen_ai_test.py from the starter code in your terminal. This will allow you to communicate with the model only with text, enabling you to develop more quickly.

As a reminder, here is the desired interaction flow for the "Three Good Things" positive psychology exercise:

Introduction: The robot introduces itself and the "Three Good Things" exercise to the human participant. Your introduction should have a bit of informal chit-chat where, for example, the robot may ask for the participants' name, ask them how they're doing, etc.
Robot Disclosure #1: The robot will start the exercise by sharing one thing that it is grateful for. Then, it will prompt the participant to share one thing they're grateful for.
Participant Disclosure #1: The human participant shares one thing they are grateful for.
Robot Response to Participant Disclosure #1: The robot responds in 1 sentence (or so) to what the human participant has shared.
Robot Disclosure #2: The robot shares a second thing it is grateful for.
Participant Disclosure #2: The human participant shares a second thing they are grateful for.
Robot Response to Participant Disclosure #2: The robot responds to what the human participant has shared.
Robot Disclosure #3: The robot shares a third thing it is grateful for.
Participant Disclosure #3: The human participant shares a third thing they are grateful for.
Robot Response to Participant Disclosure #3: The robot responds to what the human participant has shared.
Robot Conclusion: The robot concludes the interaction, thanks the participant for their participation, and says goodbye.

Robot Expressions

For this lab, you are asked to develop 5 additional custom actions for the robot. To develop these custom actions, we recommend you check out the following resources:

The list of possible action commands found in the the Misty SDK documentation.
The test_custom_actions.py file in the starter code. This file will allow you to test your just your custom actions without needing to run the whole robot "Three Good Things" exercise.

Your new robot expressions should be added to the custom_actions dictionary in llm_based_human_robot_dialogue.py and in the <your_expression> tag within three_good_things_system_instruction.txt. The rest of this section delves into how the robot expressions are executed within the starter code.

How the Robot Expressions Work in the Starter Code

In llm_based_human_robot_dialogue.py in the starter code, we have defined four robot expressions, called actions in the Misty SDK, in lines 16-21:

custom_actions = {
    "reset": "IMAGE:e_DefaultContent.jpg; ARMS:40,40,1000; HEAD:-5,0,0,1000;",
    "head-up-down-nod": "IMAGE:e_DefaultContent.jpg; HEAD:-15,0,0,500; PAUSE:500; HEAD:5,0,0,500; PAUSE:500; HEAD:-15,0,0,500; PAUSE:500; HEAD:5,0,0,500; PAUSE:500; HEAD:-5,0,0,500; PAUSE:500;",
    "hi": "IMAGE:e_Admiration.jpg; ARMS:-80,40,100;",
    "listen": "IMAGE:e_Surprise.jpg; HEAD:-6,30,0,1000; PAUSE:2500; HEAD:-5,0,0,500; IMAGE:e_DefaultContent.jpg;"
}

While the actions are defined in string format in lines 16-21 in starter code, they are added to the Misty robot as possible actions to execute in lines 35-41. When the Gemini model (self.chat) generates a text response for Misty to speak, it will also generate an action expression for the robot that corresponds with that text (e.g., "hi", "listen"), see lines 102-111 in the starter code.

These expressions can be generated by the Gemini model because the list of expressions the robot can execute are provided in the system instruction (three_good_things_system_instruction.txt):

<your_expression>
Your expression should be one of the ones from this list. 
These expressions can represent how you are feeling or be a reaction to what the student has said.
Please refrain from choosing an expression multiple times in a row: [
'head-up-down-nod',
'hi',
'listen'
]
</your_expression>

After the expression is generated by the Gemini chat model, it is executed on the robot on lines 130-135 in the starter code.

OpenAI Voices

The final component for your assignment is exploring the voice options from OpenAI. Right now, lines 118-125 in llm_based_human_robot_dialogue.py in the starter code look like this:

# OpenAI text-to-speech: generating speech and saving to a file
with self.openai_client.audio.speech.with_streaming_response.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
    instructions="Speak with a calm and encouraging tone.",
) as response:
    response.stream_to_file(self.speech_file_path_local)

You will need to replace the voice and instructions parameters with your own selection. You can play around with the available voices and instructions for the voices at https://www.openai.fm/.