Beyond the steps you've already taken to prepare your development environment in Lab 3, this Lab will require the following steps to run the code properly:
python3 --version
.
ffmpeg
installed in order to stream Misty's audio to the Deepgram transcription service. You can check to see if you have ffmpeg
installed by running ffmpeg in your terminal. If ffmpeg is installed, you should see its configuration options that should look like:
ffmpeg version 7.1.1 Copyright (c) 2000-2025 the FFmpeg developers
built with Apple clang version 16.0.0 (clang-1600.0.26.6)
configuration: --prefix=/usr/local/Cellar/ffmpeg/7.1.1_1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-audiotoolbox
libavutil 59. 39.100 / 59. 39.100
libavcodec 61. 19.101 / 61. 19.101
libavformat 61. 7.100 / 61. 7.100
libavdevice 61. 3.100 / 61. 3.100
libavfilter 10. 4.100 / 10. 4.100
libswscale 8. 3.100 / 8. 3.100
libswresample 5. 3.100 / 5. 3.100
libpostproc 58. 3.100 / 58. 3.100
Universal media converter
usage: ffmpeg [options] [[infile options] -i infile]... {[outfile options] outfile}...
Use -h to get full help or, even better, run 'man ffmpeg'
If you do not have ffmpeg
installed, I'd suggest following the instructions on the following page that matches your operating system:
ffmpeg
, you can check to see if it is installed by running ffmpeg
in your terminal again and you should see the output as shown above.
pip install deepgram-sdk
pip install ffmpeg-python
pip install openai
pip install mutagen
.env
file: In order to use Deepgram, Gemini, and OpenAI for this lab, you'll need to specify the API keys for each service in your code. As we don't want these API keys to be publicly accessible on the web, we have shared these with you directly both via email and a Canvas announcement. To sore these API keys properly in your project:
.env
within your hri_course_misty_programming
directory (the same directory where you both have your virtual environment and you have your PythonSDK
Misty Python SDK directory)..env
file. You can find the API key information both via email and a Canvas announcement.hri_course_misty_programming
), either:
hri_course_misty_programming
).
lab_5_LLM_based_human_robot_dialogue
) and copy and paste all of the contents within our template repo lab_5_LLM_based_human_robot_dialogue
to your new directory.
test_dependencies.py
file:
python3 test_dependencies.py
If everything is working properly, you should not see any errors and the code should exit without printing anything in your terminal.
During this lab, you will work with the same group that you worked with for Lab 4. Similar to Lab 4, each group will turn in one piece of code / set of deliverables.
With the starter code we've provided, in Lab 5 you are expected to:
three_good_things_system_instruction.txt
to enable the Misty robot to guide a human participant through the "Three Good Things" exercise.llm_based_human_robot_dialogue.py
.
llm_based_human_robot_dialogue.py
. Your are expected to upload the following to Canvas after you have completed the lab:
llm_based_human_robot_dialogue.py
three_good_things_system_instruction.txt
To receive credit for this lab, you will need to submit your video and code to Canvas by Friday, April 25 at 6:00pm.
hri_course_misty_programming
directory: python3 -m http.server
The starter code contains several files:
llm_based_human_robot_dialogue.py
- this is the main file you'll run for this lab to chat back-and-forth wth Misty for the "Three Good Things" exercisethree_good_things_system_instruction.txt
- the system instruction for the Gemini generative text modeltest_dependencies.py
- used to test the dependency packages and API keys required for this labtest_custom_actions.py
- used to test the custom actions you will develop for Mistygen_ai_test.py
- used to test the Gemini generative text model based on the system instruction (three_good_things_system_instruction.txt) without needing to be connected to or run anything on the robot
While it is not required to know how llm_based_human_robot_dialogue.py
in the starter code works in detail for the purposes of completing this lab, I want to provide a brief overview for those interested in how this code enables Misty to have a back-and-forth conversation with a person. This conversation consists of three main steps: speech-to-text, text generation, and text-to-speech.
Speech-to-text: The Misty robot uses the Deepgram API to transcribe the human participant's speech to text. Whenever the robot is ready to "listen" to a person, the starter code turns the robot's LED blue, starts streaming Misty's microphone feed in start_cam()
, and initializes a DeepgramClient
and connects it to the Misty microphone feed via a websocket in initialize_depgram()
. Once the person has done speaking, the transcript of their speech retrieved by Deepgram is stored in the variable self.current_deepgram_transcript
.
Text generation: The code in this lab uses Gemini's text generation chat model, allowing for multi-turn conversations. The model is initialized in lines 82-86 and 93 of starter code. The text generation occurs in line 104.
Text-to-Speech: The text generated by the Gemini model is then converted to speech using OpenAI's text-to-speech API. This conversion occurs in lines 119-125 of starter code and then played on the robot on line 128.
The primary focus of this lab will be on prompt engineering. In the three_good_things_system_instruction.txt
file, you will find a system instruction that is used to prompt the Gemini model to generate text for Misty. Right now, the system instruction guides the behavior of a robot receptionist in the CS department at UChicago. You will need to modify this system instruction to enable Misty to guide a human participant through the "Three Good Things" exercise.
If you want to test your system prompt independently from the Misty robot, you can do so by running gen_ai_test.py
from the starter code in your terminal. This will allow you to communicate with the model only with text, enabling you to develop more quickly.
As a reminder, here is the desired interaction flow for the "Three Good Things" positive psychology exercise:
For this lab, you are asked to develop 5 additional custom actions for the robot. To develop these custom actions, we recommend you check out the following resources:
test_custom_actions.py
file in the starter code. This file will allow you to test your just your custom actions without needing to run the whole robot "Three Good Things" exercise. custom_actions
dictionary in llm_based_human_robot_dialogue.py
and in the <your_expression>
tag within three_good_things_system_instruction.txt
. The rest of this section delves into how the robot expressions are executed within the starter code.
In llm_based_human_robot_dialogue.py
in the starter code, we have defined four robot expressions, called actions in the Misty SDK, in lines 16-21:
custom_actions = {
"reset": "IMAGE:e_DefaultContent.jpg; ARMS:40,40,1000; HEAD:-5,0,0,1000;",
"head-up-down-nod": "IMAGE:e_DefaultContent.jpg; HEAD:-15,0,0,500; PAUSE:500; HEAD:5,0,0,500; PAUSE:500; HEAD:-15,0,0,500; PAUSE:500; HEAD:5,0,0,500; PAUSE:500; HEAD:-5,0,0,500; PAUSE:500;",
"hi": "IMAGE:e_Admiration.jpg; ARMS:-80,40,100;",
"listen": "IMAGE:e_Surprise.jpg; HEAD:-6,30,0,1000; PAUSE:2500; HEAD:-5,0,0,500; IMAGE:e_DefaultContent.jpg;"
}
While the actions are defined in string format in lines 16-21 in starter code, they are added to the Misty robot as possible actions to execute in lines 35-41. When the Gemini model (self.chat
) generates a text response for Misty to speak, it will also generate an action expression for the robot that corresponds with that text (e.g., "hi", "listen"), see lines 102-111 in the starter code.
These expressions can be generated by the Gemini model because the list of expressions the robot can execute are provided in the system instruction (three_good_things_system_instruction.txt
):
<your_expression>
Your expression should be one of the ones from this list.
These expressions can represent how you are feeling or be a reaction to what the student has said.
Please refrain from choosing an expression multiple times in a row: [
'head-up-down-nod',
'hi',
'listen'
]
</your_expression>
After the expression is generated by the Gemini chat model, it is executed on the robot on lines 130-135 in the starter code.
The final component for your assignment is exploring the voice options from OpenAI. Right now, lines 118-125 in llm_based_human_robot_dialogue.py
in the starter code look like this:
# OpenAI text-to-speech: generating speech and saving to a file
with self.openai_client.audio.speech.with_streaming_response.create(
model="gpt-4o-mini-tts",
voice="alloy",
instructions="Speak with a calm and encouraging tone.",
) as response:
response.stream_to_file(self.speech_file_path_local)
You will need to replace the voice
and instructions
parameters with your own selection. You can play around with the available voices and instructions for the voices at https://www.openai.fm/.