Lab 5: LLM-Based Human-Robot Dialogue


Learning Goals


To Complete Before Lab


Preparing Your Development Environment

Beyond the steps you've already taken to prepare your development environment in Lab 3, this Lab will require the following steps to run the code properly:

  1. Python version: Ensure that your python version is >= 3.10. You can check your python version by running in your terminal: python3 --version.
    Note: If your python version is < 3.10, you can update your python version by downloading the most recent version of python from python.org. If you download a new python version, you will also need to update or recreate your virtual environment so the new version of python is used there as well.
  2. ffmpeg: You will need to have ffmpeg installed in order to stream Misty's audio to the Deepgram transcription service. You can check to see if you have ffmpeg installed by running ffmpeg in your terminal. If ffmpeg is installed, you should see its configuration options that should look like:
    ffmpeg version 7.1.1 Copyright (c) 2000-2025 the FFmpeg developers
    built with Apple clang version 16.0.0 (clang-1600.0.26.6)
    configuration: --prefix=/usr/local/Cellar/ffmpeg/7.1.1_1 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libspeex --enable-libsoxr --enable-libzmq --enable-libzimg --disable-libjack --disable-indev=jack --enable-videotoolbox --enable-audiotoolbox
    libavutil      59. 39.100 / 59. 39.100
    libavcodec     61. 19.101 / 61. 19.101
    libavformat    61.  7.100 / 61.  7.100
    libavdevice    61.  3.100 / 61.  3.100
    libavfilter    10.  4.100 / 10.  4.100
    libswscale      8.  3.100 /  8.  3.100
    libswresample   5.  3.100 /  5.  3.100
    libpostproc    58.  3.100 / 58.  3.100
    Universal media converter
    usage: ffmpeg [options] [[infile options] -i infile]... {[outfile options] outfile}...
    
    Use -h to get full help or, even better, run 'man ffmpeg'
    If you do not have ffmpeg installed, I'd suggest following the instructions on the following page that matches your operating system: Once you install ffmpeg, you can check to see if it is installed by running ffmpeg in your terminal again and you should see the output as shown above.
  3. Install additional dependency packages: First, active your virtual environment that you created in Lab 3. Then, install the following additional dependencies:
    pip install deepgram-sdk
    pip install ffmpeg-python
    pip install openai
    pip install mutagen
  4. Store API keys as environment variables in a .env file: In order to use Deepgram, Gemini, and OpenAI for this lab, you'll need to specify the API keys for each service in your code. As we don't want these API keys to be publicly accessible on the web, we have shared these with you directly both via email and a Canvas announcement. To sore these API keys properly in your project:
    • Create a new file named .env within your hri_course_misty_programming directory (the same directory where you both have your virtual environment and you have your PythonSDK Misty Python SDK directory).
    • Copy and paste the API keys into your .env file. You can find the API key information both via email and a Canvas announcement.
  5. Add starter code: Within your HRI course directory (hri_course_misty_programming), either:
    • Use our template repo (recommended): Create a new git repository by using our template repo lab_5_LLM_based_human_robot_dialogue . To do this, click on the "use this template" button and cloning your new repository into your HRI course directory (hri_course_misty_programming).
    • Manually copy and paste starter code: To manually copy and paste the starter code, create a new directory for lab 5 (e.g., lab_5_LLM_based_human_robot_dialogue ) and copy and paste all of the contents within our template repo lab_5_LLM_based_human_robot_dialogue to your new directory.
  6. Test your setup: To test and see if everything is working properly, run the test_dependencies.py file:
    python3 test_dependencies.py
    If everything is working properly, you should not see any errors and the code should exit without printing anything in your terminal.

Working in Groups


During this lab, you will work with the same group that you worked with for Lab 4. Similar to Lab 4, each group will turn in one piece of code / set of deliverables.

Lab 5 Deliverables & Submission


With the starter code we've provided, in Lab 5 you are expected to:

Your are expected to upload the following to Canvas after you have completed the lab:

To receive credit for this lab, you will need to submit your video and code to Canvas by Friday, April 25 at 6:00pm.

Running the Code


  1. Active your virtual environment: source .venv/bin/activate
  2. Run an HTTP server from your hri_course_misty_programmingdirectory: python3 -m http.server
    • This is required because the Misty robot needs to access the speech files generated using OpenAI in order to play them on the robot.
  3. Run the code: python3 llm_based_human_robot_dialogue.py MIST_IP_ADDRESS

An Overview of the Starter Code


The starter code contains several files:

Talking Back-and-Forth with Misty: Speech-to-Text, Text Generation, Text-to-Speech

While it is not required to know how llm_based_human_robot_dialogue.py in the starter code works in detail for the purposes of completing this lab, I want to provide a brief overview for those interested in how this code enables Misty to have a back-and-forth conversation with a person. This conversation consists of three main steps: speech-to-text, text generation, and text-to-speech.

Speech-to-text: The Misty robot uses the Deepgram API to transcribe the human participant's speech to text. Whenever the robot is ready to "listen" to a person, the starter code turns the robot's LED blue, starts streaming Misty's microphone feed in start_cam(), and initializes a DeepgramClient and connects it to the Misty microphone feed via a websocket in initialize_depgram(). Once the person has done speaking, the transcript of their speech retrieved by Deepgram is stored in the variable self.current_deepgram_transcript.

Text generation: The code in this lab uses Gemini's text generation chat model, allowing for multi-turn conversations. The model is initialized in lines 82-86 and 93 of starter code. The text generation occurs in line 104.

Text-to-Speech: The text generated by the Gemini model is then converted to speech using OpenAI's text-to-speech API. This conversion occurs in lines 119-125 of starter code and then played on the robot on line 128.

Prompt Engineering


The primary focus of this lab will be on prompt engineering. In the three_good_things_system_instruction.txt file, you will find a system instruction that is used to prompt the Gemini model to generate text for Misty. Right now, the system instruction guides the behavior of a robot receptionist in the CS department at UChicago. You will need to modify this system instruction to enable Misty to guide a human participant through the "Three Good Things" exercise.

If you want to test your system prompt independently from the Misty robot, you can do so by running gen_ai_test.py from the starter code in your terminal. This will allow you to communicate with the model only with text, enabling you to develop more quickly.

As a reminder, here is the desired interaction flow for the "Three Good Things" positive psychology exercise:

Robot Expressions


For this lab, you are asked to develop 5 additional custom actions for the robot. To develop these custom actions, we recommend you check out the following resources:

Your new robot expressions should be added to the custom_actions dictionary in llm_based_human_robot_dialogue.py and in the <your_expression> tag within three_good_things_system_instruction.txt. The rest of this section delves into how the robot expressions are executed within the starter code.

How the Robot Expressions Work in the Starter Code

In llm_based_human_robot_dialogue.py in the starter code, we have defined four robot expressions, called actions in the Misty SDK, in lines 16-21:

custom_actions = {
    "reset": "IMAGE:e_DefaultContent.jpg; ARMS:40,40,1000; HEAD:-5,0,0,1000;",
    "head-up-down-nod": "IMAGE:e_DefaultContent.jpg; HEAD:-15,0,0,500; PAUSE:500; HEAD:5,0,0,500; PAUSE:500; HEAD:-15,0,0,500; PAUSE:500; HEAD:5,0,0,500; PAUSE:500; HEAD:-5,0,0,500; PAUSE:500;",
    "hi": "IMAGE:e_Admiration.jpg; ARMS:-80,40,100;",
    "listen": "IMAGE:e_Surprise.jpg; HEAD:-6,30,0,1000; PAUSE:2500; HEAD:-5,0,0,500; IMAGE:e_DefaultContent.jpg;"
}

While the actions are defined in string format in lines 16-21 in starter code, they are added to the Misty robot as possible actions to execute in lines 35-41. When the Gemini model (self.chat) generates a text response for Misty to speak, it will also generate an action expression for the robot that corresponds with that text (e.g., "hi", "listen"), see lines 102-111 in the starter code.

These expressions can be generated by the Gemini model because the list of expressions the robot can execute are provided in the system instruction (three_good_things_system_instruction.txt):

<your_expression>
Your expression should be one of the ones from this list. 
These expressions can represent how you are feeling or be a reaction to what the student has said.
Please refrain from choosing an expression multiple times in a row: [
'head-up-down-nod',
'hi',
'listen'
]
</your_expression>

After the expression is generated by the Gemini chat model, it is executed on the robot on lines 130-135 in the starter code.

OpenAI Voices


The final component for your assignment is exploring the voice options from OpenAI. Right now, lines 118-125 in llm_based_human_robot_dialogue.py in the starter code look like this:

# OpenAI text-to-speech: generating speech and saving to a file
with self.openai_client.audio.speech.with_streaming_response.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
    instructions="Speak with a calm and encouraging tone.",
) as response:
    response.stream_to_file(self.speech_file_path_local)

You will need to replace the voice and instructions parameters with your own selection. You can play around with the available voices and instructions for the voices at https://www.openai.fm/.