AI Chat Simulation for Prerecorded Videos

Author

Nathan DeVore

Published

August 6, 2025

Back to Portfolio: https://devoreni.github.io/portfolio/

Abstract

Introduction

Live-streamed content thrives on audience interaction, however managing a live chat presents significant moderation challenges and reputational risks. This project introduces the following solution: a post-production AI Chat Simulation application. The system leverages multi-modal AI (generative AI that can take in a variety of media inputs such as text, images, and video) to generate a realistic, dynamic chat overlay for pre-recorded videos, giving creators complete control over the final content. The application features a multi-threaded architecture to ensure a responsive user experience, integrated audio transcription via OpenAI’s Whisper, multi-modal chat generation with local large language models (LLMs), and a custom video rendering pipeline built with OpenCV and Pillow. The result is a robust tool which eliminates the risks associated with a live chat while still offering the perception of engagement.

Problem Statement

Live streaming platforms foster a unique connection between the streamer and the audience through real-time chat interaction. Displaying the chat on-screen is standard practice to enhance viewer engagement for both the live viewers and video on demand (VOD) viewers. This practice introduces the significant challenge of chat moderation from potentially thousands of participants. A sigle inappropriate or malicious message appearing on stream can permenantly tarnish the reputation of the presentor, their community, and their brand, especially if the message makes it into the VOD where thousands more people could see the message.

Proposed Solution

To mitigate the risks while preserving the aesthetic of live interation, this project takes a post-production approach. By extracting features from a pre-recorded video, a simulated chate is generated and overlaid onto the footage. Users can, at any point, alter any aspect of the chat to their desire, including usernames of chatters, messages left, the color of usernames, badges chatters have, the time a message is sent, or delete or create messages at their discretion. Realistic chat messages react dynamically to the video content and to other previous chat messages. The entire workflow is streamlined and easy to use.

Project Requirements

The following requirements are met by the AI Chat project:

  • Automated Context Extraction: The system will automatically transcribe the video’s audio and analyze video frames to understand on-screen events.
  • Enhanced Contextual Awareness: Users will be able to provide additional text-based context, such as a video description or streamer persona, to refine AI-generated responses.
  • Unique Chatter Personas: Each simulated chatter will possesses a unique, procedurally generated personality, speaking style, and visual identity (username color, badges).
  • Full Editorial Control: Users will be able to add, modify, or delete any generated chat message in any way.
  • Chatter Curation: Users will have the ability to permanently remove (“ban”) specific AI chatters from the system.

UX pipeline

While the back-end is a complex integration of multiple systems, the user workflow is designed to be linear and intuitive.

  1. Video Selection: The user selects a source video file (.mp4, .mkv) via a built in file browser.
  2. Audio Transcription: The application automatically transcribes the video’s audio using a local Whisper model, producing a .vtt file. The user can manually edit this file to correct any inaccuracies, add aditional lines, remove them, or alter timestamps.
  3. Context Refinement (Optional): The user can provide supplementary context, such as the streamer’s name and a brief video description, via text fields in the GUI.
  4. Chat Generation: The user initiates the chat generation process. The system analyzes the transcript, video frames, and user-provided context to produce a complete chat log as a .csv file.
  5. Chat Curation: After generation, the user can edit the .csv file to modify message content, timing, chatter appearance, or other metadata. Users can also “ban” chatters, removing them from the database entirely.
  6. Video Rendering: The user defines an overlay position and initiates the final rendering. The system composites the chat overlay onto the original video, producing a final .mp4 file.

Screenshot of Application

Implementation

This section is an in depth description of the implementation.

Project Structure

The project has 3 main components

  • Creation of Chatters
  • The PyQt5 gui
  • The main worker threads

Chatter Creation

Usernames

One of the most important parts of making the chatters feel like real people is for them to have varied and unique usernames. To accomplish this an llm is used and given a list of attributes of a person. After a username is generated it is check against the database for uniqueness and either added or regenerated.

Message Style

Each chatter is given a personality, a message style, and a few ‘example messages’ that chatter might say. These are used later when it comes time to generate the chat messages.

Database Structure

Chatters and their messages are stored in an sqlite3 database with the following structure:

---
title: Chatter Database
---
erDiagram
    CHATTERS ||--o{ CHAT_LOGS : "have messages in"
    CHATTERS {
        TEXT username PK
        TEXT gender
        TEXT subscription_tier
        INT age
        TEXT chatter_description
        TEXT username_color
        INT message_frequency
        INT badge_one
        INT badge_two
    }
    CHAT_LOGS {
        INT message_id PK
        TEXT username
        TEXT message
    }
    RECENT_MESSAGES {
        INT message_id PK
        TEXT username
        TEXT message
    }

CHATTERS table: Stores information about a chatter for use in generating chat messages and for video rendering. The Chatter’s gender, subscription tier, and age influence the kinds of messages the chatter will leave. When displaying the chatter and their message, the username is a in a color specified by username_color along side any badges that chatter may have. Message frequency influences how often that chatter leaves messages.

CHAT_LOGS table: Stores the example messages of each chatter. It is updated when a new chatter is created. It does not store new messages chatters leave to avoid flanderization.

RECENT_MESSAGES table: keeps track of the 10 most recent messages in chat which helps to provide context when generating new chat messages. It also allows for an interactive chat where chatters can interact with each other as well as the stream.

Normal Forms

This database is in the Fifth Normal Form as it satisfies all requirements.

1NF

  • All columns contain atomic values and cannot be divided further.
  • Each row in each table is unique.
  • Each column in each table has a unique name.
  • The order in which data is stored does not matter.
    (The higher primary key value in the Recent Messages table indicates a more recent message; however, this does not break normal form because it does not matter in what order it is stored. The value of the primary key is an important semantic property as it carries temporal information.)

2NF

  • 1NF is satisfied.
  • Each column in each table is fully dependant on the primary key of the table.

3NF

  • 2NF is satisfied.
  • Each column in each table is only dependant on the primary key of the table.

Boyce-Codd Normal Form (BCNF)

  • No column depends on anything except the primary key of the table.

4NF

  • BCNF is satisfied.
  • Each column is independant of every other column, except for the primary key.

5NF

  • 4NF is satisfied.
  • No decomposition or joins are needed to fully reconstruct a table.

Main GUI

There are many steps to the chat generation process, so, having an intuative and easy to use gui is important for user satisfaction.

Layout

The layout is designed to guide users as simply as possible from the start to the end while allowing them to redo or go back to any step. The transcription process is not perfect and error can make their way into the video transcript. By allowing users to open the .vtt file generated by OpenAI’s Whisper, users can correct any mistake. They can also edit timestamps or make any other changes they wish.

Similarly, generated chat messages can sometimes need tweaking. After useers generate the chat, they have the option to make any edits they wish to the .csv file. This includes changing: the color of a username, the displayed badges, message bodies, emojis used, subscription tier, the timing of a message, and more.

Text Fields

In order to improve the accuracy and personability of the chatters, two text fields are included for customization: the streamer’s name and a video description. If these fields are left blank they will be omited from the llm prompt during message generation. The stream description text field is limited to 150 characters to avoid prompt injection.

Sometimes, chatter consistently leave less than ideal chat messages, in which case, users have the option to “ban” chatters by tying in their username and pressing commit ban.

Built in Log

Some of the processes can take a long time, especailly transcription and chat generation. To ensure users know what is happening instead of seeing the application freeze, a log is built into the gui. It updates the user anytime an action is performed. For example, during video transcription, the log with first output a message acknowledging the start of the video transcription process. It then prints lines of the transcript and their time stamps in real time so users can moniter progress.

Worker Threads

In order to keep the log active and the gui from freezing during lengthy processes, backend logic is handled in seperate threads. Threads set a flag to disable buttons in the main gui so nothing breaks, but the log is still updated and the application window can still be interacted with.

Whisper

Whisper Transcription is handled within a thread. Whisper is a speech to text, open source, ai model developed by OpenAI. When called, it extracts the audio from the selected video file, cleans it with ffmpeg, then begins the transcription process. Infered context and previously generated tokens help determine the next token in the sequence. The output is stored in a .vtt format which includes a timestamp and the generated tokens for that timestamp. Users can edit this file to correct mistakes or correct words or phrases that were mispoken. After each line is transcribed, it and the current timestamp is sent to the log to be displayed so the user can track the progress of Whisper.

Chat Generation

After the Audio is processed, chat generation can begin. For each line in the .vtt file, a random number of chat messages are set to be generated. A prompt is generated for each chatter chosen which includes the past few lines from the vtt, the past several chat messages, a screenshot of the video from the corresponding timestamp, information about the chatter including the “seed messages” that were created when the chatter was initially created, and the streamer’s name and stream description if provided by the user. The prompts are given to a built in llm and a message is created for each chatter. The messages are spaced out and stored in a csv along with metadata including the chatter’s username, display color, owned badges, and the time to be displayed.

Rendering Pipeline

The project uses a custom rendering pipeline utilizing opencv. Each frame is extracted from the provided video file, then, each chat element is layered on top. First a semi-transparent rectangle is generated, the username of the next chatter is drawn in the proper color, any badges are rendered, and finally their message is layered in the proper place. New chat messages push older one up the screen and, after a message has been displayed for a certain amount of time, it disapears from the screen entirely. The result is a dymanic and fluid chat that looks and behaves like real people reacting live to when the video was filmed. As each frame is processed, a bar at the bottom of the log fills up and displays the rendering completion percent. The video is then stored in an output folder.

Chatter Removal

Chatters can be removed from the database by entering their username in “Ban Chatter” section of the GUI and clicking “Ban Chatter.” The user is informed of a successful removal in the log if a chatter with a matching username is found, otherwise, they are informed that there was an error with the removal process.

Results

The final product is an application that adds a dynamic and lively chat to any pre-recorded video. It allows for user customization and complete control over every aspect of the wrokflow while still maintaining ease of use and a logical flow such that the product is easy to use and intuative, even to someone that has never used the product before. Due to the use of locally run llm’s, the performance of the application is dependant on the specs of the computer it is run on. The product is not currently available for download due to the complex dependancies that are required for the program to run. Docker and a setup wizard will be made available in the future so that anyone can quickly and easily download and run the application.

//Demo will be included here//

Conclusion

This was the most complex and extensive solo project I have undertaken. With 3 seperate python files, a sqlite3 database, 2500 lines of code, had drawn assets, 2 locally run llm’s, and nearly a dozen dependancies, it proved to be an extreme undertaking. While there is room for improvement, the current itteration of the Live Chat Simulation delivers exactly what it is supposed to in a user friendly way while still offering complete control over every step of the process. There is still future work to do, such as allowing more ways for the chat to apear on the screen, and containerization with Docker for distribution.