I built Voice2Sub: a local AI subtitle generator for video and audio

I built Voice2Sub because many subtitle and transcription workflows still start with uploading a media file to a browser tool.

That works for short public videos. But it becomes awkward when the file is long, private, local, or part of a repeat editing workflow.

Voice2Sub focuses on a desktop workflow:

Import a local video or audio file
Generate subtitles or transcript text with Whisper AI recognition
Review the result
Export SRT, VTT, TXT, LRC or CSV

Why I built it as a desktop app

A lot of creators, educators, podcasters and journalists work with media that they do not always want to upload to a browser tool.

Examples:

private interviews
long lectures
course recordings
podcasts
internal meetings
YouTube or TikTok editing workflows
archived audio/video files

A local-first desktop app gives users more control over the file, the model, the output format and the processing workflow.

What Voice2Sub does

Voice2Sub is an AI subtitle generator and speech-to-text desktop app for video/audio files.

It currently focuses on:

generating subtitles from local video/audio
creating transcript text from speech
exporting SRT, VTT, TXT, LRC and CSV
running on Windows, macOS Apple Silicon and Linux
supporting CUDA acceleration on compatible Windows/Linux systems
supporting Metal acceleration on Apple Silicon Macs
giving users more control over model selection and transcription settings

Why not just use an online subtitle generator?

Online tools are convenient, but a desktop workflow is useful when:

the media file is large
the content is private
the user wants repeat processing
the user wants local model control
the user wants common subtitle export formats
the user works across Windows, macOS or Linux

Voice2Sub is not trying to replace every online video editor. It is focused on a local subtitle and transcript workflow.

What I learned while building it

The AI part is only one piece of the product.

A desktop AI tool also needs:

reliable model downloads
offline and interrupted download handling
safe retry/resume behavior
cross-platform packaging
clear error messages
GPU acceleration setup
update reliability
localization
clean export formats
a first-run experience that does not confuse users

One thing I underestimated was how important the model download experience is. If the user cannot download or select an AI model, the whole product feels broken even if the transcription engine itself works.

Current platforms

Voice2Sub currently supports:

Windows x64
macOS Apple Silicon
Linux x64

The app also supports hardware acceleration when available:

CUDA on compatible NVIDIA systems
Metal on Apple Silicon Macs

Current export formats

Voice2Sub can export:

These formats cover common subtitle, transcript, lyric and editing workflows.

What I want to improve next

I am considering:

batch subtitle generation
better subtitle preview/editing
translation workflow
speaker detection
better presets for YouTube, courses, podcasts and interviews
more polish around the first-run onboarding experience

Links

If you work with subtitles, transcripts, video editing, podcasts or course content, I would love feedback on the workflow.

推荐订阅源

DEV Community