Use full text search on youtube channles with yt-dlp

Update:

I recently made a better python script using the ideas in this blog post: https://github.com/NotJoeMartinez/yt-fts

yt-dlp is a tool that can be used to download youtube videos.

yt-dlp -f mp4 "https://youtu.be/jXow8M_8LJE"

It also lets you download the auto generated captions from youtube. With the --write-auto-subs flag and you can skip downloading the video with --skip-download

yt-dlp --write-auto-sub --skip-download "https://youtu.be/jXow8M_8LJE"

this will spit out a .vtt (Web Video Text Tracks) file formatted like this

WEBVTT
Kind: captions
Language: en

00:00:00.640 --> 00:00:03.830 align:start position:0%
 
and<00:00:01.120><c> now</c><00:00:02.320><c> tim</c><00:00:02.800><c> dillon</c><00:00:03.439><c> is</c>

00:00:03.830 --> 00:00:03.840 align:start position:0%
and now tim dillon is

00:00:00.640 --> 00:00:03.830 represents the time span of where a statement is said, the following line is the transcription of what was said on that line with some mark up tags. The next line repeats this without markup tags and a less accurate time span. I haven’t found a way to download them without markup tags. This means youtube can technically be used as a free substitute for googles speech to text api although if a creator uploads their own captions, the quality of the auto generated captions falls significantly. If you know a creator uploaded captions with their video
you can download them instead of the auto generated captions with --write-subs.

You can make a csv file with the video id and the title of the video like this:

yt-dlp --print "youtu.be/%(id)s;%(title)s" "https://www.youtube.com/channel/UC4woSp8ITBoYDmjkukhEhxg/videos" >> tim_dillon.csv

9Gly30LrUaE;#220 - Hail Mary | The Tim Dillon Show
G_LpoF9awAE;JFK Tour Guide Tells All
CT5mNKVpja8;#219 - The Gates Of Hell | The Tim Dillon Show
b1I42xTIOCQ;#218 - Fake Business | The Tim Dillon Show

Youtube video ids are unique strings at the end of the base url yturl

when you share a video you get a shortened base url with the same video id.

https://youtu.be/[videoid]

You can also share a video that will start at a specific time stamp by appending ?t= with the starting time in seconds of the area you want the video to start

https://youtu.be/[videoid]?t=[time in seconds]

ytshare

Download entire channel#

Now that we have a csv file with every video ID and title of the video in it we can use the python subprocess module to run our yt-dlp command on every video in the channel name. yt-dlp is written in python and can be integrated into your program without subprocess but this was quicker for me. This script will give us a directory of vtt files named with this pattern [videoid].vtt.en.vtt

import re, subprocess

with open("tim_dillon.csv", "r") as f:
    lines = f.readlines()

    for line in lines:
        line = line.split(";")  
        vid_id = line[0] 
        url = "https://www.youtube.com/watch?v=" + vid_id
        vid_title = line[1].strip()
        subprocess.run(
        f"yt-dlp --write-auto-sub --skip-download -o subs/{vid_id}.vtt \"{url}\"", 
        shell=True)

Make a target database#

Saving all the vtt files means we have a searchable dataset and It’s up and it’s up to us to figure out what tools we will use to search the data. We could just open up the subtitle directories in vs code then manually figure out what our timestamped url should be using the filename for the id and converting the time into seconds. life_in_big

I chose to make a sqlite database to search the dataset with the following schema: make_table.sql

CREATE TABLE timdillon (
    vid_id TEXT,
    vid_title TEXT,
    start_time TEXT,
    end_time TEXT,
    sub_titles TEXT
);

sqlite3 yt_fts.db < make_table.sql

Populating the database#

There are some python libraries to parse vtt files but regex does the job just fine so that’s what I did here. We use os.walk() to grab a list of every vtt file then call parse_files() on that path which adds the captions found in those files to the database.

populate_db.py

import sys, subprocess, os
import re, sqlite3 
from pathlib import Path

def main():

    dir_dict = os.walk("subs")
    for root, dirs, files in dir_dict:
        for f in files:
            full_path = os.path.join(root,f)
            parse_files(full_path)

We need this for later to get the title from our current ID

def get_title_from_id(vid_id):
    with open("tim_dillon.csv", "r") as f:
       lines = f.readlines() 
       for line in lines:
            line = line.split(";")
            current_id = line[0]
            vid_title = line[1].strip()
            if vid_id == current_id: 
                return vid_title

Every line with a start and end time ends with the string: align:start position:0%

Ex:

01:22:26.460 --> 01:22:26.470 align:start position:0%
I have and they're dead okay they go

If we land on a line with that string we know can grab the start and end time using regex groups to isolate the time slots in this format [start] --> [end].

start = re.search("^(.*) -->",time_match.group(1))
end = re.search("--> (.*)",time_match.group(1))

We also know that the next line will be the transcribed audio within that time frame so if we enumerate the lines we can just add one to the current index and it should be the text we want.

sub_titles = lines[count + 1]

Then with all these variables isolated we can send them to the database

cur.execute("INSERT INTO timdillon VALUES (?,?,?,?,?)", 
(vid_id, vid_title, start_time, end_time, sub_titles))

full function:

def parse_files(full_path):
    fp = Path(full_path)
    vid_id = fp.stem[:11]
    vid_title = get_title_from_id(vid_id)

    con = sqlite3.connect("yt_fts.db")
    cur = con.cursor()

    time_pattern = "^(.*) align:start position:0%"

    with open(full_path, "r") as f:
        lines = f.readlines()
        for count, line in enumerate(lines):
            time_match = re.match(time_pattern, line)

            if time_match:
                start = re.search("^(.*) -->",time_match.group(1))
                end = re.search("--> (.*)",time_match.group(1))

                start_time = start.group(1)
                end_time = end.group(1)

                sub_titles = lines[count + 1]

                cur.execute("INSERT INTO timdillon VALUES (?,?,?,?,?)", 
                (vid_id, vid_title, start_time, end_time, sub_titles))

    con.commit()
    con.close()
if __name__ == '__main__':
    main()

CLI script#

Now that we have a database with a table of structured data, we need an interface for retrieving data. Ideally we don’t want to have to match the exact quote to our results so we need to use wildcards in our sql queries. Wildcards will return quotes which contain the sub string of our search text as well as exact matches. To implement wildcards we need to use the sqlite LIKE keyword:

SELECT -- return these rows 
	vid_id, start_time, sub_titles 
FROM -- from this table 
	timdillon 
WHERE -- where this collumn 
	sub_titles 
SELECT -- is kinda like 
	'%in the big city%'

If we do get a match from the database we need to convert the start time string into seconds so we can build our time stamped url. This means converting 01:22:26.460 into 4942. The time_to_secs() function does this by isolating the hour minutes and seconds with regex converting them to integers and multiplying hours by 3600 and minutes by 60 and returning the sum minus three. I subtracted three seconds because it gives the viewer time to process what they are going to listen to.

time_rex = re.search("^(\d\d):(\d\d):(\d\d)",time_str )
hours = int(time_rex.group(1)) * 3600 
mins = int(time_rex.group(2)) * 60
secs = int(time_rex.group(3))

For some reason the vtt files all repeat each line of dialog under a slightly different time frame which is really annoying but I did not fix when entering the data into the database so we’re dealing with it now. The id_stamp variable prevents us from repeating lines of dialog within the same video by appending a formatted string [videoid]hh:hh:ss to a an array that we check every iteration of printing so we don’t repeat ourselves.

id_stamp =  vid_id + start[:-4]

full function:

import sys, sqlite3, re

def main():

    if len(sys.argv) < 2:
        print("Need quote as argument")
        exit(0)
    else:
        get_quotes(sys.argv[1])

def get_quotes(quote):

    con = sqlite3.connect("yt_fts.db")
    cur = con.cursor()
    cur.execute("SELECT * FROM timdillon WHERE sub_titles LIKE ?", 
    ('%'+quote+'%',))
    res = cur.fetchall()
    con.close()

    if len(res) == 0:
        print("No matches found")
    else:

        shown_titles = []
        shown_stamps = []

        for quote in res: 
            vid_id = quote[0]
            vid_title = quote[1]
            start = quote[2]
            end = quote[3]
            subs = quote[4]

            #  should look like: 6C7vx4Ot2qk01:28:00
            id_stamp =  vid_id + start[:-4]  

            time = time_to_secs(start) 

            if vid_title not in shown_titles:
                print(f"\nMatches found in: \"{vid_title}\"")
                shown_titles.append(vid_title)

            if id_stamp not in shown_stamps:
                print(f"\n") 
                print(f"    Quote: \"{subs.strip()}\"")
                print(f"    Time Stamp: {start}")
                print(f"    Link: https://youtu.be/{vid_id}?t={time}")
                shown_stamps.append(id_stamp)

def time_to_secs(time_str):

    time_rex = re.search("^(\d\d):(\d\d):(\d\d)",time_str )
    hours = int(time_rex.group(1)) * 3600 
    mins = int(time_rex.group(2)) * 60
    secs = int(time_rex.group(3)) 

    total_secs =  hours + mins + secs
    return total_secs - 3

if __name__ == '__main__':
    main()

After all this we should have a cli that takes a string as an argument and returns
the stamped urls with full quotes for everything it finds matching that string.

shortened sample

[$] python3 yt_fts.py "in the big city"

Matches found in: "164 - Life In The Big City"


    Quote: "life in the big city it was one of my"
    Time Stamp: 00:27:17.549
    Link: https://youtu.be/dqGyCTbzYmc?t=1634


    Quote: "life in the big city my mother would go"
    Time Stamp: 00:27:33.210
    Link: https://youtu.be/dqGyCTbzYmc?t=1650


    Quote: "saying life in the big city he had a few"
    Time Stamp: 00:28:00.990
    Link: https://youtu.be/dqGyCTbzYmc?t=1677

It would be better if this interface would was something like react emoji search but the last time I dove into javascript the night terrors didn’t stop for several months.