How to extract transcript from Youtube video

Hi, I will be showing you guys how to extract the transcript of a Youtube video, using the Code Piece.
Plese, take into consideration this is a pretty rustic experiment.

1st step: Obtain the Youtube video URL (not id, or anything like that)
2nd step: Create your Code Piece and configure it like this:

3rd Step: Press the Full Screen button:

4th Step: Set your package.json to:

{
  "dependencies": {
    "srt": "0.0.3",
    "yt-dlp-wrap": "2.3.12",
    "node-html-parser": "6.1.12"
  }
}

5th Step: Set your index.ts file to:

import YTDlpWrap from 'yt-dlp-wrap';
import { promises as fs } from 'fs';
import srt from "srt";
import { parse } from 'node-html-parser';

export const code = async (inputs) => {
    await YTDlpWrap.downloadFromGithub();
    const ytDlpWrap = new YTDlpWrap('./yt-dlp');
    await ytDlpWrap.execPromise([
        inputs.videourl,
        '--skip-download',
        '--write-auto-subs',
        '--sub-lan',
        'en',
        '--sub-format',
        'ttml',
        '--convert-subs',
        'srt',
        '--exec',
        `
        before_dl:"sed -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' -e '/^[[:space:]]*$/d' -i '' %(requested_subtitles.:.filepath)#q"
        `,
    ]);
    const path = ".";
    const files = await fs.readdir(path);
    let toReturn = "";

    const readFileAndProcess = async (file) => {
        if (file.endsWith(".srt")) {
            try {
                const data = await fs.readFile(file, 'utf-8');
                const jsonObj = await srt.fromString(data);
                for (const objeto in jsonObj) {
                    const aux_text = jsonObj[objeto]['text'];
                    const tag = parse(aux_text);
                    toReturn += tag.text + " ";
                    console.log(tag.text);
                }
            } catch (error) {
                console.error("Error processing file:", file, error);
            }
        }
    };

    for (const file of files) {
        await readFileAndProcess(file);
    }

    return toReturn;
};

Disclaimer: Use this on your own risk.
PS: Maybe I’m missing the “delete file” part, so if you download multiple transcripts, it won’t work as expected (I think).

First of all Thanks a bunch for the effort, really appreciate your input…

and I tried but not getting any success and getting the following error

"missing ) after argument list"```

do you have a line number for that? is weir because i tested it in my environment before publishing it. Maybe share the video you are trying with, so I can test it on my own also.

not really thats the worst part i cant figure out why… and chatgpt saying the problem could be

‘–exec’,
before_dl:"sed -e '/^[0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9] --> [0-9][0-9]:[0-9][0-9]:[0-9][0-9].[0-9][0-9][0-9]$/d' -e '/^[[:digit:]]\{1,3\}$/d' -e 's/<[^>]*>//g' -e '/^[[:space:]]*$/d' -i '' %(requested_subtitles.:.filepath)#q" ,`

Thank a lot to you big hint, that i was able to achive this pending task.

You may see the detailed reference below.

2 Likes