Extract Data From Youtube with `yt-dlp` And `jq`
Extract Data From Youtube with `yt-dlp` And `jq`
Created:
Abstract
In this article, I summarise the steps you need to take to extract data from YouTube using the yt-dlp and jq.
The first one can generate a JSON containing all the data of a given YouTube link. In particular, it can work with YouTube playlists and channels too.
The second one can query a JSON and return a new JSON with the data we are interested in.
Content
Motivation
It All started with the need to automatically fill the sBots database with any new content coming out from YouTube. For details check this issue.
Therefore, we need a way to extract data easily from the site and try to get the best result, so we don't actually need to parse extremely complicated JSON. Fortunately yt-dlp and jq comes for the rescue.
Get Data from YouTube
After a brief investigation, we can just run the following command:
yt-dlp -J <<youtubeLink>>
This works with YouTube:
It is yielding different JSONs. However, each JSON is then wrapped inside the others: the YouTube video JSON structure could be found inside the playlist JSON one. The Same happens between the playlist and the channel. This makes the next phase easier as some extraction logic could be reused.
The size of such JSON could be quite big depending on how big the target playlist//channel. In my use cases, I saw a size of:
- 38MB for a playlist of 71 videos
- 80MB for a channel of 182 videos
This puts the size of a single video ~0.5MB. yt-dlp takes quite some time to put together such information. In our case, it's fine since we plan to run it periodically and not that often. An optimization could be to save the JSONs locally and re-download them only when they become obsolete, eg 1 month old.
Get Automatic Caption Data
Each YouTube video JSON also contains a field related to the automatic captions where several URLs can be found. Between the available formats, JSON is available to download. This new JSON contains all the captions allowing us to extract the transcript of the video itself. here is an example of such JSON. Fun fact, its extension is TXT 🤷
Extract Data from the JSON
In this section, I'm not gonna dive into the details on how to use the jq command and do another tutorial, but rather I'll explain what was my goal and the end result command to reach that goal. Maybe some specific bits will be explained because particularly interesting.
YouTube Playlist
In this case, we have to expand the previous command by generating a JSON array this time by traversing the playlist. This can be done with the following command:
cat barbero.json | jq 'del(..|nulls) | [.entries[] | {show_url: .webpage_url, show_title: .title, show_upload_date: .upload_date, show_duration: .duration, show_description: .description, show_is_live: .is_live, show_origin_automatic_caption: .automatic_captions | with_entries(if (.key|test(".*orig")) then ( {key: .key, value: .value } ) else empty end)[][] | select(.ext|contains("json")) | .url }]'
The highlights here are:
del(..|nulls)
: that eliminates all thenull
values in the JSON. Necessary as we will get an error if we are gonna try to extract data from anull
value.- The outer square parenthesis. If omitted we will get a series of JSON objects one after the other, but not a valid JSON.
- The traverse of an array value (eg.
.entries[]
). With the[]
at the end it means: compute all the entries from now on. with_entries(if (.key|test(".*orig")) then ( {key: .key, value: .value } ) else empty end)
it's quite complicated, but basically, it's a way to filter the fields by matching a regexp. Only the field ending withorig
will be considered. We are interested in the automatic caption of the original language only.select(.ext|contains("json"))
filter the entries whose.ext
field value contains the wordjson
. We are not interested in other formats, only on JSON.
YouTube Channel
A YouTube Channel's JSON adds one more layer to the previous case because we might have multiple "playlists", one for:
- Videos
- Shorts
- Live
Therefore, we need to inspect the file and get to the right playlist before applying the previous transformation. The resulting command is:
cat youtubo.json | jq 'del(..|nulls) | .entries[] | select(.title|contains("Videos")) | [.entries[] | {show_url: .webpage_url, show_title: .title, show_upload_date: .upload_date, show_duration: .duration, show_description: .description, show_is_live: .is_live, show_origin_automatic_caption: .automatic_captions | with_entries(if (.key|test(".*orig")) then ( {key: .key, value: .value } ) else empty end)[][] | select(.ext|contains("json")) | .url }]'
You can see at the beginning of the jq command an additional
.entries[]
step followed by a filter by the values of the
.title
. That brings up the Videos
playlist. The rest of the
command is the same as above. If we want to focus on other playlists
we just have to filter by a different category.
Automatic Captions
Once we can get the URL for the automatic captions
and
retrieve the JSON, we need to extract the text from it. Here is an
example of the automatic caption's JSON
The command to do that is the following:
cat f.txt | jq '[.events[] | select(.segs != null) | .segs[] | .utf8]'
We still want a resulting JSON array, but the events should contain
some text (the select
clause filter). Then we are interested only
in that field ignoring the rest.
Conclusions
I hope you learned how to extract data from YouTube and that could be useful for your projects or just for fun. The takeaway here is to invest some time into learning the jq command as it's really useful to manipulate JSONs and get what you want out of it. It also has multiple online playgrounds you can use to test your expressions if you want.