This blog post will guide you through a Python-based method for automating YouTube transcription retrieval using Selenium. This script extracts the video title and its entire transcription by providing a YouTube video URL, then saves them to a text file. This can be useful for archiving, analyzing content, or creating summaries.
Prerequisites
You’ll need:
- Python: Version 3.6 or later
- Selenium: Python package for automating web browsers
- ChromeDriver: Required to run Chrome browser instances with Selenium
Install Selenium using:
pip install selenium
Downloading ChromeDriver
To run this script, you need ChromeDriver to control the Chrome browser via Selenium. Here’s how to download it:
- Go to the official ChromeDriver for Testing page.
- Find the version that matches your installed version of Chrome.
- Download the appropriate version for your operating system.
-
Extract the
chromedriver
executable and place it in a known directory, like aselenium
folder within your project directory.
Code Walkthrough
Step 1: Initialize the WebDriver
The
initialize_driver
function sets up a Chrome WebDriver in headless mode, keeping the browser
invisible during the script’s execution.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
def initialize_driver(CHROMEDRIVER_PATH, headless=True):
chrome_options = Options()
if headless:
chrome_options.add_argument('--headless')
service = Service(CHROMEDRIVER_PATH)
driver = webdriver.Chrome(service=service, options=chrome_options)
return driver
Step 2: Extract Title and Transcript
The
get_transcriptions
the
function opens a YouTube video, clicks the necessary buttons to display
the transcript, and extracts each line.
import time
def get_transcriptions(driver, video_url, button1_xpath, button2_xpath, title_xpath):
driver.get(video_url)
time.sleep(5)
video_title = driver.find_element(By.XPATH, title_xpath).text
time.sleep(2)
driver.find_element(By.XPATH, button1_xpath).click()
time.sleep(2)
driver.find_element(By.XPATH, button2_xpath).click()
time.sleep(2)
transcript, index = '', 1
while True:
transcript_xpath = f'//*[@id="segments-container"]/ytd-transcript-segment-renderer[{index}]/div/yt-formatted-string'
try:
sentence = driver.find_element(By.XPATH, transcript_xpath).text
except:
break
transcript = transcript + sentence + '\n'
index += 1
driver.quit()
return transcript, video_title
Step 3: Define XPaths and Run the Script
Define the XPaths for buttons to show the transcript and extract the title. YouTube updates may require adjusting these XPaths.
title_xpath = '//*[@id="title"]/h1/yt-formatted-string'
more_xpath = '//*[@id="expand"]'
show_transcript_xpath = '//*[@id="primary-button"]/ytd-button-renderer/yt-button-shape/button/yt-touch-feedback-shape/div/div[2]'
CHROMEDRIVER_PATH = r'selenium\chromedriver.exe'
# Initialize driver
driver = initialize_driver(CHROMEDRIVER_PATH, headless=True)
# URL of the YouTube video
video_url = "https://www.youtube.com/watch?v=x6TsR3y5Qfg"
# Fetch transcription
transcription, video_title = get_transcriptions(driver, video_url, more_xpath, show_transcript_xpath, title_xpath)
# Save transcription to file
with open("transcription.txt", "w") as file:
file.write(f"Video Title: {video_title}\n\n")
file.write("Transcription:\n")
file.write(transcription)
Running the Script
Ensure that
chromedriver.exe
is
available in the specified directory or update
CHROMEDRIVER_PATH
to the
full path.
To run the script:
python youtube_transcription.py
The script will create a
transcription.txt
file
containing the video title and transcription.
Code Enhancements
- Error Handling: Adding try-except blocks for specific elements helps in case YouTube’s structure changes.
-
Adjustable Wait Times: Instead of
time.sleep()
, use Selenium’sWebDriverWait
for more reliable element loading.
Conclusion
This script automates YouTube transcript extraction, making it a valuable tool for content creation, educational purposes, and accessibility. Feel free to modify and enhance it to support other websites or batch processing multiple videos.
Comments
Post a Comment