From cd3dd0157fe122f5eba317f0e8d15ad2e590dc6b Mon Sep 17 00:00:00 2001 From: Jonas Depoix Date: Tue, 20 Oct 2020 10:53:50 +0200 Subject: [PATCH 1/6] migrated to travis-ci.com --- README.md | 54 +++++++++++++++++++++++++++--------------------------- 1 file changed, 27 insertions(+), 27 deletions(-) diff --git a/README.md b/README.md index a7c22a7..d2f3e60 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # YouTube Transcript/Subtitle API (including automatically generated subtitles and subtitle translations) - -[![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) [![Build Status](https://travis-ci.org/jdepoix/youtube-transcript-api.svg)](https://travis-ci.org/jdepoix/youtube-transcript-api) [![Coverage Status](https://coveralls.io/repos/github/jdepoix/youtube-transcript-api/badge.svg?branch=master)](https://coveralls.io/github/jdepoix/youtube-transcript-api?branch=master) [![MIT license](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](http://opensource.org/licenses/MIT) [![image](https://img.shields.io/pypi/v/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) [![image](https://img.shields.io/pypi/pyversions/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) + +[![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) [![Build Status](https://travis-ci.com/jdepoix/youtube-transcript-api.svg)](https://travis-ci.com/jdepoix/youtube-transcript-api) [![Coverage Status](https://coveralls.io/repos/github/jdepoix/youtube-transcript-api/badge.svg?branch=master)](https://coveralls.io/github/jdepoix/youtube-transcript-api?branch=master) [![MIT license](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](http://opensource.org/licenses/MIT) [![image](https://img.shields.io/pypi/v/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) [![image](https://img.shields.io/pypi/pyversions/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles and it does not require a headless browser, like other selenium based solutions do! @@ -147,27 +147,27 @@ for transcript in transcript_list: # translating the transcript will return another transcript object print(transcript.translate('en').fetch()) - + # you can also directly filter for the language you are looking for, using the transcript list transcript = transcript_list.find_transcript(['de', 'en']) - + # or just filter for manually created transcripts transcript = transcript_list.find_manually_created_transcript(['de', 'en']) - + # or automatically generated ones transcript = transcript_list.find_generated_transcript(['de', 'en']) ``` - + ## CLI - + Execute the CLI script using the video ids as parameters and the results will be printed out to the command line: - + ``` youtube_transcript_api ... ``` - + The CLI also gives you the option to provide a list of preferred languages: - + ``` youtube_transcript_api ... --languages de en ``` @@ -178,9 +178,9 @@ You can also specify if you want to exclude automatically generated or manually youtube_transcript_api ... --languages de en --exclude-generated youtube_transcript_api ... --languages de en --exclude-manually-created ``` - + If you would prefer to write it into a file or pipe it into another application, you can also output the results as json using the following line: - + ``` youtube_transcript_api ... --languages de en --json > transcripts.json ``` @@ -196,21 +196,21 @@ If you are not sure which languages are available for a given video you can call ``` youtube_transcript_api --list-transcripts ``` - + ## Proxy - + You can specify a https/http proxy, which will be used during the requests to YouTube: - + ```python from youtube_transcript_api import YouTubeTranscriptApi - + YouTubeTranscriptApi.get_transcript(video_id, proxies={"http": "http://user:pass@domain:port", "https": "https://user:pass@domain:port"}) ``` - + As the `proxies` dict is passed on to the `requests.get(...)` call, it follows the [format used by the requests library](http://docs.python-requests.org/en/master/user/advanced/#proxies). - + Using the CLI: - + ``` youtube_transcript_api --http-proxy http://user:pass@domain:port --https-proxy https://user:pass@domain:port ``` @@ -219,13 +219,13 @@ youtube_transcript_api --http-proxy http://us Some videos are age restricted, so this module won't be able to access those videos without some sort of authentication. To do this, you will need to have access to the desired video in a browser. Then, you will need to download that pages cookies into a text file. You can use the Chrome extension [cookies.txt](https://chrome.google.com/webstore/detail/cookiestxt/njabckikapfpffapmjgojcnbfjonfjfg?hl=en) or the Firefox extension [cookies.txt](https://addons.mozilla.org/en-US/firefox/addon/cookies-txt/). -Once you have that, you can use it with the module to access age-restricted videos' captions like so. +Once you have that, you can use it with the module to access age-restricted videos' captions like so. ```python from youtube_transcript_api import YouTubeTranscriptApi - + YouTubeTranscriptApi.get_transcript(video_id, cookies='/path/to/your/cookies.txt') - + YouTubeTranscriptApi.get_transcripts([video_id], cookies='/path/to/your/cookies.txt') ``` @@ -235,13 +235,13 @@ Using the CLI: youtube_transcript_api --cookies /path/to/your/cookies.txt ``` - + ## Warning - + This code uses an undocumented part of the YouTube API, which is called by the YouTube web-client. So there is no guarantee that it won't stop working tomorrow, if they change how things work. I will however do my best to make things working again as soon as possible if that happens. So if it stops working, let me know! - + ## Donation - + If this project makes you happy by reducing your development time, you can make me happy by treating me to a cup of coffee :) - + [![Donate](https://www.paypalobjects.com/en_US/i/btn/btn_donateCC_LG.gif)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) From 14c70359ba6a39cdc0e130e05925942780905e55 Mon Sep 17 00:00:00 2001 From: Your Name Date: Thu, 21 Jan 2021 19:43:29 +0100 Subject: [PATCH 2/6] Fix "video not available" being shown to the user when when YouTube start asking for captcha resolution due to receiving too many requests from the same IP. Show instead an appropiate message. To be able to keep making requests, the captcha must be solved in a browser and the browser cookie must be passed to youtube-transcript-api. --- youtube_transcript_api/__init__.py | 1 + youtube_transcript_api/_errors.py | 5 +- youtube_transcript_api/_transcripts.py | 3 + .../youtube_too_many_requests.html.static | 239 ++++++++++++++++++ youtube_transcript_api/test/test_api.py | 11 + 5 files changed, 258 insertions(+), 1 deletion(-) create mode 100644 youtube_transcript_api/test/assets/youtube_too_many_requests.html.static diff --git a/youtube_transcript_api/__init__.py b/youtube_transcript_api/__init__.py index 1fe0f73..baefd02 100644 --- a/youtube_transcript_api/__init__.py +++ b/youtube_transcript_api/__init__.py @@ -5,6 +5,7 @@ from ._errors import ( NoTranscriptFound, CouldNotRetrieveTranscript, VideoUnavailable, + TooManyRequests, NotTranslatable, TranslationLanguageNotAvailable, NoTranscriptAvailable, diff --git a/youtube_transcript_api/_errors.py b/youtube_transcript_api/_errors.py index 2f83a16..f7a5658 100644 --- a/youtube_transcript_api/_errors.py +++ b/youtube_transcript_api/_errors.py @@ -37,7 +37,10 @@ class CouldNotRetrieveTranscript(Exception): class VideoUnavailable(CouldNotRetrieveTranscript): CAUSE_MESSAGE = 'The video is no longer available' - + +class TooManyRequests(CouldNotRetrieveTranscript): + CAUSE_MESSAGE = ('YouTube is receiving too many requests from this IP,' + ' and now requires that a captcha must be solved in order to continue.') class TranscriptsDisabled(CouldNotRetrieveTranscript): CAUSE_MESSAGE = 'Subtitles are disabled for this video' diff --git a/youtube_transcript_api/_transcripts.py b/youtube_transcript_api/_transcripts.py index 6b767ff..9400a1d 100644 --- a/youtube_transcript_api/_transcripts.py +++ b/youtube_transcript_api/_transcripts.py @@ -14,6 +14,7 @@ import re from ._html_unescaping import unescape from ._errors import ( VideoUnavailable, + TooManyRequests, NoTranscriptFound, TranscriptsDisabled, NotTranslatable, @@ -38,6 +39,8 @@ class TranscriptListFetcher(): splitted_html = html.split('"captions":') if len(splitted_html) <= 1: + if 'class="g-recaptcha"' in html: + raise TooManyRequests(video_id) if '"playabilityStatus":' not in html: raise VideoUnavailable(video_id) diff --git a/youtube_transcript_api/test/assets/youtube_too_many_requests.html.static b/youtube_transcript_api/test/assets/youtube_too_many_requests.html.static new file mode 100644 index 0000000..c63003f --- /dev/null +++ b/youtube_transcript_api/test/assets/youtube_too_many_requests.html.static @@ -0,0 +1,239 @@ + + + + YouTube + + + + + + + + + +
+
+

+ Perdón por la interrupción. Hemos recibido un gran número de + solicitudes de tu red. +

+

+ Para seguir disfrutando de YouTube, rellena el siguiente formulario. +

+
+
+
+
+ +
+ ES + +
+
+ +
+ + diff --git a/youtube_transcript_api/test/test_api.py b/youtube_transcript_api/test/test_api.py index 5f95451..daf98f8 100644 --- a/youtube_transcript_api/test/test_api.py +++ b/youtube_transcript_api/test/test_api.py @@ -12,6 +12,7 @@ from youtube_transcript_api import ( TranscriptsDisabled, NoTranscriptFound, VideoUnavailable, + TooManyRequests, NoTranscriptAvailable, NotTranslatable, TranslationLanguageNotAvailable, @@ -134,6 +135,16 @@ class TestYouTubeTranscriptApi(TestCase): with self.assertRaises(VideoUnavailable): YouTubeTranscriptApi.get_transcript('abc') + def test_get_transcript__exception_if_video_unavailable(self): + httpretty.register_uri( + httpretty.GET, + 'https://www.youtube.com/watch', + body=load_asset('youtube_too_many_requests.html.static') + ) + + with self.assertRaises(TooManyRequests): + YouTubeTranscriptApi.get_transcript('abc') + def test_get_transcript__exception_if_transcripts_disabled(self): httpretty.register_uri( httpretty.GET, From fb819c06e4d9c54b7e372de2a8040951357e0fcd Mon Sep 17 00:00:00 2001 From: Your Name Date: Thu, 21 Jan 2021 19:53:06 +0100 Subject: [PATCH 3/6] Fix test case name --- youtube_transcript_api/test/test_api.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/youtube_transcript_api/test/test_api.py b/youtube_transcript_api/test/test_api.py index daf98f8..7650cf4 100644 --- a/youtube_transcript_api/test/test_api.py +++ b/youtube_transcript_api/test/test_api.py @@ -135,7 +135,7 @@ class TestYouTubeTranscriptApi(TestCase): with self.assertRaises(VideoUnavailable): YouTubeTranscriptApi.get_transcript('abc') - def test_get_transcript__exception_if_video_unavailable(self): + def test_get_transcript__exception_if_youtube_request_limit_reached(self): httpretty.register_uri( httpretty.GET, 'https://www.youtube.com/watch', From dbf5eeafe69f7b5e0c0eb437d0debe1dbcf75d6a Mon Sep 17 00:00:00 2001 From: Your Name Date: Fri, 22 Jan 2021 14:18:56 +0100 Subject: [PATCH 4/6] Error message more descriptive --- youtube_transcript_api/_errors.py | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/youtube_transcript_api/_errors.py b/youtube_transcript_api/_errors.py index f7a5658..c19a820 100644 --- a/youtube_transcript_api/_errors.py +++ b/youtube_transcript_api/_errors.py @@ -39,8 +39,11 @@ class VideoUnavailable(CouldNotRetrieveTranscript): CAUSE_MESSAGE = 'The video is no longer available' class TooManyRequests(CouldNotRetrieveTranscript): - CAUSE_MESSAGE = ('YouTube is receiving too many requests from this IP,' - ' and now requires that a captcha must be solved in order to continue.') + CAUSE_MESSAGE = ('YouTube is receiving too many requests from this IP, ' + 'and now requires that a captcha must be solved in order to continue. ' + 'You can solve the captcha in a browser and pass the generated cookie file to youtube-transcript-api, ' + 'or you can use a different IP, or maybe wait for the ban to be lifted.' + ) class TranscriptsDisabled(CouldNotRetrieveTranscript): CAUSE_MESSAGE = 'Subtitles are disabled for this video' From 23798f205de55a4a5b3b1c787495524d34e6aea2 Mon Sep 17 00:00:00 2001 From: Your Name Date: Mon, 25 Jan 2021 17:36:27 +0100 Subject: [PATCH 5/6] improve message as per jdepoix suggestion --- youtube_transcript_api/_errors.py | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/youtube_transcript_api/_errors.py b/youtube_transcript_api/_errors.py index c19a820..1b8360a 100644 --- a/youtube_transcript_api/_errors.py +++ b/youtube_transcript_api/_errors.py @@ -39,11 +39,10 @@ class VideoUnavailable(CouldNotRetrieveTranscript): CAUSE_MESSAGE = 'The video is no longer available' class TooManyRequests(CouldNotRetrieveTranscript): - CAUSE_MESSAGE = ('YouTube is receiving too many requests from this IP, ' - 'and now requires that a captcha must be solved in order to continue. ' - 'You can solve the captcha in a browser and pass the generated cookie file to youtube-transcript-api, ' - 'or you can use a different IP, or maybe wait for the ban to be lifted.' - ) + CAUSE_MESSAGE = ("YouTube is receiving too many requests from this IP and now requires solving a captcha to continue. One of the following things can be done to work around this:\n\ + - Manually solve the captcha in a browser and export the cookie. Read here how to use that cookie with youtube-transcript-api: https://github.com/jdepoix/youtube-transcript-api#cookies\n\ + - Use a different IP address\n\ + - Wait until the ban on your IP has been lifted") class TranscriptsDisabled(CouldNotRetrieveTranscript): CAUSE_MESSAGE = 'Subtitles are disabled for this video' From cf0647f91f719e48be61ab56cc887ebe45387e7a Mon Sep 17 00:00:00 2001 From: jdepoix Date: Sat, 30 Jan 2021 10:08:52 +0100 Subject: [PATCH 6/6] Update README.md --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 6c2735a..7328e4e 100644 --- a/README.md +++ b/README.md @@ -3,7 +3,7 @@ [![Donate](https://img.shields.io/badge/Donate-PayPal-green.svg)](https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=BAENLEW8VUJ6G&source=url) [![Build Status](https://travis-ci.com/jdepoix/youtube-transcript-api.svg)](https://travis-ci.com/jdepoix/youtube-transcript-api) [![Coverage Status](https://coveralls.io/repos/github/jdepoix/youtube-transcript-api/badge.svg?branch=master)](https://coveralls.io/github/jdepoix/youtube-transcript-api?branch=master) [![MIT license](http://img.shields.io/badge/license-MIT-brightgreen.svg?style=flat)](http://opensource.org/licenses/MIT) [![image](https://img.shields.io/pypi/v/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) [![image](https://img.shields.io/pypi/pyversions/youtube-transcript-api.svg)](https://pypi.org/project/youtube-transcript-api/) -This is an python API which allows you to get the transcripts/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles and it does not require a headless browser, like other selenium based solutions do! +This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles, supports translating subtitles and it does not require a headless browser, like other selenium based solutions do! ## Install