Crack the Code: Bypassing Google reCAPTCHA in automated testing

Learn how to easily bypass Google reCAPTCHA using Selenium and Google Speech Recognition.

#Automation#Selenium#Python#reCAPTCHA#Google-Speech-Recognition

Crack the Code: Bypassing Google ReCaptcha in automated testing

reCAPTCHA is a widely used CAPTCHA system designed to distinguish between human users and automated bots. It is commonly employed to prevent spam and abuse on websites, as well as to protect sensitive data. Google's reCAPTCHA service offers various types of challenges, including image recognition and audio challenges, to verify user authenticity.

The primary purpose of reCAPTCHA is to enhance the security of online platforms by preventing automated scripts and bots from accessing certain functionalities or submitting forms. By presenting challenges that are easy for humans to solve but difficult for bots to bypass, reCAPTCHA helps ensure the integrity of online interactions and data.

For developers and testers involved in automated testing of web applications, reCAPTCHA presents a unique challenge. Traditional automated testing frameworks may struggle to bypass reCAPTCHA challenges, leading to limitations in testing scenarios and increased manual intervention.

In this tutorial, we'll explore a solution for bypassing Google reCAPTCHA challenges in automated testing scenarios. We'll use Selenium, a popular web automation tool, along with the Google Speech Recognition library to handle audio challenges presented by reCAPTCHA.

Before you start, ensure you have downloaded the Chrome WebDriver compatible with the version installed on your computer or laptop. You can find the list of available version here. My recommendation is to use the stable version.

Once installed, place it in the folder from which you will run the code.

You can add these dependencies in your requirements.txt:

requirements.txt
pydub==0.24.1
selenium==3.141.0
SpeechRecognition==3.8.1
urllib3==1.26.2

(Note: Should be fine with the latest versions also.)

and install them with:

bash
pip3 install -r requirements.txt

GoogleReCAPTCHABypass.py
#!/usr/bin/env python3
 
import speech_recognition
from typing import Optional
from time import sleep
from os import remove
from pydub import AudioSegment
from selenium import webdriver
from requests import get
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.common.exceptions import NoSuchElementException
 
 
class GoogleReCAPTCHABypass:
    # Browser driver
    browser = None
    # Link to audio file that needs to be solved
    audio_link = ""
    # Challenge text that will be generated from audio
    challenge_text = ""
    # Language to detect
    lang = "en-EN"
 
    def __init__(
        self,
        host: Optional[str] = "",
        language: Optional[str] = "en-EN",
    ) -> None:
        """Initializes an instance of the GoogleReCAPTCHABypass class with the specified options"""
        self.set_host(host)
        self.lang = language
        self.start_browser()

Explanation:

Imports: The necessary libraries are imported, including speech_recognition, webdriver from Selenium, get from Requests, and others.

Class Definition: The GoogleReCAPTCHABypass class is defined, which will encapsulate the functionality for bypassing Google ReCaptcha.

Class Attributes: It defines several class attributes like browser, audio_link, challenge_text, and lang for storing browser instance.

Constructor: The __init__ method initializes an instance of the GoogleReCAPTCHABypass class with the specified options such as host URL and language settings.

GoogleReCAPTCHABypass.py
    # Set host
    def set_host(self, host: Optional[str]) -> None:
        """Defines the web-host to bypass the captcha"""
        if host != "":
            self.host = host
        else:
            # Use captcha as demo
            self.host = "https://www.google.com/recaptcha/api2/demo"

Explanation:

By default, the browser will navigate to reCAPTCHA DEMO website. You can provide your page URL while creating class instance.

GoogleReCAPTCHABypass.py
    # Create browser instance
    def start_browser(self) -> None:
        """Start a browser instance"""
        # Run browsers binary
        self.browser = webdriver.Chrome(
            executable_path="./chromedriver",
        )
        # Access host
        self.browser.get(self.host)

Explanation:

As mentioned in the beginning, wherever you store the downloaded chromedriver, you should provide the executable_path. If you store it in the folder from where you will run the code, you can leave it as it is.

Let's define some helper methods which we can use for fetching the page elements.

GoogleReCAPTCHABypass.py
    # Get all iframes
    def get_iframes(self, src: Optional[str] = "recaptcha/enterprise/") -> list:
        """Gets all iframes on page and checks for google captcha frames (defined by src)"""
        # Find all iframes
        iframes = self.browser.find_elements(By.TAG_NAME, "iframe")
        frames = []
 
        # Loop through iframes and find captcha frames
        for iframe in iframes:
            frame_src = iframe.get_attribute("src")
            if "https" in frame_src:
                frames.append(iframe)
        return frames

Explanation:

reCAPTCHA is usually embedded as an iframe, so we are specifically taregting ones that have src (URL).

GoogleReCAPTCHABypass.py
    # Wait for iframes to be loaded
    def wait_for_iframes(self, timeout: int = 30) -> None:
        """Waits until iframes are loaded on page"""
        wait = WebDriverWait(self.browser, timeout=timeout)
        wait.until(
            expected_conditions.frame_to_be_available_and_switch_to_it(
                (By.TAG_NAME, "iframe")
            )
        )

Explanation:

Waits for the iframes to be loaded and attached to the page.

GoogleReCAPTCHABypass.py
    # Update iframe
    def update_frame(self, frame_index: int = 0) -> None:
        """Changes context to iframe specified by index"""
        self.browser.switch_to.default_content()
        if frame_index == -1:
            # Switch to default context
            return
        frames = self.get_iframes()
        frame = frames[frame_index]
        # Scroll to frame
        self.browser.execute_script("arguments[0].scrollIntoView();", frame)
        sleep(1)
        # Wait for iframe to be loaded
        self.wait_for_iframes()
        self.browser.switch_to.default_content()
        self.browser.switch_to.frame(frame)
        sleep(1)

Explanation:

This method will be used to focus on the opened reCAPTCHA modal. sleep() functions are used to slow down the process, so you can visually see what happens.

GoogleReCAPTCHABypass.py
    # Wait for buttons to be clickable
    def wait_for_button(self, name: str, timeout: int = 30) -> None:
        """Waits until button is clickable"""
        wait = WebDriverWait(self.browser, timeout=timeout)
        wait.until(expected_conditions.visibility_of_element_located((By.ID, name)))

Explanation:

General method that waits the element with ID to be visible. Will be used for different buttons within the reCAPTCHA iframe.

GoogleReCAPTCHABypass.py
    # Getting the link of the audio data
    def get_audio_link(self) -> None:
        """Access captcha and tries to extract audio_link"""
        # Switch to first frame
        self.update_frame(0)
 
        # Open captcha
        self.wait_for_button(name="recaptcha-anchor")
        self.browser.find_element_by_id("recaptcha-anchor").click()
        sleep(1)
 
        # Switch to captcha frame
        self.update_frame(1)
        # Click audio challenge button
        self.wait_for_button(name="recaptcha-audio-button")
        self.browser.find_element_by_id("recaptcha-audio-button").click()
        # Reload frame
        self.update_frame(1)
 
        err = 0
        while self.audio_link == "":
            try:
                if err > 3:
                    return
                self.audio_link = self.browser.find_element_by_id(
                    "audio-source"
                ).get_attribute("src")
                print("[+] Found audio link: ", self.audio_link)
            except NoSuchElementException:
                sleep(1)
                err += 1  # Retry 3-times
 
    def get_second_audio_link(self) -> None:
        """Access captcha and tries to extract audio_link"""
        self.update_frame(1)
 
        err = 0
        while self.audio_link == "":
            try:
                if err > 3:
                    return
                self.audio_link = self.browser.find_element_by_id(
                    "audio-source"
                ).get_attribute("src")
                print("[+] Found audio link: ", self.audio_link)
            except NoSuchElementException:
                sleep(1)
                err += 1  # Retry 3-times

Explanation:

get_audio_link(): Method that will click on the reCAPTCHA, then on the audio button so we can fetch the audio link.

get_second_audio_link(): Will be executed in case get_audio_link() fails to generate correct expected text initially.

GoogleReCAPTCHABypass.py
    # Download and convert audio
    def download_convert(self) -> None:
        """Downloads and converts audio.mp3 file to wav"""
        # Download audio
        with open("audio.mp3", "wb") as f:
            f.write(get(self.audio_link).content)
 
        # Convert audio
        AudioSegment.from_mp3("audio.mp3").export("audio.wav", format="wav")
        remove("audio.mp3")  # Cleanup
 
    # Use speech-recognition to get text
    def get_text(self) -> None:
        """Uses speech-recognition to get audio-challenge response from audio file"""
        # Download and convert audio
        self.download_convert()
 
        recognizer = speech_recognition.Recognizer()
        # Get audio
        audio_file = speech_recognition.AudioFile("audio.wav")
        with audio_file as source:
            # Get audio-data
            audio = recognizer.record(source)
 
        # Recognize text
        try:
            self.challenge_text = recognizer.recognize_google(audio, language=self.lang)
        except speech_recognition.UnknownValueError:
            self.challenge_text = ""
        except Exception as ex:
            print(f"[!] Error occurred on text-recognition: {ex}")
 
        # Cleanup audio file
        remove("audio.wav")

Explanation:

download_convert(): Method that downloads and converts the radio to wav format.

get_text(): Method that converts downloaded audio to text using the Google Speech Recognition library.

GoogleReCAPTCHABypass.py
    # Submit found text
    def submit_text(self) -> None:
        """Submits audio-challenge response"""
        # Submit input
        inputField = self.browser.find_element_by_id("audio-response")
        inputField.send_keys(self.challenge_text.lower())
        inputField.send_keys(Keys.ENTER)
        # After submitting the text, check if the error message element appears
        try:
            sleep(3)
            error_message = self.browser.find_element(
                By.CLASS_NAME, "rc-audiochallenge-error-message"
            )
            if error_message.is_displayed():
                # Perform the actions to download the audio file again
                self.get_second_audio_link()
                self.get_text()
                self.submit_text()
        except NoSuchElementException:
            # Continue with the rest of your script if the error message is not found
            pass

Explanation:

Submits the generated text into the form. If error is displayed, will fetch the second available audio link and try again.

GoogleReCAPTCHABypass.py
    # Check if reCAPTCHA was passed
    def check_reCAPTCHA(self) -> bool:
        """Checks if reCAPTCHA was passed"""
        try:
            # Switch back to input view
            self.update_frame(0)
 
            # Check if checkMark is set
            checkMark = self.browser.find_element_by_id("recaptcha-anchor")
            if "recaptcha-checkbox-checked" in checkMark.get_attribute("class"):
                return True
        except Exception as ex:
            return False
        return False
 
    # Solve the reCAPTCHA
    def solve_reCAPTCHA(self) -> bool:
        """Searches for reCAPTCHA on web-page and tries solve it"""
        # Wait till all iframes are loaded
        self.wait_for_iframes()
        sleep(1)
 
        while self.audio_link == "":
            # Refresh the page
            self.browser.refresh()
            # Retry
            sleep(3)
            self.get_audio_link()
 
        # After getting link, get captcha text
        self.get_text()
        while self.challenge_text == "":
            # Refresh the page
            self.browser.refresh()
            # Retry
            self.get_text()
 
        # After getting captcha audio-challenge, submit the text
        self.submit_text()
 
        # Wait for captcha to be checked
        sleep(1)
 
        if self.check_reCAPTCHA():
            print("[+] Captcha bypassed!")
            self.browser.close()
            return True
        else:
            print("[-] Could not bypass!")
            self.browser.close()
            return False

Explanation:

check_reCAPTCHA(): Checks if reCAPTCHA iframe was found.

solve_reCAPTCHA(): Tries to solve the reCAPTCHA using all of the logic that we implemented.

Now, let's create a new Python script which we will use to initialize the object so we can run the test:

google_recaptcha_test.py
#!/usr/bin/env python3
 
from GoogleReCAPTCHABypass import GoogleReCAPTCHABypass
 
 
if __name__ == "__main__":
    # Initialize Bypass object with host defined, all other options are let default
    bypass = GoogleReCAPTCHABypass()
 
    if not bypass.solve_reCAPTCHA():
        # If unsuccessful, quit program
        quit()

Run the script:

bash
python3 google_recaptcha_test.py

This is the result:

To make your solution even better for automated testing, consider using a VPN tool. That way you can manipulate the worker/runner IP address and change it every time before running the test, which will trick the reCAPTCHA and will not block your test due to the multiple requests using the same IP address over and over.

By combining Selenium for web automation and Google Speech Recognition for audio challenge handling, we can effectively bypass Google reCAPTCHA challenges in automated testing scenarios. This solution enables developers and testers to streamline their testing processes and ensure comprehensive test coverage even in the presence of reCAPTCHA challenges.


Related Posts: