testsAndMisc/python_pkg/scrape_website/scrape_comics.py

"""Download comic images from a website using Selenium."""

import argparse
import logging
from pathlib import Path
from urllib.parse import urlparse

import requests
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By

_logger = logging.getLogger(__name__)

REQUEST_TIMEOUT = 30  # seconds

# Initialize argument parser to accept the website URL as an argument
parser = argparse.ArgumentParser(description="Download images from a comic website.")
parser.add_argument(
    "url", type=str, help="The URL of the website to start downloading images from"
)
args = parser.parse_args()

# Initialize WebDriver (Use the appropriate driver for your browser)
driver = webdriver.Chrome()

# Open the website from the passed argument
url = args.url
_logger.info("Opening the website: %s", url)
driver.get(url)


# A function to download images by URL
def download_image(url: str) -> bool:
    """Download an image from a URL and save it locally."""
    # Extract image name from URL
    image_name = Path(urlparse(url).path).name
    image_path = Path(image_name)

    # Check if the image already exists
    if image_path.exists():
        _logger.info("Image %s already exists, skipping download.", image_name)
        return False
    _logger.info("Downloading image from URL: %s", url)
    img_data = requests.get(url, timeout=REQUEST_TIMEOUT).content
    with open(image_path, "wb") as handler:
        handler.write(img_data)
    _logger.info("Image %s downloaded successfully", image_name)
    return True


# No need to define a specific number of images now
count = 1

while True:
    _logger.info("Processing image %s...", count)

    # Find the image element by its ID
    image_element = driver.find_element(By.ID, "cc-comic")

    # Get the image URL from the 'src' attribute
    image_url = image_element.get_attribute("src")
    _logger.info("Found image URL: %s", image_url)

    # Download the image if it doesn't already exist
    if download_image(image_url):
        count += 1  # Increment count only if the image was downloaded

    # Try to find the 'Next' button by its class
    try:
        _logger.info("Clicking the 'Next' button to load the next image...")
        next_button = driver.find_element(By.CSS_SELECTOR, "a.cc-next")

        # Navigate to the URL in the 'href' of the next button
        next_button_url = next_button.get_attribute("href")
        driver.get(next_button_url)

    except NoSuchElementException:
        # If the 'Next' button is not found, it means we've reached the last image
        _logger.info("No 'Next' button found. Reached the end of images.")
        break

# Close the browser
_logger.info("All images processed, closing the browser.")
driver.quit()
Enable D100-D107 docstring rules: add docstrings to all modules, classes, methods, and functions - Added module docstrings to 19 Python files - Added class docstrings to 5 classes (ScreenLocker, PokerModifierApp, etc.) - Added method docstrings to 22 methods - Added function docstrings to 25 functions - Added __init__ docstrings to 5 classes - Removed D100-D107 from ruff ignore list (docstrings now enforced) - Removed deprecated ANN101, ANN102, UP038 rules from ignore list - Fixed UP038: use union types in isinstance() calls - All ruff checks now pass with full docstring enforcement 2025-11-30 14:45:55 +01:00			`"""Download comic images from a website using Selenium."""`

fix: correct shebang and executable permissions - Add +x to Python scripts with shebangs (3 files) - Remove -x from non-script files like .cpp, .txt, makefile (23 files) - Move shebang to first line in C/imageViewer/lint.sh 2025-11-30 13:42:16 +01:00			`import argparse`
refactor: replace print() with logging (T201) - Converted 67 print statements to logging across 11 files - Added logging.basicConfig(level=logging.INFO) to each file - Used appropriate log levels: info, warning, error, exception - Removed T201 from ruff ignore list to enforce logging usage 2025-11-30 14:36:13 +01:00			`import logging`
fix(lint): convert os.path to pathlib - remove PTH per-file ignores - Converted os.path patterns to pathlib.Path in 15+ files - os.path.join → Path / - os.path.dirname → Path.parent - os.path.exists → Path.exists() - os.path.isfile → Path.is_file() - os.path.abspath → Path.resolve() - os.mkdir → Path.mkdir() - os.listdir → Path.iterdir() - os.getcwd → Path.cwd() - os.replace → Path.replace() - Updated function type hints to accept str \| Path Added PTH123 (open() vs Path.open()) to global ignores as stylistic preference 2025-11-30 23:03:03 +01:00			`from pathlib import Path`
fix: correct shebang and executable permissions - Add +x to Python scripts with shebangs (3 files) - Remove -x from non-script files like .cpp, .txt, makefile (23 files) - Move shebang to first line in C/imageViewer/lint.sh 2025-11-30 13:42:16 +01:00			`from urllib.parse import urlparse`

feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`import requests`
			`from selenium import webdriver`
fix(lint): BLE001 - replace blind except with specific exceptions Replace bare 'except Exception' with specific exception types: - ValueError for move parsing (chess.Move.from_uci, board.push_uci) - json.JSONDecodeError for JSON parsing - OSError for file operations - ImportError for optional imports - AttributeError for attribute access - TypeError for type-related operations - requests.RequestException for HTTP operations - subprocess.SubprocessError for subprocess failures - selenium.NoSuchElementException for element finding Also fixes: - pytest hook signature issue in conftest.py (_config -> _) - Missing file handling in test_puzzles.py - Line length in stockfish_analysis.py Removes all BLE001 per-file ignores from pyproject.toml. 2025-11-30 21:37:47 +01:00			`from selenium.common.exceptions import NoSuchElementException`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`from selenium.webdriver.common.by import By`

fix(lint): LOG015 - replace root logger with module loggers - Add _logger = logging.getLogger(__name__) to all modules - Replace logging.X() calls with _logger.X() calls - Remove logging.basicConfig() from module level (keep in run_bot()) - Add G004 to global ignores (f-strings in logging are more readable) - Remove LOG015 and G004 per-file ignores from pyproject.toml - Fix pytest_ignore_collect hook signature in conftest.py 2025-11-30 21:59:24 +01:00			`_logger = logging.getLogger(__name__)`
refactor: replace print() with logging (T201) - Converted 67 print statements to logging across 11 files - Added logging.basicConfig(level=logging.INFO) to each file - Used appropriate log levels: info, warning, error, exception - Removed T201 from ruff ignore list to enforce logging usage 2025-11-30 14:36:13 +01:00
Enable S113: add timeout to requests calls 2025-11-30 15:17:52 +01:00			`REQUEST_TIMEOUT = 30 # seconds`

feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`# Initialize argument parser to accept the website URL as an argument`
fix: correct shebang and executable permissions - Add +x to Python scripts with shebangs (3 files) - Remove -x from non-script files like .cpp, .txt, makefile (23 files) - Move shebang to first line in C/imageViewer/lint.sh 2025-11-30 13:42:16 +01:00			`parser = argparse.ArgumentParser(description="Download images from a comic website.")`
fix: enforce 88-char line length limit (E501) - Fixed all 119 line-too-long errors across Python files - Broke long strings, comments, and docstrings into multiline format - All pre-commit hooks now pass with strict 88-char limit 2025-11-30 14:25:35 +01:00			`parser.add_argument(`
			`"url", type=str, help="The URL of the website to start downloading images from"`
			`)`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`args = parser.parse_args()`

			`# Initialize WebDriver (Use the appropriate driver for your browser)`
			`driver = webdriver.Chrome()`

			`# Open the website from the passed argument`
			`url = args.url`
fix(lint): LOG015 - replace root logger with module loggers - Add _logger = logging.getLogger(__name__) to all modules - Replace logging.X() calls with _logger.X() calls - Remove logging.basicConfig() from module level (keep in run_bot()) - Add G004 to global ignores (f-strings in logging are more readable) - Remove LOG015 and G004 per-file ignores from pyproject.toml - Fix pytest_ignore_collect hook signature in conftest.py 2025-11-30 21:59:24 +01:00			`_logger.info("Opening the website: %s", url)`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`driver.get(url)`

fix: correct shebang and executable permissions - Add +x to Python scripts with shebangs (3 files) - Remove -x from non-script files like .cpp, .txt, makefile (23 files) - Move shebang to first line in C/imageViewer/lint.sh 2025-11-30 13:42:16 +01:00
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`# A function to download images by URL`
fix: correct mypy ignore comment for attr-defined error 2025-11-30 15:49:40 +01:00			`def download_image(url: str) -> bool:`
Enable D100-D107 docstring rules: add docstrings to all modules, classes, methods, and functions - Added module docstrings to 19 Python files - Added class docstrings to 5 classes (ScreenLocker, PokerModifierApp, etc.) - Added method docstrings to 22 methods - Added function docstrings to 25 functions - Added __init__ docstrings to 5 classes - Removed D100-D107 from ruff ignore list (docstrings now enforced) - Removed deprecated ANN101, ANN102, UP038 rules from ignore list - Fixed UP038: use union types in isinstance() calls - All ruff checks now pass with full docstring enforcement 2025-11-30 14:45:55 +01:00			`"""Download an image from a URL and save it locally."""`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`# Extract image name from URL`
fix(lint): convert os.path to pathlib - remove PTH per-file ignores - Converted os.path patterns to pathlib.Path in 15+ files - os.path.join → Path / - os.path.dirname → Path.parent - os.path.exists → Path.exists() - os.path.isfile → Path.is_file() - os.path.abspath → Path.resolve() - os.mkdir → Path.mkdir() - os.listdir → Path.iterdir() - os.getcwd → Path.cwd() - os.replace → Path.replace() - Updated function type hints to accept str \| Path Added PTH123 (open() vs Path.open()) to global ignores as stylistic preference 2025-11-30 23:03:03 +01:00			`image_name = Path(urlparse(url).path).name`
			`image_path = Path(image_name)`
fix: correct shebang and executable permissions - Add +x to Python scripts with shebangs (3 files) - Remove -x from non-script files like .cpp, .txt, makefile (23 files) - Move shebang to first line in C/imageViewer/lint.sh 2025-11-30 13:42:16 +01:00
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`# Check if the image already exists`
fix(lint): convert os.path to pathlib - remove PTH per-file ignores - Converted os.path patterns to pathlib.Path in 15+ files - os.path.join → Path / - os.path.dirname → Path.parent - os.path.exists → Path.exists() - os.path.isfile → Path.is_file() - os.path.abspath → Path.resolve() - os.mkdir → Path.mkdir() - os.listdir → Path.iterdir() - os.getcwd → Path.cwd() - os.replace → Path.replace() - Updated function type hints to accept str \| Path Added PTH123 (open() vs Path.open()) to global ignores as stylistic preference 2025-11-30 23:03:03 +01:00			`if image_path.exists():`
fix(lint): LOG015 - replace root logger with module loggers - Add _logger = logging.getLogger(__name__) to all modules - Replace logging.X() calls with _logger.X() calls - Remove logging.basicConfig() from module level (keep in run_bot()) - Add G004 to global ignores (f-strings in logging are more readable) - Remove LOG015 and G004 per-file ignores from pyproject.toml - Fix pytest_ignore_collect hook signature in conftest.py 2025-11-30 21:59:24 +01:00			`_logger.info("Image %s already exists, skipping download.", image_name)`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`return False`
fix(lint): LOG015 - replace root logger with module loggers - Add _logger = logging.getLogger(__name__) to all modules - Replace logging.X() calls with _logger.X() calls - Remove logging.basicConfig() from module level (keep in run_bot()) - Add G004 to global ignores (f-strings in logging are more readable) - Remove LOG015 and G004 per-file ignores from pyproject.toml - Fix pytest_ignore_collect hook signature in conftest.py 2025-11-30 21:59:24 +01:00			`_logger.info("Downloading image from URL: %s", url)`
Enable S113: add timeout to requests calls 2025-11-30 15:17:52 +01:00			`img_data = requests.get(url, timeout=REQUEST_TIMEOUT).content`
fix(lint): convert os.path to pathlib - remove PTH per-file ignores - Converted os.path patterns to pathlib.Path in 15+ files - os.path.join → Path / - os.path.dirname → Path.parent - os.path.exists → Path.exists() - os.path.isfile → Path.is_file() - os.path.abspath → Path.resolve() - os.mkdir → Path.mkdir() - os.listdir → Path.iterdir() - os.getcwd → Path.cwd() - os.replace → Path.replace() - Updated function type hints to accept str \| Path Added PTH123 (open() vs Path.open()) to global ignores as stylistic preference 2025-11-30 23:03:03 +01:00			`with open(image_path, "wb") as handler:`
fix: correct shebang and executable permissions - Add +x to Python scripts with shebangs (3 files) - Remove -x from non-script files like .cpp, .txt, makefile (23 files) - Move shebang to first line in C/imageViewer/lint.sh 2025-11-30 13:42:16 +01:00			`handler.write(img_data)`
fix(lint): LOG015 - replace root logger with module loggers - Add _logger = logging.getLogger(__name__) to all modules - Replace logging.X() calls with _logger.X() calls - Remove logging.basicConfig() from module level (keep in run_bot()) - Add G004 to global ignores (f-strings in logging are more readable) - Remove LOG015 and G004 per-file ignores from pyproject.toml - Fix pytest_ignore_collect hook signature in conftest.py 2025-11-30 21:59:24 +01:00			`_logger.info("Image %s downloaded successfully", image_name)`
fix: correct shebang and executable permissions - Add +x to Python scripts with shebangs (3 files) - Remove -x from non-script files like .cpp, .txt, makefile (23 files) - Move shebang to first line in C/imageViewer/lint.sh 2025-11-30 13:42:16 +01:00			`return True`

feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00
			`# No need to define a specific number of images now`
			`count = 1`

			`while True:`
fix(lint): LOG015 - replace root logger with module loggers - Add _logger = logging.getLogger(__name__) to all modules - Replace logging.X() calls with _logger.X() calls - Remove logging.basicConfig() from module level (keep in run_bot()) - Add G004 to global ignores (f-strings in logging are more readable) - Remove LOG015 and G004 per-file ignores from pyproject.toml - Fix pytest_ignore_collect hook signature in conftest.py 2025-11-30 21:59:24 +01:00			`_logger.info("Processing image %s...", count)`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00
			`# Find the image element by its ID`
fix: correct shebang and executable permissions - Add +x to Python scripts with shebangs (3 files) - Remove -x from non-script files like .cpp, .txt, makefile (23 files) - Move shebang to first line in C/imageViewer/lint.sh 2025-11-30 13:42:16 +01:00			`image_element = driver.find_element(By.ID, "cc-comic")`

feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`# Get the image URL from the 'src' attribute`
fix: correct shebang and executable permissions - Add +x to Python scripts with shebangs (3 files) - Remove -x from non-script files like .cpp, .txt, makefile (23 files) - Move shebang to first line in C/imageViewer/lint.sh 2025-11-30 13:42:16 +01:00			`image_url = image_element.get_attribute("src")`
fix(lint): LOG015 - replace root logger with module loggers - Add _logger = logging.getLogger(__name__) to all modules - Replace logging.X() calls with _logger.X() calls - Remove logging.basicConfig() from module level (keep in run_bot()) - Add G004 to global ignores (f-strings in logging are more readable) - Remove LOG015 and G004 per-file ignores from pyproject.toml - Fix pytest_ignore_collect hook signature in conftest.py 2025-11-30 21:59:24 +01:00			`_logger.info("Found image URL: %s", image_url)`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00
			`# Download the image if it doesn't already exist`
			`if download_image(image_url):`
			`count += 1 # Increment count only if the image was downloaded`

			`# Try to find the 'Next' button by its class`
			`try:`
fix(lint): LOG015 - replace root logger with module loggers - Add _logger = logging.getLogger(__name__) to all modules - Replace logging.X() calls with _logger.X() calls - Remove logging.basicConfig() from module level (keep in run_bot()) - Add G004 to global ignores (f-strings in logging are more readable) - Remove LOG015 and G004 per-file ignores from pyproject.toml - Fix pytest_ignore_collect hook signature in conftest.py 2025-11-30 21:59:24 +01:00			`_logger.info("Clicking the 'Next' button to load the next image...")`
fix: correct shebang and executable permissions - Add +x to Python scripts with shebangs (3 files) - Remove -x from non-script files like .cpp, .txt, makefile (23 files) - Move shebang to first line in C/imageViewer/lint.sh 2025-11-30 13:42:16 +01:00			`next_button = driver.find_element(By.CSS_SELECTOR, "a.cc-next")`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00
			`# Navigate to the URL in the 'href' of the next button`
fix: correct shebang and executable permissions - Add +x to Python scripts with shebangs (3 files) - Remove -x from non-script files like .cpp, .txt, makefile (23 files) - Move shebang to first line in C/imageViewer/lint.sh 2025-11-30 13:42:16 +01:00			`next_button_url = next_button.get_attribute("href")`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`driver.get(next_button_url)`

fix(lint): BLE001 - replace blind except with specific exceptions Replace bare 'except Exception' with specific exception types: - ValueError for move parsing (chess.Move.from_uci, board.push_uci) - json.JSONDecodeError for JSON parsing - OSError for file operations - ImportError for optional imports - AttributeError for attribute access - TypeError for type-related operations - requests.RequestException for HTTP operations - subprocess.SubprocessError for subprocess failures - selenium.NoSuchElementException for element finding Also fixes: - pytest hook signature issue in conftest.py (_config -> _) - Missing file handling in test_puzzles.py - Line length in stockfish_analysis.py Removes all BLE001 per-file ignores from pyproject.toml. 2025-11-30 21:37:47 +01:00			`except NoSuchElementException:`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`# If the 'Next' button is not found, it means we've reached the last image`
fix(lint): LOG015 - replace root logger with module loggers - Add _logger = logging.getLogger(__name__) to all modules - Replace logging.X() calls with _logger.X() calls - Remove logging.basicConfig() from module level (keep in run_bot()) - Add G004 to global ignores (f-strings in logging are more readable) - Remove LOG015 and G004 per-file ignores from pyproject.toml - Fix pytest_ignore_collect hook signature in conftest.py 2025-11-30 21:59:24 +01:00			`_logger.info("No 'Next' button found. Reached the end of images.")`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`break`

			`# Close the browser`
fix(lint): LOG015 - replace root logger with module loggers - Add _logger = logging.getLogger(__name__) to all modules - Replace logging.X() calls with _logger.X() calls - Remove logging.basicConfig() from module level (keep in run_bot()) - Add G004 to global ignores (f-strings in logging are more readable) - Remove LOG015 and G004 per-file ignores from pyproject.toml - Fix pytest_ignore_collect hook signature in conftest.py 2025-11-30 21:59:24 +01:00			`_logger.info("All images processed, closing the browser.")`
feat: python script for scraping webcomics 2024-09-16 16:21:36 +02:00			`driver.quit()`