Python Web Crawler: List All URLs Under Domain – Efficient Code

Develop an efficient Python web crawler to gather and list all the URLs under a domain, for optimal website indexing and SEO optimization.

To get all the URLs under a domain name, you can modify the previous example to perform a breadth-first search of the website, starting with the root URL and following all links within the domain. Here’s an example implementation:

import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup

def get_domain(url):
    parsed_uri = urlparse(url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    return domain

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    links = set()
    for link in soup.find_all('a'):
        link_url = link.get('href')
        if link_url:
            absolute_link = urljoin(url, link_url)
            if absolute_link.startswith(domain):
                links.add(absolute_link)
    return links

if __name__ == '__main__':
    url = 'https://example.com'
    domain = get_domain(url)
    queue = [url]
    visited = set()

    while queue:
        url = queue.pop(0)
        visited.add(url)
        print(url)
        links = get_links(url)
        for link in links:
            if link not in visited and link not in queue:
                queue.append(link)

In this example, the get_domain function extracts the domain name from a given URL. The get_links function is modified to only add links within the same domain to the set of links. The main program starts with a root URL and a queue of URLs to visit, initialized with the root URL. The program then performs a breadth-first search, visiting each URL in the queue, extracting all links on the page, and adding any new URLs to the end of the queue for later processing. The visited set is used to keep track of URLs that have already been visited to avoid revisiting them. Finally, the program prints each visited URL as it is processed.


Up ↑

%d bloggers like this: