Python Web Crawler: List All URLs Under Domain

Develop an efficient Python web crawler to gather and list all the URLs under a domain, for optimal website indexing and SEO optimization.

To get all the URLs under a domain name, you can modify the previous example to perform a breadth-first search of the website, starting with the root URL and following all links within the domain. Here’s an example implementation:

import requests
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup

def get_domain(url):
    parsed_uri = urlparse(url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
    return domain

def get_links(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    links = set()
    for link in soup.find_all('a'):
        link_url = link.get('href')
        if link_url:
            absolute_link = urljoin(url, link_url)
            if absolute_link.startswith(domain):
                links.add(absolute_link)
    return links

if __name__ == '__main__':
    url = 'https://example.com'
    domain = get_domain(url)
    queue = [url]
    visited = set()

    while queue:
        url = queue.pop(0)
        visited.add(url)
        print(url)
        links = get_links(url)
        for link in links:
            if link not in visited and link not in queue:
                queue.append(link)

In this example, the get_domain function extracts the domain name from a given URL. The get_links function is modified to only add links within the same domain to the set of links. The main program starts with a root URL and a queue of URLs to visit, initialized with the root URL. The program then performs a breadth-first search, visiting each URL in the queue, extracting all links on the page, and adding any new URLs to the end of the queue for later processing. The visited set is used to keep track of URLs that have already been visited to avoid revisiting them. Finally, the program prints each visited URL as it is processed.

Python Web Crawler: List All URLs Under Domain – Efficient Code

Like this:

Related

Leave a ReplyCancel reply

Freelance Job | Chatbot Developer | Microsoft Bot Framework | Azure Bot

Freelance Job – Power Automate | RPA

Freelance Job – Apache NiFi Pipeline | Big Data

Navigation

Archives

Categories

Category Cloud

Subscribe to Blog via Email

Translate

Recent Posts

Top Posts & Pages

Share this: