How to Handle Pagination in Web Scraping

A practical guide to handling page links, next buttons, infinite scroll, and cursor-based pagination in web scraping.

Pagination is where many otherwise solid scrapers become fragile. A site may look simple, but the path from page one to page two can involve hidden API calls, cursor tokens, JavaScript events, or an endless feed that never exposes a traditional next link. This guide gives you a practical framework for handling pagination in web scraping across the patterns you are most likely to meet: numbered pages, next buttons, infinite scroll, load-more interfaces, and cursor-based APIs. The goal is not just to get one scraper working once, but to build a next page scraper that remains understandable, debuggable, and easier to update when the target changes.

Overview

If you want to know how to scrape paginated websites reliably, start by treating pagination as a discovery problem before it becomes a coding problem. The visible page controls are only one layer. What matters is the mechanism the site actually uses to fetch the next batch of records.

In practice, pagination web scraping usually falls into five broad patterns:

Page-number URLs, such as ?page=2 or /page/3/
Next-button navigation, where a link or button leads to the next result set
Load more buttons, where the same page fetches additional items asynchronously
Infinite scroll, where scrolling triggers network requests for more data
Cursor or token pagination, common in APIs and modern web apps, where the next request depends on a cursor value returned by the previous one

The mistake many developers make is choosing tooling before identifying the pagination pattern. If the site exposes clean request parameters, a lightweight HTTP client may be enough. If the site relies on browser events and dynamic rendering, you may need Playwright or another browser automation tool. For a broader setup checklist, it helps to review Web Scraping Tech Stack Checklist for New Projects.

A reliable approach has three goals:

Discover the true pagination mechanism
Extract records without duplication or silent gaps
Stop safely when no more results exist

Everything else in this article builds on those three points.

Core framework

Here is a repeatable framework you can apply whether you are scraping a simple blog archive or a JavaScript-heavy catalog.

1. Inspect before you automate

Open browser devtools and answer these questions first:

Does the URL change when moving to the next page?
Is there a standard anchor tag with an href for the next page?
Does clicking next trigger an XHR or fetch request?
Does the response contain HTML, JSON, or some cursor token?
Is there a visible end condition, such as a disabled next button or an empty response?

This step often reveals that a page that looks dynamic is backed by a simple JSON endpoint. If so, scraping the endpoint directly is usually cleaner than driving the interface.

2. Prefer stable request patterns over UI simulation

If page two is available at a predictable URL or API endpoint, use that. Browser automation is valuable, but it introduces more moving parts: rendering delays, selector drift, and state synchronization issues. A direct request flow is usually faster and easier to debug.

This is especially relevant when deciding between parser-first and browser-first tools. If you are working in Python and want a broader comparison, Scrapy vs Beautiful Soup: Which Python Scraper Should You Use? is a useful companion read. If the site truly needs a browser, see Playwright vs Puppeteer for Web Scraping: Features, Tradeoffs, and Use Cases.

3. Define a clear pagination state

Your scraper should explicitly track the state needed to continue. Depending on the target, that may be:

A page number
A next URL
An offset and limit pair
A cursor token
A timestamp or last-seen item ID

Do not bury this state implicitly inside a loop without logging it. When a scraper fails on page 47 or after a cursor refresh, visible state makes the problem diagnosable.

4. Separate item extraction from pagination logic

Keep two concerns distinct:

Item parsing: how you extract product cards, listings, or article links from one response
Pagination control: how you decide what request to make next

This separation makes your code easier to update when the site changes one part but not the other.

5. Add stop conditions early

Many scraping loops are written as open-ended while loops and only later patched with stop logic. It is better to define stop conditions from the start. Common examples include:

No next URL found
Returned items list is empty
Cursor token is missing or unchanged
Next button is disabled or absent
Known maximum page count reached

Also add a defensive page or request cap during development so you do not accidentally create an endless loop.

6. Deduplicate across pages

Pagination bugs often hide as duplicates rather than obvious failures. A site may repeat the last item of page one at the start of page two, or an infinite scroll feed may re-render already loaded items. Track a stable unique key such as item URL, ID, or hash of meaningful fields.

7. Log enough context to debug failures

Useful logs for pagination web scraping include:

Current page number or cursor
Request URL
Response status code
Number of items extracted
Whether a next token or next URL was found

Without these basics, many pagination failures become guesswork.

Practical examples

This section shows common pagination patterns and code snippets you can adapt. The examples are intentionally simple and focus on the control flow rather than site-specific selectors.

Page-number URLs

This is the most straightforward pattern. The next page is derived from a numeric parameter.

import requests
from bs4 import BeautifulSoup

base_url = "https://example.com/products?page={}"
seen = set()

for page in range(1, 101):
    url = base_url.format(page)
    r = requests.get(url, timeout=30)
    r.raise_for_status()

    soup = BeautifulSoup(r.text, "html.parser")
    cards = soup.select(".product-card")

    if not cards:
        break

    new_count = 0
    for card in cards:
        link = card.select_one("a")
        if not link:
            continue
        href = link.get("href")
        if href in seen:
            continue
        seen.add(href)
        new_count += 1
        # extract fields here

    if new_count == 0:
        break

Why this works: the loop has two stop conditions, empty pages and no new items. That protects you from odd edge cases such as repeated final pages.

Next-page link scraping

Sometimes page numbers are not predictable, but the markup includes a next link.

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = "https://example.com/blog"
seen_pages = set()

while url and url not in seen_pages:
    seen_pages.add(url)
    r = requests.get(url, timeout=30)
    r.raise_for_status()

    soup = BeautifulSoup(r.text, "html.parser")
    posts = soup.select("article")

    for post in posts:
        # parse post summary
        pass

    next_link = soup.select_one("a.next, a[rel='next']")
    url = urljoin(url, next_link.get("href")) if next_link else None

This pattern is often more robust than manually constructing page URLs because it follows the site’s own navigation logic.

Load more button with Playwright

For dynamic interfaces, you may need to click until the feed is exhausted.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/catalog", wait_until="networkidle")

    seen = set()

    while True:
        items = page.locator(".item-card")
        count = items.count()

        for i in range(count):
            href = items.nth(i).locator("a").get_attribute("href")
            if href:
                seen.add(href)

        load_more = page.locator("button:has-text('Load more')")
        if load_more.count() == 0 or not load_more.is_enabled():
            break

        before = count
        load_more.click()
        page.wait_for_timeout(1500)
        after = page.locator(".item-card").count()

        if after <= before:
            break

    browser.close()

In real projects, replace fixed timeouts with a wait for a specific network response or item count increase when possible.

Infinite scroll scraping

Infinite scroll scraping is usually a loop of scroll, wait, measure, and stop when the page stops growing.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com/feed")

    previous_height = 0
    stable_rounds = 0

    while stable_rounds < 3:
        page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
        page.wait_for_timeout(2000)
        current_height = page.evaluate("document.body.scrollHeight")

        if current_height == previous_height:
            stable_rounds += 1
        else:
            stable_rounds = 0
            previous_height = current_height

    items = page.locator(".feed-item")
    for i in range(items.count()):
        # extract item fields
        pass

    browser.close()

This method is simple, but not always ideal. If scrolling triggers an API request, intercepting that request or calling the endpoint directly is often more reliable than extracting from the rendered DOM.

Cursor pagination scraping from an API

Modern apps often return a cursor token rather than a page number.

import requests

endpoint = "https://example.com/api/search"
cursor = None
seen_ids = set()

while True:
    payload = {"limit": 50}
    if cursor:
        payload["cursor"] = cursor

    r = requests.get(endpoint, params=payload, timeout=30)
    r.raise_for_status()
    data = r.json()

    items = data.get("items", [])
    if not items:
        break

    for item in items:
        item_id = item.get("id")
        if item_id in seen_ids:
            continue
        seen_ids.add(item_id)
        # process item

    next_cursor = data.get("next_cursor")
    if not next_cursor or next_cursor == cursor:
        break

    cursor = next_cursor

Cursor pagination scraping is common because it performs well for large datasets, but it requires careful state handling. A missing or repeated cursor should be treated as a stopping signal, not ignored.

Scrapy pagination pattern

Scrapy is well suited for pagination because request scheduling and callback flow are built in.

import scrapy

class ProductSpider(scrapy.Spider):
    name = "products"
    start_urls = ["https://example.com/products"]

    def parse(self, response):
        for card in response.css(".product-card"):
            yield {
                "name": card.css(".name::text").get(),
                "url": response.urljoin(card.css("a::attr(href)").get())
            }

        next_href = response.css("a.next::attr(href), a[rel='next']::attr(href)").get()
        if next_href:
            yield response.follow(next_href, callback=self.parse)

If your project is growing beyond a one-off script, this structure can be easier to maintain than a manually managed requests loop.

Common mistakes

Most pagination failures come from a short list of recurring mistakes. If your scraper works for a few pages and then starts missing data, stalling, or duplicating records, check these first.

Assuming the UI tells the full story

A visible next button may only be a wrapper around an API call with hidden parameters. Scraping the button click without understanding the request can make the scraper unnecessarily brittle.

Using selectors that are too tied to presentation

If your next button selector depends on fragile classes generated by a frontend framework, it may break on a minor redesign. Prefer stable attributes, semantic relationships, or direct network patterns where available.

Stopping too early

Some feeds return temporary empty states while loading, or they require a second scroll before the next batch appears. If you stop after one weak signal, you may silently lose data. Use confirmation logic where appropriate, such as requiring multiple stable scroll rounds.

Stopping too late

The opposite problem is an endless loop. This often happens when the last page repeats, the cursor stops changing, or the page keeps firing background requests unrelated to content. Add loop guards and detect repeated state explicitly.

Ignoring deduplication

If the same product appears in multiple category pages or across pagination boundaries, your final dataset can be inflated without obvious errors. Always maintain a unique key set or deduplicate downstream.

Mixing extraction and control logic

When parsing code and pagination decisions are interwoven line by line, small changes become risky. Keep the code readable enough that you can swap out only the pagination mechanism when the site changes from page numbers to cursors.

Not checking network responses

A browser may render partial content even when one background request fails. If you only inspect the DOM, you may miss rate limiting, expired tokens, or intermittent 403 and 429 responses that affect later pages.

Overusing browser automation

Playwright is excellent for complex targets, but not every site needs it. If a simple request can fetch page 2 through page 200 directly, using a browser for every page adds cost and complexity without much benefit.

When to revisit

The best pagination strategy is not permanent. Revisit your scraper when the target’s underlying behavior changes or when your own requirements become stricter.

Review and update your approach if you notice any of the following:

The site changes from page numbers to infinite scroll
A new API endpoint appears in the network panel
Cursor tokens replace simple offsets
The scraper starts producing duplicate or suspiciously low record counts
Rate limiting increases and you need a lighter request pattern
You need better restartability, checkpointing, or audit logs

A practical maintenance routine looks like this:

Pick one representative target page and re-inspect its network activity.
Confirm the current pagination mechanism: URL, next link, scroll-triggered API, or cursor.
Run a small test crawl with logging enabled.
Compare item counts against a known slice or previous baseline.
Verify that stop conditions still work and that duplicates remain low or zero.
Refactor toward the simplest stable method if a cleaner endpoint is now available.

If you are evaluating tools for a rebuild, Best Web Scraping Frameworks Compared in 2026 can help frame the tradeoffs.

The main takeaway is simple: pagination is not a single technique but a family of patterns. Reliable scrapers identify the real fetch mechanism, track pagination state explicitly, separate parsing from navigation, and stop only when the signals are clear. If you build around those principles, you will spend less time patching broken loops and more time collecting complete, usable data.