Pagination is where many otherwise solid scrapers become fragile. A site may look simple, but the path from page one to page two can involve hidden API calls, cursor tokens, JavaScript events, or an endless feed that never exposes a traditional next link. This guide gives you a practical framework for handling pagination in web scraping across the patterns you are most likely to meet: numbered pages, next buttons, infinite scroll, load-more interfaces, and cursor-based APIs. The goal is not just to get one scraper working once, but to build a next page scraper that remains understandable, debuggable, and easier to update when the target changes.
Overview
If you want to know how to scrape paginated websites reliably, start by treating pagination as a discovery problem before it becomes a coding problem. The visible page controls are only one layer. What matters is the mechanism the site actually uses to fetch the next batch of records.
In practice, pagination web scraping usually falls into five broad patterns:
- Page-number URLs, such as
?page=2or/page/3/ - Next-button navigation, where a link or button leads to the next result set
- Load more buttons, where the same page fetches additional items asynchronously
- Infinite scroll, where scrolling triggers network requests for more data
- Cursor or token pagination, common in APIs and modern web apps, where the next request depends on a cursor value returned by the previous one
The mistake many developers make is choosing tooling before identifying the pagination pattern. If the site exposes clean request parameters, a lightweight HTTP client may be enough. If the site relies on browser events and dynamic rendering, you may need Playwright or another browser automation tool. For a broader setup checklist, it helps to review Web Scraping Tech Stack Checklist for New Projects.
A reliable approach has three goals:
- Discover the true pagination mechanism
- Extract records without duplication or silent gaps
- Stop safely when no more results exist
Everything else in this article builds on those three points.
Core framework
Here is a repeatable framework you can apply whether you are scraping a simple blog archive or a JavaScript-heavy catalog.
1. Inspect before you automate
Open browser devtools and answer these questions first:
- Does the URL change when moving to the next page?
- Is there a standard anchor tag with an
hreffor the next page? - Does clicking next trigger an XHR or fetch request?
- Does the response contain HTML, JSON, or some cursor token?
- Is there a visible end condition, such as a disabled next button or an empty response?
This step often reveals that a page that looks dynamic is backed by a simple JSON endpoint. If so, scraping the endpoint directly is usually cleaner than driving the interface.
2. Prefer stable request patterns over UI simulation
If page two is available at a predictable URL or API endpoint, use that. Browser automation is valuable, but it introduces more moving parts: rendering delays, selector drift, and state synchronization issues. A direct request flow is usually faster and easier to debug.
This is especially relevant when deciding between parser-first and browser-first tools. If you are working in Python and want a broader comparison, Scrapy vs Beautiful Soup: Which Python Scraper Should You Use? is a useful companion read. If the site truly needs a browser, see Playwright vs Puppeteer for Web Scraping: Features, Tradeoffs, and Use Cases.
3. Define a clear pagination state
Your scraper should explicitly track the state needed to continue. Depending on the target, that may be:
- A page number
- A next URL
- An offset and limit pair
- A cursor token
- A timestamp or last-seen item ID
Do not bury this state implicitly inside a loop without logging it. When a scraper fails on page 47 or after a cursor refresh, visible state makes the problem diagnosable.
4. Separate item extraction from pagination logic
Keep two concerns distinct:
- Item parsing: how you extract product cards, listings, or article links from one response
- Pagination control: how you decide what request to make next
This separation makes your code easier to update when the site changes one part but not the other.
5. Add stop conditions early
Many scraping loops are written as open-ended while loops and only later patched with stop logic. It is better to define stop conditions from the start. Common examples include:
- No next URL found
- Returned items list is empty
- Cursor token is missing or unchanged
- Next button is disabled or absent
- Known maximum page count reached
Also add a defensive page or request cap during development so you do not accidentally create an endless loop.
6. Deduplicate across pages
Pagination bugs often hide as duplicates rather than obvious failures. A site may repeat the last item of page one at the start of page two, or an infinite scroll feed may re-render already loaded items. Track a stable unique key such as item URL, ID, or hash of meaningful fields.
7. Log enough context to debug failures
Useful logs for pagination web scraping include:
- Current page number or cursor
- Request URL
- Response status code
- Number of items extracted
- Whether a next token or next URL was found
Without these basics, many pagination failures become guesswork.
Practical examples
This section shows common pagination patterns and code snippets you can adapt. The examples are intentionally simple and focus on the control flow rather than site-specific selectors.
Page-number URLs
This is the most straightforward pattern. The next page is derived from a numeric parameter.
import requests
from bs4 import BeautifulSoup
base_url = "https://example.com/products?page={}"
seen = set()
for page in range(1, 101):
url = base_url.format(page)
r = requests.get(url, timeout=30)
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
cards = soup.select(".product-card")
if not cards:
break
new_count = 0
for card in cards:
link = card.select_one("a")
if not link:
continue
href = link.get("href")
if href in seen:
continue
seen.add(href)
new_count += 1
# extract fields here
if new_count == 0:
breakWhy this works: the loop has two stop conditions, empty pages and no new items. That protects you from odd edge cases such as repeated final pages.
Next-page link scraping
Sometimes page numbers are not predictable, but the markup includes a next link.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url = "https://example.com/blog"
seen_pages = set()
while url and url not in seen_pages:
seen_pages.add(url)
r = requests.get(url, timeout=30)
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
posts = soup.select("article")
for post in posts:
# parse post summary
pass
next_link = soup.select_one("a.next, a[rel='next']")
url = urljoin(url, next_link.get("href")) if next_link else NoneThis pattern is often more robust than manually constructing page URLs because it follows the site’s own navigation logic.
Load more button with Playwright
For dynamic interfaces, you may need to click until the feed is exhausted.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/catalog", wait_until="networkidle")
seen = set()
while True:
items = page.locator(".item-card")
count = items.count()
for i in range(count):
href = items.nth(i).locator("a").get_attribute("href")
if href:
seen.add(href)
load_more = page.locator("button:has-text('Load more')")
if load_more.count() == 0 or not load_more.is_enabled():
break
before = count
load_more.click()
page.wait_for_timeout(1500)
after = page.locator(".item-card").count()
if after <= before:
break
browser.close()In real projects, replace fixed timeouts with a wait for a specific network response or item count increase when possible.
Infinite scroll scraping
Infinite scroll scraping is usually a loop of scroll, wait, measure, and stop when the page stops growing.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/feed")
previous_height = 0
stable_rounds = 0
while stable_rounds < 3:
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000)
current_height = page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
stable_rounds += 1
else:
stable_rounds = 0
previous_height = current_height
items = page.locator(".feed-item")
for i in range(items.count()):
# extract item fields
pass
browser.close()This method is simple, but not always ideal. If scrolling triggers an API request, intercepting that request or calling the endpoint directly is often more reliable than extracting from the rendered DOM.
Cursor pagination scraping from an API
Modern apps often return a cursor token rather than a page number.
import requests
endpoint = "https://example.com/api/search"
cursor = None
seen_ids = set()
while True:
payload = {"limit": 50}
if cursor:
payload["cursor"] = cursor
r = requests.get(endpoint, params=payload, timeout=30)
r.raise_for_status()
data = r.json()
items = data.get("items", [])
if not items:
break
for item in items:
item_id = item.get("id")
if item_id in seen_ids:
continue
seen_ids.add(item_id)
# process item
next_cursor = data.get("next_cursor")
if not next_cursor or next_cursor == cursor:
break
cursor = next_cursorCursor pagination scraping is common because it performs well for large datasets, but it requires careful state handling. A missing or repeated cursor should be treated as a stopping signal, not ignored.
Scrapy pagination pattern
Scrapy is well suited for pagination because request scheduling and callback flow are built in.
import scrapy
class ProductSpider(scrapy.Spider):
name = "products"
start_urls = ["https://example.com/products"]
def parse(self, response):
for card in response.css(".product-card"):
yield {
"name": card.css(".name::text").get(),
"url": response.urljoin(card.css("a::attr(href)").get())
}
next_href = response.css("a.next::attr(href), a[rel='next']::attr(href)").get()
if next_href:
yield response.follow(next_href, callback=self.parse)If your project is growing beyond a one-off script, this structure can be easier to maintain than a manually managed requests loop.
Common mistakes
Most pagination failures come from a short list of recurring mistakes. If your scraper works for a few pages and then starts missing data, stalling, or duplicating records, check these first.
Assuming the UI tells the full story
A visible next button may only be a wrapper around an API call with hidden parameters. Scraping the button click without understanding the request can make the scraper unnecessarily brittle.
Using selectors that are too tied to presentation
If your next button selector depends on fragile classes generated by a frontend framework, it may break on a minor redesign. Prefer stable attributes, semantic relationships, or direct network patterns where available.
Stopping too early
Some feeds return temporary empty states while loading, or they require a second scroll before the next batch appears. If you stop after one weak signal, you may silently lose data. Use confirmation logic where appropriate, such as requiring multiple stable scroll rounds.
Stopping too late
The opposite problem is an endless loop. This often happens when the last page repeats, the cursor stops changing, or the page keeps firing background requests unrelated to content. Add loop guards and detect repeated state explicitly.
Ignoring deduplication
If the same product appears in multiple category pages or across pagination boundaries, your final dataset can be inflated without obvious errors. Always maintain a unique key set or deduplicate downstream.
Mixing extraction and control logic
When parsing code and pagination decisions are interwoven line by line, small changes become risky. Keep the code readable enough that you can swap out only the pagination mechanism when the site changes from page numbers to cursors.
Not checking network responses
A browser may render partial content even when one background request fails. If you only inspect the DOM, you may miss rate limiting, expired tokens, or intermittent 403 and 429 responses that affect later pages.
Overusing browser automation
Playwright is excellent for complex targets, but not every site needs it. If a simple request can fetch page 2 through page 200 directly, using a browser for every page adds cost and complexity without much benefit.
When to revisit
The best pagination strategy is not permanent. Revisit your scraper when the target’s underlying behavior changes or when your own requirements become stricter.
Review and update your approach if you notice any of the following:
- The site changes from page numbers to infinite scroll
- A new API endpoint appears in the network panel
- Cursor tokens replace simple offsets
- The scraper starts producing duplicate or suspiciously low record counts
- Rate limiting increases and you need a lighter request pattern
- You need better restartability, checkpointing, or audit logs
A practical maintenance routine looks like this:
- Pick one representative target page and re-inspect its network activity.
- Confirm the current pagination mechanism: URL, next link, scroll-triggered API, or cursor.
- Run a small test crawl with logging enabled.
- Compare item counts against a known slice or previous baseline.
- Verify that stop conditions still work and that duplicates remain low or zero.
- Refactor toward the simplest stable method if a cleaner endpoint is now available.
If you are evaluating tools for a rebuild, Best Web Scraping Frameworks Compared in 2026 can help frame the tradeoffs.
The main takeaway is simple: pagination is not a single technique but a family of patterns. Reliable scrapers identify the real fetch mechanism, track pagination state explicitly, separate parsing from navigation, and stop only when the signals are clear. If you build around those principles, you will spend less time patching broken loops and more time collecting complete, usable data.