HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEB SITE

How to Find All Existing and Archived URLs on a web site

How to Find All Existing and Archived URLs on a web site

Blog Article

There are several causes you could want to search out many of the URLs on a web site, but your correct goal will figure out Whatever you’re attempting to find. For illustration, you might want to:

Recognize each indexed URL to analyze issues like cannibalization or index bloat
Accumulate existing and historic URLs Google has seen, specifically for website migrations
Come across all 404 URLs to recover from article-migration mistakes
In Each individual situation, one Device won’t Supply you with every little thing you require. Unfortunately, Google Lookup Console isn’t exhaustive, plus a “web-site:illustration.com” research is restricted and tricky to extract information from.

During this put up, I’ll stroll you thru some applications to build your URL record and before deduplicating the info utilizing a spreadsheet or Jupyter Notebook, dependant upon your site’s sizing.

Outdated sitemaps and crawl exports
In case you’re in search of URLs that disappeared with the Dwell web page not too long ago, there’s a chance anyone on the workforce may have saved a sitemap file or possibly a crawl export prior to the alterations ended up designed. When you haven’t by now, check for these data files; they can normally provide what you will need. But, for those who’re looking at this, you almost certainly didn't get so lucky.

Archive.org
Archive.org
Archive.org is a useful tool for SEO tasks, funded by donations. In the event you look for a domain and choose the “URLs” alternative, you may accessibility as many as 10,000 detailed URLs.

However, there are a few constraints:

URL Restrict: You are able to only retrieve approximately web designer kuala lumpur 10,000 URLs, and that is insufficient for much larger web pages.
Quality: Many URLs may be malformed or reference useful resource information (e.g., images or scripts).
No export option: There isn’t a designed-in solution to export the checklist.
To bypass The shortage of the export button, utilize a browser scraping plugin like Dataminer.io. Nonetheless, these constraints imply Archive.org might not offer a complete Resolution for larger sized websites. Also, Archive.org doesn’t show regardless of whether Google indexed a URL—but if Archive.org discovered it, there’s a good likelihood Google did, also.

Moz Professional
Whilst you could possibly typically use a website link index to uncover exterior web pages linking to you personally, these tools also learn URLs on your website in the process.


How you can utilize it:
Export your inbound back links in Moz Pro to acquire a rapid and straightforward listing of concentrate on URLs from your web site. If you’re managing an enormous Internet site, consider using the Moz API to export info outside of what’s manageable in Excel or Google Sheets.

It’s essential to Observe that Moz Pro doesn’t ensure if URLs are indexed or learned by Google. On the other hand, due to the fact most internet sites apply a similar robots.txt policies to Moz’s bots since they do to Google’s, this technique generally functions properly for a proxy for Googlebot’s discoverability.

Google Look for Console
Google Look for Console provides many important resources for making your list of URLs.

Hyperlinks experiences:


Much like Moz Pro, the One-way links segment presents exportable lists of target URLs. However, these exports are capped at 1,000 URLs Each individual. You'll be able to use filters for precise internet pages, but because filters don’t utilize on the export, you could possibly must rely on browser scraping instruments—limited to 500 filtered URLs at a time. Not suitable.

Effectiveness → Search engine results:


This export offers you a listing of web pages acquiring look for impressions. Whilst the export is proscribed, You should utilize Google Research Console API for larger sized datasets. In addition there are no cost Google Sheets plugins that simplify pulling extra intensive info.

Indexing → Internet pages report:


This area provides exports filtered by challenge kind, while these are generally also minimal in scope.

Google Analytics
Google Analytics
The Engagement → Web pages and Screens default report in GA4 is a superb source for collecting URLs, with a generous Restrict of 100,000 URLs.


Even better, it is possible to apply filters to develop unique URL lists, efficiently surpassing the 100k Restrict. For instance, if you would like export only weblog URLs, comply with these techniques:

Move one: Include a segment on the report

Phase 2: Simply click “Make a new section.”


Stage three: Outline the phase with a narrower URL pattern, like URLs made up of /website/


Note: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply worthwhile insights.

Server log data files
Server or CDN log data files are Maybe the ultimate Software at your disposal. These logs seize an exhaustive listing of each URL path queried by consumers, Googlebot, or other bots through the recorded time period.

Factors:

Info dimension: Log files is usually large, a great number of internet sites only keep the last two weeks of information.
Complexity: Analyzing log documents can be tough, but several applications are available to simplify the process.
Mix, and very good luck
When you finally’ve gathered URLs from every one of these sources, it’s time to combine them. If your site is sufficiently small, use Excel or, for greater datasets, tools like Google Sheets or Jupyter Notebook. Assure all URLs are persistently formatted, then deduplicate the record.

And voilà—you now have a comprehensive listing of existing, previous, and archived URLs. Fantastic luck!

Report this page