Browse Definitions :
Definition

Wayback Machine

What is Wayback Machine?

The Internet Archive's Wayback Machine is a digital archive of information on the internet. The Internet Archive, a nonprofit organization based in San Francisco, made it public in 2001.

Users can access archived versions of webpages with Wayback Machine. Wayback Machine holds more than 832 billion archived webpages, dating back to 1996. In addition to webpages, the Internet Archive stores books, movies, television, music and other content. The Internet Archive takes up more than 40 petabytes of data storage, and Wayback Machine takes up a significant portion of that.

Why is Wayback Machine important?

The Internet Archive was one of the first organizations to archive the internet. Wayback Machine, therefore, serves as a unique record of the internet's early days before most recorded it.

The internet is continually growing and changing, and webpages can be deleted or edited at any time without leaving behind any artifact. Wayback Machine preserves the history of the internet even after those pages have been edited or deleted.

How does Wayback Machine work?

Wayback Machine automatically crawls and captures snapshots of webpages at various points in time. These snapshots are then stored, attached to timestamps and made accessible to users.

Wayback Machine uses several different crawlers -- some from third-party sources and some from the Internet Archive. Users can also submit a page for manual archival.

Websites are typically constructed using a combination of files, such as image files, Hypertext Markup Language (HTML), JavaScript and cascading style sheets. Each file has its own URL, which Wayback Machine captures to display the full page as it looks to the user. For example, images on a webpage have their own separate URLs from the main page. The file URLs may be captured at different times from the URL to the page itself. For example, an image might be crawled and recorded days after the main HTML of a page is crawled.

To search from the Wayback Machine homepage, users enter a site's URL into the search bar and a date range for the content they want to access.

The Wayback Machine search results page shows a graph of the number of times a webpage was crawled since 1996 and a calendar that lists crawls per day. Users can scroll over each crawl to see the date, time and reason for each.

Wayback Machine has several different features to display webpage data, including the following:

  • Collections page. This lets users see why a page was crawled.
  • Changes page. This shows how much a page has changed over time.
  • Compare feature. This lets users compare two different captures from two different times side by side.
  • Summary feature. This shows information about the entire domain.
  • Sitemap feature. This shows information about the linking structure of the site over time.

Users can click on a particular capture and view the provenance of a page. Users can also save pages to a personal web archive in their account.

In addition to searching by URL, users can search by keyword. Keyword search on Wayback Machine is different than keyword search on Google or similar search engines. The Wayback Machine's keyword search looks for entire domains about a specific keyword, not individual pages.

The Save Page Now feature saves the one URL entered in the search bar. There are also Wayback Machine Chrome extensions, web browser add-ons, a WordPress plugin and an iOS app.

How is Wayback Machine used?

Here are some basic ways to use Wayback Machine:

  • View and compare changes between two iterations of a webpage.
  • See why or when a page was crawled.
  • See who is crawling what webpages.
  • View old versions of webpages.
  • View webpages that no longer exist.
  • Troubleshoot problems with a webpage.
  • Save pages manually to Wayback Machine.
  • Link to old webpages.
  • Conduct large-scale crawls.

These basic functions have many applied uses, including search engine optimization (SEO), web development, journalism, open source intelligence (OSINT) gathering and legal research. For example, SEO-motivated users can find old versions of websites that were never redirected to live versions and fix broken links. They can also revisit old versions of pages that performed better to see if there are any elements worth re-including in new content.

Users can also look at Wayback Machine to see how frequently their competitors update content. Legal researchers could use the tool to gather evidence for a legal case. Web developers could use it to troubleshoot or debug websites by accessing past versions of a website to see when a particular bug was introduced over time. Journalists could use the service to access historical documents or perform fact checks. Cybersecurity researchers could look for OSINT hidden in older iterations of a webpage or deleted information. And archivists at Wikipedia can use Wayback Machine to help alleviate link rot.

The Wayback Machine application programming interface (API) lets users automate data retrieval functions at scale. APIs can read and write metadata to and from items in the Internet Archive. They can also write and read media or other files to and from items. Wayback Machine has several APIs, including the following:

  • Wayback Availability JSON. This tests if a URL is archived and accessible in Wayback Machine.
  • Memento. This provides additional interfaces for querying snapshots in Wayback Machine.
  • Wayback CDX Server. This enables complex filtering, querying and analysis of Wayback Machine capture data.

The Internet Archive's subscription service -- Archive-It -- lets organizations archive websites and create custom collections of content.

History of Wayback Machine

The Internet Archive was founded in 1996 to archive the internet in its nascent stages and pursue the goal of providing universal access to all knowledge. The Internet Archive is a nonprofit and was founded by Brewster Kahle and Bruce Gilliat. Wayback Machine began indexing webpages in 1996 and was formally released to the public in 2001, by which time it contained over 10 billion archived pages. Kahle founded the for-profit web crawling company Alexa Internet, which today remains one of the Internet Archive's most prominent web crawlers.

The Internet Archive now hosts several other projects, including the National Aeronautics and Space Administration images archive and the book information site Open Library. The Internet Archive also collaborates with many institutions to maintain these libraries, including the Library of Congress and Smithsonian Institution.

The name Wayback Machine is a reference to the animated cartoon The Adventures of Rocky and Bullwinkle and Friends. In it, the characters used the WABAC -- pronounced wayback -- machine to travel through time and participate in various historical events.

Limitations of Wayback Machine

Not all webpages are archived in Wayback Machine. Some websites block Wayback Machine's crawlers. Others might not be archived for various reasons, such as specific site owners requesting anonymity or pages that require a password to access. Sometimes, a site's robot.txt file keeps the site from being crawled. Robots.txt files direct web crawlers and indicate which websites they can and can't visit. Pages without inbound links from other websites are more difficult to archive, too. In some cases, JavaScript can be hard to archive as well. HTML is the easiest type of content for Wayback Machine to archive.

Additionally, the frequency of snapshots can vary, so not every change to a website is captured. It can sometimes take months for a webpage to appear in Wayback Machine after being collected.

In general, Wayback Machine doesn't collect or archive personal emails or chats from private sources. It also doesn't collect dynamic information well. For example, a user could not access a Google search engine from 2010 and use it to search for other websites.

This was last updated in August 2023

Continue Reading About Wayback Machine

Networking
Security
  • cloud security

    Cloud security, also known as 'cloud computing security,' is a set of policies, practices and controls deployed to protect ...

  • privacy impact assessment (PIA)

    A privacy impact assessment (PIA) is a method for identifying and assessing privacy risks throughout the development lifecycle of...

  • proof of concept (PoC) exploit

    A proof of concept (PoC) exploit is a nonharmful attack against a computer or network. PoC exploits are not meant to cause harm, ...

CIO
  • data collection

    Data collection is the process of gathering data for use in business decision-making, strategic planning, research and other ...

  • chief trust officer

    A chief trust officer (CTrO) in the IT industry is an executive job title given to the person responsible for building confidence...

  • green IT (green information technology)

    Green IT (green information technology) is the practice of creating and using environmentally sustainable computing resources.

HRSoftware
  • diversity, equity and inclusion (DEI)

    Diversity, equity and inclusion is a term used to describe policies and programs that promote the representation and ...

  • ADP Mobile Solutions

    ADP Mobile Solutions is a self-service mobile app that enables employees to access work records such as pay, schedules, timecards...

  • director of employee engagement

    Director of employee engagement is one of the job titles for a human resources (HR) manager who is responsible for an ...

Customer Experience
  • digital marketing

    Digital marketing is the promotion and marketing of goods and services to consumers through digital channels and electronic ...

  • contact center schedule adherence

    Contact center schedule adherence is a standard metric used in business contact centers to determine whether contact center ...

  • customer retention

    Customer retention is a metric that measures customer loyalty, or an organization's ability to retain customers over time.

Close