1 2

Website archiving

Hello and welcome to the topic of web archiving. This article is intended to serve as a resource for all those who want to understand what website archiving is, how it is used and why it is of great importance nowadays. We will present the following content in the article:

What is web archiving?
How does web archiving work?
Why is archiving web pages important?
What is the difference between archiving and creating a backup?

This article is aimed at a wide range of marketing, digital technology, compliance, research, archiving, and records management professionals – basically anyone responsible for an organization’s digital network presence, archival compliance, or long-term preservation.

What is web archiving?

When we talk about archiving a website, we mean collecting and storing web pages and the information they contain. You can think of this process as being similar to the traditional archiving of paper documents. One starts by searching for data and information. Then content is selected, saved, and archived on a hard disk, for example.

From this point on, the information on the stored pages can be made available to the public in the archive. The following groups in particular are interested in such archives: Researchers, historians, journalists and universities, but also companies, authorities and other organizations. However, some industries are also obliged to archive their websites.

How does website archiving work?

Since the Internet contains a vast amount of web pages, archiving usually uses automatic procedures to collect and store pages. For this purpose, websites use so-called crawler software. Crawlers move around the web and within URLs, extracting and storing information as they go. These bots play an important role in the accuracy of web page collection. Due to the complexity of modern sites, this process has become a challenge for all archiving vendors.

Once crawling is complete, the archived pages, and the information they contain, are available as part of the web archive collection. These can be played and navigated as in the “live web”. However, they retain only the content published at a particular point in time.

What is a crawler?

The crawler can also be called spider bot and is an Internet bot that systematically searches the World Wide Web. Normally it exists to index a web page. This means that it includes all the information of a page in the index/register. Search engines, such as Google, use these bots to determine the ranking of the pages in the search results.

Why is archiving web pages important?

Businesses, government agencies, and organizations create websites as part of their communication with the public because they are powerful tools for marketing and information sharing. Websites represent a company’s brand, values, and personality, and document the public nature of an organization and its interactions with audiences and customers. In addition, information published on the Internet has become the primary place where we seek and receive information. For this reason, a website is considered an important public document.

Importance of website archiving for different industries

Financial services industry

Some industries, such as the legal industry and the financial industry, are required by law to retain your pages:

After the 2008 financial crisis, the financial services industry was revamped to protect consumers and increase transparency. As a result, regulated organizations must comply with a strict set of regulations that are constantly evolving and changing. Of these regulations, some are directly related to securing content in a web archive. Regulatory bodies around the world require companies to maintain accurate web records due to legal requirements to retain files, URLs and domains. This is also useful for the organization, for example, when it comes to providing important evidence in the event of a legal dispute.

Marketing industry

Today, in addition to their traditional brand assets, such as print advertising, the world’s leading brands also create and distribute extensive content on the Internet. This has led to brand archiving becoming increasingly important. Not only for preserving brand heritage, but also for keeping an accurate record of what products customers liked at any given time and what strategies worked.

Brands often use preservation software in other ways as well. Accordingly, a searchable archive containing digital copies is readily used to inspire the next generation of marketers. Accessing the Internet archive and enabling them to rediscover their digital heritage.

General public

Many national archives, libraries, government, and university archives store large amounts of data, URLs and domains, for cultural and historical reasons. These Internet archives serve as the basis for research for later generations. The public sector is increasingly investing in digital channels. Because of this, organizations are finding ways to expand their repositories, taking advantage of cloud servers. This allows for more efficient and flexible storage of large amounts of data. The goal is to make data accessible to researchers, officials, students and the general public in the future.

What is the difference between archiving and creating a backup?

To begin with, it should be said that both backups and Internet archives are important for preserving the web infrastructure. Backups serve more as a daily backup in case data is lost unexpectedly, such as in a fire. Archiving, on the other hand, serves more to document website development:

Backups are data-based. Here, the goal is to preserve the data of a site. The main purpose of this is to be able to restore the website if the worst happens. This can prevent files from being lost.
Archives contain the context to the data. If you browse the archives of your favorite sites, you will find that the functionality is often incomplete. However, the design, such as images and the static content of the page are mostly intact.

It’s worth noting that archiving web pages doesn’t completely eliminate data retention. In fact, one of the advantages is that users can browse archive pages as if they were “live”.