Backup a website to S3

Clone a running website to S3 and CloudFront.

Chris G

Aug. 8, 2024

The purpose of this Python script is to backup or migrate all pages from a running website to an S3 bucket, to be served by a CloudFront Distribution.

My use case is simple, I want to have an inexpensive backup of my website in case the web server has any issues. One of the most inexpensive ways to run a website nowadays is to have a web-enabled S3 bucket. Not only is it dirt cheap, it is also extremely scalable.

When I say "web-enabled S3 bucket", of course, people don't do that anymore (please don't), put a CloudFront distribution in front of it (also super scalable and inexpensive).

To get that part started, you can use my CloudFormation template, see CloudFront and S3 Bucket CloudFormation Stack.

The Code

The Libraries

For this, we'll use the following Python libraries:

pip3 install --upgrade django-dotenv beautifulsoup4 lxml boto3

The Variables

We'll read the AWS-specific variables from a ".env" file (make sure to exclude that in your .gitignore file).

Since I want the S3 website to be a failover for my real website (just requiring me to re-point the DNS record to make it "live"), the bucket_name variable is the same for the original website I want to pull the files from and for the S3 bucket name. Your use case may require a different name.

The extra_files variable is a list containing extra files that I want to backup to S3 that may not be included in your sitemap.xml file.

dotenv.read_dotenv()
backup_region_name = os.environ.get("backup_region_name", "")
backup_aws_access_key_id = os.environ.get("backup_aws_access_key_id", "")
backup_aws_secret_access_key = os.environ.get("backup_aws_secret_access_key", "")
html_mime_type = 'text/html; charset=utf-8'
bucket_name = 'www.example.com'
extra_files = ['/', '/robots.txt', '/sitemap.xml', 'favicon.ico']
s3 = boto3.resource(
    's3',
    region_name=backup_region_name,
    aws_access_key_id=backup_aws_access_key_id,
    aws_secret_access_key=backup_aws_secret_access_key,
)

Getting a list of web pages

In order to get a list of all the web pages to pull from the original website, I decided to use the site's sitemap.xml file, the simplest option. While it may not be 100% complete, it should be one of the most up-to-date options.

The purpose of the get_sitemap function is to read the sitemap.xml file from the website and enumerate through all of the <loc> URIs. It will read each page and call the SaveFile function to save the content to S3 if the return code is 200.

Since we are reading the webpage with beautifulsoup, it will also parse the page and add any .css file to the extra_files list of files, to be retrieved later.

def get_sitemap(url):
    global extra_files
    full_url = f"https://{url}/sitemap.xml"
    with requests.Session() as req:
        r = req.get(full_url)
        soup = BeautifulSoup(r.content, 'lxml')
        links = [item.text for item in soup.select("loc")]
        for link in links:
            r = req.get(link)
            if r.status_code == 200:
                html_content = r.content
            else:
                print(f'\033[1;31;1m{link} {r.status_code}')
                continue
            soup = BeautifulSoup(r.content, 'html.parser')
            SaveFile(link, r.content, html_mime_type, soup.html["lang"])
            # Get all CSS links
            for css in soup.findAll("link", rel="stylesheet"):
                if css['href'] not in extra_files:
                    print('\033[1;37;1m', "Found the URL:", css['href'])
                    extra_files.append(css['href'])
    return

Saving web pages to S3

The purpose of the SaveFile function is to save the web pages, images, .css, or any other files to S3. I chose the "REDUCED_REDUNDANCY" class for S3 to reduce cost, adjust to your needs.

Since I have a multi-lingual site, I use the 'text/html; charset=utf-8' mime-type, and I also try to read the language of the HTML file so that I can set it on the S3 object. The script also converts the URL to proper 'utf-8' names for the S3 object.

My website doesn't use the ".html" extension but appends a "/" at the end of the web page name. In S3, this converts in a "folder" name having the name of the web page, and an S3 object named "/" containing the HTML content.

I did name the default root page of my website 'index.html' because it is the default web page configured in my CloudFront distribution.

def SaveFile(file_name, file_content, mime_type, lang):
    global bucket_name
    my_url = urllib.parse.unquote(file_name, encoding='utf-8', errors='replace')
    my_path = urllib.parse.urlparse(my_url).path
    if my_path == '/':
        my_path = 'index.html'
    if my_path.startswith('/'):
        my_path = my_path[1:]
    print(f'\033[1;32;1m{file_name} -> {my_path} {lang}')
    bucket = s3.Bucket(bucket_name)
    if lang is not None:
        bucket.put_object(Key= my_path, Body=file_content, ContentType=mime_type, StorageClass='REDUCED_REDUNDANCY', CacheControl='max-age=0', ContentLanguage=lang)
    else:
        bucket.put_object(Key= my_path, Body=file_content, ContentType=mime_type, StorageClass='REDUCED_REDUNDANCY', CacheControl='max-age=0')

Saving additional files

The purpose of the get_others function is to retrieve the "extra" files, the ones that are not included in the sitemap file. These may include '/robots.txt', '/sitemap.xml', 'favicon.ico', and more. We use the mimetypes library to guess and set the proper mime-types tag in S3. We call the same SaveFile function to save to S3.

def get_others(url):
    global extra_files
    with requests.Session() as req:
        for file in extra_files:
            my_url = requests.compat.urljoin(f"https://{url}", file)
            # Get MIME type using guess_type
            mime_type, encoding = mimetypes.guess_type(my_url)
            if mime_type is None:
                mime_type = html_mime_type
            print("\033[1;37;1mMIME Type:", mime_type)
            r = req.get(my_url)
            if r.status_code == 200:
                if mime_type == html_mime_type:
                    soup = BeautifulSoup(r.content, 'html.parser')
                    lang = soup.html["lang"]
                else:
                    lang = None
                SaveFile(my_url, r.content, mime_type, lang)
            else:
                print(f'\033[1;31;1m{my_url} {r.status_code}')
                continue
    return

The Finished Code

You can find the complete source code for this article at https://github.com/Christophe-Gauge/python/blob/main/backup_website.py.

This script may not be able to handle all of your use cases, but it is hopefully a good start. Please comment below and/or submit pull requests if you improve it!

    
    
Posted Comments: 0

    
    
Please sign in or create an account to post a Comment.

Tagged with:
AWS web