Backup a website to S3
Clone a running website to S3 and CloudFront.
The purpose of this Python script is to backup or migrate all pages from a running website to an S3 bucket, to be served by a CloudFront Distribution.
My use case is simple, I want to have an inexpensive backup of my website in case the web server has any issues. One of the most inexpensive ways to run a website nowadays is to have a web-enabled S3 bucket. Not only is it dirt cheap, it is also extremely scalable.
When I say "web-enabled S3 bucket", of course, people don't do that anymore (please don't), put a CloudFront distribution in front of it (also super scalable and inexpensive).
To get that part started, you can use my CloudFormation template, see CloudFront and S3 Bucket CloudFormation Stack.
The Code
The Libraries
For this, we'll use the following Python libraries:
pip3 install --upgrade django-dotenv beautifulsoup4 lxml boto3
The Variables
We'll read the AWS-specific variables from a ".env" file (make sure to exclude that in your .gitignore file).
Since I want the S3 website to be a failover for my real website (just requiring me to re-point the DNS record to make it "live"), the bucket_name variable is the same for the original website I want to pull the files from and for the S3 bucket name. Your use case may require a different name.
The extra_files variable is a list containing extra files that I want to backup to S3 that may not be included in your sitemap.xml file.
dotenv.read_dotenv()
backup_region_name = os.environ.get("backup_region_name", "")
backup_aws_access_key_id = os.environ.get("backup_aws_access_key_id", "")
backup_aws_secret_access_key = os.environ.get("backup_aws_secret_access_key", "")
html_mime_type = 'text/html; charset=utf-8'
bucket_name = 'www.example.com'
extra_files = ['/', '/robots.txt', '/sitemap.xml', 'favicon.ico']
s3 = boto3.resource(
's3',
region_name=backup_region_name,
aws_access_key_id=backup_aws_access_key_id,
aws_secret_access_key=backup_aws_secret_access_key,
)
Getting a list of web pages
In order to get a list of all the web pages to pull from the original website, I decided to use the site's sitemap.xml file, the simplest option. While it may not be 100% complete, it should be one of the most up-to-date options.
The purpose of the get_sitemap function is to read the sitemap.xml file from the website and enumerate through all of the <loc> URIs. It will read each page and call the SaveFile function to save the content to S3 if the return code is 200.
Since we are reading the webpage with beautifulsoup, it will also parse the page and add any .css file to the extra_files list of files, to be retrieved later.
def get_sitemap(url):
global extra_files
full_url = f"https://{url}/sitemap.xml"
with requests.Session() as req:
r = req.get(full_url)
soup = BeautifulSoup(r.content, 'lxml')
links = [item.text for item in soup.select("loc")]
for link in links:
r = req.get(link)
if r.status_code == 200:
html_content = r.content
else:
print(f'\033[1;31;1m{link} {r.status_code}')
continue
soup = BeautifulSoup(r.content, 'html.parser')
SaveFile(link, r.content, html_mime_type, soup.html["lang"])
# Get all CSS links
for css in soup.findAll("link", rel="stylesheet"):
if css['href'] not in extra_files:
print('\033[1;37;1m', "Found the URL:", css['href'])
extra_files.append(css['href'])
return
Saving web pages to S3
The purpose of the SaveFile function is to save the web pages, images, .css, or any other files to S3. I chose the "REDUCED_REDUNDANCY" class for S3 to reduce cost, adjust to your needs.
Since I have a multi-lingual site, I use the 'text/html; charset=utf-8' mime-type, and I also try to read the language of the HTML file so that I can set it on the S3 object. The script also converts the URL to proper 'utf-8' names for the S3 object.
My website doesn't use the ".html" extension but appends a "/" at the end of the web page name. In S3, this converts in a "folder" name having the name of the web page, and an S3 object named "/" containing the HTML content.
I did name the default root page of my website 'index.html' because it is the default web page configured in my CloudFront distribution.
def SaveFile(file_name, file_content, mime_type, lang):
global bucket_name
my_url = urllib.parse.unquote(file_name, encoding='utf-8', errors='replace')
my_path = urllib.parse.urlparse(my_url).path
if my_path == '/':
my_path = 'index.html'
if my_path.startswith('/'):
my_path = my_path[1:]
print(f'\033[1;32;1m{file_name} -> {my_path} {lang}')
bucket = s3.Bucket(bucket_name)
if lang is not None:
bucket.put_object(Key= my_path, Body=file_content, ContentType=mime_type, StorageClass='REDUCED_REDUNDANCY', CacheControl='max-age=0', ContentLanguage=lang)
else:
bucket.put_object(Key= my_path, Body=file_content, ContentType=mime_type, StorageClass='REDUCED_REDUNDANCY', CacheControl='max-age=0')
Saving additional files
The purpose of the get_others function is to retrieve the "extra" files, the ones that are not included in the sitemap file. These may include '/robots.txt', '/sitemap.xml', 'favicon.ico', and more. We use the mimetypes library to guess and set the proper mime-types tag in S3. We call the same SaveFile function to save to S3.
def get_others(url):
global extra_files
with requests.Session() as req:
for file in extra_files:
my_url = requests.compat.urljoin(f"https://{url}", file)
# Get MIME type using guess_type
mime_type, encoding = mimetypes.guess_type(my_url)
if mime_type is None:
mime_type = html_mime_type
print("\033[1;37;1mMIME Type:", mime_type)
r = req.get(my_url)
if r.status_code == 200:
if mime_type == html_mime_type:
soup = BeautifulSoup(r.content, 'html.parser')
lang = soup.html["lang"]
else:
lang = None
SaveFile(my_url, r.content, mime_type, lang)
else:
print(f'\033[1;31;1m{my_url} {r.status_code}')
continue
return
The Finished Code
You can find the complete source code for this article at https://github.com/Christophe-Gauge/python/blob/main/backup_website.py.
This script may not be able to handle all of your use cases, but it is hopefully a good start. Please comment below and/or submit pull requests if you improve it!
Tagged with:
AWS web