ABD338: MirrorWeb: Powering Large-scale, Full-text Search for the UK Government Web Archives Using Amazon Elasticsearch Service - a podcast by AWS

from 2021-01-31T22:10:42.023393

:: ::

MirrorWeb offers automated website and social media archiving services with full text search capability for all content. The UK government hired MirrorWeb to provide search services across 20 years of archived data from over 4,800 websites. In this session, MirrorWeb discusses the technology stack they built using Amazon Elasticsearch Service (Amazon ES) to search across the 333 million unique documents (over 120 TB) that they indexed within a 10-hour period. They discuss how they moved data from on-premises to Amazon S3 using AWS Snowball and then processed that data using Amazon EC2 Spot Instances, reducing costs by over 90%. They also talk about how they used AWS Lambda to ingest data into Amazon ES. Finally, they share best practices for building a large-scale document search architecture.

Further episodes of AWS re:Invent 2017

Further podcasts by AWS

Website of AWS