Archiving the Internet Before it All Rots Away
Slides and video for my talk about Internet Archiving.
Nick Sweeting (Co-Founder @ https://monadical.com) @theSquashSH / @pirate
About
Could you imagine an internet where all links stopped working after 4 years? All the old blogs from the 90’s… gone, all your hot takes on Twitter… gone, all the news and reporting… gone.
Some of that decay is good, no one wants the entire internet to be preserved for eternity, but most of that decay leads to great content disappearing forever, and future generations being deprived of access to the most important medium for knowledge in the last half century. If no one worked on preserving that information, the human race would be facing a loss of knowledge many times greater than the burning of the Library of Alexandria.
Luckily organizations like Archive.org and the Internet Preservation Consortium work tirelessly every day to save what they can. But archiving doesn’t have to be exclusive to big organizations, we can all play a part by archiving the stuff that matters to us locally. Learn about the internet archiving community, the tools of the trade, and how to save content you care about in this talk!
Slides
Video
- PyCon Colombia 2020 @ Medellin: Video (I talk more about the ethics in this version)
- PyGotham 2019 @ NYC: Video (I talk more about the tech in this version)
- Our Networks 2019 @ Toronto: Video
- RC Never Graduate Week @ NYC: No video available.
Rough Outline
- 2 min: Self intro
- name, company
- founded in Colombia
- poker -> consulting, fully remote in MTL and NYC now
- 5min: what got me into internet archiving
- grew up with unreliable internet
- censored internet
- hostile environment for journalism and content
- discovered wget
- created pocket-archive stream
- 5min: equifax story
- equifax breach announced, site launched
- cloned with pocket-archive-stream
- rehosted and forgot about it
- notified of equifax misposts
- goes viral, 2mil hits
- only 2nd mention of wget in NYTimes history
- 5 min: Intro to internet archiving tooling
- wget is powerful
- wget has mny options and tunables
- heres the ones I chose for ArchiveBox
- demo
- 5 min: Intro to internet archiving ecosystem
- Why is preserving information important? why does humanity create libraries and museums?
- How has it been done so far?
- what types of archives end up surviving?
- What are the benefits of decentraliced vs centralized archives?
- 5 min: Why is internet archiving hard
- Dynamic and interactive content
- Private and paywalled content
- Content ID and discovery, Base32 is hard
- Dealing with the huge amount of data directly vs curating a smaller amount
- Archive format longevity tradeoffs (WARC vs html / pdf)
- 5 min: Setting up a Wikipedia clone
- Setup Kiwix server
- Download your collections
- Create an index and rehost it
- 1 min: What can you do today to help save the internet?
- Joining the ArchiveTeam task force & archive.org community
- Running a local internet archive
Old outline: https://docs.sweeting.me/s/internet-archiving-talk
Resources
- Big index of all the software and web archiving communities (ArchiveBox Wiki)
- https://archivebox.io
- https://github.com/pirate/wikipedia-mirror
- https://kiwix.org
- https://archive.org
- https://netpreserve.org
- https://reddit.com/r/ArchiveTeam
- https://sanctum.geek.nz/presentations/web-archiving-with-archivebox.pdf
- https://webrecorder.io
- https://docs.sweeting.me/s/equifax-security-incident