More

ikreymer · on Oct 17, 2024

Unfortunately, there is not much we can do about transfer-encoding, but the data is otherwise exactly as is returned from the browser. Browsertrix uses the browser to create web archives, so users get an accurate representation of what they see in their browser, which is generally what people want from archives.

We do the best we can with a limited standard that is difficult to modify. Archiving is always lossy, we try to reduce that as much as possible, but there are limits. People create web archives because they care about not losing their stuff online, not because they need an accurate record of transfer-encoding property in an HTTP connection. If storing the transfer-encoding is the most important thing, then yes, there are better tools for that.

CorentinB · on Oct 17, 2024

You could use a proxy.

"Archiving is always lossy" No.

nikisweeting · on Oct 17, 2024

You're talking to the guy who built the best proxy recorder in the archiving industry ;) ikreymer created https://pywb.readthedocs.io/en/latest/

I think he has more context than any of us on the limits of proxy archiving vs browser based archiving.

But also if you really need perfect packet-level replication, just wireshark it as he said. Why bother with WARCs at all?

pabs3 · on Oct 17, 2024

pywb has WARC issues too, due to use of warcio:

https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem

ikreymer · on Oct 17, 2024

Every archiving tool out there makes trade-offs about what is archived and how. No one preserves the raw TLS encrypted H3 traffic because that's not useful. When you browse through an archiving MITM proxy, there are different trade-offs: there's an extra HTTP connection involved (that's not stored), a fake MITM cert, and a downgrade of H2/H3 connection to HTTP/1 (some sites serve different content via H2 vs HTTP/1.1, can detect differences, etc...)

The web is best-effort, and so is archiving the web.

ikreymer · on Oct 17, 2024

This isn't really true, our tools do not just modify response data for no reason!

Our tools do the best that we can with an old format that is in use by many institutions. The WARC format does not account for H2/H3 data, which is used by most sites nowadays.

The goal of our (Webreocrder) tools is to preserve interactive web content with as much fidelity as possible and make them accessible/viewable in the browser. That means stripping TLS, H2/H3, sometimes forcing a certain video resolution, etc.. while preserving the authenticity and interactivity of the site. It can be a tricky balance.

If the goal is to preserve 'raw bytes sent over the network' you can use Wireshark / packet capture, but your archive won't necessarily be useful to a human.

CorentinB · on Oct 17, 2024

He didn't say you modify the data for no reason, he said you violate the standard. Which is true. You could respect it, but you don't.

nikisweeting · on Oct 17, 2024

imo the Webrecorder stuff is truly state of the art, if they're pushing the limits of WARC standards it's for good reason, and I trust their judgement. They pioneered the newer WACZ standard and are really pushing the whole field forward.

ikreymer · on Aug 1, 2023

Memory usage increases for long-running session, see this issue for more details: https://github.com/microsoft/playwright/issues/6319

ikreymer · on Aug 1, 2023

If you have other extensions, disable all of them before trying to record. Content from other extensions can't be accessed due to recent security changes in Chromium, and can cause that error. We'll add it to the Common Errors Page and fix the links. Thanks!

ikreymer · on June 21, 2022

A bit late to this thread, but I think WARC is a reasonable format for raw HTTP traffic. We should definitely have better tools to ensure WARC files produced are valid, and that's one of the things we build at Webrecorder.

Unless you're crawling really text heavy content, most of the WARC data is binary content that doesn't really need to be in a db. However, sqlite or database as replacement for CDX is an appealing option, where WARC files can remain static data at rest and derived data (offests, full-text search, can be put into a db.

We are experimenting with a new format, WACZ, which bundles WARC files into a ZIP, while adding CDXJ and exploring sqlite as an option for full-text search. I agree that it's better to build on solid, existing formats that can be validated, especially when large amounts of data are concerned!

ikreymer · on Feb 19, 2021

At this point, mostly time constraints maintaining two very different implementations.

The archiving is done via the CDP Fetch domain (https://chromedevtools.github.io/devtools-protocol/tot/Fetch...), as it requires intercepting and sometimes modifying the response body of a request to make it more replayable.

Firefox doesn't current support this yet (https://bugzilla.mozilla.org/show_bug.cgi?id=1587426), although, it does have webRequest.StreamFilter instead (https://developer.mozilla.org/en-US/docs/Mozilla/Add-ons/Web...), which is lacking in Chromium.

It probably should be possible to achieve this functionality in Firefox, using this API, but would unfortunately require a new implementation that uses WebRequest instead of CDP. But probably worth looking into!

The archive replay using ReplayWeb.page should work in Firefox and Safari.

Edit: Another limitation on Firefox is lack of service worker support from extensions origins (https://bugzilla.mozilla.org/show_bug.cgi?id=1344561). This is needed to view what you've archived in the extension. Would need to work around this issue somehow until that is supported, so probably a bit of work, unfortunately.

ikreymer · on Feb 19, 2021

ArchiveWeb.page is the latest tool from the Webrecorder project. It works as a Chrome (or Brave, or Edge) browser extension or as a standalone Electron app.

There's also an experimental IPFS-sharing feature, which will share whatever was archived via IPFS, which works best in Brave (due to native IPFS support) or the Electron app.

Web archives can also be exported in a random-access friendly format (WACZ, https://github.com/webrecorder/wacz-format) which contains standard WARCs, allowing large archives to be loaded on-demand.

The extension or app is needed to create the web archives, but once created, the archives are accessible/viewable in any modern browser supporting service workers.

Zetaphor · on Feb 19, 2021

Please consider supporting Firefox extensions

ikreymer · on Dec 24, 2020

I'm excited to share the latest update from Webrecorder project, the release of a new OldWeb.today (https://oldweb.today), now running emulated browsers using JS/WASM entirely in your browser!

This is an update to the previous version (discussed 5 years ago at: https://news.ycombinator.com/item?id=10653033)

This builds on much previous work, including two excellent emulators, v86 (previously discussed at: https://news.ycombinator.com/item?id=11155203) and Basilisk in JS (mentioned here: https://news.ycombinator.com/item?id=20632843) running in the browser. These were modified to support a browser-based network stack, an Emscripten build of picotcp created by the bwFla Emulation-as-a-Service team (https://gitlab.com/emulation-as-a-service/picotcp.js)

The IE and Netscape browsers support older versions of Flash and Java. For good measure, there is also an option to run just the Ruffle emulator (discussed at: https://news.ycombinator.com/item?id=25242115) for Flash-only emulation in your current browser.

The system can browse live web pages as well as archived pages from Internet Archive's Wayback Machine. OldWeb.today can be deployed as a static site and connected to other archives as well. (Only a server-side CORS proxy is necessary, for connecting to external websites or archives)

More details on how it works at: https://github.com/oldweb-today/oldweb-today

Blog post: https://webrecorder.net/2020/12/23/new-oldweb-today.html

ikreymer · on July 25, 2017

Webrecorder supports remote browsers that allow you to run older browsers pre-configured with Flash, as well as Java applets.

http://rhizome.org/editorial/2016/oct/25/rhizome-releases-ma...

Example: Record Flash: https://webrecorder.io/_new/temp/flash-example/record/$br:ch...

An archived Java applet: https://webrecorder.io/demo/java/20170505193641$br:firefox:4...

In this mode, the browsers run remotely and stream the video to your browser. We are still working on the audio support but hope to have audio support soon. It should be possible to archive Flash in this way, though we could use more help/research in this area.

Once the last version of Flash is released, we'll include the latest browsers that can run it.

If the Mozilla Shumway project or similar picks up again (we hope), Webrecorder can integrate that as well to offer a native JS based Flash-recording and replay.

If anyone is interesting in helping out, let us know!

j45 · on July 26, 2017

It might be possible to also run Flash in a mini VM. Who knows, maybe someone will figure it out playing flash in WebAssembly in "legacy mode" which disables a vast swath of file and other security issues. Introducing sandboxing could be one angle, but I'm sure someone far more experienced has already thought of this.

ikreymer · on June 28, 2017

Unfortunately, this approach alone will only work for sites that are mostly static, eg. do not use JS to load dynamic content. That is a small (and shrinking) percent of the web. Once JS is involved, all bets are off -- JS will attempt to load content via ajax, or generate new html, load iframes, etc and you will have 'live leaks' where the content seems to be coming form the archive but is actually coming form the live web.

Here is an example from archiving nytimes home page:

https://archive.tesoro.io/665dbeab57a4d57d8140f89cfedc69b5

If you look at network traffic (domain in devtools), you'll see that only a small % is coming from archive.tesoro.io -- the rest of the content is loaded from the live web. This can be misleading and possibly a security risk as well.

Not to discourage you, but this is a hard problem and I've been working on for years now. This area is a moving target, but we think live leaks are mostly eliminated in Webrecorder and pywb, although there are lots of areas to work on to maintain high-fidelity preservation.

If you want chat about possible solutions or want to collaborate (we're always looking for contributors!), feel free to reach out to us at support [at] webrecorder.io or find my contact on GH.