Unfortunately, there is not much we can do about transfer-encoding, but the data is otherwise exactly as is returned from the browser. Browsertrix uses the browser to create web archives, so users get an accurate representation of what they see in their browser, which is generally what people want from archives.
We do the best we can with a limited standard that is difficult to modify. Archiving is always lossy, we try to reduce that as much as possible, but there are limits. People create web archives because they care about not losing their stuff online, not because they need an accurate record of transfer-encoding property in an HTTP connection. If storing the transfer-encoding is the most important thing, then yes, there are better tools for that.
Every archiving tool out there makes trade-offs about what is archived and how. No one preserves the raw TLS encrypted H3 traffic because that's not useful. When you browse through an archiving MITM proxy, there are different trade-offs: there's an extra HTTP connection involved (that's not stored), a fake MITM cert, and a downgrade of H2/H3 connection to HTTP/1 (some sites serve different content via H2 vs HTTP/1.1, can detect differences, etc...)
The web is best-effort, and so is archiving the web.
This isn't really true, our tools do not just modify response data for no reason!
Our tools do the best that we can with an old format that is in use by many institutions. The WARC format does not account for H2/H3 data, which is used by most sites nowadays.
The goal of our (Webreocrder) tools is to preserve interactive web content with as much fidelity as possible and make them accessible/viewable in the browser. That means stripping TLS, H2/H3, sometimes forcing a certain video resolution, etc.. while preserving the authenticity and interactivity of the site. It can be a tricky balance.
If the goal is to preserve 'raw bytes sent over the network' you can use Wireshark / packet capture, but your archive won't necessarily be useful to a human.
imo the Webrecorder stuff is truly state of the art, if they're pushing the limits of WARC standards it's for good reason, and I trust their judgement. They pioneered the newer WACZ standard and are really pushing the whole field forward.
If you have other extensions, disable all of them before trying to record.
Content from other extensions can't be accessed due to recent security changes in Chromium, and can cause that error. We'll add it to the Common Errors Page and fix the links. Thanks!
A bit late to this thread, but I think WARC is a reasonable format for raw HTTP traffic. We should definitely have better tools to ensure WARC files produced are valid, and that's one of the things we build at Webrecorder.
Unless you're crawling really text heavy content, most of the WARC data is binary content that doesn't really need to be in a db. However, sqlite or database as replacement for CDX is an appealing option, where WARC files can remain static data at rest and derived data (offests, full-text search, can be put into a db.
We are experimenting with a new format, WACZ, which bundles WARC files into a ZIP, while adding CDXJ and exploring sqlite as an option for full-text search.
I agree that it's better to build on solid, existing formats that can be validated, especially when large amounts of data are concerned!
It probably should be possible to achieve this functionality in Firefox, using this API, but would unfortunately require a new implementation that uses WebRequest instead of CDP. But probably worth looking into!
The archive replay using ReplayWeb.page should work in Firefox and Safari.
Edit: Another limitation on Firefox is lack of service worker support from extensions origins (https://bugzilla.mozilla.org/show_bug.cgi?id=1344561).
This is needed to view what you've archived in the extension. Would need to work around this issue somehow until that is supported, so probably a bit of work, unfortunately.
ArchiveWeb.page is the latest tool from the Webrecorder project. It works as a Chrome (or Brave, or Edge) browser extension or as a standalone Electron app.
There's also an experimental IPFS-sharing feature, which will share whatever was archived via IPFS, which works best in Brave (due to native IPFS support) or the Electron app.
Web archives can also be exported in a random-access friendly format (WACZ, https://github.com/webrecorder/wacz-format) which contains standard WARCs, allowing large archives to be loaded on-demand.
The extension or app is needed to create the web archives, but once created, the archives are accessible/viewable in any modern browser supporting service workers.
I'm excited to share the latest update from Webrecorder project, the release of a new OldWeb.today (https://oldweb.today), now running emulated browsers using JS/WASM entirely in your browser!
The IE and Netscape browsers support older versions of Flash and Java. For good measure, there is also an option to run just the Ruffle emulator (discussed at: https://news.ycombinator.com/item?id=25242115) for Flash-only emulation in your current browser.
The system can browse live web pages as well as archived pages from Internet Archive's Wayback Machine.
OldWeb.today can be deployed as a static site and connected to other archives as well. (Only a server-side CORS proxy is necessary, for connecting to external websites or archives)
In this mode, the browsers run remotely and stream the video to your browser. We are still working on the audio support but hope to have audio support soon. It should be possible to archive Flash in this way, though we could use more help/research in this area.
Once the last version of Flash is released, we'll include the latest browsers that can run it.
If the Mozilla Shumway project or similar picks up again (we hope), Webrecorder can integrate that as well to offer a native JS based Flash-recording and replay.
If anyone is interesting in helping out, let us know!
It might be possible to also run Flash in a mini VM. Who knows, maybe someone will figure it out playing flash in WebAssembly in "legacy mode" which disables a vast swath of file and other security issues. Introducing sandboxing could be one angle, but I'm sure someone far more experienced has already thought of this.
Unfortunately, this approach alone will only work for sites that are mostly static, eg. do not use JS to load dynamic content. That is a small (and shrinking) percent of the web. Once JS is involved, all bets are off -- JS will attempt to load content via ajax, or generate new html, load iframes, etc and you will have 'live leaks' where the content seems to be coming form the archive but is actually coming form the live web.
Here is an example from archiving nytimes home page:
If you look at network traffic (domain in devtools), you'll see that only a small % is coming from archive.tesoro.io -- the rest of the content is loaded from the live web. This can be misleading and possibly a security risk as well.
Not to discourage you, but this is a hard problem and I've been working on for years now. This area is a moving target, but we think live leaks are mostly eliminated in Webrecorder and pywb, although there are lots of areas to work on to maintain high-fidelity preservation.
If you want chat about possible solutions or want to collaborate (we're always looking for contributors!), feel free to reach out to us at support [at] webrecorder.io or find my contact on GH.
We do the best we can with a limited standard that is difficult to modify. Archiving is always lossy, we try to reduce that as much as possible, but there are limits. People create web archives because they care about not losing their stuff online, not because they need an accurate record of transfer-encoding property in an HTTP connection. If storing the transfer-encoding is the most important thing, then yes, there are better tools for that.