EDIT: I see some comments that are already talking about huge data volumes. To t...

frig · on Nov 30, 2009

In what meaningful sense do Google or Amazon have peer-reviewed and transparent data sets?

Please be specific.

DanielBMarkham · on Nov 30, 2009

I'll pick on Google, and use memory.

1) Google is notorious for peer-reviewing code.

2) IIRC, Google also has some functional/provable code

In general, large, critical datasets are routinely processed by businesses. Businesses who use separate testing groups, code reviews, and tiger teams to validate each others' work. None of this is new information (at least to me)

Of course, transparency doesn't extend beyond the organization's walls in many instances. If you thought I was saying that it did then I mis-communicated.

RichKatz · on Nov 30, 2009

However, in many instances, transparency does extend well beyond an organizations boundaries to the public domain.

Many scientific and business code-bases are open source. Since the advent of portable languages and VM-based back-ends, the practice of open sourced software has increased - not decreased. Many algorithms themselves are published and peer-reviewed. You can purchase Corman's Introduction to Algoritms, Knuth's books Sedgewick's books, the Stony Brook Algorithm Repository. Scientific software has been especially drawn to participate in open-source. From the beginning, efforts of scientific software were focused on to increased accuracy and minimizing error. Start from Anthony Ralston's books and Richard Hamming's book for instance.

Peer review is available and performed at numerous levels. The original criticism "no one" does peer review is over-drawn, misleading, and I would say, unfounded.

frig · on Nov 30, 2009

Yes, you did miscommunicate; even allowing for the miscommunication you're making an extremely poorly-considered analogy.

The kind of peer review that matters for purposes of scientific integrity is review by outsiders; eg, a paper is "peer-reviewed" when experts in the field not involved in writing it or conducting its research give it a going-over and see if it appears to hold up.

The kind of transparency that matters for purposes of scientific integrity is making data available as-is to outsiders, so that they can meaningfully replicate your results ab initio (or, perhaps, not!).

Neither Google nor Amazon conducts meaningful amounts of peer review in the scientific sense nor are they transparent in the scientific sense (nor should they be, the last thing I want is any old anyone seeing the raw data backing someone else's gmail account or search history).

So you're making a useless assertion in the context of the issue at hand: neither Google nor Amazon does much "peer review" (to my knowledge Microsoft in fact does to some limited extent with shared source and hiring 3rd party auditors for some important code chunks); neither is "transparent".

In this thread you'll find multiple people speaking from positions of actual experience working on large-scale endeavors of scientific computation who've commented at length upon their internal practices.

I'm not going to repeat what they've typed but if you read those comments you will see that at least those commenting here were in fact engaging in what you're calling "peer review" and "transparency" wrt their development practices; their reports match my experience concerning scientific computation with large annual budgets but I won't claim any authority for my anecdotal experiences.

I will close out by posting a protip here; you might benefit from it but it's not specifically aimed at you.

If a blog with a name like "chicagoboyz" makes a bold, sweeping, and somewhat shocking assertion about an entire area of human endeavor -- that it itself has no plausible claim to expertise in (it's an ec blog, not a large-scale computation blog, and no claim of direct experience was made that I saw) -- and you find yourself nodding your head and thinking "yeah, that sounds plausible", proceed to do the following:

- slow down, step away from the computer, and count to 20 backwards in greek

- ask yourself: do I have any concrete knowledge, at all, about the area this claim is being made? Any involvement in a project in that area, or business involving that area, etc.? (In this case: do you have any direct experience with large-scale computational efforts in science? do you know anyone who's been involved in such a project? anything beyond the flamebait du jour?)

- if you do have any concrete knowledge: great, you have at least some nonzero evidence base from which your initial "yeah, that's right feeling" may or may not be substantiated. Think carefully about what you already know and see if the "yeah that's right" feeling holds up.

- if you don't have any concrete knowledge: you've given yourself an awesome opportunity for self-discovery and personal growth. Clearly there's something that makes you want to uncritically believe this specific sweeping claim about some area about which you literally know nothing concrete; we generally consider who believe sweeping claims without evidence suckers, and we've found an area where your preexisting biases leave you a sucker, and therefore at the mercy of others. You might still have the right intuition about the sweeping claim, but at least take the opportunity to de-suckerify yourself on this front before drawing your conclusion.