Google has a history of scraping content that they want, their business is built...

anonytrary · on Aug 12, 2020

Probably a silly question, but why not just use robots.txt? That was designed for preventing exactly this.

asutekku · on Aug 12, 2020

I’d say most of Genius’ visitors comes from the “song x lyrics” so hiding those with robots would ultimately make them lose almost all of their traffic.

dewey · on Aug 12, 2020

Not due to robots.txt but you can see what happens to genius formerly rapgenius when they get removed from the index:

https://techcrunch.com/2013/12/25/google-rap-genius/

encom · on Aug 12, 2020

Wear a condom before clicking techcrunch.com links:

https://archive.ph/9eUkv

jackdh · on Aug 12, 2020

Outline works well for TC and provides a nicer reading experience. https://outline.com/https://techcrunch.com/2013/12/25/google...

icebraining · on Aug 12, 2020

To be fair to TC, if you disable JavaScript you get a pretty good experience - just the full article, legible. Not like those sites that require JS to load the text and/or images.

wizzwizz4 · on Aug 12, 2020

But you need JavaScript to get past the "we value your privacy, so give us permission to sell it" banner.

Polylactic_acid · on Aug 12, 2020

robots.txt is designed to keep garbage off search results. It has absolutely no power to prevent a bot to do anything. Also if the site added robots.txt they might as well shut down because their entire userbase comes from people searching lyrics on google.

mikemotherwell · on Aug 12, 2020

Other way around. It was invented to stop crawling. Indexing is still technically allowed even when blocked by robots.txt from crawling.

AgloeDreams · on Aug 12, 2020

The problem is that Google is stealing content and placing it on search so the user never goes to the source, By blocking it with robots they block themselves from google results AND Google may already keep scraping the content.

jtxx · on Aug 12, 2020

robots.txt isn’t enforced by anything

Avamander · on Aug 12, 2020

They also scrape MusicBrainz, but even if they don't index MusicBrainz at least they donate to it

niknetniko · on Aug 12, 2020

They have an contract with MusicBrainz. They are listed on https://metabrainz.org/supporters/tiers/4.

> The Unicorn tier is for large companies or companies that would like to have a reciprocal relationship with our foundation. If you need special guarantees, indemnities or require us to sign your contract for a data license, please select this tier. If you have another creative idea you would like to propose, please also select the unicorn tier.

> For any of these cases, please detail your request in the company information field and we will work with you to fit your company's mythical situation. We will also find an appropriate monthly support amount to our non-profit foundation of $1500 or more per month. Please always consider enabling the growth of our non-profit foundation and the continuous growth of our metadata!

smabie · on Aug 12, 2020

That's like saying it's ironic that a soldier fights for his life when he tries to kill other people.

It's just the war that is being fought, not some sort of hypocrisy or irony.

harry8 · on Aug 12, 2020

Garbage.

We live in a society of laws. Even soldiers. Google have shown they have no respect for the law not equality before it and will cheat while using the law as a cudgel. Recall law exists that the strongest might not always get their way. "Ironic" is the pole way of pointing this out.

Without law, Google cease to exist immediately. They are incapable of enforcing property rights without it.

Pardons aside, soldiers go to jail for taking an attitude like Google's.

TAForObvReasons · on Aug 12, 2020

Just like Genius, Google licensed the lyrics. If they didn't, the publishers definitely would have sued.

Ironically, it is Genius that seems to have no respect for copyright law. Genius ended up having to settle a case years ago because they were using lyrics without the appropriate licensing [1].

https://www.nytimes.com/2014/05/07/business/media/rap-genius...

bawolff · on Aug 12, 2020

Which law did google break? Scraping in and of itself isn't illegal last time i checked, and usa doesn't have database copyrights unlike some juridsictions.

a1369209993 · on Aug 12, 2020

It's blocking scrapers that is (somewhat, per things like the Americans with Disabilities Act) and/or should be (in general) illegal.

(And to head off the obvious: rate-limiting is orthogonal to whether the high-request-rate querient is scraping.)

nl · on Aug 12, 2020

> should be (in general) illegal

But.. it's not illegal?

> somewhat, per things like the Americans with Disabilities Act

This is just not right at all. There is nothing in the Americans with Disabilities Act that make blocking scrapers illegal.

I think you mean you don't like the power imbalance of the large company taking away from smaller companies while using technological means to stop the same thing happening to them.

I don't like it either, but that doesn't magically make it is illegal. I'm not even sure it should be.

a1369209993 · on Aug 12, 2020

> There is nothing in the Americans with Disabilities Act that make blocking scrapers illegal.

Retrieving, processing, and displaying information in a manner contrary to the wishes of the provider of that information is necessary for accessibility to disabled users. As a specific example, any attempt to block use of wget for scraping also blocks use of wget as part of a `wget | filter | text-to-speech` pipeline[0], and is thus a discrimination against blind or otherwise visually impaired users. The ADA is, as mentioned, only somewhat effective in prohibiting such things, though.

> it's not illegal

> that doesn't magically make it is illegal.

I don't think anyone is claiming that scraping itself actually is legally protected - I interpreted DigitalSea and harry8 as implying that it should be.

0: in either the shell sense or the workflow sense

nl · on Aug 12, 2020

Retrieving, processing, and displaying information in a manner contrary to the wishes of the provider of that information is necessary for accessibility to disabled users. As a specific example, any attempt to block use of wget for scraping also blocks use of wget as part of a `wget | filter | text-to-speech` pipeline[0], and is thus a discrimination against blind or otherwise visually impaired users. The ADA is, as mentioned, only somewhat effective in prohibiting such things, though.

This is not the case. Unfortunately (?) the ADA doesn't allows the disabled person to specify their own technology. If Google can reasonably say that speech to text works via a standard screenreader (which it does) then they are ok.

> The ADA is, as mentioned, only somewhat effective in prohibiting such things, though

Well that's not the intent of the ADA, so not really surprising.