Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Google has a history of scraping content that they want, their business is built on the back of scraping other peoples content. The story I read just recently of what happened to Celebrity Net Worth was an interesting read where Google asked for an API, they refused and Google just scraped the content anyway. There was no lawsuit, but CNW put up fake content and sure enough, it made its way to Google.

It is all ironic given how aggressive Google are in blocking any attempts to scrape its content.



Probably a silly question, but why not just use robots.txt? That was designed for preventing exactly this.


I’d say most of Genius’ visitors comes from the “song x lyrics” so hiding those with robots would ultimately make them lose almost all of their traffic.


Not due to robots.txt but you can see what happens to genius formerly rapgenius when they get removed from the index:

https://techcrunch.com/2013/12/25/google-rap-genius/


Wear a condom before clicking techcrunch.com links:

https://archive.ph/9eUkv


Outline works well for TC and provides a nicer reading experience. https://outline.com/https://techcrunch.com/2013/12/25/google...


To be fair to TC, if you disable JavaScript you get a pretty good experience - just the full article, legible. Not like those sites that require JS to load the text and/or images.


But you need JavaScript to get past the "we value your privacy, so give us permission to sell it" banner.


robots.txt is designed to keep garbage off search results. It has absolutely no power to prevent a bot to do anything. Also if the site added robots.txt they might as well shut down because their entire userbase comes from people searching lyrics on google.


Other way around. It was invented to stop crawling. Indexing is still technically allowed even when blocked by robots.txt from crawling.


The problem is that Google is stealing content and placing it on search so the user never goes to the source, By blocking it with robots they block themselves from google results AND Google may already keep scraping the content.


robots.txt isn’t enforced by anything


They also scrape MusicBrainz, but even if they don't index MusicBrainz at least they donate to it


They have an contract with MusicBrainz. They are listed on https://metabrainz.org/supporters/tiers/4.

> The Unicorn tier is for large companies or companies that would like to have a reciprocal relationship with our foundation. If you need special guarantees, indemnities or require us to sign your contract for a data license, please select this tier. If you have another creative idea you would like to propose, please also select the unicorn tier.

> For any of these cases, please detail your request in the company information field and we will work with you to fit your company's mythical situation. We will also find an appropriate monthly support amount to our non-profit foundation of $1500 or more per month. Please always consider enabling the growth of our non-profit foundation and the continuous growth of our metadata!


That's like saying it's ironic that a soldier fights for his life when he tries to kill other people.

It's just the war that is being fought, not some sort of hypocrisy or irony.


Garbage.

We live in a society of laws. Even soldiers. Google have shown they have no respect for the law not equality before it and will cheat while using the law as a cudgel. Recall law exists that the strongest might not always get their way. "Ironic" is the pole way of pointing this out.

Without law, Google cease to exist immediately. They are incapable of enforcing property rights without it.

Pardons aside, soldiers go to jail for taking an attitude like Google's.


Just like Genius, Google licensed the lyrics. If they didn't, the publishers definitely would have sued.

Ironically, it is Genius that seems to have no respect for copyright law. Genius ended up having to settle a case years ago because they were using lyrics without the appropriate licensing [1].

https://www.nytimes.com/2014/05/07/business/media/rap-genius...


Which law did google break? Scraping in and of itself isn't illegal last time i checked, and usa doesn't have database copyrights unlike some juridsictions.


It's blocking scrapers that is (somewhat, per things like the Americans with Disabilities Act) and/or should be (in general) illegal.

(And to head off the obvious: rate-limiting is orthogonal to whether the high-request-rate querient is scraping.)


> should be (in general) illegal

But.. it's not illegal?

> somewhat, per things like the Americans with Disabilities Act

This is just not right at all. There is nothing in the Americans with Disabilities Act that make blocking scrapers illegal.

I think you mean you don't like the power imbalance of the large company taking away from smaller companies while using technological means to stop the same thing happening to them.

I don't like it either, but that doesn't magically make it is illegal. I'm not even sure it should be.


> There is nothing in the Americans with Disabilities Act that make blocking scrapers illegal.

Retrieving, processing, and displaying information in a manner contrary to the wishes of the provider of that information is necessary for accessibility to disabled users. As a specific example, any attempt to block use of wget for scraping also blocks use of wget as part of a `wget | filter | text-to-speech` pipeline[0], and is thus a discrimination against blind or otherwise visually impaired users. The ADA is, as mentioned, only somewhat effective in prohibiting such things, though.

> it's not illegal

> that doesn't magically make it is illegal.

I don't think anyone is claiming that scraping itself actually is legally protected - I interpreted DigitalSea and harry8 as implying that it should be.

0: in either the shell sense or the workflow sense


Retrieving, processing, and displaying information in a manner contrary to the wishes of the provider of that information is necessary for accessibility to disabled users. As a specific example, any attempt to block use of wget for scraping also blocks use of wget as part of a `wget | filter | text-to-speech` pipeline[0], and is thus a discrimination against blind or otherwise visually impaired users. The ADA is, as mentioned, only somewhat effective in prohibiting such things, though.

This is not the case. Unfortunately (?) the ADA doesn't allows the disabled person to specify their own technology. If Google can reasonably say that speech to text works via a standard screenreader (which it does) then they are ok.

> The ADA is, as mentioned, only somewhat effective in prohibiting such things, though

Well that's not the intent of the ADA, so not really surprising.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: