You can do it as a service, but that is highly competitive and basically trading time for money. Best ways are to productize it:
- build a on-demand data api for a specific type of data and charge a premium for it. Good example is https://serpapi.com/ who do Google data, charge ~10X markup on proxy costs
- proxy solutions make good money. To scrape at scale you need proxies, and lots of users pay $1-5k per month. Lots of proxy solutions doing +$100k per month.
- build a tool that uses web scraped data, analyses/filters it and displays it to users. Lots of the biggest web scrapers are doing this, ex. doing product monitoring products for e-commerce companies, etc. Lots of competition there, but you can do it in new markets, like NFTs, etc.
- hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.
> build a tool that uses web scraped data, analyses/filters it and displays it to users. Lots of the biggest web scrapers are doing this, ex. doing product monitoring products for e-commerce companies, etc. Lots of competition there, but you can do it in new markets, like NFTs, etc.
Do you have any examples of such sites?
> hedge funds will pay huge money for web data, if you have 5 years of continuous data so they can backtest it.
what kind of web data would they be interested in?
100% agree, when scraping it should always be done respectfully.
- If they provide a API, then use it.
- Don't slam a website, ideally spread it out over hours of the day when there target audience is least active (night time).
- If you can get cached data from somewhere that works, then use that.
Most developers are respectful and only scrape what they really need, not only from an ethical point of view but also a cost and resources point of view. Scraping data is resource intensive and proxy costs can quickly rise to $1,000-$10,000 per month. So most only scrape the minimum they need.
The other thing here as well, is that a lot of the most popular sites being scraped, are also massive scrapers themselves. The big ecommerce sites are being scraped, but they are also scraping their competitors too.
Donβt get my home address, name, family members names, salary, cell phone number, aggregate and sell them and claim βitβs all publically available anywayβ
If you post that data on a public domain, that is publicly available. It's like writing that info on a cardboard and putting it in the town square and then saying 'why you people steal my data!'
I disagree because there is a difference between posting something publicly for humans and posting something publicly for bots/large scale analysis. I'm ok with my employer possibly being able to see whether I am looking for a new job or not on LinkedIn if that means they would need to have a human looking at my LinkedIn page. I am not ok with them training some ML algorithm to monitor my LinkedIn page to determine how likely I am to leave the company at all times.
Another danger is when public but not easily accessible data is able to deanonymize datasets which is probably the norm rather than the exception for anonymized datasets. Sure there are technical measures to make it better, but at the end of the day I think a lot of privacy is about respecting social boundaries and not breaking these protection measures even if technically possible. Most of the time, these measures are really about keeping honest people honest and not about stopping dedicated attackers.
great point - personally, I see so many people wasting massive amounts of time and money on content that goes nowhere. Content producers should approach content like investors and go after the opportunities with the best ROIs
Sharing is caring really. I've analysed numerous content niches with this technique (lots that I haven't produced content for) and there is so much opportunity there that I feel people should take advantage of it.
Since youβve generously offered, Iβll take you up on it
Iβve wanted to learn more about this whole area for a while now but was never sure. I'd like to do more reading β is it SEO or content marketing or something else again maybe?
I'm really into woodworking and keep thinking that there's room for good blog-style content in the sea of other information that's out there. But I've never really known how/where to jump into that.
I haven't checked all the tools, but even out of the paid ones SEMRush is the only one I've found that allows you to export all the keywords in a way to make this technique effective.