Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
vortex47828yI started working on my first scraper today 😀 I'm scraping something similar but without api.. just getting the whole page and if there's a Next button I'm looping to the next page..
*disclamer: this answer is based on knowledge I gained in the last 3 hours -
ergo3588yI did news scraping last to last year using java.. it was infact a crawler, I used jsoup to fetch and parse HTML and redis to store the URLs queue..
One thing I can tell you is that it takes a lot of resources to continuously run a crawler, managing the queues of links crawled and to crawl is a daunting task
You can use RSS feeds too, most publications provide them and you can easily find a RSS parser or can make one.. just run it periodically
*Edit- typo correction -
@vortex Good for you! 👍 I recommend you to use dataquest.io for learning more on scraping. They have great interactive tutorials.
-
@ergo thank you very much for the suggestion. Some sources have poor APIs which are slow and the JSON objects cannot be customized. I will definitely consider the RSS feed.
-
RobertMackey3228dI'm delighted to find your post. I've been wondering for a long time if anyone was working on news scraping. News scraping, the process of extracting information from various online news sources, is becoming increasingly popular for real-time data collection and analytics. However, one of the challenges of news scraping is dealing with the anti-bots used by websites to prevent automatic data extraction. And a tool like https://www.zenrows.com/ offers a comprehensive solution to this problem by handling all anti-bot bypass mechanisms for newsgathering users. Whether it's CAPTCHAs, IP blocking, or other anti-bot measures, ZenRows provides users with the tools and technology they need to seamlessly overcome these obstacles. I think this should be of interest to you.
Related Rants
Has anyone here worked on news scraping?
I am currently doing my academic project where I need to scrap the news headlines. I have built scrappers for some news sources using their native API. I also tried using newsapi.org, but it returns only 10 results.
If anyone have worked on similar projects or know of their existence, some advice would be highly appreciated.
undefined
api
news
python
analysis
headline
sentiment
source
mining
scraping
opinion