Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API

From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
retoor834613dI've this:
sqlite> select count (0) from rants;
53068
sqlite> select count (0) from comments;
504678
sqlite>
This is literally everything that is to crawl about devRant. Well, maybe that searching for single characters and stuff could still lead to something. To achieve this, i had to do 300.000+ requests. I did it using aiohttp and let it run a weekend in slow mode. I consider it not very social to attack it with a concurrency of 20.
The rants+comments are around 158Mb well parsed in sqlite. I scraped rant data using C from all pages and insert into sqlite. Calculating how big the html is takes some time. It's around 13Gb i think.
This has crawled to every profile, rant, and comment that is in any way connected but still does not cover the whole site. That means, there's a lot of unseen stuff behind the search functionality. Sadly, the search functionality sucks. I could make it really awesome with current tech.
WHOOPS: 37G devrant. Half is cache.
Related Rants
we should archive these posts, I'm gonna miss the choice bullshit when this forum finally gives out
devrant
archive