Ranter
Join devRant
Do all the things like
++ or -- rants, post your own rants, comment on others' rants and build your customized dev avatar
Sign Up
Pipeless API
From the creators of devRant, Pipeless lets you power real-time personalized recommendations and activity feeds using a simple API
Learn More
Comments
-
plusgut60377y* make request to website
* parse html
* optional: execute js
* get your wanted information
* recursion: get links and start at the beginning -
plusgut60377y@AntaresStar sure, but there are libraries for it. i wouldn't recommend writing one by your own...
-
coolq48267y@AntaresStar
There are different kinds of web crawling and scraping. What exactly are you trying to do? -
Anaeijon5147y@AntaresStar In my Opinion you don't even need to parse everything.
In my opinion it would be far more efficient and easy to build some regular expression to search for strings starting with http or https.
Add all strings you find to a list.
Remember, that you shouldn't be able to add duplicates. A sorted list might be useful for that. Or even a database system, if the whole thing might get bigger.
Mark the analyzed urls and iterate further over the unmarked elements in the list.
You might want to do this recursively, but this could get really memory consuming.
Better also save the depth to the elements in the list, if you want to stop somewhere. Just set the depth of new items to depth of current item + 1.
There are a lot of opportunities to optimize. For example multiprocessing (even with multiple clients, if you use a database for storing).
Extend the Regex, if you want.
Good start:
^["']https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.[a-z]{2,6}\b([-a-zA-Z0-9@:%_\+.~#?&//=]*)["'] -
coolq48267y@jAsE
It's all good mate, no need to talk about it.
If you ever feel up to it, maybe a future rant? Let it all out, but that would be hard. I don't know what it's like, so I can only hope you're all right? -
plusgut60377y@-Neo phantomjs is dead. The maintainer said, that everyone should use chrome headless.
-
Guys, so it's easier to write a post on devrant than just Google the keyword and read any of thousands of articles on web crawlers with examples, pointers and links to resources? o.O
-
@kargaroth it's what I am doing.
But now that I am part of this community I think that I can learn a lot also from your experience.
So thank you all for your comments :)
Related Rants
Can you help me understand how to start building a Web crawler?
I need to understand if my idea is impossible.
question
web crawler