cats

Ranter

DataSecs

927

Comments

0

BinaryByter

8548

9y

Selfmade?
0

BinaryByter

8548

9y

@linuxer4fun crawler*
2

Jappe

2903

9y

Ours is running for over a month and has that too! The most odd sites you will ever see are crawled😂
1

DataSecs

927

9y

@linuxer4fun yes, I started off regexing everything by myself. Like just using some bufferd reader and then regexing everything. I then moved on to use JSoup because, well, it offered everything I needed 😂
I added some features and am now working with a cluster-like Engine. Means you have a Master server which is actually a bot that adds links to a Queue. And every 10 links sends a Packet with the links to a slave, that processes it. You can have several instances of slaves that connect to the master. The slaves are multi-threading, for each link a thread.
The communication is done with netty.
0

DataSecs

927

9y

@Jappe woooow, that's insane 😂😂😂
Which Url did you Start off?
1

Jappe

2903

9y

@DataSec Dmoz.org
1

DataSecs

927

9y

@Jappe That's an awesome site to start Off. Did you write it in Java?
0

Jappe

2903

9y

@DataSec Nope. Python. That was the most simple language for us to build our crawler
0

DataSecs

927

9y

@Jappe Did you yet calculated how many links you can retrieve a minute for example. I'm quite curious of that because I'd like to know what's actually more efficient. To be honest I could just guess
0

Jappe

2903

9y

We know that it crawls around 100.000 per hour.

But it depends on how many crawlers are running though. For 100.000 link per hour are about 20 crawlers needed.
0

DataSecs

927

9y

@Jappe Oh I expected more. What downloadrate have you got?
I have a semi-fixed thread number, it uses a fixed thread pool which calculates its thread number by the number of available processors. This is for every slave that is connected to the master.
With a 100 Mbit downloadrate it gets 100.000 links per minute and probably completely crawles and indexes 70.000 per minute if not more.
Means with 1 Gbit you could fetch almost 1 million links per minute 😂
2

Desinika

1209

9y

Cats are indeed weird crawlers.
0

Jappe

2903

9y

All right that's pretty awesome, but what are the specs of your server/computer? We only have two regular PC's each with only 2 Gb of RAM.. 😕

Oh and we run it within a crappy school network. So optimisation is everything what we can do to make faster...😂
0

DataSecs

927

9y

@Jappe Yea I was pretty impressed 😂
I totally forgot to mention that. I Run a computer with windows and it Till now just ran in IntelliJ not on a server. I have 8 Gb DDR4 and an i7-6700HQ @ 2.6 GHz, it's a Quadcore. So it's neat Hardware. On a server I would probably use a VPs with 2-4 Gb Ram and a decent CPU.
But though my internet downloadrate is the most determining thing actually 😂
Tested it on school network and threw many exceptions 😂😂
0

Jappe

2903

9y

@DataSec That's awesome!! We are going to upgrade both PC's from 2Gb to 4Gb each, so it's gonna be a little faster than it is right now..😎
1

DataSecs

927

9y

@Jappe That sounds very promising! 😏
I am sure it will fasten the crawling up :D
1

Jappe

2903

9y

Look at our daily result😇 with @hahaha1234 and @papierbouwer

Related Rants

Add Comment

When your crawler starts to find very weird pages on the internet...

undefined

weird

crawler

search engine

java