My evolved news crawler :) v1.8

Share on:

Well I needed to kill some time during this strange intermission period between jobs. My original 1-hour hack (less than 100 lines of code), evolved to something more flexible and useful (I hope so). Eventually my father is very happy now, instead of 1 newspaper summary he now receives

He was also kind enough, to email me (from the Pacific) some early bugs like duplicate entries and formatting issues, which I tried to resolve. It is always fun to have someone use your code, isn't it?

Of course in order to honor my Java development heritage, in this small tool I had to create my own mini framework / crawling logic - all java devs do it! it's not that complex actually, and now I can easily add more crawlers for similar sites.

So currently I support the following sites (greek at the time being), but I will keep adding more :

I have also added 2 optional command line arguments.

  • flag to control the max amount of articles to be crawled and included in the final report.
  • flag to control the creation of zip files, that contain each html report. That way I manage to reduce the size even more. So when I email them the payload is far less :).

You can find more in the official github page. By the way I try to keep my documentation up to date.

You will find all the required material in order to run or compile this small utility, plus any requirements.

I will soon add a small section, for those (if there is anyone interested) that would like to plug, extra crawling implementations for other RSS based sites.

Of course there a lot of stuff that I could do, in order to improve the utility, and most probably I will continue to add, crawlers for sites and make the design more _modular'.

happy crawling .