My evolved news crawler :) v1.8
Well I needed to kill some time during this strange intermission period between jobs. My original 1-hour hack (less than 100 lines of code), evolved to something more flexible and useful (I hope so). Eventually my father is very happy now, instead of 1 newspaper summary he now receives
He was also kind enough, to email me (from the Pacific) some early bugs like duplicate entries and formatting issues, which I tried to resolve. It is always fun to have someone use your code, isn't it?
Of course in order to honor my Java development heritage, in this small tool I had to create my own mini framework / crawling logic - all java devs do it! it's not that complex actually, and now I can easily add more crawlers for similar sites.
So currently I support the following sites (greek at the time being), but I will keep adding more :
I have also added 2 optional command line arguments.
- flag to control the max amount of articles to be crawled and included in the final report.
- flag to control the creation of zip files, that contain each html report. That way I manage to reduce the size even more. So when I email them the payload is far less :).
You can find more in the official github page. By the way I try to keep my documentation up to date.
You will find all the required material in order to run or compile this small utility, plus any requirements.
I will soon add a small section, for those (if there is anyone interested) that would like to plug, extra crawling implementations for other RSS based sites.
Of course there a lot of stuff that I could do, in order to improve the utility, and most probably I will continue to add, crawlers for sites and make the design more _modular'.
happy crawling .