Robots Run Amok
Posted on 2006-09-06 (Updated on 2019-01-21)
Here at the County, we're getting hammered by a web crawler named "Pita+". The only information it provides for itself (and the only information I could find on it) was an e-mail address, which was 'webmaster' at 'pita.stanford college'. It didn't actually say 'pita.stanford college', but I'd rather keep that e-mail address spam free, even if it's spamming our site.
Here at the county, we sit on a lowly T1 line, which means getting hammered is pretty effective; and here at the county, we have around two gigs of downloadable schtuff. So, getting hammered by an over-zealous web crawler is really kicking the internet line's ass.
For now, I've added a line to our Robots.txt file disallowing the files being assaulted, and e-mailed the webmaster to see if Pita+ will obey robot guidelines such as "Crawl-Delay". However, our upstream bandwidth is still hitting the roof, and I'm currently only able to "ask politely" for them to stop. Perhaps, if this continues, I'll look into more forceful ways of controlling our visitors.
Oh, if you're interested, here's the full user string for the robot: