The Spam Club

» The Spam Club - Life, The Universe and Everything - Site Issues - The Google dilemma - Reply

Reply

Username:
Not Authentication Code (blank):
Password:
Guest Password: oeQDJ
Post:
Attachment: (max. 5000000 bytes)
Mail Notification?Yes
No

Last 20 Posts (View All)

Posted at 11:36 on January 6th, 2008 | Quote | Edit | Delete
Avatar
Admin
Reborn Gumby
Posts: 11126
On each request, the bot sends the If-Modified-Since header. I guess the solution is to match this with the modification timestamp of the respective page, and if the latter is lower, send the 304 (Not modified) status code instead of the page itself.

As for telling Google how often to index the pages, there seems to be something called 'Sitemap' which webmasters can use to suggest such intervals, but the description over at Google is rather vague. I seems like they'll disregard it whenever they feel like it.
-----
Now you see the violence inherent in the system!
Posted at 03:46 on January 6th, 2008 | Quote | Edit | Delete
Avatar
Member
Retired Gumby
Posts: 740
I wonder if there's a way to tell google to only check the site once per month or some such? Most of the pages change rarely, if at all, and when most of the pages do change it's a very minor change (this especially applies to the review pages). Ideally google should just use it's previous cache of most pages on the site, and only re-check the index pages for new reviews & the pages for any -new- reviews.
-----
At the end of the day, you're left with a bent fork & a pissed off rhino.
Posted at 23:39 on January 5th, 2008 | Quote | Edit | Delete
Avatar
Admin
Reborn Gumby
Posts: 11126
Oh, technically, it's very easy. I already have an option in the framework with which I can easily define whether a page should be indexed by Google or not. Furthermore, there's an option to tell it whether it should follow further links on the page or stop looking there. At the moment, they're both set to true by default, just stopping at download pages.

The pages which are found most by Google are in fact the review pages, so not letting it index those wouldn't be such a good idea. However, it's enough to have just one version of each indexed (not with each combination of submenus either open or closed and so on). All the endless games listings are probably unecessary to be indexed as well.

Perhaps most importantly: the bot doesn't really need to follow the site-internal links if I manage to feed it a 'definite' list of pages to index.

In the end, this will still decrease visibility, of course, but I guess if the right pages are left accessable, it wouldn't have that much of an impact while still reducing the bandwidth significantly.

From the top of my head, I'd say the following should be 'indexable' (only one version of each page, of course):
- news
- system introductions
- game reviews
- application reviews
- comics pages

Secondary:
- FAQs
- editorials
- crew profiles
- links

In the forum, at least the game comments, but I'd like to keep it all 'open'. I need to do some research about what the bot does if the 'last modified' header of the page is set to an old date. At the moment, the forum always claims to be completely current on every thread. It might be a good idea to set this header to the timestamp of the last post. Maybe the bot doesn't download the whole page anymore, then.

Any crucial pages I forgot?
-----
Now you see the violence inherent in the system!
Posted at 23:24 on January 5th, 2008 | Quote | Edit | Delete
Avatar
Member
Retired Gumby
Posts: 1092
Google will be one of the megacorporations the rebels will fight in the grim cyberpunk future.

More seriously, I fear blocking google will make it harder for people who are looking to know about old games to find the page, even though most of them will be just looking for free games some may be really interested.

I think Cypherswipe's idea is good. I don't know how google works and how can it be blocked, but leaving accesible to it only the pages for the main zones can help.
Posted at 04:07 on January 5th, 2008 | Quote | Edit | Delete
Avatar
Member
Retired Gumby
Posts: 740
What about using robots.txt files, or something similar, to prevent the google bot from indexing every single page on the site. Let it index the main page, and maybe the index pages for each system, but not every single review page.

Also, blocking the bot from the forum would probably reduce the problem a great deal. Individual users will only load a few threads each visit, while the bot will reload every single thread on each & every visit.
-----
At the end of the day, you're left with a bent fork & a pissed off rhino.
Posted at 17:48 on January 4th, 2008 | Quote | Edit | Delete
Avatar
Admin
Reborn Gumby
Posts: 11126
Stats from December 2007:
Traffic caused by actual visitors: ~30GB
Traffic caused by Google Bot: ~115GB

Adding these two numbers, this isn't yet critical (bandwidth isn't that expensive anymore these days), but you'll probably agree this is still totally blown out of proportion, especially since the Google Bot certainly doesn't download any binary files. I've even had Google's image bot blocked for years already, so all that remains is the normal bot indexing the pages. For reference: the next 'biggest' bot (by bandwidth consumption) is Yahoo's with slightly more than 500MB.

Now, on the other hand, by far most of the visitors come here through Google (in spite of them being evil and all, it seems to be the most popular search engine). In spite of that, I'm contemplating banning Google completely again, as I've done a few years ago. That bot just goes crazy! It's hitting the site almost constantly, apparantely re-indexing every page again and again.

Any ideas what could be done? How do other people cope with this?
-----
Now you see the violence inherent in the system!
Powered by Spam Board 5.2.4 © 2007 - 2011 Spam Board Team