View Full Version : Archive moved offline

The Management
19-07-2003, 17:35:26
In order to improve site performance old Counterpoint threads have been archived offline. In the future older Counterpoint threads will be archived every 3 months to maintain performance levels.

Whether or not these threads will be available in an online archive form in the future, is dependant on interest from the community, and a viable technical solution.

Sir Penguin
19-07-2003, 17:41:06
I'd definitely be interested in having an online archive, because I'm working on a tech article for which I need access to the whole archives (well, I don't need access, but it would make it way better). I've been putting off downloading them. :bash:

How much space do the archives take up?


19-07-2003, 18:11:53
I'm not sure how posts by Venom will help in your tech article.....

SP, The whole db for the site prior to the 'prune' was 350mb.

There are a couple of potential solutions for an archive as I see it.

1. read only db driven archive on another server, would have to be re-coded to avoid paying for another vbulletin license. As it would be irregularly used, performance with a very large db wouldn't be too much of an issue. Would of course need another server (a PC permanently connected by DSL could be ideal).

2. archive exported to flat html and stored on this server. space being the only issue (which it isn't). much harder to search. again code would have to be written to do the export.


Sir Penguin
19-07-2003, 19:00:54
The search for #2 wouldn't be that hard. You'd just need a binary search tree or something as an interface to a keyword in context system. The only long part would be populating the tree.


20-07-2003, 09:43:13
So this means our post count is finally inaccurate? COOL! :D

Was it the archive that was the issue? If so, why does CG still slow down to almost not answering? Is there a hit being sustained due to search engines crawling it, or what?

20-07-2003, 14:05:40
the archive was just another forum, the posts, threads etc where still in the database which was causing the slowdown...

postcount = 0 ;)

21-07-2003, 12:11:30
posted by tripOriginally posted by No longer Trippin
So what exactly are you looking for nav... coding, offline storage? Can't help as I don't really know what the question is other than do we want it available to us and how to do it.

Could always drop it into HTML and use the header for the title and build a search engine to scan just the header of the HTML. Though you'd be searching only by post... you could go back and add the starters name, though it would be quicker to code something to do that. Your pretty much limited to either offline storage, taking it off the board but leave the file structure intact, or going to HTML. vbulltin probably wouldn't be happy if you cheated your way into a second copy by reverse engineering the code, it isn't like your doing it to the copy you bought, your basically building another one, thus you'd need a new license.

21-07-2003, 12:16:03
Well I'm looking for ideas for solutions and possibly volunteers to carry out some of the work and/or donate storage space.

The way I see it we own the data, so if we could extract that data, with our own code (written from scratch), that would be okay?

No longer Trippin
21-07-2003, 12:38:26
I don't have the space on a RAID 5 array to store it... have 2 72 gigs, and 2 32 gigs, about 80 gigs is taken) just that no ISP here will let you host a site unless you pay for a commercial hookup (and they enforce it fiercely here as Cox and Bellsouth suck balls). Will have to go to Intel's 2.4C since it clocks at stock vcore to 3.2 most of the time. Also would have to get the MSI 865PE motherboard as that's the best performing 32 bit board on the market and Ultra 160 should be more than enough for the forums. So looking at cost, about 350 for board and cpu and a commercial hookup, which would be expensive. Would be cheaper in the short term to just pull them off the forums for the time being and when your current contract is up, find a different ISP.

I do have a friend who is going into the army soon who has a dual 2800 MP rig with a RAID 5 setup, though he doesn't have nearly the space required either. I may be able to talk him into loaning it to me (it's a tyan server board) since he won't be using it since he'll be gone for four years and he has a PC... the server he used when he had a comp shop, but it closed. If I can talk him into it and hook my RAID drives to the array IIRC I should have around 500 megs total space. So then the commercial connection would be the only cost of hosting the entire site. That is iffy as he is a bitch, but who knows.

Might just be best to take the old threads offline for awhile until your current ISP contract is up and move to a different ISP if I can get the 2800 SMP rig from my friend and add the other drives in. Hopefully he says yes to that, as that would be a dedicated server, so any hiccups would be easy to troubleshoot... just paying for the a commercial hookup probably cost more a month than server fees.

Would it be worthwhile for me to try, or would I just be increasing overhead cost due to the expense of the commercial hookup?

You can PM me or leave me a message on ICQ if you want to keep it disclosed.

21-07-2003, 16:05:19
At the moment I'm only investigating the possibilities, and I'd rather keep costs to a minimum (hopefully nothing).

The reason for hosting elsewhere is only to take database load off this server*. So if somebody already had a PC (ideally a PII or greater) permanantly connected to the internet via Cable or DSL that could be ideal. (I might be able to provide this in the future, but I am unable to for the foreseeable future)

If the old counterpoint threads are stored as flat html, they could quite easily be hosted on this server. (search/indexing issues not withstanding).

One of the reasons I'm putting this out to discussion is to reduce the sites reliance on me for everything. :)

* This site is run on a 'dedicated' cobalt raq4 with 512mb of ram. (only 1 other small low-traffic forum is hosted on the server). With lots of diskspace upto 10gigs at a push.

21-07-2003, 17:38:01
Originally posted by Sir Penguin
The search for #2 wouldn't be that hard. You'd just need a binary search tree or something as an interface to a keyword in context system. The only long part would be populating the tree.

There are more efficient data structures than BSTs for huge datasets like this.

Plus I don't think Nav is interested in coding that kind of stuff with PHP. :D

21-07-2003, 17:54:16
I'm not coding anything. :D

Sir Penguin
21-07-2003, 18:05:32
I know there must be more efficient ones, but I have no idea what they are. :)

I could write one in Python, it would be really easy. The only problem would be keeping the data intact between search requests (although since this is just an archive, I guess you could just keep it in a text file and make the tree every time somebody asked for a search).


21-07-2003, 18:17:15
It would be very important to have some sort of index intact, and that can be regenerated (or added to) at each archiving. Afterall the archive could get very big over the years (yes it should be long term).

I'm sorry but I don't have very much knowledge of search engine technology!


Sir Penguin
21-07-2003, 19:26:53
Python's the only language in which I'm particularly proficient.

Adding to the index would be easy. You'd just need to go through each keyword and add the links, and add any new keywords and their links.

populating the search tree would look something like this:

import string

index = open("./kwicindex.dat",'rU') # open the index file
tree = [] # the list that will contain the tree
keywds = {} # the KWIC dictionary

## The KWIC dictionary has keywords at its indices, and a list of URLs where the
## keyword can be found in the other bit (I don't remember what it's called).

for line in index.readlines(): # loops through each line in the index
line = string.split(line) # splits the line into a list of words
keyword = line.pop(0) # gets the keyword
tree.append(keyword) # puts the keyword into the keyword tree
keywds[keyword] = [] # makes a new dict. entry for the keywd
keywds[keyword].extend(line) # puts list of URLs into the other bit

tree.sort() # sorts the tree for easy searching

## An input file would look like this:
## keyword1 http://url1 http://url2 http://url3
## keyword2 http://url2 http://url4 http://url6
## etc.

It would probably be a little more complex if you wanted to be able to search by username and date. I honestly have no idea how efficient that is. It might be better to have an object-based tree.


No longer Trippin
21-07-2003, 21:34:23
My "friend" said no - so now he has to find another best man (didn't like him much anyhow, just a heighbor with no friends, so I took pity and said yes, well now it's no). If he can't do that for me, and he isn't even using it - nor was it his money that paid for it, he's a bastard.

I gave away my duron platform which would have worked, though IDE.

As for options to avoid server bills:

1 - I'm thinking SP way would be best at first, just to test and see how much load a search will draw - if it isn't much then we have a free solution (well unless SP makes us pay him :) ). Drop it into HTML and use Python. It's iffy from the sounds of things (minaly him questioning it as I don't know Python), but if it works that would make it cost nothing and all nav would have to do is drop it all to HTML and throw on SPs program.

2 - A new system as cheap as posssible that isn't going to crash on day one.... say at least ten of us throw an average of 40 dollars (If 20 people, then about 20) down on average and someone will volunteer to build and host it.

Prices taken from Newegg as they are reputable

Unless we go with IDE, which may be a very bad idea, SCSI will cost a whole lot, as while I can find a drive for cheap, I can't find a card for cheap - looking at 200 or so mininum, maybe more, so I'll just say IDE RAID right now?

Althon 1700 XP (266 bus) - 44 dollars 0 shipping
ECS K7VTA3-RAID VIA KT333 Chipset ATX Motherboard - RETAIL - 62 dollars (RAID, 10/100, and USB 2.0)
WD WESTERN DIGITAL EIDE HARD DRIVE 20GB 7200RPM MODEL # WD200BB - Caviar OEM, DRIVE ONLY - 56 dollars (x 2 if RAIDed for redundancy - 112 dollars)
CRUCIAL MICRON 512MB 64x64 PC 2700 DDR RAM - OEM - 81 dollars.
Antec case w/PSU - 49.00 +15.00 shipping
LG Electronics Black 52x24x52 Internal CDRW Drive, Model GCE-8523BK (B) - OEM - 43 + 5 shipping

Total cost: w/out RAID and second drive: 354, w/raid: 410.

Just need someone with cable or DSL cable to host it, if USB, a cheap spliter could be used. If 10/100, a couple cards, one in each system running as a LAN with permissions denied to the PC is also a cheap solution.

We could probably raise the fund for that easily if we don't expect that overnight... and if we go SCSI, we'd just have to raise a bit more - though the controller card is the issue, can't find a SCSI one that is cheap. A 10 gig Atlas cost the same as the SCSI cards I've seen. An Adaptec one is 160 dollars + another 160 for a 10 gig atlas. I also imagine we were on SCSI before, though nav can correct me if I'm wrong here. So that would add about 200 more to the price :(

The only downsides from a technical aspect are that it's IDE (even if RAIDed it isn't SCSI, but your adding another 100 or so to the cost if you want SCSI - IDE raid at least makes up for the quality loss, though not speed and cpu utilization) and that we need someone with a DSL or cable connection... a 4 way USB splitter is cheap, so if USB 2.0 is what it's connected by then that elimates the need for a router. We'd also need someone with an extra copy of an OS, or an unused copy that is no longer installed unless nav knows Linux that will eat up a good bit more. Also the OS would be on the same drive or array as the BBS which isn't ideal either.

Having said that, given a months time (and nav takes the 350 megs of threads and obes them offline) 350-450 dollars shouldn't be that hard to do if we go the above route with no SCSI, find an OS, and can get someone who is able to host it as their ISP won't mind. If that person has a spare case and PSU, then we can drop the cost of that from it - though if they have a spare, they should have something which would be suitable as a server, though they might not be able to host it depending on their ISP, so we'd have to ship it to someone who can. So if someone has a spare, and we settle for what the system will be IDE, then we'd only pay for shipping - just might have to change the OS out.

3 - just leave the stuff to be archived offline.

Sir Penguin
21-07-2003, 22:04:55
Since this is an archive that won't be accessed very often, we don't need anything fancy. Not even IDE RAID. Not even a dedicated machine. Heck, I'd offer to host it on my machine, but my uptime isn't too good (that is, I turn it off at night), and I'll be leaving the dorm in less than a month.

I'm pretty sure I can write a complete search function, after a fashion. The only problem I can think of right now is efficiency. I have zero experience with messing around with large sets of data.


21-07-2003, 22:12:02
Because of the costs of machines and the cost of potential hosting, it's just not worth it. CG is on a tight enough budget as it is.

If we can find a regular poster who has a spare pc with enough space and a Cable/DSL connection (we could limit users to 1 or 2 at a time for instance). we can look into that route. If we can't, we can't.

Otherwise we store the flat html on this server, but just generate it offline.

Sir Penguin
21-07-2003, 22:32:55
You could just do a web-based frontend to 'grep'. :)


21-07-2003, 22:34:56
sorry over my head, please explain?

Sir Penguin
21-07-2003, 22:42:55
grep is a UNIX utility that searches through files for strings that match a regular expression. I'm pretty sure it's not appropriate for searching through 350 MB of posts, though. :)


21-07-2003, 23:07:59
thought it was something like that.

btw if a search proves unviable, we could just index it by date (ie month)

Qaj the Fuzzy Love Worm
21-07-2003, 23:38:31
If you're going to use Unix utilities to do this, wouldn't awk be preferable?

I'm trying to build a server for my home network, and I'd be only too happy to host your archive, provided you don't mind it being 56k dialup :) :)

SPs script should be dead easy to do. The only problem should come with the difficult decisions - do you code the page so it's mostly flat HTML, with hooks for signatures and avatars etc.? Should it reference the user color scheme (ie. it'd have to be some kind of weird hybrid dynamic CSS or something)? Or would you just say "bugger it, it's black and white, no frills, take it or leave it" - myself, I'd opt for the latter. Then it's just a matter of a script that does something along the lines of the following:

rem Qaj's BASIC Archive Pseudo-code!
10 for i = 1 to (number of threads to archive)
20 open new HTML file
30 print generic header HTML code
40 for j = 1 to (number of posts in thread)
50 read user, post
60 print user, post in formated table
70 next j
80 print generic footer HTML code
90 close HTML file
100 next i

Of course it's going to be a little more complex than that, because you have to interpret vB tags and such, and put pictures in and links and whatnot, but you get the idea.

I'm sure that SP is just itching to get stuck into the new Python release with a major new project, this looks right up his alley :)

Sir Penguin
22-07-2003, 00:39:06

You don't need to interpret anything if you just dump the HTML. Also, if I'm not mistaken, the CSS is local to the HTML file, so you'd just need to download a file and you'd have the default colour scheme. If one were to hate the default colour scheme, it would just take a quick Perl script to change everything.

It could be that avatars can be accessed remotely. The HTML contains "avatar.php?userid=nnn", so it would be a simple exercise to replace the '<img src="' with '<img src="http://blahblahblah' and have access to current avatars (and all the other links, too: bio, PM...). Signatures would be static, since they're hardcoded once PHP has done its thing.


Qaj the Fuzzy Love Worm
22-07-2003, 02:05:59
I don't understand what you mean by "dump the HTML". If you mean let the current forum PHP stuff to generate the page, then somehow save it, then I'm not sure that'd be any easier (or even possible!) than writing the archive script to rebuild the pages.

The CSS is generated from the user preferences. I've got Gangrene loaded myself, but I'm sure that other people are going to be using Default or something. Either you need to generate CSS for each of the archive pages (defeats the purpose of the archive if you still have to do processing).

You still do have to interpret vB code, or else how are you going to display tags like |code| or |url|, or even ;gasmaske;? Sure, you could put the plain text in, but if I'm not mistaken the posts are written raw to the DB and interpreted every time the thread is served to the client - or else when you edit a post, you'd have to de-interpret HTML into vB code again, which is stupid.

If archive material is going to be transformed into flat HTML, you're not going to get the opportunity to recompile each page - the whole point of archiving would seem to be to free up space on the DB by converting everything to HTML then deleting the DB stuff (or at the very least archiving it to much more permanent storage like CD which you hopefully wouldn't touch again except in gravest emergency). So unless you're suggesting you go through the flat HTML and replace <img src="avatar_1.jpg"> with <img src="avatar_2.jpg"> every time someone chages an avatar, I don't see how that works either. You could have a hard link to a filename which is set for each user (not too hard, though it would entail some code changes to the user preferences in vB code - you'd have to store a regular avatar file with an obvious file name as well as (I suspect) the file in the DB.

Last, signatures aren't hardcoded currently.

Like I said before, archives should IMHO ideally be black, grey and white (distinguish it from the dynamically colored PHP-served pages so you _know_ you're reading an archive), without all the usual trappings of reply/quote/etc., have no avatars (since archived material referring to avatars could be way out of date even days after a post is made) or signatures (same reason) and be as high-speed, low code as possible.

Other stuff, like the navigation (forum jump, the title index hierarchy thing, next/last thread etc), the control function links, profile link and so forth should be easy to just hard code since they'd be pretty much unchanging.

The only real pain would be interpreting the time. You'd have to pick a time zone (prob. GMT) to have all the pages conform to, since you don't want to do any server side processing, unless you had a cookie on the client with the time zone and did some client-side JavaScript.

Actually, this project sounds like quite a lot of fun to do. Wish I had the tools to do it, or I'd volunteer :) :)

Sir Penguin
22-07-2003, 03:01:56
By "dump the HTML", I mean do the same thing as opening up the source and saving it. Only automated.

The CSS is generated from the user preferences, but the actual HTML source has a local <style></style> block. I don't know how to log onto the server with Python, so this would return HTML pages styled with the default colour scheme, which can then be replaced easily with the archive colours.

You don't need to process the vB tags, because they're changed into proper HTML server-side. In the source for a thread page, code (for example) looks like this:<blockquote><pre><font face="verdana,arial,helvetica" size="1" >code:</font>
<hr>...code stuff...<hr></pre></blockquote>As for avatars, you don't need to update the static HTML every time an avatar's updated on the server. In HTML, "<img src="http://www.counterglow.com/forum/avatar.php?userid=19>" displays my current avatar. But, I agree that avatars aren't necessary (or particularly desirable) for the archive.

Signatures are static in a page source, too. Not hardcoded hardcoded, but hardcoded hardcoded. Er... They look like this:<p><p><font face="verdana,arial,helvetica" size="1" >__________________
<br>signature</font></p> I also agree that signatures shouldn't be in the archive, because they can't be changed dynamically when users change them (well, not easily).

I don't think the Forum Jump feature is necessary in the archive. Last/Next thread can be converted easily from the current HTML output, although it would probably have to be changed to last/next thread numerically rather than by last post. Everything below the last/next thread links could probably be deleted, or maybe replaced by some text about being archived.

Here's an example of what I'm talking about:


That file would be modified to remove avatar code, a few time-related things (like "All times are GMT -8 hours..."), and all the unnecessary links and stuff. Smiley links could be changed to point either to the CG copies or some local copies.


22-07-2003, 04:23:12
I don't see why you don't dump it down into a db on your server, and just let that do the searching for you. It's not like we do many searches against the archive, is it?

Plenty of free DBs to pick from. It's already in a DB, isn't it?

(I skipped to the end of this as it's just too long to bother with at the moment. :D)

Sir Penguin
22-07-2003, 05:50:46
I think it's because we're working under the assumption that we're going to be working with text files.

I'm thinking an archive page would look like this:


I got that by passing the HTML source dump through the following Perl script. It works for this thread, I don't know if it will work with other threads. :) (it definitely won't right for threads with more than one page)#!/usr/bin/perl -w

use strict;

my $print = 1;

while (<INHTML>) {

if (/<!-- time zone and post buttons -->$/) {
print OUTHTML "</td></tr></table></td></tr></table>";

s/<font face="verdana,arial,helvetica" size="1" >.+?<\/font><br>//g;
s/<img src="avatar\.php\?userid=\d{1,4}&dateline=\d+" [^>]+>//g;
s/<img src="(?!http:\/\/)/<img src="http:\/\/www.counterglow.com\/forum\//g;
s/<a href="(?!http:\/\/)/<a href="http:\/\/www.counterglow.com\/forum\//g;
s/<a href=".*?newthread\.php\?[^>]*><img src=[^>]*><\/a>$//g;
s/<a href=".*?newreply\.php\?[^>]*><img src=[^>]*><\/a>$//g;
s/<font [^>]*> <a href="[^>]*>.*?<\/a> <a href="[^>]*><img src=".*?images\/firstnew.gif"[^>]*><\/a>&nbsp;<\/font>//g;
s/<a href=.*?showthread\.php\?[^>]*>(?:(?:Last)|(?:Next)) Thread<\/a>$//g;
s/<img src=".*?(?:(?:prev)|(?:next))\.gif[^>]*>$//g;
s/<a href=".*?(?:(?:editpost)|(?:newreply))\.php\?s=[^>]+>\[(?:(?:edit)|(?:quote))\]<\/a>$//g;
s/<font face="verdana,arial,helvetica" size="1" >Location: .*?<\/font><\/td>$//g;
s/<p><font face="verdana,arial,helvetica" size="1" ><i>Last edited by .*?<\/i><\/font><\/p>//g;
s/<img src=".*?images\/posticon(?:new)?\.gif" border="0" alt="(?:(?:Old)|(?:New)) Post">//g;
s/<img src=".*?images\/(?:(?:on)|(?:off)).gif" border="0" alt="[^"]*" align="absmiddle">//g;
s/<td class="size8" (?:[^|]*\|){5}.*?<\/td>$//g;
s/<img src=".*?images\/vb_bullet.gif"[^>]*>//g;
s/<a href=".*?index\.php\?s=[^>]*>[\w\s]+<\/a>(?:.*?&gt;){3} (.*?)<\/b><\/font><\/td>$/<a href=".\/archive.html">Archive<\/a> &gt; $1 <\/b><\/font><\/td>/g;

if (/<p><p><font face="verdana,arial,helvetica" size="1" >__________________<br>/) {
$print = 0;
} elsif (/.*?<\/font><\/p>$/ and $print == 0) {
$print = 1;

if ($print) {
print OUTHTML $_."\n";



22-07-2003, 06:19:45
So? Data is data. Whether its text or numeric.

I'm just trying to understand why we need to go through and dump it out into text, when we can leave it in a DB and take advantage of the DB's strengths. If we have clear rights to the data, and the data is text (with some references to User ID, date posted/editted, whatever), it would be easy to whip up some database interfacing code.

Otherwise, greping a bunch of text files is not going to be very easy on the host, is it?

Important software rule: Don't reinvent the wheel (no matter how bored or neat you think it would be), if the wheel does its job.

Sir Penguin
22-07-2003, 06:30:29
Because we're working under the assumption that we're going to be working with text files. You don't bother discussing using a DB when you're working under the assumption that you're going to be working with text files.

Using text files is a lot easier on the person who takes care of the server. They're a lot easier to secure, too. And I'm pretty sure we've decided not to use grep. :)

And note:Originally posted by Nav
If the old counterpoint threads are stored as flat html, they could quite easily be hosted on this server. (search/indexing issues not withstanding).


22-07-2003, 06:44:13
That's what I'm asking... why bother with flat files? The archives are not going to get used much are they? There's only, what, 50 of us that use CG daily, and maybe 150-200 use it weekly?

If we want to present the look with sigs and location and whatnot, then you aren't working with static text. At which point, you are doing processing, so why stop there? Go back to a DB. Let the DB do all the searching by the various critieria for you, and then you don't have to worry about it yourself.

Sir Penguin
22-07-2003, 06:57:33
But we don't want to do sigs and location and whatnot (especially since that's further load on the old database). Read the damned thread already. ;)

I'm not sure, but I think another issue is licensing. I understand that in order to use the forum software to search an archive database, we'd need another license. In other words, anything that doesn't use the main DB needs to be written from scratch.


22-07-2003, 07:18:52
Ah. But I and several other people here can whip up an easy to use web interface that will display threads, posts, etc... and do searches. It really isn't a big deal. So no vBulletin code involved for that. Just a quick clickky click, badda bing, badda boom, and HTML output...

Seriously. A data schema is about all some of us need. I'd run the archive site off my own permanent connection and tools from my home machine, except my ISP rotates my IP every 6 hours to prevent people from running businesses from home.

Sir Penguin
22-07-2003, 07:23:01
Go for it. You can write a script that changes every six hours the IP address to which CG refers. :)


22-07-2003, 07:26:05
You want to have the host CG spot that the Archive has changed IP and update itself? Not good...

Having the Archive spot it's IP has changed, and pushing that over to the CG host is also not good, security wise.

Of course, for a lot more money, I can get my ISP to upgrade me to commerical. For about x10 the cost, I get the same service level with a permanent IP address. Last time I looked...

Sir Penguin
22-07-2003, 07:35:08
It's not really much of a security risk. You'd need an account with permissions to modify one file, and the CG software would put that into the scripts. After that, the only security risk would be on your side, if somebody got your password and userid from the script. Of course, Nav would have to modify the code, and I think he's said that he doesn't want to mess around in the vB code. :)


22-07-2003, 07:38:34
You wouldn't do it that way. You'd simply have a link over to the Archive host. Completely seperate. We'd just do a dump/addition when new threads got archived. And when we do that, you also dump down the users, their signatures, avatars, and profiles. That get uploaded (insert/updated) to the Archive DB, and everything is fine. 2 web sites.

Sir Penguin
22-07-2003, 07:40:12
That doesn't help with your IP address problem.


22-07-2003, 07:48:35
I'm not worried about *my* problem, because I'm not going to be doing the hosting. We can find other hosts with permanent IP, I bet.

Sir Penguin
22-07-2003, 08:11:42
I think the issue is whether or not we can do that. If we can find a free host who can deal with databases and cgi/php/whatever, then that's great. Otherwise, it has to be either offline, or on CG's server. If it's on CG's server, then it's text files (I know, it's flawed logic, but from what Nav said, I understood that another database on CG isn't preferable to plaintext).

Now that I think about it, you wouldn't need to populate the tree every time the archive is searched. You'd just need to pickle the tree and the KWIC dictionary, and unpickle it when the search is accessed. Duh. You wouldn't even need to keep an index file except as backup. Actually, you wouldn't need one of the tree or the dictionary, whichever one is slower to access.


No longer Trippin
22-07-2003, 08:53:36
Will you two speak in english please? :)

22-07-2003, 10:04:26
They never have before...

22-07-2003, 11:35:29
While another database with the data is possible, I'd rather not go down that route as after another couple of years that could become unworkable as well.

In order to avoid any copyright issues, we should create a new database and import the pertinent data into it (this would also allow us to manage that data better). We would then be able to export this into html. We should interpret vbulletin codes at this stage as well.

Of couse having a master db, would mean we could switch to another method at any time in the future, if something became feasible.

We should only use information that is stored in the thread or post table. So Titles, avatars, Location, www link, sig would not be used. This would make it quicker and considerably reduce the size of the archive.

In terms of colours, the differences between the light and dark styles are probably too great to make it easy to accomplish (ie different header graphic and background graphic). Also I don't think the cookie holds the style choice, so you would need php coding to properly accomplish this, which of course would have to access the main cg db remotely for that info. er possibly not. ;)

22-07-2003, 12:08:35
You can change graphics using a cookie and CSS by giving a text element display:none and a background image.

Er, anyway. There are search solutions out there, so why build a new one? Export to some kind of format from the master database (as Nav suggests), then run something like Swish-E (http://swish-e.org/).

22-07-2003, 12:20:10
Worth looking into those kind of solutions

The indices can get large. In our example index of HTML files, the index occupies about 11MB, about one-fourth the size of the original files indexed.ouch. http://www.linuxjournal.com/article.php?sid=6652

22-07-2003, 12:49:13
You never know, all the repeated words on here might make that a non-issue. :gasmaske: I AM THE HARDMAN U R ALL GAW TWATS HARR HARR HARR fucking a man Hasselbaink.

Qaj the Fuzzy Love Worm
22-07-2003, 15:58:21
Originally posted by Nav
While another database with the data is possible, I'd rather not go down that route as after another couple of years that could become unworkable as well.

In order to avoid any copyright issues, we should create a new database and import the pertinent data into it (this would also allow us to manage that data better). We would then be able to export this into html. We should interpret vbulletin codes at this stage as well.

Of couse having a master db, would mean we could switch to another method at any time in the future, if something became feasible.

We should only use information that is stored in the thread or post table. So Titles, avatars, Location, www link, sig would not be used. This would make it quicker and considerably reduce the size of the archive.

In terms of colours, the differences between the light and dark styles are probably too great to make it easy to accomplish (ie different header graphic and background graphic). Also I don't think the cookie holds the style choice, so you would need php coding to properly accomplish this, which of course would have to access the main cg db remotely for that info. er possibly not. ;)

I think having the vB stuff converted to a different database (MySQL or whatever is flavor of the month these days) that's kept offline would be the best solution - the data can still be manipulated in the future if it floats your boat.

Having it online is, as ou say, only going to create problems down the road, since you'd have all that overhead for something that only a few people would use very infrequently. Not worth slowing down regular operations for, in my book.

SP, your method is a little labor intensive, which is exactly the kind of half-assed approach I'd expect from a student :) Given the number of threads in CP, do you really want to have to have it grab a (presumably) hand-saved HTML file for each page of each thread? I wouldn't. Much better, IMHO, to have a script that drags everything out of the DB (of whatever flavor) and builds archive pages for a certain thread/date range.

It could even be written to read from the original DB, convert to the new one, build the page(s), and remove the old information from the master DB. You'd really want to make sure you got the automation right with that one though :) Or, write separate processes to (a) convert to new DB (b) convert HTML (c) remove from original DB. Whatever is preferable, I guess. But you do the right job the first time ie. write a decent script that works with the DB info, and it'll take care of itself, provided Nav doesn't monkey with the vB innards too much between archive runs :)

Sir Penguin
22-07-2003, 18:58:11
Labour intensive? I would write a short script that did the following:

0. Input either a thread range, or a modification date range and a specific forum.
1. Download the first page of each thread and extract the number of pages in that thread; either save the thread to a file (threadNum.html) or store it in a database
2. Download the remaining pages of each thread, and either store them in files (threadNum-n.html) or in a DB
3. Run all the threads through a conversion script like the one I posted, and save the results.
4. Generate the Archive index page.

It would be, like, 50 lines. Including comments. A bit more if you wanted to archive based on modification date rather than thread number, and another bit more if you want it to do database stuff.


Qaj the Fuzzy Love Worm
22-07-2003, 19:28:08
So the script would do the saving of the HTML pages? What, though some kind of pipe or whatsit (I know enough unixy stuff to get myself into trouble here).

Still seems rather lazy to have the page generate, then save it, then generate a page from it. Maybe I'm just more of a database purist than you, SP. You program messy :)

22-07-2003, 20:22:45
Why don't you just zip it and email a copy to all the forum members?

22-07-2003, 20:57:19
Why don't you just zip it?

Edit. Damn that really didnt work over two pages. :(

Sir Penguin
22-07-2003, 21:11:42
You're right, you wouldn't have to save it the first time. That was stupid. But you would have to pass it through a coversion routine, and for that I would still use a separate script. Python to do the database/web/main stuff (since I know how do do all that in Python), and Perl for the text manipulation (since that's what Perl's good at). You'd just have to change the Perl script to input from stdin.

I program the messy way, and then clean up as I go along. (at least, if I expect anybody to see my source code) :) I follow the notion that it's best to get the ideas out and then find ways to do things better, rather than overthink the problem.

Qaj the Fuzzy Love Worm
22-07-2003, 21:47:41
Then maybe that's where we're different. I make web pages direct from databases, so that how I'm approching this problem. Just seems natural.

Of course, the project I'm working on right is anything but natural. I'm having to transfer an inferface to a 600+ table DB built from Access into a web page served by ASP, as a means to getting it out to our staff with the minimum of fuss. The biggest problem in trying to interpret what the moron who wrote it was doing (or trying to do - the thing doesn't work right and never has).

So when a problem like Nav's comes along I think "Aha! Just suck the info out of the database and create a page from scratch. Easy peasy!" But then you come and obfuscate it all with your Unix hackery mentality. That's just like a Canadian :)

22-07-2003, 22:04:42
Sed is nice too. Create a pipe for each and every edit you want to make, so you can really tie up the server. :)

But seriously, it sounds like we want to use another SQL capable database, grab stuff out of it with Python, and clean it up with Perl. Or maybe I should read some more of this, but.... nyah.

Sir Penguin
22-07-2003, 23:02:05
Qaj: If you have access to the database, then there's no problem. I don't have access to the database.

Besides, I don't like databases. There's little you can't do with text files, and you don't have to mess around with connections and stuff. :)


Qaj the Fuzzy Love Worm
22-07-2003, 23:50:37
You don't like databases? You suck! :)

I have a friend called Sed, he's far from nice.

23-07-2003, 02:14:50
I'm with Qaj. Direct DB. Let the DB do all the work for us. It's much better then having to go digging through text files. Bleah. Too easy for those to get corrupted by whatever else eats up your host. With a DB, you just need to back one item up, and all your data is safe.

And for that matter, I can do it all in one language, and have it work on ever HTML 1.1 compliant browser, and we can migrate to other backend platforms with no trouble.

Maybe I should just buy a new big box and see what kind of deal I can cut with my ISP for a perm IP address and low mileage web site hosted off it? I wonder what kind of deal they are getting at LoudMedicine.com, and what kind of hosting they have?

You can never over-think a problem in the real world. Eventually, you will have to start working on the problem before you really have the problem defined, and they'll quickly keep redefining the problem and tagging in new problems for it to also solve. If you don't think it through, and do so very well, you end up with a serious piece of shit product that you hope noone will ever be able to trace to you, and that doesn't really do what it should.

23-07-2003, 02:18:27
Remember, in the real world, you never get time to do it right. They will always tell you to just start it, and get going. Knock it out as fast as possible, and you'll have time to come back and do it right afterward. That's one of the biggest lies you'll ever here in software design. The second biggest is whenever you are told "Never" as in "we will never want to do that" or "we will never have this or that happen" etc etc etc.

We've got time to think about what we have available, what we can make available, and what we can get for free (what coder, for how long, etc). And how, in an almost ideal world, what are the best (low manual operations and maintence, how portability (data and server side code), etc etc etc).

23-07-2003, 15:15:50
I'm also with Qaj and Darkstar, we don't want to add unnecessary complication. For example PHP could do the whole process, from grabbing the right data, to writing the HTML files. It's very important that somebody else will be able to take over the management of this, if need be.

And of course whoever would like to have a bash at this will have access to the archive database.

I'll create and manage the master archive db. I will import only the necessary data, making it a bit more compact. Initially it will be fairly large, but if I just send updates in the future it shouldnt be too big.

DS, Loudmedicine is only on a standard hosting thingy, nothing special (I think).

Qaj the Fuzzy Love Worm
23-07-2003, 21:58:02
I'd "like to" have a bash at this, I'm not sure I'd get much time to do it though.

The initial work shouldn't be too difficult or time-consuming - after all, it should be only one or two different scripts (one for the thread pages, one for the index page(s)).

23-07-2003, 22:25:06
yep. It should probably be broken up into months, with multiple-pages like the live site (I guess the same should be done with threads as well)

We'd just have to look into whether an indexing system is feasible. Would be great to search it.

There's def. no rush on it

Sir Penguin
11-08-2003, 07:01:48
Well, Qaj? :)


13-08-2003, 09:55:16
I need to produce the master db first anyway. (when I have time).