PDA

View Full Version : Image harvestors...


Darkstar
15-06-2003, 08:53:11
Ok... Just to try and stir up some new conversations...

What kind of image harvestors do you use to cull out new images from Usenet and web sites?

Lately, I've just been using FotoVac (an ACDSee product) to cull out images from Usenet. I used to use Picture Agent, but they key your license off your HD, so when you change HDs, your license no longer works. FotoVac is nice, as it has nice options like 'no dupes', but I'm curious what others are using.

Sir Penguin
15-06-2003, 08:58:45
I use Opera. That way, I get the text as well as the pictures.

SP

Darkstar
15-06-2003, 09:16:28
When I want to preserve a web page with images in place, I use that Save As... whatever.mht (web archive) in IE. If I want to just preserve one image and text, I tend to import the date into my "Literary Machine Pro" application/database.

Deacon
15-06-2003, 18:30:24
For usenet harvesting, I use Pan. It's a Linux app. Back when I was in school, my dorm room had a wonderful connection. I used Pan to get lots of MP3s, videos, and stills of various subject matter. Even Outlook Express is decent, but I kept running into problems when downloading more than 120MB at a time. But a few hops to download a CD isn't too annoying.

For websites, I use Go!Zilla 3.5's leech button.

There's probably a Perl script that does Usenet and the web, but I'd bet money that finding it would be a pain, and getting it to work in my environment would be painful too.

Sir Penguin
15-06-2003, 20:09:58
I never understood how to get binaries with Pan (or any news reader, for that matter). How does that work?

SP

Darkstar
16-06-2003, 05:02:08
With Outlook Express, you have to highlight all the binary parts, and tell it to decode. If you actually selected all the parts, it will eventually spit out the UUEncoded item.

Deacon
16-06-2003, 05:21:34
How does the software work, or how does the usenet work?

The usenet is a giant collection of newsgroups. I'm foggy on how groups are created and how old posts are flushed out, but I've never run an NNTP server, so there. :p

NNTP is the protocol. NNTP is one of many protocols used on the internet. These days, IP seems to be the protocol of protocols. So typical NNTP servers and clients communicate using NNTP stuff within IP packets. I think. :)

But it's like the web, but with virtually no effort put into presentation, and no hyperlinks. Like with E-mail, everything, including binaries must be transmitted as text. Binary data is typically encoded into text by representing every 6 bits of the binary with a one-byte ASCII character. This makes a post conataining an encoded binary roughly 4/3 the size of the unencoded binary.

Maybe bases larger or smaller than 64 were tried, and the world almost ended. But 64 is the magic number. Probably because if you use 7 bits, I believe that chaos is possible because some of the ASCII characters are unprintable or might instruct a particulary dumb machine to behave badly. Using a smaller base yields an even more wasteful final transmission. Why not use extended ASCII characters that are printable? Maybe extended ASCII is too new.

To use usenet software, you need a usenet (nntp) server account. If your ISP doesn't provide you with a usenet account, you can buy one from somebody else and access their server over the internet. In the past, I've used Randori. There are probably others. For me, it was as simple as going to the website, choosing a plan, punching in the credit card number and e-mail address, and receiving a login and password.

Next, you need the software. All usenet software is the same if it's any good. You enter your server settings, and then download whatever groups are being carried by the server. Some servers are just about unrestricted in terms of what they carry. My ISP seems to be restricting MP3 content.

I probably danced around the heart of the matter...

Sir Penguin
16-06-2003, 05:27:23
Actually, I just couldn't find the download binary function in Pan. I may have been using an NNTP server that didn't like binaries, though.

SP

Deacon
16-06-2003, 05:52:54
I forgot to mention another technicality. Just like an e-mail message, a usenet post is a huge blob. Again, since I don't know much about NNTP servers, I don't know why or how maximum post sizes are used, but they exist. So you'll often see ISOs or other big files split into multiple files with various popular archiver/compression combos like RAR or ZIP.

I suppose you could do things the UNIX way and use tar to create the archive (if it's more than one file), a splitter program to break it into chunks, and a compression program to compress the chunks. But most folks use ZIP or RAR. They are one-stop shopping, while if things were done the UNIX way, well, you'd need one of every program ever used for each task. And even then, you're at the mercy of the person who posted. In what order were all the utilities applied? Did the poster compress first or split first? Did he make a mistake with the filename extensions that are supposed to tip you off?

Here's some more paranoia. Things like Tar and Gzip SHOULD not care what they are being combined with. But maybe the person who posted swears by GodAwfulZIP. Because of some bug, GAZIP might require you to install more software to handle each and every combination.

Multiply the number of compression programs by the number of archivers by the number of splitters to see where this is going.

Finally, a message can contain multiple attachments as long as the sum of everything stays under the limit. I think. :)

Deacon
16-06-2003, 05:59:50
I'm using 0.13.4. I just highlight a group of posts with the green icon next to them, each of which should represent one file. I then click the "Save As" icon, which is the one with the diskette and pencil.