Post by Aaron GoldblattPost by Kevin Horn2) Should this be done? (I think it should...)
The problem that occurs with logging an active channel is simply one of
volume. That is, how do you determine what is "relevant" discussion and
what is the idle chatter that can rapidly develop on IRC?
I don't think the volume will really become an issue with the amount of
storage typically available with the low cost of it these days. I'd set a
very high estimate of 3MB a day (total per day/all channels). This would
only come out to a gigabyte a year *uncompressed*. gzip produces 70-80%
compression rate on the XML logs.
I just now checked the log current sizes where the bot is now. In 26 hours
less than 650KB was logged. Of couse Zynot channels are slow now compared to
what they will become in a year, but that's where I got my 3MB a day
*estimate*.
***@cerebellum siffy $ du -h log/
646K log
***@cerebellum siffy $ du -h logs.tgz
120K logs.tgz
That's just to give a better idea of prejected compression. I have no idea if
it will get better or worse as things progress.
Post by Aaron GoldblattIf you want to archive everything, you should be prepared to store it, and
if archived search is a goal, you should be prepared to store it
appropriately from the beginning.
Hopefully the official zynot servers should be able to accomodate this. I
haven't talked specifics with anyone about cpu/hdd/or even which server they
want to put any logging bots on. And for "archiving" I'd like to see apache
doing on the fly gzip on the raw XML code that isn't the current log. Also
the html output will likely be dynamically created when asked for by someone
running a search.
Post by Aaron GoldblattEven if you drop control messages (joins, parts, ops, etc), you're still
left with a nice chunk of raw data.
Just FYI, I haven't been able to get the bot to automatically ignore these
things yet. If anyone reading this knows what pattern I need to give the
mozbot to make it "block" the logging of such things feel free to let me
know. My main concern with logging parts, joins, and mode changes was
netsplits. With freenode 500 idling people can turn into 5000 lines of XML
during heavy rehubbing and 0 lines of that being anything actually said.
Post by Aaron GoldblattSearching and sorting of data: Do you store nick in a separate field, or
simply use an endless database of raw text storage and simply do a text
search? Do you store datestamps? These things provide a greater
amount of power and flexibility in sorting, but require thought and
possibly hacking first.
I plan to store everything as XML for flexibility in converting it to other
forms.
Sample log lines
***@cerebellum log $ tail \#zynot.xml.part
<emote channel="#zynot" nick="Galik" time="2003-07-15T23:37:02Z">stamps on
dialup modem</emote>
<msg channel="#zynot" nick="decaf" time="2003-07-15T23:37:05Z">may be,
you'll need some day</msg>
<msg channel="#zynot" nick="Galik" time="2003-07-15T23:37:19Z">Hmmmm good
point</msg>
<emote channel="#zynot" nick="Galik" time="2003-07-15T23:37:29Z">picks up the
pieces...</emote>
<emote channel="#zynot" nick="Galik" time="2003-07-15T23:37:39Z">puts them in
a drawer</emote>
<msg channel="#zynot" nick="Galik" time="2003-07-15T23:38:23Z">Thnx for saving
me :)</msg>
<emote channel="#zynot" nick="Emmett" time="2003-07-15T23:39:37Z">glues the
pieces of Galik's modem back together again</emote>
<msg channel="#zynot" nick="Emmett" time="2003-07-15T23:39:44Z">Wow, an
acoustic coupler. Cool!</msg>
<msg channel="#zynot" nick="Galik" time="2003-07-15T23:39:55Z">lol :)</msg>
<emote channel="#zynot" nick="Emmett" time="2003-07-15T23:39:58Z">gets out the
solder and duct tape</emote>
Post by Aaron GoldblattThink about speed of archival: That is, how often will the data move from
XML output from the bot to searchable db, and how long will this conversion
take for how much data on what kind of box? Should we first track
discussion in the channel over a period of a couple weeks in terms of
volume and processing time, and then factor upwards by, say, 250% to 300%,
in looking at target capabilities?
Answering this will require more time/testing.
Post by Aaron GoldblattGive consideration to either how long the archive will be held, or how
fast it will grow and what the limitations of the db engine are that we'll
need to design around. MySQL may be perfectly sufficient ... or does it
have some upward limit we may run into? If so, how quickly? And if it's
"a long time away," it's helpful to think -now- about future upgrades and
data conversion (or even at abandoning a limited system before we start),
as we're all well aware (because lack of planning on your part does not
constitute an emergency on my part).
If you think you can "edit" toward only "relevant" discussion, forget it.
Nobody is going to want to volunteer the time to read all that nonsense,
I'd bet, and be unwilling to put up with the flack they'll catch once
something "important" is zapped. (Been there, done that.)
Right, after reading this I thought "save everything, <clitche>hard drive
space is cheap</cliche>"
Post by Aaron GoldblattHaving said all of this, I'll agree: If IRC is a discussion forum that
leads to decision-making within the community, it's important that we
archive, if for no other reason than "governance in the sunshine."
Decision-making and archival of discussion surrounding the decision are
linked and should not be separated, because the record of discussion, how
decisions are made, and why, is important to transparency (just ask the
FCC). That has been a complaint about Gentoo's centralized decision-making
-- that it's not transparent and seems to be going on behind closed doors
or behind participants' backs.
If we wish to avoid the same mistake, we should avoid it from the beginning
and request (require?) that archives be kept as a matter of community record.
ag
Kevin and I had a fairly in depth conversation about this thread. (If you
want to see an XML log of that conversation let me know ;)) I thought I'd
wait for some more replies and thoughts before making my own, but too late
now. As always, if I was unclear about anything let me know and I'll do my
best to better express my thoughts.
Now for a list of things (specs?) I think are needed for this small project.
1. Obviously a bot or 2 on independant servers for redundancy.
2. Some perl coders to make any changes to the mozbot code if we keep it for
XML logging.
3. I'd like some help making a module that can act as a cronjob for the bot
so I can automate it by having it send itself commands. ie, "rotatelogs"
right now this is having to be done by hand afaik. Or some editing to the
XMLLogger.bm module to have the bot rotate logs on a schedule.
4. Someone knowledgable of db if that's (MySQL,etc) what we need for speed
once the logs get large.
5. XSLT and HTML writers for style sheets and an interface to view the logs.
6. More ideas.
7. There's something else, always is. So this is here for YOU, the reader.
:)
Thanks,
Will Reid