Discussion:
IRC Logging
Josiah Ritchie
2003-07-09 17:50:59 UTC
Permalink
Any idea on when ZyBot will be able to log for IRC? Or is this already possible
and I don't know it? If so where are the logs.

Thanks,
flickerfly
Kevin Horn
2003-07-14 10:19:10 UTC
Permalink
according to:
http://www.mozilla.org/projects/mozbot/

MozBot 2.4 has a logging module...it logs a simple XML format.

here's an example:
http://bugzilla.mozilla.org/attachment.cgi?id=76579&action=view

Is there some problem with using this functionality? It seems like it would
only take a couple of minutes to set up.

If we can capture these logs, the Docs team can use XSLT to transform it
into HTML for public viewing.
We also might work out a way to make it searchable.

so:
1) Can this be done? (I think it can...)
2) Should this be done? (I think it should...)
3) If no to either of the above, why not?

Kevin Horn
Andrew Aylett
2003-07-14 11:09:47 UTC
Permalink
Post by Kevin Horn
http://www.mozilla.org/projects/mozbot/
MozBot 2.4 has a logging module...it logs a simple XML format.
http://bugzilla.mozilla.org/attachment.cgi?id=76579&action=view
Is there some problem with using this functionality? It seems like
it would only take a couple of minutes to set up.
If we're going to be using MozBot anyway, this looks good. It seems
overkill if we only want it for logging though...

I run Dave Beckett's RDF logger (it also generates text and HTML) on
my internal IRC server.

<http://cvs.ilrt.org/cvsweb/redland/logger/>

It's a perl script, logs to RDF by default (which is easily XSLT'd to
something nice) or to HTML or Text (or any two, or all three...).

As I said, if we want the MozBot anyway, good. If not, have a look at
this.

OK,
--
Andrew Aylett | www.aylett.co.uk | 9.72 x 10^-21 parsecs per picosecond
***@aylett.co.uk | answer==42 | -- it's not just a good idea, it's the law!
Aaron Goldblatt
2003-07-15 04:24:36 UTC
Permalink
Post by Kevin Horn
2) Should this be done? (I think it should...)
The problem that occurs with logging an active channel is simply one of
volume. That is, how do you determine what is "relevant" discussion and what
is the idle chatter that can rapidly develop on IRC?

If you want to archive everything, you should be prepared to store it, and if
archived search is a goal, you should be prepared to store it appropriately
from the beginning.

Even if you drop control messages (joins, parts, ops, etc), you're still left
with a nice chunk of raw data.

Searching and sorting of data: Do you store nick in a separate field, or
simply use an endless database of raw text storage and simply do a text
search? Do you store datestamps? These things provide a greater
amount of power and flexibility in sorting, but require thought and possibly
hacking first.

Think about speed of archival: That is, how often will the data move from XML
output from the bot to searchable db, and how long will this conversion take
for how much data on what kind of box? Should we first track discussion in
the channel over a period of a couple weeks in terms of volume and processing
time, and then factor upwards by, say, 250% to 300%, in looking at target
capabilities?

Give consideration to either how long the archive will be held, or how
fast it will grow and what the limitations of the db engine are that we'll
need to design around. MySQL may be perfectly sufficient ... or does it have
some upward limit we may run into? If so, how quickly? And if it's "a long
time away," it's helpful to think -now- about future upgrades and data
conversion (or even at abandoning a limited system before we start), as we're
all well aware (because lack of planning on your part does not constitute an
emergency on my part).

If you think you can "edit" toward only "relevant" discussion, forget it.
Nobody is going to want to volunteer the time to read all that nonsense, I'd
bet, and be unwilling to put up with the flack they'll catch once something
"important" is zapped. (Been there, done that.)

Having said all of this, I'll agree: If IRC is a discussion forum that leads
to decision-making within the community, it's important that we archive, if
for no other reason than "governance in the sunshine." Decision-making
and archival of discussion surrounding the decision are linked and should not
be separated, because the record of discussion, how decisions are made, and
why, is important to transparency (just ask the FCC). That has been a
complaint about Gentoo's centralized decision-making -- that it's not
transparent and seems to be going on behind closed doors or behind
participants' backs.

If we wish to avoid the same mistake, we should avoid it from the beginning
and request (require?) that archives be kept as a matter of community record.

ag
Josiah Ritchie
2003-07-15 19:01:44 UTC
Permalink
Thanks, that puts the struggles right up front. I hadn't considered most
of them.
Post by Aaron Goldblatt
Post by Kevin Horn
2) Should this be done? (I think it should...)
The problem that occurs with logging an active channel is simply one of
volume. That is, how do you determine what is "relevant" discussion and what
is the idle chatter that can rapidly develop on IRC?
If you want to archive everything, you should be prepared to store it, and if
archived search is a goal, you should be prepared to store it appropriately
from the beginning.
Even if you drop control messages (joins, parts, ops, etc), you're still left
with a nice chunk of raw data.
Searching and sorting of data: Do you store nick in a separate field, or
simply use an endless database of raw text storage and simply do a text
search? Do you store datestamps? These things provide a greater
amount of power and flexibility in sorting, but require thought and possibly
hacking first.
Think about speed of archival: That is, how often will the data move from XML
output from the bot to searchable db, and how long will this conversion take
for how much data on what kind of box? Should we first track discussion in
the channel over a period of a couple weeks in terms of volume and processing
time, and then factor upwards by, say, 250% to 300%, in looking at target
capabilities?
Give consideration to either how long the archive will be held, or how
fast it will grow and what the limitations of the db engine are that we'll
need to design around. MySQL may be perfectly sufficient ... or does it have
some upward limit we may run into? If so, how quickly? And if it's "a long
time away," it's helpful to think -now- about future upgrades and data
conversion (or even at abandoning a limited system before we start), as we're
all well aware (because lack of planning on your part does not constitute an
emergency on my part).
If you think you can "edit" toward only "relevant" discussion, forget it.
Nobody is going to want to volunteer the time to read all that nonsense, I'd
bet, and be unwilling to put up with the flack they'll catch once something
"important" is zapped. (Been there, done that.)
Having said all of this, I'll agree: If IRC is a discussion forum that leads
to decision-making within the community, it's important that we archive, if
for no other reason than "governance in the sunshine." Decision-making
and archival of discussion surrounding the decision are linked and should not
be separated, because the record of discussion, how decisions are made, and
why, is important to transparency (just ask the FCC). That has been a
complaint about Gentoo's centralized decision-making -- that it's not
transparent and seems to be going on behind closed doors or behind
participants' backs.
If we wish to avoid the same mistake, we should avoid it from the beginning
and request (require?) that archives be kept as a matter of community record.
ag
_______________________________________________
Zynaut mailing list
http://lists.zynot.org/mailman/listinfo/zynaut
Will Reid
2003-07-15 23:58:53 UTC
Permalink
Post by Aaron Goldblatt
Post by Kevin Horn
2) Should this be done? (I think it should...)
The problem that occurs with logging an active channel is simply one of
volume. That is, how do you determine what is "relevant" discussion and
what is the idle chatter that can rapidly develop on IRC?
I don't think the volume will really become an issue with the amount of
storage typically available with the low cost of it these days. I'd set a
very high estimate of 3MB a day (total per day/all channels). This would
only come out to a gigabyte a year *uncompressed*. gzip produces 70-80%
compression rate on the XML logs.
I just now checked the log current sizes where the bot is now. In 26 hours
less than 650KB was logged. Of couse Zynot channels are slow now compared to
what they will become in a year, but that's where I got my 3MB a day
*estimate*.

***@cerebellum siffy $ du -h log/
646K log
***@cerebellum siffy $ du -h logs.tgz
120K logs.tgz

That's just to give a better idea of prejected compression. I have no idea if
it will get better or worse as things progress.
Post by Aaron Goldblatt
If you want to archive everything, you should be prepared to store it, and
if archived search is a goal, you should be prepared to store it
appropriately from the beginning.
Hopefully the official zynot servers should be able to accomodate this. I
haven't talked specifics with anyone about cpu/hdd/or even which server they
want to put any logging bots on. And for "archiving" I'd like to see apache
doing on the fly gzip on the raw XML code that isn't the current log. Also
the html output will likely be dynamically created when asked for by someone
running a search.
Post by Aaron Goldblatt
Even if you drop control messages (joins, parts, ops, etc), you're still
left with a nice chunk of raw data.
Just FYI, I haven't been able to get the bot to automatically ignore these
things yet. If anyone reading this knows what pattern I need to give the
mozbot to make it "block" the logging of such things feel free to let me
know. My main concern with logging parts, joins, and mode changes was
netsplits. With freenode 500 idling people can turn into 5000 lines of XML
during heavy rehubbing and 0 lines of that being anything actually said.
Post by Aaron Goldblatt
Searching and sorting of data: Do you store nick in a separate field, or
simply use an endless database of raw text storage and simply do a text
search? Do you store datestamps? These things provide a greater
amount of power and flexibility in sorting, but require thought and
possibly hacking first.
I plan to store everything as XML for flexibility in converting it to other
forms.
Sample log lines

***@cerebellum log $ tail \#zynot.xml.part
<emote channel="#zynot" nick="Galik" time="2003-07-15T23:37:02Z">stamps on
dialup modem</emote>
<msg channel="#zynot" nick="decaf" time="2003-07-15T23:37:05Z">may be,
you&apos;ll need some day</msg>
<msg channel="#zynot" nick="Galik" time="2003-07-15T23:37:19Z">Hmmmm good
point</msg>
<emote channel="#zynot" nick="Galik" time="2003-07-15T23:37:29Z">picks up the
pieces...</emote>
<emote channel="#zynot" nick="Galik" time="2003-07-15T23:37:39Z">puts them in
a drawer</emote>
<msg channel="#zynot" nick="Galik" time="2003-07-15T23:38:23Z">Thnx for saving
me :)</msg>
<emote channel="#zynot" nick="Emmett" time="2003-07-15T23:39:37Z">glues the
pieces of Galik&apos;s modem back together again</emote>
<msg channel="#zynot" nick="Emmett" time="2003-07-15T23:39:44Z">Wow, an
acoustic coupler. Cool!</msg>
<msg channel="#zynot" nick="Galik" time="2003-07-15T23:39:55Z">lol :)</msg>
<emote channel="#zynot" nick="Emmett" time="2003-07-15T23:39:58Z">gets out the
solder and duct tape</emote>
Post by Aaron Goldblatt
Think about speed of archival: That is, how often will the data move from
XML output from the bot to searchable db, and how long will this conversion
take for how much data on what kind of box? Should we first track
discussion in the channel over a period of a couple weeks in terms of
volume and processing time, and then factor upwards by, say, 250% to 300%,
in looking at target capabilities?
Answering this will require more time/testing.
Post by Aaron Goldblatt
Give consideration to either how long the archive will be held, or how
fast it will grow and what the limitations of the db engine are that we'll
need to design around. MySQL may be perfectly sufficient ... or does it
have some upward limit we may run into? If so, how quickly? And if it's
"a long time away," it's helpful to think -now- about future upgrades and
data conversion (or even at abandoning a limited system before we start),
as we're all well aware (because lack of planning on your part does not
constitute an emergency on my part).
If you think you can "edit" toward only "relevant" discussion, forget it.
Nobody is going to want to volunteer the time to read all that nonsense,
I'd bet, and be unwilling to put up with the flack they'll catch once
something "important" is zapped. (Been there, done that.)
Right, after reading this I thought "save everything, <clitche>hard drive
space is cheap</cliche>"
Post by Aaron Goldblatt
Having said all of this, I'll agree: If IRC is a discussion forum that
leads to decision-making within the community, it's important that we
archive, if for no other reason than "governance in the sunshine."
Decision-making and archival of discussion surrounding the decision are
linked and should not be separated, because the record of discussion, how
decisions are made, and why, is important to transparency (just ask the
FCC). That has been a complaint about Gentoo's centralized decision-making
-- that it's not transparent and seems to be going on behind closed doors
or behind participants' backs.
If we wish to avoid the same mistake, we should avoid it from the beginning
and request (require?) that archives be kept as a matter of community record.
ag
Kevin and I had a fairly in depth conversation about this thread. (If you
want to see an XML log of that conversation let me know ;)) I thought I'd
wait for some more replies and thoughts before making my own, but too late
now. As always, if I was unclear about anything let me know and I'll do my
best to better express my thoughts.

Now for a list of things (specs?) I think are needed for this small project.

1. Obviously a bot or 2 on independant servers for redundancy.

2. Some perl coders to make any changes to the mozbot code if we keep it for
XML logging.

3. I'd like some help making a module that can act as a cronjob for the bot
so I can automate it by having it send itself commands. ie, "rotatelogs"
right now this is having to be done by hand afaik. Or some editing to the
XMLLogger.bm module to have the bot rotate logs on a schedule.

4. Someone knowledgable of db if that's (MySQL,etc) what we need for speed
once the logs get large.

5. XSLT and HTML writers for style sheets and an interface to view the logs.

6. More ideas.

7. There's something else, always is. So this is here for YOU, the reader.
:)


Thanks,
Will Reid

Loading...