buildfarm logging versus embedded nulls

by Tom Laneon 2010-03-11T22:57:10+00:00
I was looking at this recent nonrepeatable buildfarm failure:
http://buildfarm.postgresql.org/cgi-bin/show-log.pl?nm=polecat&dt=2010-03-11%2021:45:10
which has several instances of the known "pgstat wait timeout" problem
during the parallel regression tests.
I was disappointed to see that the postmaster log part of the report
is truncated near the start of the run, so there's no way to see if
anything interesting got logged near the point of the failure.
When I run "make check" on my own OS X machine, I notice that the
postmaster.log file usually has some runs of a few dozen null bytes in
it. I suspect this is an artifact of Apple's stdio buffering
implementation when several backends are writing to the same log file.
I suppose that what happened in the above case is that some nulls got
embedded in postmaster.log, and then the file got truncated at the first
null, perhaps during the upload to the buildfarm server, or maybe it's
intact on the server but the web page is failing to display anything
past that point.
There's probably not much we can do about Apple's stdio (and I would bet
that they inherited this behavior from the BSDen anyway). What we
*could* do is
(1) encourage buildfarm owners to run the tests with logging-collector
turned on, and/or
(2) fix the buildfarm reporting mechanisms to not be fazed by nulls in
the log files.
I have no clear idea how hard either of these things is.
regards, tom lane

Re: buildfarm logging versus embedded nulls

by Andrew Dunstanon 2010-03-12T00:00:34+00:00.

Well, the good news is that we actually have the data on the server, in
a temp file that will be cleaned up, but hasn't been yet. I have placed
a copy at is entirely on the server. That's more good news.
Now, the log-text field in our build-status-log table is text, so it's
on insertion to the database that it gets truncated. I'm thinking that I
should just escape nulls with a regex ( 's/x00/\0/g' or similar)
before inserting the data.
Anyone got any better ideas?
(BTW, your idea of using logging-collector won't work without
significant changes in the buildfarm client. Nice idea, though.)
cheers
andrew

Re: buildfarm logging versus embedded nulls

by Tom Laneon 2010-03-12T14:18:53+00:00.
Andrew Dunstan Thanks for that. I found the relevant part of the postmaster log:
[4b99671f.ddb4:36] LOG: statement: DELETE FROM vactst WHERE i != 0;
[4b99671f.ddb4:37] LOG: statement: VACUUM (FULL) vactst;
[4b99671f.ddb4:38] WARNING: pgstat wait timeout
[4b99671f.ddb4:39] LOG: statement: DELETE FROM vactst;
[4b99671f.ddb4:40] LOG: statement: SELECT * FROM vactst;
[4b99671f.ddb4:41] LOG: statement: VACUUM (FULL, FREEZE) vactst;
[4b99671f.ddb4:42] WARNING: pgstat wait timeout
[4b99671f.ddb4:43] LOG: statement: VACUUM (ANALYZE, FULL) vactst;
[4b99671f.ddb4:44] WARNING: pgstat wait timeout
[4b99671f.ddb4:45] LOG: statement: CREATE TABLE vaccluster (i INT PRIMARY KEY);
[4b99671f.ddb4:46] LOG: statement: ALTER TABLE vaccluster CLUSTER ON vaccluster-pkey;
[4b99671f.ddb4:47] LOG: statement: INSERT INTO vaccluster SELECT * FROM vactst;
[4b99671f.ddb4:48] LOG: statement: CLUSTER vaccluster;
[4b99671f.ddb4:49] LOG: statement: VACUUM FULL pg-am;
[4b99671f.ddb4:50] WARNING: pgstat wait timeout
[4b99671f.ddb4:51] LOG: statement: VACUUM FULL pg-class;
[4b99671f.ddb4:52] WARNING: pgstat wait timeout
[4b99671f.ddb4:53] LOG: statement: VACUUM FULL pg-database;
[4b99671f.ddb4:54] WARNING: pgstat wait timeout
[4b99671f.ddb4:55] LOG: statement: VACUUM FULL vaccluster;
[4b996707.dcc0:2] WARNING: pgstat wait timeout
[4b99671f.ddb4:56] WARNING: pgstat wait timeout
[4b99671f.ddb4:57] LOG: statement: VACUUM FULL vactst;
[4b996707.dcc0:3] WARNING: pgstat wait timeout
[4b996747.ddcc:1] WARNING: pgstat wait timeout
[4b99671f.ddba:4] LOG: statement: INSERT INTO DEFAULT-TBL VALUES (1, 'thomas', 57.0613);
Most of the "pgstat wait timeout" gripes are coming from the backend
running the "vacuum" regression test, but there are two that came from
process dcc0, which is shown by other log entries to be the autovacuum
launcher. So now I'm wondering if there could be some issue that occurs
when the autovac launcher and a backend both want stats concurrently.
I have not got further than that in my uncaffeinated state
Works for me.
regards, tom lane

Re: buildfarm logging versus embedded nulls

by Alvaro Herreraon 2010-03-12T14:40:01+00:00.
Tom Lane wrote:
Seems like this issue would exist whenever two backends want stats
concurrently, no? Is this well-tested?

Re: buildfarm logging versus embedded nulls

by Andrew Dunstanon 2010-03-12T14:45:52+00:00.

Done.
cheers
andrew

Re: buildfarm logging versus embedded nulls

by Tom Laneon 2010-03-12T14:56:41+00:00.
Alvaro Herrera Well, you tell me how much it's been tested ;-). But the autovac
launcher could possibly interact differently than another backend,
since it has a different value for maximum stats file age.
Anyway it's only a guess. It could well be that that machine was simply
so heavily loaded that the stats collector couldn't respond fast enough.
I'm just wondering whether there's an unrecognized bug lurking here.
regards, tom lane

Re: buildfarm logging versus embedded nulls

by Robert Creageron 2010-03-12T15:20:06+00:00.

You've got me trained well now. I'm now looking at my build machine failures. Wasn't sure what to do about that one, since no relevant files changed.
Is there any value in setting "keep-error-builds =
Cheers,
Rob

Re: buildfarm logging versus embedded nulls

by Tom Laneon 2010-03-12T20:19:15+00:00.
I wrote:
Still meditating on this ... and it strikes me that the pgstat.c code
is really uncommunicative about problems. In particular,
pgstat-read-statsfile-timestamp and pgstat-read-statsfile don't complain
at all about being unable to read a stats file. It seems to me that the
only "expected" case is ENOENT (and even that isn't really expected, in
normal operation). Surely we should at least elog(LOG) any other
failure condition?
Another place that could probably do with elog(LOG) is where
pgstat-write-statsfile resets last-statrequest in case it's in the
future. That shouldn't ever happen. While the reset is probably
a good thing for robustness, wouldn't logging it be a good idea?
Lastly, backend-read-statsfile is designed to send an inquiry message
every time through the loop, ie, every 10 msec. This is said to be in
case the stats collector drops one. But is this enough to flood the
collector and make things worse? I wonder if there should be some
backoff there.
regards, tom lane

Re: buildfarm logging versus embedded nulls

by Alvaro Herreraon 2010-03-12T20:45:28+00:00.
Tom Lane wrote:
Yeah, I had the same thought.
I also think the autovacuum worker minimum timestamp may be playing
games with the retry logic too. Maybe a worker is requesting a new file
continuously because pgstat is not able to provide one before the
deadline is past, and thus overloading it. I still think that 500ms is
too much for a worker, but backing off all the way to 10ms seems too
much. Maybe it should just be, say, 100ms.

Re: buildfarm logging versus embedded nulls

by Tom Laneon 2010-03-12T21:11:48+00:00.
Alvaro Herrera OK, I'll add some logging.
But we don't advance the deadline within the wait loop, so (in theory)
a single requestor shouldn't be able to trigger more than one stats file
update. I wonder though if an autovac worker could make many such
requests over its lifespan ...
regards, tom lane

Re: buildfarm logging versus embedded nulls

by Alvaro Herreraon 2010-03-12T21:46:25+00:00.
Tom Lane wrote:
Hmm, yeah.
Well, yes, but it will request fresh stats only for the recheck logic
before each table, so there will be one intervening vacuum (or none,
actually, if the table was vacuumed by some other autovac worker.
Though given the default naptime of 1 min I find it unlikely that the
regression database will ever see more than one worker).
Since the warning comes from the launcher and not the worker, I wonder
if this is a red herring.

Re: buildfarm logging versus embedded nulls

by Tom Laneon 2010-03-12T22:19:59+00:00.
Alvaro Herrera It's all speculation at the moment. So far there's not really enough
evidence to refute the idea that the system was just under heavy load
at that point I wonder if there is any practical way for the buildfarm client script
to report about the system's load average, or some other gauge of how
much is going on in the buildfarm machine besides the regression tests.
One thought is just to log how long it takes to run the regression
tests. A longer-than-usual run for a particular animal would be evidence
of a load spike; if we could correlate that with failures of this sort
it would be easier to write them off as heavy load.
regards, tom lane