- Previous thread: HeapTupleData.t_self garbage values
- Next thread: Reposnse from backend when wrong user/database request send
- Threads sorted by date: postgresql-hackers 201003
http://buildfarm.postgresql.org/cgi-bin/show-log.pl?nm=polecat&dt=2010-03-11%2021:45:10
which has several instances of the known "pgstat wait timeout" problem
during the parallel regression tests.
I was disappointed to see that the postmaster log part of the report
is truncated near the start of the run, so there's no way to see if
anything interesting got logged near the point of the failure.
When I run "make check" on my own OS X machine, I notice that the
postmaster.log file usually has some runs of a few dozen null bytes in
it. I suspect this is an artifact of Apple's stdio buffering
implementation when several backends are writing to the same log file.
I suppose that what happened in the above case is that some nulls got
embedded in postmaster.log, and then the file got truncated at the first
null, perhaps during the upload to the buildfarm server, or maybe it's
intact on the server but the web page is failing to display anything
past that point.
There's probably not much we can do about Apple's stdio (and I would bet
that they inherited this behavior from the BSDen anyway). What we
*could* do is
(1) encourage buildfarm owners to run the tests with logging-collector
turned on, and/or
(2) fix the buildfarm reporting mechanisms to not be fazed by nulls in
the log files.
I have no clear idea how hard either of these things is.
regards, tom lane
Well, the good news is that we actually have the data on the server, in
a temp file that will be cleaned up, but hasn't been yet. I have placed
a copy at
Now, the log-text field in our build-status-log table is text, so it's
on insertion to the database that it gets truncated. I'm thinking that I
should just escape nulls with a regex ( 's/x00/\0/g' or similar)
before inserting the data.
Anyone got any better ideas?
(BTW, your idea of using logging-collector won't work without
significant changes in the buildfarm client. Nice idea, though.)
cheers
andrew
Andrew Dunstan
Thanks for that. I found the relevant part of the postmaster log:
[4b99671f.ddb4:36] LOG: statement: DELETE FROM vactst WHERE i != 0;
[4b99671f.ddb4:37] LOG: statement: VACUUM (FULL) vactst;
[4b99671f.ddb4:38] WARNING: pgstat wait timeout
[4b99671f.ddb4:39] LOG: statement: DELETE FROM vactst;
[4b99671f.ddb4:40] LOG: statement: SELECT * FROM vactst;
[4b99671f.ddb4:41] LOG: statement: VACUUM (FULL, FREEZE) vactst;
[4b99671f.ddb4:42] WARNING: pgstat wait timeout
[4b99671f.ddb4:43] LOG: statement: VACUUM (ANALYZE, FULL) vactst;
[4b99671f.ddb4:44] WARNING: pgstat wait timeout
[4b99671f.ddb4:45] LOG: statement: CREATE TABLE vaccluster (i INT PRIMARY KEY);
[4b99671f.ddb4:46] LOG: statement: ALTER TABLE vaccluster CLUSTER ON vaccluster-pkey;
[4b99671f.ddb4:47] LOG: statement: INSERT INTO vaccluster SELECT * FROM vactst;
[4b99671f.ddb4:48] LOG: statement: CLUSTER vaccluster;
[4b99671f.ddb4:49] LOG: statement: VACUUM FULL pg-am;
[4b99671f.ddb4:50] WARNING: pgstat wait timeout
[4b99671f.ddb4:51] LOG: statement: VACUUM FULL pg-class;
[4b99671f.ddb4:52] WARNING: pgstat wait timeout
[4b99671f.ddb4:53] LOG: statement: VACUUM FULL pg-database;
[4b99671f.ddb4:54] WARNING: pgstat wait timeout
[4b99671f.ddb4:55] LOG: statement: VACUUM FULL vaccluster;
[4b996707.dcc0:2] WARNING: pgstat wait timeout
[4b99671f.ddb4:56] WARNING: pgstat wait timeout
[4b99671f.ddb4:57] LOG: statement: VACUUM FULL vactst;
[4b996707.dcc0:3] WARNING: pgstat wait timeout
[4b996747.ddcc:1] WARNING: pgstat wait timeout
[4b99671f.ddba:4] LOG: statement: INSERT INTO DEFAULT-TBL VALUES (1, 'thomas', 57.0613);
Most of the "pgstat wait timeout" gripes are coming from the backend
running the "vacuum" regression test, but there are two that came from
process dcc0, which is shown by other log entries to be the autovacuum
launcher. So now I'm wondering if there could be some issue that occurs
when the autovac launcher and a backend both want stats concurrently.
I have not got further than that in my uncaffeinated state
Works for me.
regards, tom lane
[4b99671f.ddb4:36] LOG: statement: DELETE FROM vactst WHERE i != 0;
[4b99671f.ddb4:37] LOG: statement: VACUUM (FULL) vactst;
[4b99671f.ddb4:38] WARNING: pgstat wait timeout
[4b99671f.ddb4:39] LOG: statement: DELETE FROM vactst;
[4b99671f.ddb4:40] LOG: statement: SELECT * FROM vactst;
[4b99671f.ddb4:41] LOG: statement: VACUUM (FULL, FREEZE) vactst;
[4b99671f.ddb4:42] WARNING: pgstat wait timeout
[4b99671f.ddb4:43] LOG: statement: VACUUM (ANALYZE, FULL) vactst;
[4b99671f.ddb4:44] WARNING: pgstat wait timeout
[4b99671f.ddb4:45] LOG: statement: CREATE TABLE vaccluster (i INT PRIMARY KEY);
[4b99671f.ddb4:46] LOG: statement: ALTER TABLE vaccluster CLUSTER ON vaccluster-pkey;
[4b99671f.ddb4:47] LOG: statement: INSERT INTO vaccluster SELECT * FROM vactst;
[4b99671f.ddb4:48] LOG: statement: CLUSTER vaccluster;
[4b99671f.ddb4:49] LOG: statement: VACUUM FULL pg-am;
[4b99671f.ddb4:50] WARNING: pgstat wait timeout
[4b99671f.ddb4:51] LOG: statement: VACUUM FULL pg-class;
[4b99671f.ddb4:52] WARNING: pgstat wait timeout
[4b99671f.ddb4:53] LOG: statement: VACUUM FULL pg-database;
[4b99671f.ddb4:54] WARNING: pgstat wait timeout
[4b99671f.ddb4:55] LOG: statement: VACUUM FULL vaccluster;
[4b996707.dcc0:2] WARNING: pgstat wait timeout
[4b99671f.ddb4:56] WARNING: pgstat wait timeout
[4b99671f.ddb4:57] LOG: statement: VACUUM FULL vactst;
[4b996707.dcc0:3] WARNING: pgstat wait timeout
[4b996747.ddcc:1] WARNING: pgstat wait timeout
[4b99671f.ddba:4] LOG: statement: INSERT INTO DEFAULT-TBL VALUES (1, 'thomas', 57.0613);
Most of the "pgstat wait timeout" gripes are coming from the backend
running the "vacuum" regression test, but there are two that came from
process dcc0, which is shown by other log entries to be the autovacuum
launcher. So now I'm wondering if there could be some issue that occurs
when the autovac launcher and a backend both want stats concurrently.
I have not got further than that in my uncaffeinated state
Works for me.
regards, tom lane
Tom Lane wrote:
Seems like this issue would exist whenever two backends want stats
concurrently, no? Is this well-tested?
Seems like this issue would exist whenever two backends want stats
concurrently, no? Is this well-tested?
Done.
cheers
andrew
Alvaro Herrera
Well, you tell me how much it's been tested ;-). But the autovac
launcher could possibly interact differently than another backend,
since it has a different value for maximum stats file age.
Anyway it's only a guess. It could well be that that machine was simply
so heavily loaded that the stats collector couldn't respond fast enough.
I'm just wondering whether there's an unrecognized bug lurking here.
regards, tom lane
launcher could possibly interact differently than another backend,
since it has a different value for maximum stats file age.
Anyway it's only a guess. It could well be that that machine was simply
so heavily loaded that the stats collector couldn't respond fast enough.
I'm just wondering whether there's an unrecognized bug lurking here.
regards, tom lane
You've got me trained well now. I'm now looking at my build machine failures. Wasn't sure what to do about that one, since no relevant files changed.
Is there any value in setting "keep-error-builds =
Cheers,
Rob
I wrote:
Still meditating on this ... and it strikes me that the pgstat.c code
is really uncommunicative about problems. In particular,
pgstat-read-statsfile-timestamp and pgstat-read-statsfile don't complain
at all about being unable to read a stats file. It seems to me that the
only "expected" case is ENOENT (and even that isn't really expected, in
normal operation). Surely we should at least elog(LOG) any other
failure condition?
Another place that could probably do with elog(LOG) is where
pgstat-write-statsfile resets last-statrequest in case it's in the
future. That shouldn't ever happen. While the reset is probably
a good thing for robustness, wouldn't logging it be a good idea?
Lastly, backend-read-statsfile is designed to send an inquiry message
every time through the loop, ie, every 10 msec. This is said to be in
case the stats collector drops one. But is this enough to flood the
collector and make things worse? I wonder if there should be some
backoff there.
regards, tom lane
Still meditating on this ... and it strikes me that the pgstat.c code
is really uncommunicative about problems. In particular,
pgstat-read-statsfile-timestamp and pgstat-read-statsfile don't complain
at all about being unable to read a stats file. It seems to me that the
only "expected" case is ENOENT (and even that isn't really expected, in
normal operation). Surely we should at least elog(LOG) any other
failure condition?
Another place that could probably do with elog(LOG) is where
pgstat-write-statsfile resets last-statrequest in case it's in the
future. That shouldn't ever happen. While the reset is probably
a good thing for robustness, wouldn't logging it be a good idea?
Lastly, backend-read-statsfile is designed to send an inquiry message
every time through the loop, ie, every 10 msec. This is said to be in
case the stats collector drops one. But is this enough to flood the
collector and make things worse? I wonder if there should be some
backoff there.
regards, tom lane
Tom Lane wrote:
Yeah, I had the same thought.
I also think the autovacuum worker minimum timestamp may be playing
games with the retry logic too. Maybe a worker is requesting a new file
continuously because pgstat is not able to provide one before the
deadline is past, and thus overloading it. I still think that 500ms is
too much for a worker, but backing off all the way to 10ms seems too
much. Maybe it should just be, say, 100ms.
Yeah, I had the same thought.
I also think the autovacuum worker minimum timestamp may be playing
games with the retry logic too. Maybe a worker is requesting a new file
continuously because pgstat is not able to provide one before the
deadline is past, and thus overloading it. I still think that 500ms is
too much for a worker, but backing off all the way to 10ms seems too
much. Maybe it should just be, say, 100ms.
Alvaro Herrera
OK, I'll add some logging.
But we don't advance the deadline within the wait loop, so (in theory)
a single requestor shouldn't be able to trigger more than one stats file
update. I wonder though if an autovac worker could make many such
requests over its lifespan ...
regards, tom lane
But we don't advance the deadline within the wait loop, so (in theory)
a single requestor shouldn't be able to trigger more than one stats file
update. I wonder though if an autovac worker could make many such
requests over its lifespan ...
regards, tom lane
Tom Lane wrote:
Hmm, yeah.
Well, yes, but it will request fresh stats only for the recheck logic
before each table, so there will be one intervening vacuum (or none,
actually, if the table was vacuumed by some other autovac worker.
Though given the default naptime of 1 min I find it unlikely that the
regression database will ever see more than one worker).
Since the warning comes from the launcher and not the worker, I wonder
if this is a red herring.
Hmm, yeah.
Well, yes, but it will request fresh stats only for the recheck logic
before each table, so there will be one intervening vacuum (or none,
actually, if the table was vacuumed by some other autovac worker.
Though given the default naptime of 1 min I find it unlikely that the
regression database will ever see more than one worker).
Since the warning comes from the launcher and not the worker, I wonder
if this is a red herring.
Alvaro Herrera
It's all speculation at the moment. So far there's not really enough
evidence to refute the idea that the system was just under heavy load
at that point I wonder if there is any practical way for the buildfarm client script
to report about the system's load average, or some other gauge of how
much is going on in the buildfarm machine besides the regression tests.
One thought is just to log how long it takes to run the regression
tests. A longer-than-usual run for a particular animal would be evidence
of a load spike; if we could correlate that with failures of this sort
it would be easier to write them off as heavy load.
regards, tom lane
evidence to refute the idea that the system was just under heavy load
at that point I wonder if there is any practical way for the buildfarm client script
to report about the system's load average, or some other gauge of how
much is going on in the buildfarm machine besides the regression tests.
One thought is just to log how long it takes to run the regression
tests. A longer-than-usual run for a particular animal would be evidence
of a load spike; if we could correlate that with failures of this sort
it would be easier to write them off as heavy load.
regards, tom lane
Related Threads
- opensuse-buildservice - _link not required anymore for new packages - opensuse-buildservice
- PyQt Ram Usage - python-list
- OpenLayers-Users - RootContainer styleMap problem- IE - openlayers-users
- widgets - WARP default policy - w3c-public-webapps
- PATCH - ext4: check for a good block group before loading buddy pages - linux-ext4
- UBT Wiki FG Meeting - ubuntu-beginners
- max time wait for a function - python-list
- Qt-creator - Instability in qt-creator... anyone interested in back traces? - qt-creator
- Vala - casting out parameters - vala-list
- FFmpeg-devel - PATCH 7/7 - Add support for the newly added Nut codec tags (added in Nut r669): Y100 - 16 - , 16 - 00 - 1Y, Y31 - mplayer-ffmpeg-devel
Related Lists
- cassandra-dev
- cassandra-user
- couchdb-user
- exist-open
- firebird-devel
- jbase
- mongodb-user
- mysql-cluster
- mysql-general
- mysql-internals
- neo4j-user
- nhibernate-development
- oracle
- pgadmin-hackers
- pgadmin-support
- postgresql-admin
- postgresql-general
- postgresql-hackers
- postgresql-jdbc
- postgresql-performance
- postgresql-sql
- voldemort