thr3ads.net - Gluster users - [Gluster-users] Complete machine lockup, v3.4.2 [Feb 2014]

If this information is useful, please help other people find it:
Share via:

Laurent Chouinard

2014-Feb-18 13:12 UTC

[Gluster-users] Complete machine lockup, v3.4.2

Hi everyone,

We've been using 3.3.2 for a while, and recently started to migrate to
3.4.2. We run on platform CentOS 6.5 for 3.4.2 (while 3.3.2 were installed on
CentOS 6.4)

Recently, we've have a very scary condition happen and we do not know
exactly the cause of it.

We have a 3 nodes cluster with a replication factor of 3. Each node has one
brick, which is made out of one RAID0 volume, comprised of multiple SSDs.

Following some read/write errors, nodes 2 and 3 have completely locked. Nothing
could be done physically (nothing on the screen, nothing by SSH), physical power
cycle had to be done. Node 1 was still accessible, but its fuse client rejected
most if not all reads and writes.

Has anyone experienced something similar?

Before the system freeze, the last thing the kernel seemed to be doing is
killing HTTPD threads (INFO: task httpd:7910 blocked for more than 120 seconds.)
End-users talk to Apache in order to read/write from the Gluster volume, so it
seems a simple case of "something wrong" with gluster which locks
read/writes, and eventually the kernel kills them.

At this point, we're unsure where to look. Nothing very specific can be
found in the logs, but perhaps if someone has pointers of what to look for, that
could give us a new search track.

Thanks

Laurent Chouinard
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140218/c03f9e09/attachment.html>

James

2014-Feb-18 16:17 UTC

head link

[Gluster-users] Complete machine lockup, v3.4.2

On Tue, Feb 18, 2014 at 8:12 AM, Laurent Chouinard
<laurent.chouinard at ubisoft.com> wrote:> Before the system freeze, the last thing the kernel seemed to be doing is
> killing HTTPD threads (INFO: task httpd:7910 blocked for more than 120
> seconds.)  End-users talk to Apache in order to read/write from the Gluster
> volume, so it seems a simple case of ?something wrong? with gluster which
> locks read/writes, and eventually the kernel kills them.

If the kernel was killing things, check that it wasn't the OOM killer.
If so, you might want to ensure you've got swap, enough memory, check
if anything is leaking, and finally if you have memory management
issues between services, cgroups might be the thing to use to control
this.

HTH,
James

Gluster users - Feb 2014 - Complete machine lockup, v3.4.2

[Gluster-users] Complete machine lockup, v3.4.2

[Gluster-users] Complete machine lockup, v3.4.2