Hi everyone,
We've been using 3.3.2 for a while, and recently started to migrate to
3.4.2. We run on platform CentOS 6.5 for 3.4.2 (while 3.3.2 were installed on
CentOS 6.4)
Recently, we've have a very scary condition happen and we do not know
exactly the cause of it.
We have a 3 nodes cluster with a replication factor of 3. Each node has one
brick, which is made out of one RAID0 volume, comprised of multiple SSDs.
Following some read/write errors, nodes 2 and 3 have completely locked. Nothing
could be done physically (nothing on the screen, nothing by SSH), physical power
cycle had to be done. Node 1 was still accessible, but its fuse client rejected
most if not all reads and writes.
Has anyone experienced something similar?
Before the system freeze, the last thing the kernel seemed to be doing is
killing HTTPD threads (INFO: task httpd:7910 blocked for more than 120 seconds.)
End-users talk to Apache in order to read/write from the Gluster volume, so it
seems a simple case of "something wrong" with gluster which locks
read/writes, and eventually the kernel kills them.
At this point, we're unsure where to look. Nothing very specific can be
found in the logs, but perhaps if someone has pointers of what to look for, that
could give us a new search track.
Thanks
Laurent Chouinard
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://supercolony.gluster.org/pipermail/gluster-users/attachments/20140218/c03f9e09/attachment.html>