thr3ads.net - Gluster users - [Gluster-users] disperse volume file to subvolume mapping [Apr 2016]

If this information is useful, please help other people find it:
Share via:

Serkan Çoban

2016-Apr-21 13:19 UTC

[Gluster-users] disperse volume file to subvolume mapping

Same result. Also checked the rebalance.log file, it has also no
reference to part files...

On Thu, Apr 21, 2016 at 3:34 PM, Xavier Hernandez <xhernandez at
datalab.es> wrote:> Can you try a 'gluster volume rebalance v0 start force' ?
>
>
> On 21/04/16 14:23, Serkan ?oban wrote:
>>>
>>> Has the rebalance operation finished successfully ? has it skipped
any
>>> files ?
>>
>> Yes according to gluster v rebalance status it is completed without any
>> errors.
>> rebalance status report is like:
>> Node         Rebalanced files   size               Scanned
>> failures  skipped
>> 1.1.1.185   158                      29GB             1720
>> 0           314
>> 1.1.1.205    93                       46.5GB           761
>> 0           95
>> 1.1.1.225    74                       37GB              779
>>   0           94
>>
>>
>> All other hosts has 0 values.
>>
>> I double check that files with '---------T' attributes are
there,
>> maybe some of them deleted but I still see them in bricks...
>> I am also concerned why part files not distributed to all 60 nodes?
>> Rebalance should do that?
>>
>> On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez <xhernandez at
datalab.es>
>> wrote:
>>>
>>> Hi Serkan,
>>>
>>> On 21/04/16 12:39, Serkan ?oban wrote:
>>>>
>>>>
>>>> I started a gluster v rebalance v0 start command hoping that it
will
>>>> equally redistribute files across 60 nodes but it did not do
that...
>>>> why it did not redistribute files? any thoughts?
>>>
>>>
>>>
>>> Has the rebalance operation finished successfully ? has it skipped
any
>>> files
>>> ?
>>>
>>> After a successful rebalance all files with attributes
'---------T'
>>> should
>>> have disappeared.
>>>
>>>
>>>>
>>>> On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
>>>> <xhernandez at datalab.es> wrote:
>>>>>
>>>>>
>>>>> Hi Serkan,
>>>>>
>>>>> On 21/04/16 10:07, Serkan ?oban wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I think the problem is in the temporary name that
distcp gives to the
>>>>>>> file while it's being copied before renaming it
to the real name. Do
>>>>>>> you
>>>>>>> know what is the structure of this name ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Distcp temporary file name format is:
>>>>>>
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
>>>>>> temporary file name used by one map process. For
example I see in the
>>>>>> logs that one map copies files
part-m-00031,part-m-00047,part-m-00063
>>>>>> sequentially and they all use same temporary file name
above. So no
>>>>>> original file name appears in temporary file name.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> This explains the problem. With the default options, DHT
sends all
>>>>> files
>>>>> to
>>>>> the subvolume that should store a file named
'distcp.tmp'.
>>>>>
>>>>> With this temporary name format, little can be done.
>>>>>
>>>>>>
>>>>>> I will check if we can modify distcp behaviour, or we
have to write
>>>>>> our mapreduce procedures instead of using distcp.
>>>>>>
>>>>>>> 2. define the option 'extra-hash-regex' to
an expression that matches
>>>>>>> your temporary file names and returns the same name
that will finally
>>>>>>> have.
>>>>>>> Depending on the differences between original and
temporary file
>>>>>>> names,
>>>>>>> this
>>>>>>> option could be useless.
>>>>>>> 3. set the option 'rsync-hash-regex' to
'none'. This will prevent the
>>>>>>> name conversion, so the files will be evenly
distributed. However
>>>>>>> this
>>>>>>> will
>>>>>>> cause a lot of files placed in incorrect
subvolumes, creating a lot
>>>>>>> of
>>>>>>> link
>>>>>>> files until a rebalance is executed.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> How can I set these options?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> You can set gluster options using:
>>>>>
>>>>> gluster volume set <volname> <option>
<value>
>>>>>
>>>>> for example:
>>>>>
>>>>> gluster volume set v0 rsync-hash-regex none
>>>>>
>>>>> Xavi
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
>>>>>> <xhernandez at datalab.es> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi Serkan,
>>>>>>>
>>>>>>> I think the problem is in the temporary name that
distcp gives to the
>>>>>>> file
>>>>>>> while it's being copied before renaming it to
the real name. Do you
>>>>>>> know
>>>>>>> what is the structure of this name ?
>>>>>>>
>>>>>>> DHT selects the subvolume (in this case the ec set)
on which the file
>>>>>>> will
>>>>>>> be stored based on the name of the file. This has a
problem when a
>>>>>>> file
>>>>>>> is
>>>>>>> being renamed, because this could change the
subvolume where the file
>>>>>>> should
>>>>>>> be found.
>>>>>>>
>>>>>>> DHT has a feature to avoid incorrect file
placements when executing
>>>>>>> renames
>>>>>>> for the rsync case. What it does is to check if the
file matches the
>>>>>>> following regular expression:
>>>>>>>
>>>>>>>        ^\.(.+)\.[^.]+$
>>>>>>>
>>>>>>> If a match is found, it only considers the part
between parenthesis
>>>>>>> to
>>>>>>> calculate the destination subvolume.
>>>>>>>
>>>>>>> This is useful for rsync because temporary file
names are constructed
>>>>>>> in
>>>>>>> the
>>>>>>> following way: suppose the original filename is
'test'. The temporary
>>>>>>> filename while rsync is being executed is made by
prepending a dot
>>>>>>> and
>>>>>>> appending '.<random chars>':
.test.712hd
>>>>>>>
>>>>>>> As you can see, the original name and the part of
the name between
>>>>>>> parenthesis that matches the regular expression are
the same. This
>>>>>>> causes
>>>>>>> that, after renaming the temporary file to its
original filename,
>>>>>>> both
>>>>>>> files
>>>>>>> will be considered to belong to the same subvolume
by DHT.
>>>>>>>
>>>>>>> In your case it's very probable that distcp
uses a temporary name
>>>>>>> like
>>>>>>> '.part.<number>'. In this case the
portion of the name used to select
>>>>>>> the
>>>>>>> subvolume is always 'part'. This would
explain why all files go to
>>>>>>> the
>>>>>>> same
>>>>>>> subvolume. Once the file is renamed to another
name, DHT realizes
>>>>>>> that
>>>>>>> it
>>>>>>> should go to another subvolume. At this point it
creates a link file
>>>>>>> (those
>>>>>>> files with access rights = '---------T') in
the correct subvolume but
>>>>>>> it
>>>>>>> doesn't move it. As you can see, this kind of
files are better
>>>>>>> balanced.
>>>>>>>
>>>>>>> To solve this problem you have three options:
>>>>>>>
>>>>>>> 1. change the temporary filename used by distcp to
correctly match
>>>>>>> the
>>>>>>> regular expression. I'm not sure if this can be
configured, but if
>>>>>>> this
>>>>>>> is
>>>>>>> possible, this is the best option.
>>>>>>>
>>>>>>> 2. define the option 'extra-hash-regex' to
an expression that matches
>>>>>>> your
>>>>>>> temporary file names and returns the same name that
will finally
>>>>>>> have.
>>>>>>> Depending on the differences between original and
temporary file
>>>>>>> names,
>>>>>>> this
>>>>>>> option could be useless.
>>>>>>>
>>>>>>> 3. set the option 'rsync-hash-regex' to
'none'. This will prevent the
>>>>>>> name
>>>>>>> conversion, so the files will be evenly
distributed. However this
>>>>>>> will
>>>>>>> cause
>>>>>>> a lot of files placed in incorrect subvolumes,
creating a lot of link
>>>>>>> files
>>>>>>> until a rebalance is executed.
>>>>>>>
>>>>>>> Xavi
>>>>>>>
>>>>>>>
>>>>>>> On 20/04/16 14:13, Serkan ?oban wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Here is the steps that I do in detail and
relevant output from
>>>>>>>> bricks:
>>>>>>>>
>>>>>>>> I am using below command for volume creation:
>>>>>>>> gluster volume create v0 disperse 20 redundancy
4 \
>>>>>>>> 1.1.1.{185..204}:/bricks/02 \
>>>>>>>> 1.1.1.{205..224}:/bricks/02 \
>>>>>>>> 1.1.1.{225..244}:/bricks/02 \
>>>>>>>> 1.1.1.{185..204}:/bricks/03 \
>>>>>>>> 1.1.1.{205..224}:/bricks/03 \
>>>>>>>> 1.1.1.{225..244}:/bricks/03 \
>>>>>>>> 1.1.1.{185..204}:/bricks/04 \
>>>>>>>> 1.1.1.{205..224}:/bricks/04 \
>>>>>>>> 1.1.1.{225..244}:/bricks/04 \
>>>>>>>> 1.1.1.{185..204}:/bricks/05 \
>>>>>>>> 1.1.1.{205..224}:/bricks/05 \
>>>>>>>> 1.1.1.{225..244}:/bricks/05 \
>>>>>>>> 1.1.1.{185..204}:/bricks/06 \
>>>>>>>> 1.1.1.{205..224}:/bricks/06 \
>>>>>>>> 1.1.1.{225..244}:/bricks/06 \
>>>>>>>> 1.1.1.{185..204}:/bricks/07 \
>>>>>>>> 1.1.1.{205..224}:/bricks/07 \
>>>>>>>> 1.1.1.{225..244}:/bricks/07 \
>>>>>>>> 1.1.1.{185..204}:/bricks/08 \
>>>>>>>> 1.1.1.{205..224}:/bricks/08 \
>>>>>>>> 1.1.1.{225..244}:/bricks/08 \
>>>>>>>> 1.1.1.{185..204}:/bricks/09 \
>>>>>>>> 1.1.1.{205..224}:/bricks/09 \
>>>>>>>> 1.1.1.{225..244}:/bricks/09 \
>>>>>>>> 1.1.1.{185..204}:/bricks/10 \
>>>>>>>> 1.1.1.{205..224}:/bricks/10 \
>>>>>>>> 1.1.1.{225..244}:/bricks/10 \
>>>>>>>> 1.1.1.{185..204}:/bricks/11 \
>>>>>>>> 1.1.1.{205..224}:/bricks/11 \
>>>>>>>> 1.1.1.{225..244}:/bricks/11 \
>>>>>>>> 1.1.1.{185..204}:/bricks/12 \
>>>>>>>> 1.1.1.{205..224}:/bricks/12 \
>>>>>>>> 1.1.1.{225..244}:/bricks/12 \
>>>>>>>> 1.1.1.{185..204}:/bricks/13 \
>>>>>>>> 1.1.1.{205..224}:/bricks/13 \
>>>>>>>> 1.1.1.{225..244}:/bricks/13 \
>>>>>>>> 1.1.1.{185..204}:/bricks/14 \
>>>>>>>> 1.1.1.{205..224}:/bricks/14 \
>>>>>>>> 1.1.1.{225..244}:/bricks/14 \
>>>>>>>> 1.1.1.{185..204}:/bricks/15 \
>>>>>>>> 1.1.1.{205..224}:/bricks/15 \
>>>>>>>> 1.1.1.{225..244}:/bricks/15 \
>>>>>>>> 1.1.1.{185..204}:/bricks/16 \
>>>>>>>> 1.1.1.{205..224}:/bricks/16 \
>>>>>>>> 1.1.1.{225..244}:/bricks/16 \
>>>>>>>> 1.1.1.{185..204}:/bricks/17 \
>>>>>>>> 1.1.1.{205..224}:/bricks/17 \
>>>>>>>> 1.1.1.{225..244}:/bricks/17 \
>>>>>>>> 1.1.1.{185..204}:/bricks/18 \
>>>>>>>> 1.1.1.{205..224}:/bricks/18 \
>>>>>>>> 1.1.1.{225..244}:/bricks/18 \
>>>>>>>> 1.1.1.{185..204}:/bricks/19 \
>>>>>>>> 1.1.1.{205..224}:/bricks/19 \
>>>>>>>> 1.1.1.{225..244}:/bricks/19 \
>>>>>>>> 1.1.1.{185..204}:/bricks/20 \
>>>>>>>> 1.1.1.{205..224}:/bricks/20 \
>>>>>>>> 1.1.1.{225..244}:/bricks/20 \
>>>>>>>> 1.1.1.{185..204}:/bricks/21 \
>>>>>>>> 1.1.1.{205..224}:/bricks/21 \
>>>>>>>> 1.1.1.{225..244}:/bricks/21 \
>>>>>>>> 1.1.1.{185..204}:/bricks/22 \
>>>>>>>> 1.1.1.{205..224}:/bricks/22 \
>>>>>>>> 1.1.1.{225..244}:/bricks/22 \
>>>>>>>> 1.1.1.{185..204}:/bricks/23 \
>>>>>>>> 1.1.1.{205..224}:/bricks/23 \
>>>>>>>> 1.1.1.{225..244}:/bricks/23 \
>>>>>>>> 1.1.1.{185..204}:/bricks/24 \
>>>>>>>> 1.1.1.{205..224}:/bricks/24 \
>>>>>>>> 1.1.1.{225..244}:/bricks/24 \
>>>>>>>> 1.1.1.{185..204}:/bricks/25 \
>>>>>>>> 1.1.1.{205..224}:/bricks/25 \
>>>>>>>> 1.1.1.{225..244}:/bricks/25 \
>>>>>>>> 1.1.1.{185..204}:/bricks/26 \
>>>>>>>> 1.1.1.{205..224}:/bricks/26 \
>>>>>>>> 1.1.1.{225..244}:/bricks/26 \
>>>>>>>> 1.1.1.{185..204}:/bricks/27 \
>>>>>>>> 1.1.1.{205..224}:/bricks/27 \
>>>>>>>> 1.1.1.{225..244}:/bricks/27 force
>>>>>>>>
>>>>>>>> then I mount volume on 50 clients:
>>>>>>>> mount -t glusterfs 1.1.1.185:/v0 /mnt/gluster
>>>>>>>>
>>>>>>>> then I make a directory from one of the clients
and chmod it.
>>>>>>>> mkdir /mnt/gluster/s1 && chmod 777
/mnt/gluster/s1
>>>>>>>>
>>>>>>>> then I start distcp on clients, there are
1059X8.8GB files in one
>>>>>>>> folder
>>>>>>>> and
>>>>>>>> they will be copied to /mnt/gluster/s1 with 100
parallel which means
>>>>>>>> 2
>>>>>>>> copy jobs per client at same time.
>>>>>>>> hadoop distcp -m 100
http://nn1:8020/path/to/teragen-10tb
>>>>>>>> file:///mnt/gluster/s1
>>>>>>>>
>>>>>>>> After job finished here is the status of s1
directory from bricks:
>>>>>>>> s1 directory is present in all 1560 brick.
>>>>>>>> s1/teragen-10tb folder is present in all 1560
brick.
>>>>>>>>
>>>>>>>> full listing of files in bricks:
>>>>>>>>
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
>>>>>>>>
>>>>>>>> You can ignore the .crc files in the brick
output above, they are
>>>>>>>> checksum files...
>>>>>>>>
>>>>>>>> As you can see part-m-xxxx files written only
some bricks in nodes
>>>>>>>> 0205..0224
>>>>>>>> All bricks have some files but they have zero
size.
>>>>>>>>
>>>>>>>> I increase file descriptors to 65k so it is not
the issue...
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier
Hernandez
>>>>>>>> <xhernandez at datalab.es>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Serkan,
>>>>>>>>>
>>>>>>>>> On 19/04/16 15:16, Serkan ?oban wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I assume that gluster is
used to store the intermediate files
>>>>>>>>>>>>> before
>>>>>>>>>>>>> the reduce phase
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Nope, gluster is the destination for
distcp command. hadoop distcp
>>>>>>>>>> -m
>>>>>>>>>> 50 http://nn1:8020/path/to/folder
file:///mnt/gluster
>>>>>>>>>> This run maps on datanodes which have
/mnt/gluster mounted on all
>>>>>>>>>> of
>>>>>>>>>> them.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I don't know hadoop, so I'm of
little help here. However it seems
>>>>>>>>> that
>>>>>>>>> -m
>>>>>>>>> 50
>>>>>>>>> means to execute 50 copies in parallel.
This means that even if the
>>>>>>>>> distribution worked fine, at most 50 (much
probably less) of the 78
>>>>>>>>> ec
>>>>>>>>> sets
>>>>>>>>> would be used in parallel.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>> This means that this is
caused by some peculiarity of the
>>>>>>>>>>>>> mapreduce.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Yes but how a client write 500 files to
gluster mount and those
>>>>>>>>>> file
>>>>>>>>>> just written only to subset of
subvolumes? I cannot use gluster as
>>>>>>>>>> a
>>>>>>>>>> backup cluster if I cannot write with
distcp.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> All 500 files were created only on one of
the 78 ec sets and the
>>>>>>>>> remaining
>>>>>>>>> 77 got empty ?
>>>>>>>>>
>>>>>>>>>>>>> You should look which files
are created in each brick and how
>>>>>>>>>>>>> many
>>>>>>>>>>>>> while the process is
running.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Files only created on nodes 185..204 or
205..224 or 225..244. Only
>>>>>>>>>> on
>>>>>>>>>> 20 nodes in each test.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> How many files there were in each brick ?
>>>>>>>>>
>>>>>>>>> Not sure if this can be related, but
standard linux distributions
>>>>>>>>> have
>>>>>>>>> a
>>>>>>>>> default limit of 1024 open file
descriptors. Having a so big volume
>>>>>>>>> and
>>>>>>>>> doing a massive copy, maybe this limit is
affecting something ?
>>>>>>>>>
>>>>>>>>> Are there any error or warning messages in
the mount or bricks logs
>>>>>>>>> ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Xavi
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 19, 2016 at 1:05 PM, Xavier
Hernandez
>>>>>>>>>> <xhernandez at datalab.es>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>
>>>>>>>>>>> moved to gluster-users since this
doesn't belong to devel list.
>>>>>>>>>>>
>>>>>>>>>>> On 19/04/16 11:24, Serkan ?oban
wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I am copying 10.000 files to
gluster volume using mapreduce on
>>>>>>>>>>>> clients. Each map process took
one file at a time and copy it to
>>>>>>>>>>>> gluster volume.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I assume that gluster is used to
store the intermediate files
>>>>>>>>>>> before
>>>>>>>>>>> the
>>>>>>>>>>> reduce phase.
>>>>>>>>>>>
>>>>>>>>>>>> My disperse volume consist of
78 subvolumes of 16+4 disk each.
>>>>>>>>>>>> So
>>>>>>>>>>>> If
>>>>>>>>>>>> I
>>>>>>>>>>>> copy >78 files parallel I
expect each file goes to different
>>>>>>>>>>>> subvolume
>>>>>>>>>>>> right?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> If you only copy 78 files, most
probably you will get some
>>>>>>>>>>> subvolume
>>>>>>>>>>> empty
>>>>>>>>>>> and some other with more than one
or two files. It's not an exact
>>>>>>>>>>> distribution, it's a
statistially balanced distribution: over
>>>>>>>>>>> time
>>>>>>>>>>> and
>>>>>>>>>>> with
>>>>>>>>>>> enough files, each brick will
contain an amount of files in the
>>>>>>>>>>> same
>>>>>>>>>>> order
>>>>>>>>>>> of magnitude, but they won't
have the *same* number of files.
>>>>>>>>>>>
>>>>>>>>>>>> In my tests during tests with
fio I can see every file goes to
>>>>>>>>>>>> different subvolume, but when I
start mapreduce process from
>>>>>>>>>>>> clients
>>>>>>>>>>>> only 78/3=26 subvolumes used
for writing files.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This means that this is caused by
some peculiarity of the
>>>>>>>>>>> mapreduce.
>>>>>>>>>>>
>>>>>>>>>>>> I see that clearly from network
traffic. Mapreduce on client
>>>>>>>>>>>> side
>>>>>>>>>>>> can
>>>>>>>>>>>> be run multi thread. I tested
with 1-5-10 threads on each client
>>>>>>>>>>>> but
>>>>>>>>>>>> every time only 26 subvolumes
used.
>>>>>>>>>>>> How can I debug the issue
further?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> You should look which files are
created in each brick and how
>>>>>>>>>>> many
>>>>>>>>>>> while
>>>>>>>>>>> the
>>>>>>>>>>> process is running.
>>>>>>>>>>>
>>>>>>>>>>> Xavi
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Apr 19, 2016 at 11:22
AM, Xavier Hernandez
>>>>>>>>>>>> <xhernandez at
datalab.es> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 19/04/16 09:18, Serkan
?oban wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, I just reinstalled
fresh 3.7.11 and I am seeing the same
>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>> 50 clients copying
part-0-xxxx named files using mapreduce to
>>>>>>>>>>>>>> gluster
>>>>>>>>>>>>>> using one thread per
server and they are using only 20 servers
>>>>>>>>>>>>>> out
>>>>>>>>>>>>>> of
>>>>>>>>>>>>>> 60. On the other hand
fio tests use all the servers. Anything
>>>>>>>>>>>>>> I
>>>>>>>>>>>>>> can
>>>>>>>>>>>>>> do
>>>>>>>>>>>>>> to solve the issue?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Distribution of files to ec
sets is done by dht. In theory if
>>>>>>>>>>>>> you
>>>>>>>>>>>>> create
>>>>>>>>>>>>> many files each ec set will
receive the same amount of files.
>>>>>>>>>>>>> However
>>>>>>>>>>>>> when
>>>>>>>>>>>>> the number of files is
small enough, statistics can fail.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Not sure what you are doing
exactly, but a mapreduce procedure
>>>>>>>>>>>>> generally
>>>>>>>>>>>>> only creates a single
output. In that case it makes sense that
>>>>>>>>>>>>> only
>>>>>>>>>>>>> one
>>>>>>>>>>>>> ec
>>>>>>>>>>>>> set is used. If you want to
use all ec sets for a single file,
>>>>>>>>>>>>> you
>>>>>>>>>>>>> should
>>>>>>>>>>>>> enable sharding (I
haven't tested that) or split the result in
>>>>>>>>>>>>> multiple
>>>>>>>>>>>>> files.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Xavi
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Serkan
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---------- Forwarded
message ----------
>>>>>>>>>>>>>> From: Serkan ?oban
<cobanserkan at gmail.com>
>>>>>>>>>>>>>> Date: Mon, Apr 18, 2016
at 2:39 PM
>>>>>>>>>>>>>> Subject: disperse
volume file to subvolume mapping
>>>>>>>>>>>>>> To: Gluster Users
<gluster-users at gluster.org>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, I have a problem
where clients are using only 1/3 of nodes
>>>>>>>>>>>>>> in
>>>>>>>>>>>>>> disperse volume for
writing.
>>>>>>>>>>>>>> I am testing from 50
clients using 1 to 10 threads with file
>>>>>>>>>>>>>> names
>>>>>>>>>>>>>> part-0-xxxx.
>>>>>>>>>>>>>> What I see is clients
only use 20 nodes for writing. How is
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> file
>>>>>>>>>>>>>> name to sub volume
hashing is done? Is this related to file
>>>>>>>>>>>>>> names
>>>>>>>>>>>>>> are
>>>>>>>>>>>>>> similar?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> My cluster is 3.7.10
with 60 nodes each has 26 disks. Disperse
>>>>>>>>>>>>>> volume
>>>>>>>>>>>>>> is 78 x (16+4). Only 26
out of 78 sub volumes used during
>>>>>>>>>>>>>> writes..
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>

Xavier Hernandez

2016-Apr-22 06:15 UTC

head link

[Gluster-users] disperse volume file to subvolume mapping

When you execute a rebalance 'force' the skipped column should be 0 for 
all nodes and all '---------T' files must have disappeared. Otherwise 
something failed. Is this true in your case ?

On 21/04/16 15:19, Serkan ?oban wrote:> Same result. Also checked the rebalance.log file, it has also no
> reference to part files...
>
> On Thu, Apr 21, 2016 at 3:34 PM, Xavier Hernandez <xhernandez at
datalab.es> wrote:
>> Can you try a 'gluster volume rebalance v0 start force' ?
>>
>>
>> On 21/04/16 14:23, Serkan ?oban wrote:
>>>>
>>>> Has the rebalance operation finished successfully ? has it
skipped any
>>>> files ?
>>>
>>> Yes according to gluster v rebalance status it is completed without
any
>>> errors.
>>> rebalance status report is like:
>>> Node         Rebalanced files   size               Scanned
>>> failures  skipped
>>> 1.1.1.185   158                      29GB             1720
>>> 0           314
>>> 1.1.1.205    93                       46.5GB           761
>>> 0           95
>>> 1.1.1.225    74                       37GB              779
>>>    0           94
>>>
>>>
>>> All other hosts has 0 values.
>>>
>>> I double check that files with '---------T' attributes are
there,
>>> maybe some of them deleted but I still see them in bricks...
>>> I am also concerned why part files not distributed to all 60 nodes?
>>> Rebalance should do that?
>>>
>>> On Thu, Apr 21, 2016 at 1:55 PM, Xavier Hernandez <xhernandez at
datalab.es>
>>> wrote:
>>>>
>>>> Hi Serkan,
>>>>
>>>> On 21/04/16 12:39, Serkan ?oban wrote:
>>>>>
>>>>>
>>>>> I started a gluster v rebalance v0 start command hoping
that it will
>>>>> equally redistribute files across 60 nodes but it did not
do that...
>>>>> why it did not redistribute files? any thoughts?
>>>>
>>>>
>>>>
>>>> Has the rebalance operation finished successfully ? has it
skipped any
>>>> files
>>>> ?
>>>>
>>>> After a successful rebalance all files with attributes
'---------T'
>>>> should
>>>> have disappeared.
>>>>
>>>>
>>>>>
>>>>> On Thu, Apr 21, 2016 at 11:24 AM, Xavier Hernandez
>>>>> <xhernandez at datalab.es> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Serkan,
>>>>>>
>>>>>> On 21/04/16 10:07, Serkan ?oban wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I think the problem is in the temporary name
that distcp gives to the
>>>>>>>> file while it's being copied before
renaming it to the real name. Do
>>>>>>>> you
>>>>>>>> know what is the structure of this name ?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Distcp temporary file name format is:
>>>>>>>
".distcp.tmp.attempt_1460381790773_0248_m_000001_0" and the same
>>>>>>> temporary file name used by one map process. For
example I see in the
>>>>>>> logs that one map copies files
part-m-00031,part-m-00047,part-m-00063
>>>>>>> sequentially and they all use same temporary file
name above. So no
>>>>>>> original file name appears in temporary file name.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> This explains the problem. With the default options,
DHT sends all
>>>>>> files
>>>>>> to
>>>>>> the subvolume that should store a file named
'distcp.tmp'.
>>>>>>
>>>>>> With this temporary name format, little can be done.
>>>>>>
>>>>>>>
>>>>>>> I will check if we can modify distcp behaviour, or
we have to write
>>>>>>> our mapreduce procedures instead of using distcp.
>>>>>>>
>>>>>>>> 2. define the option 'extra-hash-regex'
to an expression that matches
>>>>>>>> your temporary file names and returns the same
name that will finally
>>>>>>>> have.
>>>>>>>> Depending on the differences between original
and temporary file
>>>>>>>> names,
>>>>>>>> this
>>>>>>>> option could be useless.
>>>>>>>> 3. set the option 'rsync-hash-regex' to
'none'. This will prevent the
>>>>>>>> name conversion, so the files will be evenly
distributed. However
>>>>>>>> this
>>>>>>>> will
>>>>>>>> cause a lot of files placed in incorrect
subvolumes, creating a lot
>>>>>>>> of
>>>>>>>> link
>>>>>>>> files until a rebalance is executed.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> How can I set these options?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> You can set gluster options using:
>>>>>>
>>>>>> gluster volume set <volname> <option>
<value>
>>>>>>
>>>>>> for example:
>>>>>>
>>>>>> gluster volume set v0 rsync-hash-regex none
>>>>>>
>>>>>> Xavi
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Apr 21, 2016 at 10:00 AM, Xavier Hernandez
>>>>>>> <xhernandez at datalab.es> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Serkan,
>>>>>>>>
>>>>>>>> I think the problem is in the temporary name
that distcp gives to the
>>>>>>>> file
>>>>>>>> while it's being copied before renaming it
to the real name. Do you
>>>>>>>> know
>>>>>>>> what is the structure of this name ?
>>>>>>>>
>>>>>>>> DHT selects the subvolume (in this case the ec
set) on which the file
>>>>>>>> will
>>>>>>>> be stored based on the name of the file. This
has a problem when a
>>>>>>>> file
>>>>>>>> is
>>>>>>>> being renamed, because this could change the
subvolume where the file
>>>>>>>> should
>>>>>>>> be found.
>>>>>>>>
>>>>>>>> DHT has a feature to avoid incorrect file
placements when executing
>>>>>>>> renames
>>>>>>>> for the rsync case. What it does is to check if
the file matches the
>>>>>>>> following regular expression:
>>>>>>>>
>>>>>>>>         ^\.(.+)\.[^.]+$
>>>>>>>>
>>>>>>>> If a match is found, it only considers the part
between parenthesis
>>>>>>>> to
>>>>>>>> calculate the destination subvolume.
>>>>>>>>
>>>>>>>> This is useful for rsync because temporary file
names are constructed
>>>>>>>> in
>>>>>>>> the
>>>>>>>> following way: suppose the original filename is
'test'. The temporary
>>>>>>>> filename while rsync is being executed is made
by prepending a dot
>>>>>>>> and
>>>>>>>> appending '.<random chars>':
.test.712hd
>>>>>>>>
>>>>>>>> As you can see, the original name and the part
of the name between
>>>>>>>> parenthesis that matches the regular expression
are the same. This
>>>>>>>> causes
>>>>>>>> that, after renaming the temporary file to its
original filename,
>>>>>>>> both
>>>>>>>> files
>>>>>>>> will be considered to belong to the same
subvolume by DHT.
>>>>>>>>
>>>>>>>> In your case it's very probable that distcp
uses a temporary name
>>>>>>>> like
>>>>>>>> '.part.<number>'. In this case
the portion of the name used to select
>>>>>>>> the
>>>>>>>> subvolume is always 'part'. This would
explain why all files go to
>>>>>>>> the
>>>>>>>> same
>>>>>>>> subvolume. Once the file is renamed to another
name, DHT realizes
>>>>>>>> that
>>>>>>>> it
>>>>>>>> should go to another subvolume. At this point
it creates a link file
>>>>>>>> (those
>>>>>>>> files with access rights =
'---------T') in the correct subvolume but
>>>>>>>> it
>>>>>>>> doesn't move it. As you can see, this kind
of files are better
>>>>>>>> balanced.
>>>>>>>>
>>>>>>>> To solve this problem you have three options:
>>>>>>>>
>>>>>>>> 1. change the temporary filename used by distcp
to correctly match
>>>>>>>> the
>>>>>>>> regular expression. I'm not sure if this
can be configured, but if
>>>>>>>> this
>>>>>>>> is
>>>>>>>> possible, this is the best option.
>>>>>>>>
>>>>>>>> 2. define the option 'extra-hash-regex'
to an expression that matches
>>>>>>>> your
>>>>>>>> temporary file names and returns the same name
that will finally
>>>>>>>> have.
>>>>>>>> Depending on the differences between original
and temporary file
>>>>>>>> names,
>>>>>>>> this
>>>>>>>> option could be useless.
>>>>>>>>
>>>>>>>> 3. set the option 'rsync-hash-regex' to
'none'. This will prevent the
>>>>>>>> name
>>>>>>>> conversion, so the files will be evenly
distributed. However this
>>>>>>>> will
>>>>>>>> cause
>>>>>>>> a lot of files placed in incorrect subvolumes,
creating a lot of link
>>>>>>>> files
>>>>>>>> until a rebalance is executed.
>>>>>>>>
>>>>>>>> Xavi
>>>>>>>>
>>>>>>>>
>>>>>>>> On 20/04/16 14:13, Serkan ?oban wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Here is the steps that I do in detail and
relevant output from
>>>>>>>>> bricks:
>>>>>>>>>
>>>>>>>>> I am using below command for volume
creation:
>>>>>>>>> gluster volume create v0 disperse 20
redundancy 4 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/02 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/02 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/02 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/03 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/03 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/03 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/04 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/04 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/04 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/05 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/05 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/05 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/06 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/06 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/06 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/07 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/07 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/07 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/08 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/08 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/08 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/09 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/09 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/09 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/10 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/10 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/10 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/11 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/11 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/11 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/12 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/12 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/12 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/13 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/13 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/13 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/14 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/14 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/14 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/15 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/15 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/15 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/16 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/16 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/16 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/17 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/17 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/17 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/18 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/18 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/18 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/19 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/19 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/19 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/20 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/20 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/20 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/21 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/21 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/21 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/22 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/22 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/22 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/23 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/23 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/23 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/24 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/24 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/24 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/25 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/25 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/25 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/26 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/26 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/26 \
>>>>>>>>> 1.1.1.{185..204}:/bricks/27 \
>>>>>>>>> 1.1.1.{205..224}:/bricks/27 \
>>>>>>>>> 1.1.1.{225..244}:/bricks/27 force
>>>>>>>>>
>>>>>>>>> then I mount volume on 50 clients:
>>>>>>>>> mount -t glusterfs 1.1.1.185:/v0
/mnt/gluster
>>>>>>>>>
>>>>>>>>> then I make a directory from one of the
clients and chmod it.
>>>>>>>>> mkdir /mnt/gluster/s1 && chmod 777
/mnt/gluster/s1
>>>>>>>>>
>>>>>>>>> then I start distcp on clients, there are
1059X8.8GB files in one
>>>>>>>>> folder
>>>>>>>>> and
>>>>>>>>> they will be copied to /mnt/gluster/s1 with
100 parallel which means
>>>>>>>>> 2
>>>>>>>>> copy jobs per client at same time.
>>>>>>>>> hadoop distcp -m 100
http://nn1:8020/path/to/teragen-10tb
>>>>>>>>> file:///mnt/gluster/s1
>>>>>>>>>
>>>>>>>>> After job finished here is the status of s1
directory from bricks:
>>>>>>>>> s1 directory is present in all 1560 brick.
>>>>>>>>> s1/teragen-10tb folder is present in all
1560 brick.
>>>>>>>>>
>>>>>>>>> full listing of files in bricks:
>>>>>>>>>
https://www.dropbox.com/s/rbgdxmrtwz8oya8/teragen_list.zip?dl=0
>>>>>>>>>
>>>>>>>>> You can ignore the .crc files in the brick
output above, they are
>>>>>>>>> checksum files...
>>>>>>>>>
>>>>>>>>> As you can see part-m-xxxx files written
only some bricks in nodes
>>>>>>>>> 0205..0224
>>>>>>>>> All bricks have some files but they have
zero size.
>>>>>>>>>
>>>>>>>>> I increase file descriptors to 65k so it is
not the issue...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Apr 20, 2016 at 9:34 AM, Xavier
Hernandez
>>>>>>>>> <xhernandez at datalab.es>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Serkan,
>>>>>>>>>>
>>>>>>>>>> On 19/04/16 15:16, Serkan ?oban wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I assume that gluster
is used to store the intermediate files
>>>>>>>>>>>>>> before
>>>>>>>>>>>>>> the reduce phase
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Nope, gluster is the destination
for distcp command. hadoop distcp
>>>>>>>>>>> -m
>>>>>>>>>>> 50 http://nn1:8020/path/to/folder
file:///mnt/gluster
>>>>>>>>>>> This run maps on datanodes which
have /mnt/gluster mounted on all
>>>>>>>>>>> of
>>>>>>>>>>> them.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't know hadoop, so I'm of
little help here. However it seems
>>>>>>>>>> that
>>>>>>>>>> -m
>>>>>>>>>> 50
>>>>>>>>>> means to execute 50 copies in parallel.
This means that even if the
>>>>>>>>>> distribution worked fine, at most 50
(much probably less) of the 78
>>>>>>>>>> ec
>>>>>>>>>> sets
>>>>>>>>>> would be used in parallel.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>> This means that this is
caused by some peculiarity of the
>>>>>>>>>>>>>> mapreduce.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Yes but how a client write 500
files to gluster mount and those
>>>>>>>>>>> file
>>>>>>>>>>> just written only to subset of
subvolumes? I cannot use gluster as
>>>>>>>>>>> a
>>>>>>>>>>> backup cluster if I cannot write
with distcp.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> All 500 files were created only on one
of the 78 ec sets and the
>>>>>>>>>> remaining
>>>>>>>>>> 77 got empty ?
>>>>>>>>>>
>>>>>>>>>>>>>> You should look which
files are created in each brick and how
>>>>>>>>>>>>>> many
>>>>>>>>>>>>>> while the process is
running.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Files only created on nodes
185..204 or 205..224 or 225..244. Only
>>>>>>>>>>> on
>>>>>>>>>>> 20 nodes in each test.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> How many files there were in each brick
?
>>>>>>>>>>
>>>>>>>>>> Not sure if this can be related, but
standard linux distributions
>>>>>>>>>> have
>>>>>>>>>> a
>>>>>>>>>> default limit of 1024 open file
descriptors. Having a so big volume
>>>>>>>>>> and
>>>>>>>>>> doing a massive copy, maybe this limit
is affecting something ?
>>>>>>>>>>
>>>>>>>>>> Are there any error or warning messages
in the mount or bricks logs
>>>>>>>>>> ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Xavi
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 19, 2016 at 1:05 PM,
Xavier Hernandez
>>>>>>>>>>> <xhernandez at datalab.es>
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>>
>>>>>>>>>>>> moved to gluster-users since
this doesn't belong to devel list.
>>>>>>>>>>>>
>>>>>>>>>>>> On 19/04/16 11:24, Serkan ?oban
wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am copying 10.000 files
to gluster volume using mapreduce on
>>>>>>>>>>>>> clients. Each map process
took one file at a time and copy it to
>>>>>>>>>>>>> gluster volume.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I assume that gluster is used
to store the intermediate files
>>>>>>>>>>>> before
>>>>>>>>>>>> the
>>>>>>>>>>>> reduce phase.
>>>>>>>>>>>>
>>>>>>>>>>>>> My disperse volume consist
of 78 subvolumes of 16+4 disk each.
>>>>>>>>>>>>> So
>>>>>>>>>>>>> If
>>>>>>>>>>>>> I
>>>>>>>>>>>>> copy >78 files parallel
I expect each file goes to different
>>>>>>>>>>>>> subvolume
>>>>>>>>>>>>> right?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> If you only copy 78 files, most
probably you will get some
>>>>>>>>>>>> subvolume
>>>>>>>>>>>> empty
>>>>>>>>>>>> and some other with more than
one or two files. It's not an exact
>>>>>>>>>>>> distribution, it's a
statistially balanced distribution: over
>>>>>>>>>>>> time
>>>>>>>>>>>> and
>>>>>>>>>>>> with
>>>>>>>>>>>> enough files, each brick will
contain an amount of files in the
>>>>>>>>>>>> same
>>>>>>>>>>>> order
>>>>>>>>>>>> of magnitude, but they
won't have the *same* number of files.
>>>>>>>>>>>>
>>>>>>>>>>>>> In my tests during tests
with fio I can see every file goes to
>>>>>>>>>>>>> different subvolume, but
when I start mapreduce process from
>>>>>>>>>>>>> clients
>>>>>>>>>>>>> only 78/3=26 subvolumes
used for writing files.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This means that this is caused
by some peculiarity of the
>>>>>>>>>>>> mapreduce.
>>>>>>>>>>>>
>>>>>>>>>>>>> I see that clearly from
network traffic. Mapreduce on client
>>>>>>>>>>>>> side
>>>>>>>>>>>>> can
>>>>>>>>>>>>> be run multi thread. I
tested with 1-5-10 threads on each client
>>>>>>>>>>>>> but
>>>>>>>>>>>>> every time only 26
subvolumes used.
>>>>>>>>>>>>> How can I debug the issue
further?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> You should look which files are
created in each brick and how
>>>>>>>>>>>> many
>>>>>>>>>>>> while
>>>>>>>>>>>> the
>>>>>>>>>>>> process is running.
>>>>>>>>>>>>
>>>>>>>>>>>> Xavi
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Apr 19, 2016 at
11:22 AM, Xavier Hernandez
>>>>>>>>>>>>> <xhernandez at
datalab.es> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Serkan,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 19/04/16 09:18,
Serkan ?oban wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi, I just
reinstalled fresh 3.7.11 and I am seeing the same
>>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>>> 50 clients copying
part-0-xxxx named files using mapreduce to
>>>>>>>>>>>>>>> gluster
>>>>>>>>>>>>>>> using one thread
per server and they are using only 20 servers
>>>>>>>>>>>>>>> out
>>>>>>>>>>>>>>> of
>>>>>>>>>>>>>>> 60. On the other
hand fio tests use all the servers. Anything
>>>>>>>>>>>>>>> I
>>>>>>>>>>>>>>> can
>>>>>>>>>>>>>>> do
>>>>>>>>>>>>>>> to solve the issue?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Distribution of files
to ec sets is done by dht. In theory if
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>> create
>>>>>>>>>>>>>> many files each ec set
will receive the same amount of files.
>>>>>>>>>>>>>> However
>>>>>>>>>>>>>> when
>>>>>>>>>>>>>> the number of files is
small enough, statistics can fail.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Not sure what you are
doing exactly, but a mapreduce procedure
>>>>>>>>>>>>>> generally
>>>>>>>>>>>>>> only creates a single
output. In that case it makes sense that
>>>>>>>>>>>>>> only
>>>>>>>>>>>>>> one
>>>>>>>>>>>>>> ec
>>>>>>>>>>>>>> set is used. If you
want to use all ec sets for a single file,
>>>>>>>>>>>>>> you
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>> enable sharding (I
haven't tested that) or split the result in
>>>>>>>>>>>>>> multiple
>>>>>>>>>>>>>> files.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Xavi
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Serkan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> ----------
Forwarded message ----------
>>>>>>>>>>>>>>> From: Serkan ?oban
<cobanserkan at gmail.com>
>>>>>>>>>>>>>>> Date: Mon, Apr 18,
2016 at 2:39 PM
>>>>>>>>>>>>>>> Subject: disperse
volume file to subvolume mapping
>>>>>>>>>>>>>>> To: Gluster Users
<gluster-users at gluster.org>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi, I have a
problem where clients are using only 1/3 of nodes
>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> disperse volume for
writing.
>>>>>>>>>>>>>>> I am testing from
50 clients using 1 to 10 threads with file
>>>>>>>>>>>>>>> names
>>>>>>>>>>>>>>> part-0-xxxx.
>>>>>>>>>>>>>>> What I see is
clients only use 20 nodes for writing. How is
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> file
>>>>>>>>>>>>>>> name to sub volume
hashing is done? Is this related to file
>>>>>>>>>>>>>>> names
>>>>>>>>>>>>>>> are
>>>>>>>>>>>>>>> similar?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My cluster is
3.7.10 with 60 nodes each has 26 disks. Disperse
>>>>>>>>>>>>>>> volume
>>>>>>>>>>>>>>> is 78 x (16+4).
Only 26 out of 78 sub volumes used during
>>>>>>>>>>>>>>> writes..
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>

Gluster users - Apr 2016 - disperse volume file to subvolume mapping

[Gluster-users] disperse volume file to subvolume mapping

[Gluster-users] disperse volume file to subvolume mapping