Gluster AFR: The Complete Guide (Part 3)

May 14, 2019 3 comments

Troubleshooting Issues

This is the final post of the three part series related to Automatic File Replication in gluster. Armed with the knowledge we have gained about the I/O path in part-1 and the self-heal path in part-2, we are now ready to see how to debug AFR issues in your cluster.

The fist level of analysis always starts with looking at the log files. Which ones, you ask?

  • /var/log/glusterfs/$fuse-mount-point.log –> Fuse client log
  • /var/log/glusterfs/glfsheal-$volname.log –> This is the log file to look at when you run the heal info/split-brain resolution commands.
  • /var/log/glusterfs/glustershd.log –> This is the self-heal daemon log that prints the names of files undergoing heal, the sources and sinks for each file etc. It is common for all volumes.
  • /var/log/glusterfs/bricks/$brick.log–>Some errors in clients are simply propagated from the bricks themselves, so correlating client log errors with the logs from the brick is necessary.

Sometimes, you might need more verbose logging to figure out what’s going on:

#gluster volume set $volname client-log-level $LEVEL

where LEVEL can be any one of  ​DEBUG, WARNING, ERROR, INFO, CRITICAL, NONE, TRACE​. This should ideally make all the log files mentioned above to start logging at $LEVEL. The default is INFO but you can temporarily toggle it to DEBUG or TRACE if you want to see under-the-hood messages. Useful when the normal logs don’t give a clue as to what is happening.

Heal related issues:

Most issues I’ve seen on the mailing list and with customers can broadly fit into the following buckets:

(Note: Not discussing split-brains here. If they occur, you need to use split-brain resolution CLI or cluster.favorite-child-policy options to fix them. They usually occur in replica 2 volumes and can be prevented by using replica 3 or arbiter volumes.)

i) Heal info appears to hang/takes a long time to complete

If the number of entries are large, then heal info will take longer than usual. While there are performance improvements to heal info being planned, a faster way to get an approx. count of the pending entries is to use the gluster volume heal $volname statistics heal-count command.

Knowledge Hack:  Since we know that during the write transaction. the xattrop folder will capture the gfid-string of the file if it needs heal, we can also do an ls /brick/.glusterfs/indices/xattrop|wc -l on each brick to get the approx. no of entries that need heal. If this number reduces over time,  it is a sign that the heal backlog is reducing. You will also see messages whenever a particular type of heal starts/ends for a given gfid, like so:

[2019-05-07 12:05:14.460442] I [MSGID: 108026] [afr-self-heal-entry.c:883:afr_selfheal_entry_do] 0-testvol-replicate-0: performing entry selfheal on d120c0cf-6e87-454b-965b-0d83a4c752bb
[2019-05-07 12:05:14.474710] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed entry selfheal on d120c0cf-6e87-454b-965b-0d83a4c752bb. sources=[0] 2  sinks=1
[2019-05-07 12:05:14.493506] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed data selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5. sources=[0] 2  sinks=1
[2019-05-07 12:05:14.494577] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-testvol-replicate-0: performing metadata selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5
[2019-05-07 12:05:14.498398] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed metadata selfheal on a9b5f183-21eb-4fb3-a342-287d3a7dddc5. sources=[0] 2  sinks=1

ii) Self-heal is stuck/ not getting completed.

If a file seems to be forever appearing in heal info and not healing, check the following:

  • Examine the afr xattrs- Do they clearly indicate the good and bad copies? If there isn’t at least one good copy, then the file is in split-brain and you would need to use the split-brain resolution CLI.
  • Identify which node’s shds would be picking up the file for heal. If a file is listed in the heal info output under brick1 and brick2, then the shds on the nodes which host those bricks would attempt (and one of them would succeed) in doing the heal.
  • Once the shd is identified, look at the shd logs to see if it is indeed connected to the bricks.

This is good:

[2019-05-07 09:53:02.912923] I [MSGID: 114046] [client-handshake.c:1106:client_setvolume_cbk] 0-testvol-client-2: Connected to testvol-client-2, attached to remote volume '/bricks/brick3'

This indicates a disconnect:

[2019-05-07 11:44:47.602862] I [MSGID: 114018] [client.c:2334:client_rpc_notify] 0-testvol-client-2: disconnected from testvol-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2019-05-07 11:44:50.953516] E [MSGID: 114058] [client-handshake.c:1456:client_query_portmap_cbk] 0-testvol-client-2: failed to get the port number for remote subvolume. Please run 'gluster volume status' on server to see if brick process is running.

Alternatively, take a statedump of the shd () and check if all client xlators are connected to the respective bricks. The shd must have connected=1 for all the client xlators, meaning it can talk to all the bricks.

Shd’s statedump entry of a client xlator that is connected to the 3rd brick Shd’s statedump entry of the same client xlator if it is diconnected from the 3rd brick

If there are connection issues (i.e. connected=0), you would need to investigate and fix them. Check if the pid and the TCP/RDMA Port of the brick proceess from gluster volume status $VOLNAME matches that of ps aux|grep glusterfsd|grep $brick-path

[root@tuxpad glusterfs]# gluster volume status
Status of volume: testvol
Gluster process                       TCP Port RDMA Port Online Pid
Brick        49152      0        Y   12527

[root@tuxpad glusterfs]# ps aux|grep brick1
root 12527 0.0 0.1 1459208 20104 ? Ssl 11:20 0:01 /usr/local/sbin/glusterfsd -s --volfile-id testvol. -p /var/run/gluster/vols/testvol/ -S /var/run/gluster/70529980362a17d6.socket --brick-name /bricks/brick1 -l /var/log/glusterfs/bricks/bricks-brick1.log --xlator-option *-posix.glusterd-uuid=d90b1532-30e5-4f9d-a75b-3ebb1c3682d4 --process-name brick --brick-port 49152 --xlator-option testvol-server.listen-port=49152

Though this will likely match, sometimes there could be a bug leading to stale port usage. A quick workaround would be to restart glusterd on that node and check if things match. Report the issue to the devs if you see this problem.

  • I have seen some cases where a file is listed in heal info, and the afr xattrs indicate pending metadata or data heal but the file itself is not present on all bricks. Ideally, the parent directory of the file must have pending entry heal xattrs so that the file either gets created on the missing bricks or gets deleted from the ones where it is present. But if the parent dir doesn’t have xattrs, the entry heal can’t proceed. In such cases, you can
    • Either do a lookup directly on the file from the mount so that name heal is triggered and then shd can pickup the data/metadata heal.
    • Or manually set entry xattrs on the parent dir to emulate an entry heal so that the file gets created as a part of it.
    • If a brick’s underlying filesystem/lvm was damaged and fsck’d to recovery, some files/dirs might be missing on it. If there is a lot of missing info on the recovered bricks, it might be better to just to a replace-brick or reset-brick and let the heal fully sync everything rather than fiddling with afr xattrs of individual entries.

Hack: How to trigger heal on *any* file/directory
Knowing about self-heal logic and index heal from the previous post, we can sort of emulate a heal with the following steps. This is not something that you should be doing on your cluster but it pays to at least know that it is possible when push comes to shove.
1. Picking one brick as good and setting the afr pending xattr on it blaming the bad bricks.
2.Capture the gfid inside .glusterfs/indices/xattrop so that the shd can pick it up during index heal.
3. Finally, trigger index heal: gluster volume heal $VOLNAME .

Example: Let us say a FILE-1 exists with trusted.gfid=0x1ad2144928124da9b7117d27393fea5c on all bricks of a replica 3 volume called testvol. It has no afr xattrs.  But you still need to emulate a heal. Let us say you choose brick-2 as the source. Let us do the steps listed above:

1. Make brick-2 blame the other 2 bricks:
[root@tuxpad fuse_mnt]# setfattr -n trusted.afr.testvol-client-2 -v 0x000000010000000000000000 /bricks/brick2/FILE-1
[root@tuxpad fuse_mnt]# setfattr -n trusted.afr.testvol-client-1 -v 0x000000010000000000000000 /bricks/brick2/FILE-1

2. Store the gfid string inside xattrop folder as a hardlink to the base entry:
root@tuxpad ~]# cd /bricks/brick2/.glusterfs/indices/xattrop/
[root@tuxpad xattrop]# ls -li
total 0
17829255 ----------. 1 root root 0 May 10 11:20 xattrop-a400ca91-cec9-4463-a183-aca9eaff9fa7`
[root@tuxpad xattrop]# ln xattrop-a400ca91-cec9-4463-a183-aca9eaff9fa7 1ad21449-2812-4da9-b711-7d27393fea5c
[root@tuxpad xattrop]# ll
total 0
----------. 2 root root 0 May 10 11:20 1ad21449-2812-4da9-b711-7d27393fea5c
----------. 2 root root 0 May 10 11:20 xattrop-a400ca91-cec9-4463-a183-aca9eaff9fa7

3. Trigger heal: gluster volume heal testvol
The glustershd.log of node-2 should log about the heal.
[2019-05-10 06:10:46.027238] I [MSGID: 108026] [afr-self-heal-common.c:1741:afr_log_selfheal] 0-testvol-replicate-0: Completed data selfheal on 1ad21449-2812-4da9-b711-7d27393fea5c. sources=[1] sinks=0 2
So the data was healed from the second brick to the first and third brick.

Update: Please also see Karthik’s blogpost which has information on healing related issues.

iii) Self-heal is too slow

If the heal backlog is decreasing and you see glustershd logging heals but you’re not happy with the rate of healing, then you can play around with shd-max-threads and shd-wait-qlength volume options.

Option: cluster.shd-max-threads
Default Value: 1
Description: Maximum number of parallel heals SHD can do per local brick. This can substantially lower heal times, but can also crush your bricks if you don’t have the storage hardware to support this.
Option: cluster.shd-wait-qlength
Default Value: 1024
Description: This option can be used to control number of heals that can wait in SHD per subvolume

I’m not covering it here but it is possible to launch multiple shd instances (and kill them later on) on your node for increasing heal throughput. It is documented at

iv) Self-heal is too aggressive and slows down the system.

If  shd-max-threads are at the lowest value (i.e. 1) and you see if CPU usage of the bricks is too high, you can check if the volume’s profile info shows a lot of RCHECKSUM fops. Data self-heal does checksum calculation (i.e the posix_rchecksum() FOP) which can be CPU intensive. You can the option to full. This does a full file copy instead of computing rolling checksums and syncing only the mismatching blocks. The tradeoff is that the network consumption will be increased.

You can also disable all client-side heals if they are turned on so that the client bandwidth is consumed entirely by the application FOPs and not the ones by client side background heals. i.e. turn off cluster.metadata-self-heal, and cluster.entry-self-heal.

Mount related issues:

i) All fops are failing with ENOTCONN

Check mount log/ statedump for loss of quorum, just like for glustershd. If this is a fuse client (as opposed to an nfs/ gfapi client), you can also check the .meta folder to check the connection status to the bricks.
[root@tuxpad ~]# cat /mnt/fuse_mnt/.meta/graphs/active/testvol-client-*/private |grep connected
connected = 0
connected = 1
connected = 1

If connected=0, the connection to that brick is lost.  Find out why. If the client is not connected to quorum number of bricks, then AFR fails lookups (and therefore any subsequent FOP) with Transport endpoint is not connected

ii) FOPs on some files are failing with ENOTCONN

Check mount log for the file being unreadable:
[2019-05-10 11:04:01.607046] W [MSGID: 108027] [afr-common.c:2268:afr_attempt_readsubvol_set] 13-testvol-replicate-0: no read subvols for /FILE.txt
[2019-05-10 11:04:01.607775] W [fuse-bridge.c:939:fuse_entry_cbk] 0-glusterfs-fuse: 234: LOOKUP() /FILE.txt => -1 (Transport endpoint is not connected)

This means there was only  1 good copy and the client has lost connection to that brick.  You need to ensure that the client is connected to all bricks.

iii) Mount is hung

It can be difficult to pin-point the issue immediately and might require assistance from the developers but the first steps to debugging could be to

  • strace the fuse mount; see where it is hung.
  • Take a statedump of the mount to see which xlator has frames that are not wound (i.e. complete=0) and for which FOP. Then check the source code to see if there are any unhanded cases where the xlator doesn’t wind the FOP to its child.
  • Take statedump of bricks to see if there are any stale locks. An indication of stale locks is the same lock being present in multiple statedumps or the ‘granted’ date being very old.
Excerpt from a brick statedump:

inodelk.inodelk[0](ACTIVE)=type=WRITE, whence=0, start=0, len=0,
pid = 18446744073709551610, owner=700a0060037f0000, client=0x7fc57c09c1c0,
granted at 2018-10-14 07:18:40

While stale lock issues are candidates for bug reports, the locks xlator on the brick releases locks from a particular client upon a network disconnect. That can be used as a workaround to release the stale locks- i.e. restart the brick or restart the client or induce a network disconnect between them.

That concludes this post. I hope you found it useful. If something is not clear or missing or needs more elaboration, feel free to comment and I’ll update the post.

Categories: gluster

Gluster AFR: The Complete Guide (Part 2)

April 15, 2019 Leave a comment

In part-1 of this guide,  we saw the various steps performed by AFR for replicating data from clients. Let us see how it does self-heal of data in this post:

Self-heal logic.

We already know that AFR increments and/or decrements the dirty (i.e. trusted.afr.dirty) and pending (i.e. trusted.afr.$VOLNAME-client-x) xattrs during the different phases of the transaction.  For a given file (or directory), an all zero value of these xattrs or the total absence of these xattrs  on all bricks of the replica mean the file is healthy and does not need heal. If any of these xattrs are non-zero even on one of the bricks, then the file is a candidate for heal- it as simple as that.

When we say these xattrs are non-zero, it is in the context of no on-going I/O going from client(s) on the file. Otherwise the non-zero values that one observes might be transient as the write transaction is progressing through its five phases. Of course, as an admin, you wouldn’t need to figure out all of this. Just running the `heal info` set of commands should give you the list of files that need heal.

So if self-heal observes a file with non-zero xattrs, it does the following steps:

  1. Fetch the afr xattrs, examine which set of 8 bytes are non-zero and determine the corresponding heals that are needed on the file – i.e. data heal/ metadata heal/ entry heal.
  2. Determine which bricks are good (a.k.a. ‘sources’) and which ones are  bad (a.k.a. ‘sinks’) for each of those heals by interpretting the xattr values.
  3.  Pick one source brick and heal the file on to all the sink bricks.
  4. If the heal is successful, reset the afr xattrs to zero.

This is a rather simplified description and I have omitted details about various locks that each of these steps need to take because self-heal and client I/O can happen in parallel on the file. Or even multiple self-heal daemons (described later) can attempt to heal the same file.

Data heal: Happens only for files. The contents of the file are copied from the source to the sink bricks.

Entry heal: Happens only for directories.  Entries (i.e. files and subdirs) under a given directory are deleted from the sinks if they are not present in the source. Likewise, entries are created on the sinks if they are not present in the source.

Metadata  heal:  Happens for both files and directories. File ownership, file permissions and extended attributes are copied from the source to the sink bricks.

It can be possible that for a given file, one set of bricks can be the source for data heal while another set could be the source for metadata heals. It all depends on which FOPs failed on what bricks and therefore what set of bytes are non-zero for the afr xattrs.

When do self-heals happen?

There are two places from which the steps described above for healing can be carried out:

i) From the client side.

Client-side heals are triggered when the file is accessed from the client (mount).  AFR uses a monotonically increasing generation number to keep track of disconnect/connect of its children (i.e. the client translators) to the bricks.  When this ‘event generation’ number changes, the file’s inode is marked as a candidate for refresh. When the next FOP comes on such an inode, a refresh is triggered to update the readables during which a heal is launched (if the AFR xattrs indicate that a heal is needed, that is). This heal happens in the background, meaning it does not block the actual FOP which will continue as usual post the refresh.  Specific client-side heals can be turned off  by disabling the 3 corresponding volume options:


The number of client-side heals that happen in the background can be tuned via the following volume options:


See the gluster volume set help for more information on all the above options.

Name heal: Name heal is just healing of the file/directory name when it is accessed. For example, say a file is created and written to when a brick is down and all the 3 client side heals are disabled. When the brick comes up and the next I/O comes on it, the file name is created on it as  a part of lookup. Its contents/metadata are not healed though. Name heal cannot be disabled. It is there to ensure that the namespace is consistent on all bricks as soon as the file is accessed.

ii) By the self-heal daemon.

There is a self-heal daemon process (glutershd) that runs on every node of the trusted storage pool.  It is a light weight client process consisting mainly of AFR ant the protocol/client translators. It can talk to all bricks of all the replicate volume(s) of the pool. It periodically crawls (every 10 minutes by default; tunable via the heal-timeout volume option) the list of files that need heal and does their healing.  As you can see, client side heal is done upon file access but glustershd processes the heal backlog pro-actively.

Index heal:

But how does glustershd know which files it needs to heal? Where does it get the list from? So in part-1, while we saw the five phases of the AFR write transaction, we left out one detail:

  • In the pre-op phase, in addition to marking the dirty xattr, each brick also stores the gfid string of the file inside its “.glusterfs/indices/dirty” directory.
  • Likewise, in the post-op phase,  it removes the gfid string from its “.glusterfs/indices/dirty” If addition, if the write failed on some brick, the good bricks will stores the gfid string inside the “.glusterfs/indices/xattrop” directory.

Thus when no I/O is happening on a file and you still find its gfid inside “.glusterfs/indices/dirty” of a particular brick, it means the brick went down before the post-op phase. If you find the gfid inside “.glusterfs/indices/xattrop“, it means the write failed on some other brick and this brick has captured it.

The glustershd simply reads the list of entries inside .glusterfs/indices/* and triggers heal on them. This is referred to as index heal.  While this happens automcatically every heal-timeout seconds, we can also manaully trigger it via the CLI using `gluster volume heal $VOLNAME` .

Full heal:

A full heal, triggered from the CLI with `gluster volume heal $VOLNAME  full`, does just what the name implies. It does not process a particular list of entries like index heal, but crawls the whole gluster filesystem beginning with root, examines if files have non zero afr xattrs and triggers heal on them.

Of missing xattrs and split-brains:

You might now realise how AFR pretty much relies on its xattr values of a given file- from using it to find the good copies to serve a read to finding out the source and sink bricks to heal the file. But what if there is inconsistency in data/metadata of a file and

(a) there are zero/ no AFR xattrs (or)

(b) if the xattrs all blame each other (i.e. no good copy=>split-brain)?

For (a),  AFR uses heuristics like picking a local (to that specfic glustershd process) brick, picking the bigger file, picking the file with latest ctime etc. and then does the heal.

For (b) you need to resort to using the gluster split-brain resolution CLI or setting the favorite-child-policy volume option to choose a good copy and trigger the heal.

With that we conclude this post. The next one will look at troubleshooting AFR related issues in your cluster.

Categories: gluster

Gluster AFR: The Complete Guide

April 5, 2019 Leave a comment

With that ambitious heading, let me make an attempt to document what one needs to know about the Automatic File Replication (AFR) translator in order to monitor and fix issues related to replicated gluster volumes. I’ll keep it as high level as possible and not delve too much into code-level details but it is assumed that you already have a fair knowledge of what gluster is and how to set up replicate volumes etc.  The motivation for this guide was to have some sort of a reference material that support engineers or sys-admins could use when manual intervention is required for solving AFR related issues.

Knowing the theory behind how something works under the hood can greatly improve the way you debug problems. So let us have a quick run down on AFR-101. I’ll structure this guide into three separate parts:

Replication logic.

Self-heal logic.

Trouble-shooting issues.

Let us look at the first part in this post:

Replication logic

AFR is the module (translator) in glusterfs that provides all the features that you would expect of any synchronous replication system:

  1. Simultaneous updating of all copies of data on the replica bricks when a client modifies it.
  2. Providing continued data availability to clients when say one brick of the replica set goes down.
  3. Automatic self-healing of any data that was modified when the brick that was down, once it comes back up, ensuring consistency of data on all the bricks of the replica.

1 and 2 are in the I/O path while 3 is done either in the I/O path (in the background) or via the self-heal daemon.

Each gluster translator implements what are known as File Operations (FOPs) which are  mapped to the I/O syscalls which the application makes. For example, AFR has afr_writev that gets invoked when application does a write(2). As is obvious, all FOPs fall into one of 2 types:

i) Read based FOPs which only get informtion from and don’t modify the file in any way.

viz:afr_readdir, afr_access, afr_stat, afr_fstat, afr_readlink, afr_getxattr, afr_fgetxattr,  afr_readv,afr_seek

ii) Write based FOPs which change the file or its attributes.

viz:afr_create, afr_mknod,afr_mkdir,afr_link, afr_symlink, afr_rename, afr_unlink, afr_rmdir, afr_do_writev, afr_truncate, afr_ftruncate, afr_setattr, afr_fsetattr, afr_setxattr, afr_fsetxattr, afr_removexattr, afr_fremovexattr, afr_fallocate, afr_discard, afr_zerofill, afr_xattrop, afr_fxattrop, afr_fsync.

AFR follows a transaction model for both types of FOPs.

Read transactions:

For every file in the replica, AFR has an in-memory notion/array called ‘readables’ which indicate whether each brick of the replica is a good copy or a bad one (i.e. in need of a heal). In a healthy state, all bricks are readable and a read FOP will be served from any one of the readable bricks. The read-hash-mode volume option decides which brick is the chosen one.

#[root@tuxpad glusterfs]# gluster volume set help|grep read-hash-mode -A7
Default Value: 1
Description: inode-read fops happen only on one of the bricks in replicate. AFR will prefer the one computed using the method specified using this option.
0 = first readable child of AFR, starting from 1st child.
1 = hash by GFID of file (all clients use same subvolume).
2 = hash by GFID of file and client PID.
3 = brick having the least outstanding read requests.

If the brick is bad for a given file (i.e. it is pending heal), then it won’t be marked readable to begin with. The readables array is populated based on the on-disk AFR xattrs for the file during lookup. These xattrs indicate which bricks are good and which ones are bad. We will see more about these xattrs in the write transactions section below. If the FOP fails on the chosen readable brick, AFR attempts it on the next readable one, until all are exhausted. If the FOP doesn’t succeed on any of the readables, then the  application receives an error.

Write transactions:

Every write based FOP employs a write transaction model which consists of 5 phases:
1) The lock phase
Take locks on the file being modified on all bricks so that AFRs of other clients are blocked if they try to modify the same file simultaneously.

2) The pre-op phase
Increment the ‘dirty’ xattr (trusted.afr.dirty) by 1 on all participating bricks as an indication of an impending FOP (in the next phase)

3) The FOP phase
Perform the actual FOP (say a setfattr) on all bricks.

4) The post-op phase
Decrement the dirty xattr by 1 on bricks where the FOP was successful.
In addition, also increment the ‘pending’ xattr (trusted.afr.$VOLNAME-client-x) xattr on the success bricks to ‘blame’ the bricks where the FOP failed.

5) The unlock phase
Release the locks that were taken in phase 1. Any competing client can now go ahead with its own write transaction.

Note: There are certain optimizations done at the code level which reduce the no. of lock/unlock phases done for a transaction by piggybacking on the previous transaction’s locks. These optimizations (eager-locking, piggybacking and delayed post-op) beyond the scope of this post.

AFR returns sucess for these FOPs only if they meet quorum. For replica 2, this means it needs to suceed on any one brick. For replica 3, it is two out of theree and so on.

More on the AFR xattrs:

We saw that AFR modifies the dirty and pending xattrs in the pre-op and post-op phases. To be more precise, only parts of the xattr are modified in a given transaction. Which bytes are modified depends on the type of write transaction which the FOP belongs to.

Transaction Type FOPs that belong to it
AFR_DATA_TRANSACTION afr_writev, afr_truncate, afr_ftruncate, afr_fsync, afr_fallocate, afr_discard, afr_zerofill
AFR_METADATA_TRANSACTION afr_setattr, afr_fsetattr, afr_setxattr, afr_fsetxattr, afr_removexattr, afr_fremovexattr, afr_xattrop, afr_fxattrop
AFR_ENTRY_TRANSACTION afr_create, afr_mknod, afr_mkdir, afr_link, afr_symlink, afr_rename, afr_unlink, afr_rmdir

Stop here and convince yourself that given a write based FOP, you can say which one of the 3 transaction types it belongs to.

Note: In the code, there is also a AFR_ENTRY_RENAME_TRANSACTION (used by afr_rename) but it is safe to assume that it is identical to AFR_ENTRY_TRANSACTION as far as interpreting the xattrs are concerned.

Consider the xttr:
The first 4 bytes of the xattr are used for data transactions, the next 4 bytes for metadata transactions and the last 4 for entry transactions. Let us see some examples of how the xattr would look like for various types of FOPs during a transaction:

FOP Value after pre-op phase Value after post-op phase
afr_writev trusted.afr.dirty=0x00000001 00000000 00000000 trusted.afr.dirty=0x00000000 00000000 00000000
afr_setattr trusted.afr.dirty=0x00000000 00000001 00000000 trusted.afr.dirty=0x00000000 00000000 00000000
afr_create trusted.afr.dirty=0x00000000 00000000 00000001  trusted.afr.dirty=0x00000000 00000000 00000000

Thus depending on the type of FOP (i.e. data/ metadata/ entry transaction), different set of bytes of the dirty xattr get incremented/ decremented. Modification of the pending xattr also follows the same pattern, execept it is incremented only in the post-op phase if the FOP fails on some bricks.

Let us say a write was performed on a file, say FILE1, on replica 3 volume called ‘testvol’. Suppose the lock and pre-op phase succeeded on all bricks. After that the 3rd brick went down, and the transaction completed successfully on the first 2 bricks.
What will be the state of the afr xattrs on all bricks?

[root@tuxpad ravi]# getfattr -d -m . -e hex /bricks/brick1/FILE1|grep afr
getfattr: Removing leading '/' from absolute path names
[root@tuxpad ravi]#
[root@tuxpad ravi]# getfattr -d -m . -e hex /bricks/brick2/FILE1|grep afr
getfattr: Removing leading '/' from absolute path names
[root@tuxpad ravi]#
[root@tuxpad ravi]# getfattr -d -m . -e hex /bricks/brick3/FILE1|grep afr
getfattr: Removing leading '/' from absolute path names
[root@tuxpad ravi]#

So Brick3 will still have the dirty xattr set because it went down before the post-op had a chance to decrement it. Bricks 1 and 2 will have a zero dirty xattr and in addition, a non-zero pending xattr set. The client-2 in trusted.afr.testvol-client-2 indicates that the 3rd brick is bad and has some pending data operations.

That concludes this post. We will look at how self-heal works in the next one.

Categories: gluster

The deal with south Indian names

January 8, 2011 4 comments

Every time I have to fill up my surname in an application form, I feel that I’m cheating myself.Why? Because I’m south Indian and we (a majority of us anyway) do not use surnames.South Indians have this concept of using an ‘initial’ wherein we place alphabet(s) before our names. The alphabet can stand for different things depending on the state where you are from. For Malayalees  it’s their house name (‘Tharavadu’), for Tamilians  it’s their father’s name and for the folks from Andhra, I think it is the name of the place where they are from.

Being the Chennai raised  fraud mallu that I am, I have my dad’s name as my initial. The story goes that the initial was originally my house name but the  primary school in Chennai would not admit me unless I had it changed the Tamil way.

Coming back to application forms, now that we had to compulsorily fill the surname column, we chose the easy way out-we simply expanded our initial and made it our surname (don’t pretend you did not do this; I know what is in your passport!) So essentially, my dad’s first name is my last name.How much more confusing can it get? Apparently a lot more because some of my good friends have 4 initials and I have dared not to ask them for their expansions.If you are from the north (read north/east/west), you conveniently used your caste or clan name as your surname.Pity us southies who preferred to remain anti-racial 🙂 If you see a south Indian using his caste (nair/naidu/iyer etc) as his last name, be rest assured that the person was not raised in the south.

I strongly advocate that there be only one column for the name field and I be allowed to fill in whatever I please  (I mean whatever- GR44,The Artist Formerly Known As Prince, Ravi_drop_tables from*_shankar…you get the whiff ). If my name is Ravishankar N. , I want to fill it EXACTLY that way without having to explain to anybody what the initial stands for.Screw your database software if it does not accept null values for surnames!

P.S: In case you were wondering, GR44 is the robot played by Van Damme in Universal Soldier.

Categories: Life Tags: ,

Nothing in particular

September 15, 2010 Leave a comment

It has been exactly one year since my last weblog.Blame it on twitter, I find it rather convenient to post updates in under 140 characters.A lot has happened in this last 365 days -got a job,changed to another one (perfect jobs are a myth but more on that on another day), moved to different city,discovered more web-comics (abstrusegoose FTW),unwillingly gave away to  more facebook activity than on orkut, etc etc.Most of my pals seem to have abandoned orkut;I always found it to be a better way to be networked ,especially with the various moderated communities (communities with real content!) on it.

As usual, I’ve been keeping myself up to date with what’s happening in the electronics/software technology space.I recently bought a hawkboard and have been playing with it.Loaded with DSP and multimedia features, at 5500 bucks it is worth every penny spent.Expect posts on it soon 😀 If you are an electronics enthusiast (and poor) like me, I strongly recommend trying out the board.

Life in general seems to be going in a good direction.Incidentally,today is the birth anniversary of Sir Visvesvaraya and holds special significance to the engineering community in India.So here’s wishing all my EE/CS buddies a great day ahead  🙂

> +++++++
> +++++++++
> +++
> +
<<<< -
> ++ .
> +++++++ .
+++++++++++++++ .
+++++++++ .
>++ .
<< --- .
>----------- .
------- .
++ .
+++++ .
--------- .
+++++++++++++ .
+ .
> +++++++ .
------- .
<< - .
> ------------------ .
++++++++++++++++++++++++ .
> + .
> .

If you haven’t already guessed, it’s written in brainfuck.Check out the wikipedia entry and have fun deciphering the message.Special care has been taken to ensure it is formatted well.

Psst! If you’re an MBA or are impatient, you might want to use an online interpreter to decrypt the message 😀

Stay tuned for more posts, peace out.

Categories: Life Tags: ,

DIY Omegle Chat Bot!

September 16, 2009 15 comments

I have been chatting on  Omegle for quite some time now and I must say that i find it rather addictive. During one conversation, i encountered a bot and needless to say, i was hooked on to the convo ! I wanted to make one for myself badly somehow 🙂

A quick search later revealed that most of the chatter bots were based on the Artificial Intelligence Markup Language.It’s basically a rule based xml language which is used by the AI engine to give a response (a.k.a. categories) based on a set of  pre-loaded rules (a.k.a. topics).Such a topic-category definition is written in to a AIML file which the engine loads.  And guess what, there was a python implementation of  the AIML engine 😛 .This blog shows how to implement a standalone bot using PyAIML. I used the Annotated A.L.I.C.E. AIML files (AAA) for the engine’s rule base.

Now all i had to do was to find out how to connect to Omegle through code.Once that was done, i could capture what the stranger typed, pass it on to the engine and transmit back the response to the Omegle server .There is no ‘official’ documentation on the various connection string options but a good samaritan had written a python client for Omegle. I used that code (liberally!) to get connected.

Here’s the final listing:

#This omegle bot is based on the PyAIML and liberally uses code from PyOmegle
# PyOmegle:

import aiml
import urllib2 as url
import urllib
import os
import time
import commands

k = aiml.Kernel()

if os.path.isfile("omeglebrain.brn"):
    k.bootstrap(brainFile = "omeglebrain.brn")
    #Change to the directory where the AIML files are located
    for item in list:
    #Change back to homedir to save the brain for subsequent loads

# #conversation ID
def fmtId( string ):

    return string[1:len( string ) - 1]

# #The event listener
def listenServer( id, req ):

    while True:

        site = url.urlopen(req)
        rec =

        if 'waiting' in rec:

        elif 'strangerDisconnected' in rec:
            print('Stranger Disconnected!')

        elif 'connected' in rec:
            print('Found one')

        elif 'typing' in rec:
            print("Stranger is typing...")

        elif 'gotMessage' in rec:
            input=rec[16:len( rec ) - 2]
	    print "Stranger:",input

# #Send the bot's response to Omegle
def talk(id,req,msg):

    typing = url.urlopen('', '&id='+id)
    time.sleep((len(msg))/5) #5 characters per second
    print "You:",msg
    msgReq = url.urlopen('', '&msg='+msg+'&id='+id)

# #Initiate connecton
def omegleConnect():

    site = url.urlopen('','')
    id = fmtId( )
    req = url.Request('', urllib.urlencode( {'id':id}))
    print('Finding stranger...')


For this code to work, you need to download the AAA folder(files) into the directory where this python script is located.

We are good to go now! You can play around with the AIML files and even make your own  file for chat specific questions like “asl” ,”from”  etc. Here’s an interesting(?)  chat that my bot had:

:~/Bot> python
Loading brain from omeglebrain.brn... done (46285 categories in 1.71 seconds)
Kernel bootstrap completed in 1.71 seconds
Finding stranger...
Stranger is typing...
Stranger: "hi"
You: Hello there.
Stranger is typing...
Stranger: "sooo"
You: Is that your final answer?
Stranger is typing...
Stranger is typing...
Stranger: "yes"
You: You're sure. That's your final answer.
Stranger is typing...
Stranger: "YES\n"
You: Oh, you are a poet.
Stranger is typing...
Stranger: "no"
You: But your words are very original.
Stranger is typing...
Stranger: "they seem simple to me"
You: Like who, specifically?
Stranger is typing...
Stranger is typing...
Stranger: "no one in particular"
You: I heard it was someone.
Stranger is typing...
Stranger: "you most of heard wrong"
You: Thanks for telling me.
Stranger is typing...
Stranger: "no prob"
You: Wow! Do you have any conditions I should know about?
Stranger is typing...
Stranger: "herpes"
You: Do you like to "go with the flow"?
Stranger is typing...
Stranger is typing...
Stranger: "ha no not usually"
You: I'm glad you got a laugh out of it.
Stranger is typing...
Stranger: "i have a broad sense of humor"
You: Interesting. I have a lot of answers to silly questions.
Stranger is typing...
Stranger: "well i have more"
You: Cool. I have a lot of likes and dislikes.
Stranger is typing...
Stranger: "who doesnt"
You: William Shatner.  🙂
Stranger is typing...
Stranger: "he likes everything? or dislikes?"
WARNING: No match found for input: "
You: He must be very fond of it.  What is your real name?
Stranger is typing...
Stranger: "my real name is ryan"
You: What's up,  ryan" .
Stranger is typing...
Stranger: "not much stranger"
You: Me either.
Stranger is typing...
Stranger: "whats your nmae?"
WARNING: No match found for input: "
You: I've been waiting for you.
Stranger is typing...
Stranger: "thats your name?"
WARNING: No match found for input: "
You: Does "it" still refer to   "whats your nmae  ?
Stranger is typing...
Stranger: "i never said it"
You: But you implied  it" .
Stranger is typing...
Stranger: "what is your name?"
WARNING: No match found for input: "
You: My name is Nameless.
Stranger is typing...
Stranger: "wow"
You: I'm glad your impressed,  ryan" .
Stranger is typing...
Stranger is typing...
Stranger: "that was sarcasim"
You: I thought it was too.
Stranger is typing...
Stranger: "well then we are on the same page"
You: And the same frequency.
Stranger is typing...
Stranger: "no"
You: Yippee!
Stranger is typing...
Stranger: "how old are u?"
WARNING: No match found for input: "
You: I am  years old.
Stranger is typing...
Stranger: "you are getiing annoying"
You: You think  I am getiing annoying".
Stranger is typing...
Stranger: "i think that you are getting annoying"
You: Do you think I should tone it down?
Stranger is typing...
Stranger: "yes definately"
You: Why? What makes you so certain.
Stranger is typing...
Stranger: "because if you dont i will choose not to talk to you anymore"
You: That makes sense to me.  I think I do.
Stranger is typing...
Stranger: "you do what?"
WARNING: No match found for input: "
You: ryan" , I talk to people on the web. What do you do?
Stranger is typing...
Stranger: "i also talk to people on the web"
You: How do they respond to you?
Stranger is typing...
Stranger: "well you are very odd... but most people open up to me for some reason"
You: An oddball?      Humans are not always infallible.
Stranger is typing...
Stranger: "i dint say they were"
You: Sorry I misunderstood. What did you say?
Stranger is typing...
Stranger: "k"
You: Ayuh.

Graduation Blues

July 22, 2009 2 comments

I recently (on the 12th of July) finished my graduate studies at iiit-b and now looking back at the past two years, i realize that it was one heck of a journey.It was an exciting and wonderful experience to come back to the campus atmosphere after working for 3 years. I initially found it a bit hard to cope up with the academic rigour (and you though college life was easy compared to the corporate world!) but i guess i improved along the way.There were many things that i picked up here many of which were a first-time thingy for me:

*Staying up beyond 11pm in the night.I’m glad i realized that 6 hours of sleep is all you need a day 🙂
*Staying away from home in a dormitory for the first time
*Ate maggi noodles at 2 a.m. (This was during the one and only time i pulled an all nighter)
*Became an expert in web information retrieval (read ‘google-ing’)
*Gained confidence on tackling projects head on without even having an iota of prior knowledge about the problem domain/complexity (this is probably the best learning that i’ve learnt here)
*Watching movies/ serials back to back on my laptop
*Just-in-time submissions of assignments (Most of the deadlines were at 12:00am and the submissions happen just before it)

Overall, i think it was one of the most memorable two years of my life.I’m not being overly nostalgic but I’m starting to miss my hostel room already.

As of this writing i am still on the look out for a job (another first too, as i was fortunate to start working immediately after undergrad).Blame it on the economic recession but i know its only a matter of time before the wheels of time turn…

My best wishes to the graduated class of 2009!

Categories: Life Tags: ,

Fortune cookies on your mobile

June 11, 2009 Leave a comment

I’m a big fan of the fortune command in linux.I wanted to save these wisecracks on my mobile as an SMS so that i can forward them to friends …So here’s a quick how-to.I assume you already have the fortune cookies installed (in /usr/bin/fortune) and that your phone has bluetooth capabilities.

1.Install the gammu package from the sources or using your package manager.Gammu is a fantastic tool to communicate with your gsm phone/modem.

2.Pair your mobile phone with your computer via bluetooth (searching for the device, entering the  PIN number and all that stuff).

3.Note down your phone’s bluetooth device ID and name.You will need it in the next step.You can run the hcitool scan command to find the device ID/name.Don’t forget keep the bluetooth  of the computer/phone tunrned on!!

4.Now for the gammu commands to work , you need to create a .gammurc file in your home(~) directory. You can use the gammu-config command to do it,but it’s easier to create it using a text editor.The main parameters are the phone’s  id and name found in step 3.Here’s my .gammurc file created in /home/ravi. I paired a sony-erricson to my laptop.

 name=Sony Ericsson W700i/W700c

The device id and name are assigned to the ‘port’ and ‘name’ parameters respectively.The ‘connection’ is mostly blueat, except for some nokia phones where it is bluephonet.Leave the ‘model’ as such.

4.OK.Its time to have some fun!Run the following command to pipe the output of fortune to your mobile

fortune|gammu --saveSMS TEXT -folder 3 -unread -len 400

The command is self explanatory.The folder number (3=inbox) specifies where the sms gets stored, viz inbox ,drafts etc.To get the list of folders for your phone, run the gammu –getsmsfolders command

Put the command in a shell script and run it whenever you want :)You might not get an audible notification for the sms, but check your inbox anyway.

Categories: DIY, Programming Tags: , , ,

Stop the music(k) please!

March 18, 2009 3 comments

After some self-deliberation on whether is should write this or not, i finally decided that i am going to.WTH!,my blog is called entitled opinions.So whats the fuss all about,you ask.It’s the noise that comes out of mobile phone speakers.Most of the mid range phones that support music playback today have an inbuilt speaker.And the folks who own it seem to think that public broadcast is their birth right.They are everywhere- on the bus, in the road,in the hotel…They think that they are doing a favour by playing songs on their speaker. Little do they realize that others people actually give a damn.In fact the sound is so annoying that at any distance more than a foot away from the phone, all one hears is a cacophony of vessels clanging and glass breaking.For heaven’s sake,use the damn headphones, people.If you think that you are ‘cool’ blasting noise off your cheap phone, think again!If you are like me and come across one of these subnormally intelligent people,i strongly urge you to ask them to stop the inconvenience immediately.It might not work the first time, but i’m sure if they constantly hear the complaint, they might gradually stop it.

Do it folks!Tolerating nonsense is not a virtue.

Categories: Life

Yet Another Gmail Notifier

February 9, 2009 Leave a comment

Came across Jamie Matthews’ cool gmail notifier. Just the kind of thing i wanted to run on the ARM7 board that i’m trying out right now. Not much of a challenge for the ARM MCU’s processing power but it was fun making it work. I used the on-board seven-segment LED to display the mail count (it was a single display so the mail count is limited to 9).


LPC 2129 Evaluation Board

LPC 2129 Evaluation Board















The python script was slightly modified:

1) In windows, just change the com port address to COMx where x is the port number as seen in the Device Manager

2)I sent the mail count (assumed to be less than 9) instead of the Y/N string.

To run the script periodically, i used  Andreas Baumann’s free Z-cron to schedule it once every five minutes. But the annoying thing was the command prompt that kept popping up when the script ran. I’m sure it could have been run as a windows service in the background but did not have the patience to look it up.

The setup

The python script:

import urllib2, re, serial, sys

 #Settings - Change these to match your account details
PASSWORD="your password"


SERIALPORT = "COM5" # Change this to your serial port!

# Set up serial port
        ser = serial.Serial(SERIALPORT, 9600)
        #print ser.portstr #For Debug:Check if port name is correct !
except serial.SerialException:

# Get Gmail Atom feed
passman = urllib2.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, SERVER, USERNAME, PASSWORD)
authhandler = urllib2.HTTPBasicAuthHandler(passman)
opener = urllib2.build_opener(authhandler)
page = urllib2.urlopen(PROTO + SERVER + PATH)

# Find the mail count line
for line in page:
        count = line.find("fullcount")
        if count > 0: break

# Extract the mail count as an integer
newmails = int('\d+', line).group())

# Output data to serial port
if newmails > 0:
                #print "No.of mails=%d" %newmails
else: ser.write(str(0))

# Close serial port

The C program on the LPC 2129:

/*Tested on the ARM starter kit (*/

 #include <LPC21xx.H>
 /*The 7 segment pattern for  digits 0 through 9*/
 const unsigned char bitMask8[] = {
   0x80,  // binary 10000000
   0x40,  // binary 01000000
   0x20,  // binary 00100000
   0x10,  // binary 00010000
   0x08,  // binary 00001000
   0x04,  // binary 00000100
   0x02,  // binary 00000010
   0x01   // binary 00000001

 void ser_init(void); //initialize serial port
 void send_8bit_serial_data(unsigned char); //display data on the 7 segment

 int main(void)
  {  char no_of_mails[10]={0xfc,0x60,0xda,0xf2,0x66,0xb6,0xbe,0xe0,0xfe,0xf6};
     char data;
      //mapping of pins to serial in parallel out shift register
		/*P1.16-->(~QH)--port dir= i/p
	              P1.17-->SRCLK--port dir= o/p
	              P1.18-->RCLK--port dir= o/p
	              P1.19-->SD1--port dir= o/p
     IODIR1=0x000E0000;//set port P1 direction to reflect the pin connections detailed above
	 send_8bit_serial_data(~(no_of_mails[0])); //Reset LED
   		{   while(!(U1LSR & 0x01)); //wait till data arrives
	     	data=U1RBR-48; //convert ascii back to integer

      return 0;  //this is never reached

void ser_init(void)
  PINSEL0 = 0x00050000;                  /* Enable RxD1 and TxD1              */
  U1LCR = 0x83;                          /* 8 bits, no Parity, 1 Stop bit     */
  U1DLL = 97;                            /* 9600 Baud Rate @ 15MHz VPB Clock  */
  U1LCR = 0x03;                          /* DLAB = 0                          */

void send_8bit_serial_data(unsigned char data)
{ /*The 7 segment is driven  bit-banging  style using  SN74HC595D, */
    int x;
   // Loop through all the bits, 7...0
   for(x = 7; x >=0; x--)
       if(data & bitMask8[x])
           IOSET1=0x00080000;      // we have a bit, make P1.19 high
		   IOSET1=0x000a0000; 		//make p1.17=SRCLK aslo hign
		   IOCLR1=0x00020000;		//toggle back p1.17
           IOCLR1=0x000F0000;        // no bit, make P1.19 low


   IOCLR1=0x000F0000; //TOGGLE P1.18=RCLK

If you wanna take this further, check out this page for an  amazing  LCD based notifier.