Effective Use of Arbtt

Effective Use of Arbtt
Prev		Next

Now that the syntax has been described and the toolbox laid out, how do you practically go about using and configuring arbtt?

Enabling data collection

After installing arbtt, you need to configure it to run. There are many ways you can run the arbtt-capture daemon. One standard way is to include the command

arbtt-capture &

in your desktop environments startup script, e.g. ~/.xinitrc or similar.

Another trick is add it as a cron job. To do so, edit your crontab file (crontab -e) and add a line like this:

DISPLAY=:0
@reboot arbtt-capture --logfile=/home/username/doc/arbtt/capture.log

At boot, arbtt-capture will be run in the background and will capture a snapshot of the X metadata for active windows every 60 seconds (the default). If you want more fine-grained time data at the expense of doubling storage use, you could increase the sampling rate with an option like --sample-rate=30. To be resilient to any errors or segfaults, you could also wrap it in an infinite loop to restart the daemon should it ever crash, with a command like

DISPLAY=:0
@reboot while true; do arbtt-capture --sample-rate=30; sleep 1m; done

Checking data availability

arbtt tracks X properties like window title, class, and running program, and you write rules to classify those strings as you wish; but this assumes that the necessary data is present in those properties.

For some programs, this is the case. For example, web browsers like Firefox typically set the X title to the HTML <title> element of the web page in the currently-focused tab, which is enough for classification.

Some programs have title-setting available as plugins. The IRC client irssi in a GNU screen or X terminal usually sets the title to just "irssi", which blocks more accurate time-classification based on IRC channel (one channel may be for recreation, another for programming, and yet another for work), but can be easily configured to set the title using the extension title.pl.

Some programs do not set titles or class, and all arbtt sees is empty strings like ""; or they may set the title/class to a constant like "Liferea", which may be acceptable if that program is used for only one purpose, but if it is used for many purposes, then you cannot write a rule matching it without producing highly-misleading time analyses. (For example, a web browser may be used for countless purposes, ranging from work to research to music to writing to programming; but if the web browser's title/class were always just "Web browser", how would you classify 5 hours spent using the web browser? If the 5 hours are classified as any or all of those purposes, then the results will be misleading garbage - you probably did not spend 5 hours just listening to music, but a mixture of those purposes, which changes from day to day.)

You should check for such problematic programs upon starting using arbtt. It would be unfortunate if you were to log for a few months, go back for a detailed report for some reason, and discover that the necessary data was never available for arbtt to log!

These programs can sometimes be customized internally, a bug report filed with the maintainers, or their titles can be externally set by wmctrl or xprop.

`xprop`

You can check the X properties of a running window by running the command xprop and clicking on the window; xprop will print out all the relevant X information. For example, the output for Emacs might look like this

$ xprop | tail -5
WM_CLASS(STRING) = "emacs", "Emacs"
WM_ICON_NAME(STRING) = "emacs@elan"
_NET_WM_ICON_NAME(UTF8_STRING) = "emacs@elan"
WM_NAME(STRING) = "emacs@elan"
_NET_WM_NAME(UTF8_STRING) = "emacs@elan"

This is not very helpful: it does not tell us the filename being edited, the mode being used, or anything. You could classify time spent in Emacs as "programming" or "writing", but this would be imperfect, especially if you do both activities regularly. However, Emacs can be customized by editing ~/.emacs, and after some searching with queries like "setting Emacs window title", the Emacs wiki and manual advise us to put something like this Elisp in our .emacs file:

(setq frame-title-format "%f")

Now the output looks different:

$ xprop | tail -5
WM_CLASS(STRING) = "emacs", "Emacs"
WM_ICON_NAME(STRING) = "/home/gwern/arbtt.page"
_NET_WM_ICON_NAME(UTF8_STRING) = "/home/gwern/arbtt.page"
WM_NAME(STRING) = "/home/gwern/arbtt.page"
_NET_WM_NAME(UTF8_STRING) = "/home/gwern/arbtt.page"

With this, we can usefully classify all such time samples as being “writing”:

current window $title == "/home/gwern/arbtt.page" ==> tag Writing,

Another common gap is terminals/shells: they often do not include information in the title like the current working directory or last shell command. For example, urxvt/Bash:

WM_COMMAND(STRING) = { "urxvt" }
_NET_WM_ICON_NAME(UTF8_STRING) = "urxvt"
WM_ICON_NAME(STRING) = "urxvt"
_NET_WM_NAME(UTF8_STRING) = "urxvt"
WM_NAME(STRING) = "urxvt"

Programmers may spend many hours in the shell doing a variety of things (like Emacs), so this is a problem. Fortunately, this is also solvable by customizing one's .bashrc to set the prompt to emit an escape code interpreted by the terminal (baroque, but it works). The following will include the working directory, a timestamp, and the last command:

trap 'echo -ne "\033]2;$(pwd); $(history 1 | sed "s/^[ ]*[0-9]*[ ]*//g")\007"' DEBUG

Now the urxvt samples are useful:

_NET_WM_NAME(UTF8_STRING) = "/home/gwern/wiki; 2014-09-03 13:39:32 arbtt-stats --help"

Some distributions (e.g. Debian) already provide the relevant configuration for this to happen. If it does not work for you, you can try to add

. /etc/profile.d/vte.sh

to your ~/.bashrc.

A rule could classify based on the directory you are working in, the command one ran, or both. Other shells like zsh can be fixed this way too but the exact command may differ; you will need to research and experiment.

Some programs can be tricky to set. The X image viewer feh has a --title option but it cannot be set in the configuration file, .config/feh/themes, because it needs to be specified dynamically; so you need to set up a shell alias or script to wrap the command like feh --title "$(pwd) / %f / %n".

Raw samples

xprop can be tedious to use on every running window and you may forget to check seldomly used programs. A better approach is to use arbtt-stats’s --dump-samples option: this option will print out the collected data for specified time periods, allowing you to examine the X properties en masse. This option can be used with the --exclude= option to print the samples for samples not matched by existing rules as well, which is indispensable for improving coverage and suggesting ideas for new rules. A good way to figure out what customizations to make is to run arbtt as a daemon for a day or so, and then begin examining the raw samples for problems.

Example 2. An initial configuration session

An example: suppose I create a simple category file named foo with just the line

$idle > 30 ==> tag inactive

I can then dump all my arbtt samples for the past day with a command like this:

arbtt-stats --categorizefile=foo --m=0 --filter='$sampleage <24:00' --dump-samples

Because there are so many open windows, this produces a large amount (26586 lines) of hard-to-read output:

...
( ) Navigator:      /r/Touhou's Favorite Arranges! Part 71: Retribution for the Eternal Night ~ Imperishable Night : touhou - Iceweasel
( ) Navigator:      Configuring the arbtt categorizer (arbtt-stats) - Iceweasel
( ) evince:         ATTACHMENT02
( ) evince:         2009-geisler.pdf — Heart rate variability predicts self-control in goal pursuit
( ) urxvt:          /home/gwern; arbtt-stats --categorizefile=foo --m=0 --filter='$sampleage <24:00' --dump-samples
( ) mnemosyne:      Mnemosyne
( ) urxvt:          /home/gwern; 2014-09-03 13:11:45 xprop
( ) urxvt:          /home/gwern; 2014-09-03 13:42:17 history 1 | cut --delimiter=' ' --fields=5-
( ) urxvt:          /home/gwern; 2014-09-03 13:12:21 git log -p .emacs
(*) emacs:          emacs@elan
( ) urxvt:          /home/gwern/blackmarket-mirrors/silkroad2-forums; 2014-08-31 23:20:10 mv /home/gwern/cookies.txt ./; http_proxy="localhost:8118" wget...
( ) urxvt:          /home/gwern/blackmarket-mirrors/agora; 2014-08-31 23:15:50 mv /home/gwern/cookies.txt ./; http_proxy="localhost:8118" wget --mirror ...
( ) urxvt:          /home/gwern/blackmarket-mirrors/evolution-forums; 2014-08-31 23:04:10 mv ~/cookies.txt ./; http_proxy="localhost:8118" wget --mirror ...
( ) puddletag:      puddletag: /home/gwern/music

Active windows are denoted by an asterisk, so I can focus & simplify by adding a pipe like | fgrep '(*)', producing more manageable output like

(*) urxvt:          irssi
(*) urxvt:          irssi
(*) urxvt:          irssi
(*) Navigator:      Pyramid of Technology - NextNature.net - Iceweasel
(*) Navigator:      Search results - gwern0@gmail.com - Gmail - Iceweasel
(*) Navigator:      [New comment] The Wrong Path - gwern0@gmail.com - Gmail - Iceweasel
(*) Navigator:      Iceweasel
(*) Navigator:      Litecoin Exchange Rate - $4.83 USD - litecoinexchangerate.org - Iceweasel
(*) Navigator:      PredictionBook: LiteCoin will trade at >=10 USD per ltc in 2 years, - Iceweasel
(*) urxvt:          irssi
(*) Navigator:      Bug#691547 closed by Mikhail Gusarov <dottedmag@dottedmag.net> (Re: s3cmd: Man page: --default-mime-type documentation incomplete...)
(*) Navigator:      Bug#691547 closed by Mikhail Gusarov <dottedmag@dottedmag.net> (Re: s3cmd: Man page: --default-mime-type documentation incomplete...)
(*) Navigator:      Bug#691547 closed by Mikhail Gusarov <dottedmag@dottedmag.net> (Re: s3cmd: Man page: --default-mime-type documentation incomplete...)
(*) urxvt:          /home/gwern; 2014-09-02 14:25:17 man s3cmd
(*) evince:         bayesiancausality.pdf
(*) evince:         bayesiancausality.pdf
(*) puddletag:      puddletag: /home/gwern/music
(*) puddletag:      puddletag: /home/gwern/music
(*) evince:         bayesiancausality.pdf
(*) Navigator:      ▶ Umineko no Naku Koro ni Music Box 4 - オルガン小曲 第2億番 ハ短調 - YouTube - Iceweasel
...

This is better. We can see a few things: the windows all now produce enough information to be usefully classified (Gmail can be classified under email, irssi can be classified as IRC, the urxvt usage can clearly be classified as programming, the PDF being read is statistics, etc) in part because of customizations to bash/urxvt. The duplication still impedes focus, and we don't know what's most common. We can use another pipeline to sort, count duplicates, and sort by number of duplicates (| sort | uniq --count | sort --general-numeric-sort), yielding:

 ...
 14     (*) Navigator:      A Bluer Shade of White Chapter 4, a frozen fanfic | FanFiction - Iceweasel
 14     (*) Navigator:      Iceweasel
 15     (*) evince:         2009-geisler.pdf — Heart rate variability predicts self-control in goal pursuit
 15     (*) Navigator:      Tool use by animals - Wikipedia, the free encyclopedia - Iceweasel
 16     (*) Navigator:      Hacker News | Add Comment - Iceweasel
 17     (*) evince:         bayesiancausality.pdf
 17     (*) Navigator:      Comments - Less Wrong Discussion - Iceweasel
 17     (*) Navigator:      Keith Gessen · Why not kill them all?: In Donetsk · LRB 11 September 2014 - Iceweasel
 17     (*) Navigator:      Notes on the Celebrity Data Theft | Hacker News - Iceweasel
 18     (*) Navigator:      A Bluer Shade of White Chapter 1, a frozen fanfic | FanFiction - Iceweasel
 19     (*) gl:             mplayer2
 19     (*) Navigator:      Neural networks and deep learning - Iceweasel
 20     (*) Navigator:      Harry Potter and the Philosopher's Zombie, a harry potter fanfic | FanFiction - Iceweasel
 20     (*) Navigator:      [OBNYC] Time tracking app - gwern0@gmail.com - Gmail - Iceweasel
 25     (*) evince:         ps2007.pdf — untitled
 35     (*) emacs:          /home/gwern/arbtt.page
 43     (*) Navigator:      CCC comments on The Octopus, the Dolphin and Us: a Great Filter tale - Less Wrong - Iceweasel
 62     (*) evince:         The physics of information processing superobjects - Anders Sandberg - 1999.pdf — Brains2
 69     (*) liferea:        Liferea
 82     (*) evince:         BMS_raftery.pdf — untitled
 84     (*) emacs:          emacs@elan
 87     (*) Navigator:      overview for gwern - Iceweasel
109     (*) puddletag:      puddletag: /home/gwern/music
150     (*) urxvt:          irssi

Put this way, we can see what rules we should write to categorize: we could categorize the activities here into a few categories of "recreational", "statistics", "music", "email", "IRC", "research", and "writing"; and add to the categorize.cfg some rules like thus:

$idle > 30 ==> tag inactive,

current window $title =~ [/.*Hacker News.*/, /.*Less Wrong.*/, /.*overview for gwern.*/, /.*[fF]an[fF]ic.*/, /.* LRB .*/]
  || current window $program == "liferea" ==> tag Recreation,
current window $title =~ [/.*puddletag.*/, /.*mplayer2.*/] ==> tag Music,
current window $title =~ [/.*[bB]ayesian.*/, /.*[nN]eural [nN]etworks.*/, /.*ps2007.pdf.*/, /.*[Rr]aftery.*/] ==> tag Statistics,
current window $title =~ [/.*Wikipedia.*/, /.*Heart rate variability.*/, /.*Anders Sandberg.*/] ==> tag Research,
current window $title =~ [/.*Gmail.*/] ==> tag Email,
current window $title =~ [/.*arbtt.*/] ==> tag Writing,
current window $title == "irssi" ==> tag IRC,

If we reran the command, we'd see the same output, so we need to leverage our new rules and exclude any samples matching our current tags, so now we run a command like:

arbtt-stats --categorizefile=foo --filter='$sampleage <24:00' --dump-samples --exclude=Recreation --exclude=Music --exclude=Statistics
             --exclude=Research --exclude=Email --exclude=Writing --exclude=IRC |
             fgrep '(*)' | sort | uniq --count | sort --general-numeric-sort

Now the previous samples disappear, leaving us with a fresh batch of unclassified samples to work with:

  9     (*) Navigator:      New Web Order > Nik Cubrilovic - - Notes on the Celebrity Data Theft - Iceweasel
  9     ( ) urxvt:          /home/gwern; arbtt-stats --categorizefile=foo --filter='$sampleage <24:00' --dump-samples | fgrep '(*)' | less
 10     (*) evince:         ATTACHMENT02
 10     (*) Navigator:      These Giant Copper Orbs Show Just How Much Metal Comes From a Mine | Design | WIRED - Iceweasel
 12     (*) evince:         [Jon_Elster]_Alchemies_of_the_Mind_Rationality_an(BookFi.org).pdf — Alchemies of the mind
 12     (*) Navigator:      Morality Quiz/Test your Morals, Values & Ethics - YourMorals.Org - Iceweasel
 33     ( ) urxvt:          /home/gwern; arbtt-stats --categorizefile=foo --filter='$sampleage <24:00' --dump-samples | fgrep '(*)'...

We can add rules categorizing these as 'Recreational', 'Writing', 'Research', 'Recreational', 'Research', 'Writing', and 'Writing' respectively; and we might decide at this point that 'Writing' is starting to become overloaded, so we'll split it into two tags, 'Writing' and 'Programming'. And then after tossing another --exclude=Programming into our rules, we can repeat the process.

As we refine our rules, we will quickly spot instances where the title/class/program are insufficient to allow accurate classification, and we will figure out the best collection of tags for our particular purposes. A few iterations is enough for most purposes.

Categorizing advice

When building up rules, a few rules of thumb should be kept in mind:

Categorize by purpose, not by program

This leads to misleading time reports. Avoid, for example, lumping all web browser time into a single category named 'Internet'; this is more misleading than helpful. Good categories describe an activity or goal, such as 'Work' or 'Recreation', not a tool, like 'Emacs' or 'Vim'.

When in doubt, write narrow rules and generalize later

Regexps are tricky and it can be easy to write rules far broader than one intended. The --exclude filters mean that one will never see samples which are matched accidentally. If one is in doubt, it can be helpful to take a specific sample one wants to match and several similar strings and look at how well one's regexp rule works in Emacs's regexp-builder or online regexp-testers like regexpal.

Don't try to classify everything

You will never classify 100% of samples because sometimes programs do not include useful X properties and cannot be fixed, you have samples from before you fixed them, or they are too transient (like popups and dialogues) to be worth fixing. It is not necessary to classify 100% of your time, since as long as the most common programs and, say, 80% of your time is classified, then you have most of the value. It is easy to waste more time tweaking arbtt than one gains from increased accuracy or more finely-grained tags.

Avoid large and microscopic tags

If a tag takes up more than a third or so of your time, it is probably too large, masks variation, and can be broken down into more meaningful tags. Conversely, a tag too narrow to show up regularly in reports (because it is below the default 1% filter) may not be helpful because it is usually tiny, and can be combined with the most similar tag to yield more compact and easily interpreted reports.

Long-term storage

Each halving of the sampling rate doubles the number of samples taken and hence the storage requirement; sampling rates below 20s are probably wasteful. But even the default 60s can accumulate into a nontrivial amount of data over a year. A constantly-changing binary file can interact poorly with backup systems, may make arbtt analyses slower, and if one's system occasionally crashes or experiences other problems, cause some corruption of the log and be a nuisance in having to run arbtt-recover.

Thus it may be a good idea to archive one's capture.log on an annual basis. If one needs to query the historical data, the particular log file can be specified as an option like --logfile=/home/gwern/doc/arbtt/2013-2014.log

External processing of arbtt statistics

arbtt supports CSV export of time by category in various levels of granularity in a 'long' format (multiple rows for each day, with n row specifying a category's value for that day). These CSV exports can be imported into statistical programs like R or Excel and manipulated as desired.

R users may prefer to have their time data in a 'wide' format (each row is 1 day, with n columns for each possible category); this can be done with the reshape default library. After reading in the CSV, the time-intervals can be converted to counts and the data to a wide data-frame with R code like the following:

arbtt <- read.csv("arbtt.csv")
interval <- function(x) { if (!is.na(x)) { if (grepl(" s",x)) as.integer(sub(" s","",x))
                                          else { y <- unlist(strsplit(x, ":"));
                                                 as.integer(y[[1]])*3600 + as.integer(y[[2]])*60 + as.integer(y[[3]]); }
                                                 }
                         else NA
                         }
arbtt$Time <- sapply(as.character(arbtt$Time), interval)
library(reshape)
arbtt <- reshape(arbtt, v.names="Time", timevar="Tag", idvar="Day", direction="wide")

Selecting the current day

This can be done easily with a bit of help from the command line:

arbtt-stats  --filter='$date>='`date +"%Y-%m-%d"`

Prev		Next
Configuring the arbtt categorizer (arbtt-stats)	Home	Contributed tools