How I teach myself new bioinformatics tools

I’m not sure if there’s a name for people who thought they would be doing lab work for the rest of their lives and then find themselves thrust into the deep end of bioinformatics, but I am one of them. This seems to be a common occurrence in research labs, and will probably continue to be until undergraduate programs catch up with the bioinformatics skills required in many fields of research. Fortunately I quite enjoy the stuff, but I am continually learning new things, and I find that with much of it self-taught it can be a long process to learn and then do each analysis.

When embarking on a new kind of analysis, it is common to hear about several tools/scripts/software that do the same thing in different ways with different results. Which one is the best? Which should I use? Which one will tell me the answer to (research question)? Unfortunately the answer is probably that they are all justifiable to use, some are more accurate than others, and that it also depends on your data. You may also face promising tools (or instructions) that are not well-developed yet, or may no longer be supported/updated. Finally, there is the risk that someone, somewhere will criticise you for your choice regardless of which one you go with.

Being somewhat experienced in “just working it out” and still reasonably new to bioinformatics analyses, I offer my approach to identifying the tool you need and learning how to use it, when you have no experience to tell you where to start. In this, remember that most bioinformatics tools run on UNIX systems or sometimes as R packages; strong introductory knowledge of the command line and/or R will make the process of learning new tools much easier.

  1. Understand what you want to do. You may have heard about that tool that people use to “analyse microbiome data” or “do transcriptome stuff”. But what does it actually do? Will it be useful to you? First, understand what your research question is, what kind of data you have, and what you need to do to the data to get your answers. Then you can start investigating which tools can provide you with what you want to know.

  2. Look for available tools. Search Google for the kind of analysis you need to do. Search the literature for papers that may have done what you want to do, perhaps in a different context or organism. What did they use? Follow the example of good, highly cited papers. You don’t need to (and probably shouldn’t) spend a great deal of time here, but it’s a good idea to have a grasp of the commonly used tools, or perhaps whether there’s several you need to string together in a pipeline to achieve what you need.

  3. Check out the tools to see if they do the job. Check the abstracts of their papers, or the introduction to the software manual. Does the tool sound like it will do what you want it to do? You may rule some out at this point because you discover that they are designed for a different kind of dataset, or won’t produce results that will answer your questions. Better to find this out now and save yourself the time!

  4. Choosing a tool. You may have found from the literature that everyone uses one common tool, or perhaps that there’s several to choose from that all seem equally good. In this case, benchmarking papers can be useful - someone else has done the hard work and tried them all, and reported on how well they perform. This is also useful because they can highlight the downsides that may not have been obvious (e.g. requiring far more RAM than you have access to). If there’s still a lot of choice, pick a good one and try it out.

  5. Learning that tool. There is probably a spectrum of approaches from “just run it and see what happens” to “I have to read the methods section on the algorithm before I can start”. I find that the former is easier to do when you have more experience, where you can understand whether the output looks anything like it’s supposed to. I don’t think anyone would bother with the latter unless they are a programmer or statistician and can understand it fast. Below is the approach I usually take:

  • Skim the paper, if it is a published tool. This is to get a decent idea of what the tool does and how it works. This is also an opportunity to find out if, for some reason, the tool isn’t in fact suitable for what you need to do.

  • Follow the installation instructions. Get it installed, and go from there. If this causes too many difficulties from the start you may want to consider an alternative tool if you had a few to choose from.

  • Follow the tutorial or the manual. For each command, I like to write it out in R markdown first, where I am recording everything I’m doing. I check the available options for the command to see if there is anything that I should change, but my rule of thumb is that if it doesn’t make sense or you’re not sure you need to change it, leave it default. The programmers have likely chosen default options that suit most cases. I copy the command from my R Markdown document to the terminal and write about what it does/why I’m doing it as it runs.

  • Deal with error messages. I have found that when running a new command from a new tool I will commonly see an error message on the first go. Hopefully, the error message will be informative so you can work out if it’s just a typo you made, an installation problem, or something else. Google the error message to see if others have seen it before, or failing that contact the author of the tool. Interpreting and solving error messages is a skill in itself that is beneficial to develop, and can improve your understanding of how the tool works.

  • Try and understand what each command is doing. This will help you to deal with error messages and noticing if anything went wrong and choosing options as well as what’s happening to the data so you can interpret the output. It helps to know what you’re doing then you can explain to others how the analysis works.

  • Check out your results. Try and work out if anything has gone wrong - does the output file look like it should? Does it make sense? Just because it ran without throwing an error message doesn’t mean that everything has worked correctly. If the results look unusual or don’t make sense, try the next best tool for comparison. Be cautious of choosing the results that “look” best or fit your hypothesis best, but if you are unable to interpret results or they don’t make sense it is okay to go with the tool that you can use.

Remember that if you are still learning, particularly if you are self-teaching, you don’t have to know everything. If the tool is well written and well documented, then you can expect to be able to run the analysis correctly using the information provided - including valuable answers from the authors on the help forum, if it has one (use it if you need to!). This means you shouldn’t need a workshop or somebody to show you how - though if these opportunities arise, they will certainly save you time! The experience will regardless leave you better equipped to pick up other tools/analyses and can be very helpful to others starting out where you did.

I hope that my approach is useful to you - let me know! If you have any questions or comments, find me on Twitter or email me!

Written on November 8, 2017