Extracting information from GenBank files

Sometimes you want to get a quick overview of the distribution of a group of uncultured microorganisms. In my case I had extracted all the 16S sequences classified as a certain clade, from the Silva database, and I wanted to know which environment they came from. This information can easily be found in the GenBank entry by looking at the /isolation_source and the title of the publication. It’s however a bit tedious to do this manually with a few hundred sequences, so some kind of bash-hacking or scripting had to be done. I Googled a bit but did not find an answer on how to do exactly what I wanted.

To start with I had to make a list with all the accession numbers from the fasta file that I had extracted from Silva, so that I could use Batch Entrez to download them in GenBank format. Being a newbie on unix, I knew that there should be an easy way to do this with regular expressions. However, since I have yet to learn how to use regular expressions, I used grep, awk, and sed instead.

For some reason this worked on all accession numbers except for the first one on the list, which came out in the form of AA123456.1.123 instead of AA123456 (Does anyone have a clue why this happened?). Anders showed me a bit more elegant command using the magical regular expression “^>\w+”, which will look for all the word-characters (ie letters and numbers), following the “>”, and stop when it encounters the first punctuation (since it’s a non-word character).

We quickly realized that the extraction of information could best be handled by BioPerl, so Anders helped me put together a small script. The purpose of the script is to loop through a file with GenBank entries, extract the accession number of each entry followed by the publication reference and isolation source, and output that information in a tabular format. This was not straightforward for the reference, because for some reason there are two fields named “TITLE” in GenBank files. Since the second field only says “Direct Submission”, we told the script to only output the value if it didn’t say “Direct Submission”.

It took us less than an hour to write the script. Slightly shorter time than if I would have done it manually!