Adding a Sorghum version 2.0 genome assembly to IGBQuickLoad

Recently, a researcher working with the sorghum genome posted this question on BioStars – Question: IGB- genome sequence in sequence viewer just shows dashes

Updating to a more recent version of IGB solved the problem, but the question reminded me that Phytozome had released a new sorghum assembly and that we needed to update the IGBQuickLoad site.

In this post, I’ll describe how I updated IGB to the latest sorghum genome assembly using data from Phytozome and a few easy-to-use command line tools.

Step One. Get the data

To start, Nowlan downloaded gene annotations, sequence, and a gene information file from Phytozome.net. The files were:

  • Sbicolor_255_v2.0.fa.gz – sequence data
  • Sbicolor_255_v2.1.gene_exons.gff3.gz – gene models
  • Sbicolor_255_v2.1.defline.txt – gene information file, summarize function

He put them in our shared IGB Dropbox, saving me the trouble of logging in and figuring out where to look. Thank you Nowlan!

Step Two. Find out when the genome was released and what it’s called.

Using Google searches, I found a page describing the new assembly and associated gene annotations. The genome assembly is called “version 2.0” and the annotations are called “version 2.1.”

A search for “Sorghum 2.0 released” found this news page which contained text:

[07 Jun 2013]   v2.1 of Sorghum bicolor now available as early release.

This new item refers to the genome annotations, not the genome assembly, but probably the two were released about the same time, so I decided to use this. Why does it matter? It matters because including the month and year in the name of a genome release tells us which release is most recent. This is why IGB gives every  assembly a name that tells it (and us) how old the genome is.

Step Three: Check out a copy of IGBQuickLoad.

We are using subversion (a version control system) to track and distribute files that are part of the IGBQuickLoad site.

To add the new data to IGBQuickLoad, I first have to check out a copy using svn. (For more information about svn, see: http://svnbook.red-bean.com/.)

svn co https://svn.transvar.org/repos/genomes/trunk/pub/quickload

I’ve configured my computer to provide the correct user name and password when I execute svn commands. Otherwise, I would need to supply my user name and password uisng the –username and –password options. For more information, run

svn co -h

Step Four: Add new directory for this new genome

Following IGB conventions, I added a new directory to the IGBQuickLoad data repository using svn:

svn mkdir S_bicolor_Jun_2013

By convention, we keep genome assembly data files and meta-data files in a directory named after the species and release date.

I also added new lines to the .htaccess and contents.txt files:

htaccess:

AddDescription "Sorghum bicolor v. 2.0" S_bicolor_Jun_2013

This ensures the directory listing on the IGBQuickLoad.org Web site will report the version name and species.

contents.txt:

S_bicolor_Jun_2013      Sorghum bicolor v2.0 (June 2013)

This ensures IGB will list the genome in the Current Genome tab. When IGB first accesses an IGBQuickLoad site, it fetches this “contents.txt” file and uses it to determine the names and locations of genome assemblies available on the site.

This contents.txt file is a tab-delimited file with two columns – the first column lists genome directories and the second column lists the human-friendly name of the genome, which is what IGB displays in the IGB window title bar when users selects that genome. Note that sometimes two sites could contain different text in the second genome, but this won’t break IGB. IGB will just show the data from whichever file it read first. So far, this has not posed any problems.

Step Five. Make 2bit sequence file.

I used Jim Kent’s faToTwoBit program to convert the fasta sequence file to 2bit. This format is more compact than fasta and is also random access, which means IGB can use the file to retrieve small parts of the genome quickly when users click the Load Sequence button. By convention, we name the 2bit file after the genome version – this ensures IGB will be able to find it on the site.

faToTwoBit Sbicolor_255_v2.0.fa S_bicolor_Jun_2013.2bit

The fasta file is 702 Mb (almost 1 Gb!) and the 2bit file was only 173M. Much better!

This file must reside in the genome directory S_bicolor_Jun_2013. Because it is not very big, I used svn to add it to the repository.

Step Six. Make genome.txt file.

To populate the Current Sequence tab, IGB needs a “genome.txt” file that lists chromosomes and their sizes – this file, like the 2bit file, resides the in the genome directory. To make this, I use twoBitInfo, another Jim Kent program:

twoBitInfo S_bicolor_Jun_2013.2bit genome.txt

IGB’s Current Sequence table will list the chromosomes in the same order they appear in this file, so I view the first few lines of the file to make sure the most useful (fully assembled) chromosomes appear first in the list:

Chr01   73727935
Chr02   77694824
Chr03   74408397
Chr04   67966759
Chr05   62243505
Chr06   62192017
Chr07   64263908
Chr08   55354556
Chr09   59454246
Chr10   61085274
super_10        8818317

This looks fine. But if it didn’t I might re-order the listing by sequence size:

sort -k2,2nr genome.txt > tmp
mv tmp genome.txt

Step Seven. Make annotations meta-data file (annots.xml)

Here, I used a program named phytozomeGff3ToBed.py, available from a Loraine Lab git repository. This program converts GFF to BED12 format:

phytozomeGff3ToBed.py -g Sbicolor_255_v2.1.gene_exons.gff3.gz -b tmp.bed

I used a Unix one-liner convert the BED12 file to BED-detail format, sort the file by sequence name and gene model start position, compress it using bgzip, and then save the result to a new file named for the genome version:

bedToBedDetail.py -g Sbicolor_255_v2.1.defline.txt -b tmp.bed | sort -k1,1 -k2,2n | bgzip > S_bicolor_Jun_2013.bed.gz

To share reference gene model annotations, we use BED-detail format.  It’s compact and more convenient for data analysis than GFF3. It contains all the usual field in a BED12 file, plus two extra fields. In IGB, our BED-detail files

  • report locus names in field 13, useful for grouping alternative transcripts arising from the same transcriptional unit
  • don’t use the itemRGB field (we don’t color-code gene models, but we might do that someday)
  • insert the start position in the thickStart and thickStop fields when a gene model doesn’t encode a protein

When we distribute annotations file, we also index them using tabix, which requires that the indexed file be sorted and block compressed using bgzip. (Google “tabix” to find a copy – it’s open source and associated with the samtools package.)

To index the file using tabix, I did this:

tabix -s 1 -b 2 -e 3 S_bicolor_Jun_2013.bed.gz

This created a file named S_bicolor_Jun_2013.bed.gz.tbi (“tbi” for tabix index) which must reside in the same directory as its corresponding data file for IGB (or other programs) to use it. The index file enables IGB to quickly load the file or retrieve just parts of it via byte-range queries.

Next, I viewed the first few lines of the file:

Chr01    24317    38116    Sobic.001G000400.2    0    -    24631    36898    19    590,255,172,228,134,84,291,333,107,126,166,104,161,901,91,291,85,83,3870,683,1377,1661,2230,8581,8772,9179,9608,9929,10147,10436,10651,10922,11953,12118,12497,13040,13412,    Sobic.001G000400    similar to PDR-like ABC transporter
Chr01    24317    42528    Sobic.001G000400.1    0    -    24631    42234    20    590,255,172,228,134,84,291,333,107,126,166,104,161,901,91,291,85,83,121,500,    0,683,1377,1661,2230,8581,8772,9179,9608,9929,10147,10436,10651,10922,11953,12118,12497,13040,17387,17711,    Sobic.001G000400    similar to PDR-like ABC transporter
Chr01    50023    50488    Sobic.001G000500.1    0    -    50037    50488    54,263,    0,202,    Sobic.001G000500    Not Available
Chr01    53429    60786    Sobic.001G000600.1    0    -    60555    60738    204,419,    0,6938,    Sobic.001G000600    Not Available

This looked good. Field 4 contained gene model name, field 13 contained locus name, and field 14 contained a description taken from the gene information file (Sbicolor_255_v2.1.defline.txt), but reported “Not Available” when nothing about that gene was listed.

How many gene models have no information?

gunzip -c S_bicolor_Jun_2013.bed.gz | grep -c "Not Available" 
   14622
gunzip -c S_bicolor_Jun_2013.bed.gz | wc -l
   39441

Out of 39,441 gene models, around 14,000 had no functional information. This isn’t great, but it’s not unusual for a newly annotated genome.

Step Eight. Make a HEADER.html file – documentation

Next, I created a HEADER.html file that documents the files and where they came from. When user visit the new genome directory on the web, they’ll see the HEADER file contents displayed along with a listing of the files.

Step Nine. Make an annots.xml file

To ensure that IGB will show the genome, I next needed to add an “annots.xml” meta-data file that reports the data sets that are available for the genome. Because I’m a little lazy, I use the “svn cp” (svn copy) command to make an annots.xml file using the file from a previous genome version.

svn cp ../S_bicolor_Jan_2009/annots.xml .

I used a text editor to update the data for this new genome:

<files>
  <file name="S_bicolor_Jun_2013.bed.gz" 
        title="Genes"
        description="Version 2.1 gene models from Phytozome, Nov. 2014."
        url="S_bicolor_Jun_2013"
        foreground="000000"
        name_size="14"
        direction_type="arrow"
        background="FFFFFF"
        load_hint="Whole Sequence"
        show2tracks="true"
  />  
</files>

What all this means:

  • name – name of the file in the file system; IGB assumes it resides in the same directory as the annots.xml file, inside the genome directory S_bicolor_Jun_2013
  • title – name IGB gives this data set in the Available Data sets section of the Data Access tab. This is also the name IGB displays in the track label once the data are loaded into a track.
  • description – displayed as a tooltip when users position the cursor over the “i” icon next to the data set under Available Data Sets.
  • url – if users click on the “i” icon, a Web browser will open to this location relative to the IGBQuickLoad root directory. (http://www.igbquickload.org/quickload/S_bicolor_Jun_2013 in this case.)
  • foreground – annotation color
  • name_size – font size for the track label
  • background – background color for the track
  • load_hint – how and when to load the data. “Whole Sequence” means: load all the data at once.
  • show2tracks – whether or not to divide the annotations into two tracks – one for minus strand genes and another for plus strand genes

Step Ten. See how it looks in IGB.

For testing and development, I’ve been adding and updating the files in a local copy of the IGBQuickLoad repository that I’ve checked out onto my computer. To see how the new files will look in IGB, I use the Data Sources tab in the Preferences window to add the local QuickLoad site to IGB.

Then, I used the Current Sequence tab to select species Sorghum bicolor and genome version S_bicolor_Jun_2013. I also ran a quick keyword search (Advanced Search tab) to make sure the “details” part of the BED-detail file is read correctly into IGB.

Everything looks good!

S_bicolor

Step Eleven. Check it into the repository and update the public IGB QuickLoad sites.

I then used “svn add” and “svn commit” to add the new files to the repository. Then I logged into the main IGBQuickLoad site and the IGBquickLoad backup site and ran “svn up” to add new files and update old ones.

This entire process, including writing this post and coffee breaks, took about two hours of my time.

 

 

 

 

 

 

 

 

 

Farewell Tarun

Yesterday we had a lab dinner to celebrate Tarun Kanaparthi’s December 2014 graduation – he is earning his Masters degree in Computer Science from UNC Charlotte.

Following a vacation, he plans to start work at his new job, to be determined. Like many recent or soon-to-be graduates in computer science and bioinformatics, he is trying to decide between multiple opportunities. It’s nice to know that computer programming and data analysis skills are still in demand. Congratulations Tarun!

Here is a photo from the going-away party. From left to right, we are:

April Estrada, Ivory Blakley, Mason Meyer, David Norris, Tarun Kanaparthi, Tarun Mall, Nowlan Freese

IMG_1541

Using Table Browser at UCSC to get a data set for IGB

The UCSC Genome Bioinformatics site offers a wealth of data from many genomes, ranging from tiny (but deadly) genomes like ebola virus to much larger genomes like our own human genome. Indeed, the reference genome sequence and gene model data sets for many of the genomes the IGB teams hosts on the IGBQuickLoad.org data repository site are originally from UCSC. Other projects – like Galaxy – also rely on UCSC for core data sets. If you have used Galaxy, you may have noticed that Galaxy has built-in genomes for many animal species; these data were imported from the UCSC ecosystem.

When you visit a UCSC-supported genome in IGB, you’ll see a folder named “UCSC (DAS)” in the Available Data Sets  section of the Data Access Panel. If you open the folder, you’ll see many data sets with seemingly cryptic titles – like “nestedRepeats” and “altLocations.” These names correspond to tables in the UCSC database. If you select these data sets and click Load Data in IGB, these data will flow from the UCSC system’s Distributed Annotation Server into IGB.

DAS4HumanHowever, for technical reasons, the UCSC DAS site can only support some of their data sets. To view UCSC data that are not supported in their DAS site, you can use the UCSC Table Browser to download them onto your computer and open them in IGB.

In this post, I’ll explain how you can use the UCSC Table Browser to obtain and then open data sets in Integrated Genome Browser. I’ll also explain how you can use tabix and bgzip to compress and index files that are too large to load into IGB all at once.

Part I: Getting data from the UCSC Web site

In this example, I’ll show you how to get human variation data (SNPs) from the UCSC Web site.

  1. Go  to http://genome.ucsc.edu
  2. Click Tables (top of page)

Configure the browser to access the latest (as of this writing) human genome assembly:

  1. clade Mammal
  2. genome Human
  3. assembly Dec 2013 (GRCh38/hg38) (the latest)
  4. group Variation
  5. track Common SNPs(141)
  6. table snp141Common
  7. region genome
  8. output format BED – browser extensible data
  9. file type returned gzip compressed

Enter a name for the output file – e.g., snp141Common.hg38.bed.gz. It should end with “.gz” to ensure your computer will recognize it as a gzip-compressed file.

Tip: Click the button Describe Table Schema to see what data are in the table.

SNPTableBrowser

to be continued

Protocol: Growing rice in a growth room

Our building’s basement contains several temperature-controlled rooms where we grow plants under controlled conditions. We set up one of these rooms for growing rice. Here, we explain how we did it in hopes this will help other groups grow rice for research.

The protocol described here is based on this protocol from Purdue University. [link: “this protocol” to the Web page at Purdue. Don’t mention it again after this.]

Materials

  • seeds
  • container
  • soil
  • pea gravel
  • fertilizer
  • lighting

Rice seeds. We got our received seeds (wild type Nipponbare) from the Kieber lab at UNC Chapel Hill.

Soil. We use 50/50 mix of profile greens and Promix soil. Profile Greens: profile is the brand, greens is the grade. Its intended for landscaping idn golf courses and we bought our supply through a golf course supply store in south Charlotte.  These fine clay granules look like sand but wick up water (much like the clay granules used in cat liter do) and create an evenly hydrated medium.  The Promix soil is a common garden mix, we ordered it online.

Container. Plastic tub – 2ft x 3ft x 8in deep.

Pea gravel. [Explain where this comes from] We use pea gravel to help with controlling gnats.

Fertilizer. Scotts 15-15-15 fertilizer.  This comes as a solid mixture to be dissolved in water.  I mix 133g of it in 1L of water to make a concentration that is 100x.  I dilute this in a 2L pitcher to 1x and apply 6L up to three times per week.

Lighting.  We use the lumigrow XXXX, which uses blue, red, white LEDs. The red and blue LEDs emit only the wavelengths of light that are most efficiently utilized by plants.  LEDs are more efficient than other types of bulbs in terms of the percentage of energy that is lost to heat, and the amount of light generated per unit of energy.  These fixtures take it one step further by being more efficient in terms of light used by the plant per light emitted by the fixture.  The white LEDs are primarily for my benefit; the red-blue is easier on the eyes with a little white mixed in, and there is a little switch so I can have just white light if I need to do anything in the growth chamber.  The lights are plugged into a timer so they are only on for 14 hours each day.

Methods

  1. Prepare soil. Explain what to do.
  2. Set up shelving. Explain what to do.
  3. Set up lighting. Explain what to do.
  4. and etc.

Heat and humidity.  Our tub of rice and LED lights are set up in a growth chamber in the basement. The chamber is temperature controlled (we have it set to 28C) and humidity controlled (we have set to about 50% RH).

Diary

19 Sept. 2014 – Planted Rice seeds
Used the seeds from an envelope labelled: Nipponbare wild type, collected 07/15/13, envelope 6 of 6, see page 58 notebook 36.I tried to get seeds from several different panicles and I tried to discard seeds that seemed skinny. I set up one fo the ikea tubs, ~50%/50% mix of profile greens grade clay grandules and Pro mix.  I wetted the soil mixture and mixed it.  The seeds are in a grid of 5 rows x 6 columns with 2 seeds per spot about an inch or two apart. I marked each seed spot with a little green wire.

I put the 32 seeds that I didn’t use into a plastic dis with water and left on top of the soil.  I covered teh top of the tub with clear-cling plastic wrap so the seeds would not get dry over the weekend.

22 Sept. 2014
Took off the plastic wrap.  29 of the 32 seeds in teh dish have germinated. They are all at the bottom of the dish. The three seeds that did not germinate are skinny and floating. Not many (hardly any) plants poking out of the soil yet.

30 Sept 2014
100% of the seeds have germinated. Some have the first leaf blade.

2 October 2014
I bought pea gravel at Lowes hardware, and added it to the surface of the the soil around the rice plants.

3 October 2014
Watered + fertilized the rice.  I still have the 100x concentrate of liquide 15-15-15 Scotts fertilizer from the last time we grew rice.  I gave the tub 4L of 1x and then hose water, and left it with an inch of water over the pea gravel.

6 October 2014
The pea gravel is still wet, and a bit green.  Next time I guess I should just water it up to the stones, not above them.

10 October 2014
Added some fertilizer to the soil (3L of 1x) and hose water.
I removed 9 plants to reduce crowding (when I seeded them I didn’t actually expect all of them to germinate). Took samples on liquid nitrogen (see page 123, notebook 40 for more info).
I have a small digital thermometer in the chamber. It reads 86F if it is shaded by the rice and 90F if not shaded. It records the min/max over 24 hours, and I think the temperature difference is in sync with the light cycle (7am to 9pm).

14 October 2014
Mason and I moved the tub of rice from the shelf to the floor.  We have two lumigrow light fixtures on one shelving unit.  One fixture is suspended from the top and illuminates the shelf below. The other is suspended from the middle shelf and illuminates the floor where the tub of rice is.
Something a little odd, the plants at the edge of the tub seem to be the healthiest–tallest greenest leaves.  Are the ones in the middle drying or maybe burning because they are too close to the light? –if that’s the case, the move should help, they have more space now.

I added 4L of 1x fertilizer to the tub of rice.  Then I added some hose water–partially just to rinse any fertilizer of of the leaves.

whoops, this should have been a picture of rice.

25 days after the initial seeding. The tub was moved to the floor to have more space between the tub and the light fixture, and to make space for the next batch of seedlings (top shelf).

15 October 2014 – pictures

whoops, there was supposed to be a pciture here!

26 days after initial seeding.

how embarrassing...

26 days have initial seeding, plants are tillering.

17 October 2014
Watered with 6L of 1x, then hose water.

20 October 2014
Used the hose to add more water to the tub of rice. The plants look good, the rocks were dry, and the soil was not saturated even at teh bottom. I filled it until the water was around the rocks. I think things dry out much faster in here when I have two lights on, and the rice must pull more water from the soil now that they are larger.

22 October 2014
Added 8L of 1x fertilizer to the tub of rice. Then added water from the hose until there was water around the bottoms of the rocks. This is the same aproximate water level that I reached on Monday (2 days ago) which  means that the tub looses more than 8L in 2 days (or >4L/day).

24 October 2014
Added 6L of 1x, then added hose water to fill to the top of the tub.

27 October 2014
Added 6L of 1x, then added hose water to fill to the top of the tub.
I showed Mason how to do this. : )

29 October 2014
Mason watered the rice: Added 6L of 1x, then added hose water to fill to the top of the tub.
****skulls and daggers! I turned on the overhead light when I came in on Monday and I forgot to turn it off.  The overhead lights are fine except that they are not on a timer. During the rice day (when the LEDs are on) its hard to tell if the overhead lights are on, but during the rice night, the overhead lights will stay on.  I’ve heard that Nipponbare is very sensitive to light cycle, and it will take forever for it it to go to seed if the days are too long.  I hope this does mess things up. : (  These lights are not nearly as bright as the LEDs, hopefully it’ll have only as much affect as moonlight or starlight…two nights of artificial full moon to go with the artificial sun light.

31 October 2014
I wanted to use 100% fertilizer water to fill the tub today, but I got up to 12L and that didn’t even bring it up to the level of the rocks…The soil was really dry, but the plants look fine, so it must not have been dry for long. I used hose water to fill it up the rest of the way to the top of the tub.

4 November 2014
Made a new bottle of 100x fertilizer from the solid stuff. See details in Mason’s notebook: page 78 notebook 58.  For full original protocol and rationale, see my notebook: page 78 notebook 36.

5 November 2014
Added 6L of 1x, then added hose water to fill to the top of the tub.

7 November 2014
Added 6L of 1.5x, then added hose water to fill to the top of the tub. (I thought the leaves looked like an overly light green, which made me think that they weren’t getting enough N).

The rice grew up into the fan of the LED light.  So April and Nowlan helped me move the shelf up.  The rice is on the floor, so the this moves the light further away. (Moved it 10 inches).

10 November 2014
Added 6L of 1x, then added hose water to fill to the top of the tub.

12 November 2014
Added 6L of 1x, then added hose water to fill to the top of the tub.

14 November 2014
Added 6L of 1x, then added hose water to fill to the top of the tub.

17 November 2014
Added 6L of 1x, then added hose water to fill to the top of the tub.

Nowlan and I cut up one of the plants and from the tub to see if it had entered the “booting” stage, when the panicle begins to develop.

whoopsies

One plant from the tub.

We cut off individual tillers….

woops. shhhh.

One tiller.

The largest tiller on the plant we cut has at least 5 distinct “crowns”.

whoops....how embarassing!

59 days after initial seeding. Cross section of the largest tiller on the plant we cut up.

Compare this to these reference images:

ReferenceImages-GreenRing

Reference images showing the green ring that becomes visible with the initiation of panicle development.

Figure image from “Chapter 2: Rice Growth and Development” by Karen Moldenhauer, Charles E. Wilson, Jr., Paul Counce and Jarrod Hardke.  Photo from “Understanding of Growth Stages is Critical in Rice Production” from the LSU college of agriculture.

We looked at several tillers from this plant, the others had fewer crowns: 3, 5, 2 and 4.

Smaller tiller from the same plant from the tub.

Smaller tiller from the same plant from the tub.

This range of stages may be very convenient when we collect samples for our time course.  Cutting up one plant might give us several sample options.

We also looked at one of the plants that was removed from the tub on October 10th. I set a few of them aside to grown in a window sill in one of the offices.  They are much less developed than their counterparts downstairs (due to getting only window light in November, lower temperature, lower humidity, much less fertilizer, general neglect–its a harsh life).

oopsies.

Window sill plant.

Cross section of a tiller from the window sill plant.

Cross section of a tiller from the window sill plant.

18 November 2014
The leaves are starting to get into the fan again. >: [
That means they grew nearly 10 inches in 11 days.

I used a pair of scissors to essentially mow the leaves that were too close to the light (cut about 1 to 6inches per leaf).
The plants in the middle of the tub (directly under the light) are a bit taller than the ones on the edge, but only by a few inches. At least they were, before the mowing.

Adding new fruit fly annotations to QuickLoad (by Ann)

Several IGB users asked us to add the latest fruit fly genome and annotations to IGB. Here I describe how I did this, using data files downloaded from FlyBase, the chief annotation authority for fruit fly genomes.

In July 2014, FlyBase and Berkeley Drosophila Genome Project (BDGP) released a new genome assembly for the fruit fly genome Drosophila melanogaster. They also released a new set of gene model annotations in various formats – including GFF, GTF, and XML.

To make the data available to IGB users, I needed to download the sequence data and reference gene model annotations, convert them to formats IGB can use for visualization, and then deploy the converted files on the IGBQuickLoad.org Web site. This is the Web site that IGB uses to retrieve data for annotated, sequenced genomes. When IGB users select a genome, the reference gene model track data that automatically loads into IGB comes from the IGBQuickLoad site or one of its mirror sites.

Adding a new genome directory to the IGBQuickLoad data repository

To start, I added a new genome version directory (folder) to the IGBQuickLoad.org data repository. I did this using “svn”, the program we use to manage the data sets. I used the “svn mkdir” command to create a new directory named D_melanogaster_Jul_2014, following our naming convention for genome versions. In IGB, we use the genus, species, release month and year to specify a genome version. (Sometimes we also include an extra name to indicate variety or something else, such as  Z_mays_B73.) However, in this case, we don’t need a variety suffix, so we only use the genus and species to indicate the genome.

The IGBQuickLoad.org data repository is version-controlled as part of a large subversion (svn) repository – the master copy is hosted at https://svn.transvar.org/repos/genomes/trunk/pub/quickload/ and is publicly accessible. Which means: anyone who wants a copy can use “svn co” to get it.

The IGB team has been using subversion for many years to track important data files, like gene annotations or sequence, that people use for data analysis and visualization in IGB. To save effort, we also use the same repository to serve data files to IGB, using the IGB QuickLoad mechanism. When we need to deploy a new quickload site – on our local computers or via a Web server – we just use “svn co” to check out a fresh copy and then copy over some bigger data files (e.g., BAM files) that are too large to add to the repo. However, genome sequence files for smaller genomes (like fruit fly) are not very large and so I usually version-control these in the repository along with the gene model files. (More on this later.)

Adding genome sequence to the repository

Next, I looked around the FlyBase.org Web site to find a file containing the new genome assembly sequences. I found it here: ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.03_FB2014_06/fasta/. The directory contained a file named dmel-all-chromosome-r6.03.fasta.gz. Based on the name, I guessed this file contained the genome sequence, so I downloaded that.

Next, I used Jim Kent’s faToTwoBit program to convert the file from fasta to 2bit format, a compressed, indexed format that enables IGB (and other programs) to retrieve data from the file without having to read the entire thing – called “random access.” This is what enables IGB to load just part of a genomic sequence, which is faster and more memory-efficient than loading the entire thing.

After converting the fasta file to 2bit, I next used Kent’s twoBitInfo program to make a genome.txt file, which is a tab-delimited file that lists the sequences and their sizes. When a user selects a genome version in IGB, IGB then sends a request to a QuickLoad site that supports that genome, asking for a copy of that genomes genome.txt file, which is stored in the genome directory for that genome version. IGB reads this file and uses the information to populate the Current Sequence tab on the right side of the main IGB  window. The Current Sequence tab displays a sortable, clickable table that you can use to change chromosomes in the main display.

This is different from (and we think better than) other genome browser designs, where users have to operate a menu or click on a to-scale genome graphic to change which sequence is on display. Menus don’t work well if there are more than a few items in the menu, and a to-scale genome graphic is even worse because smaller contigs are too hard to select. That’s why in IGB, we use a sortable table instead. And, of course, users can also search for contigs and genes by name.

A quick review of the genome.txt file revealed that this new assembly contains the expected fruit fly chromosome arm sequences (e.g., 2L), but it also contains a lot of other sequences with identifiers like “211000022280328” and “2Cen_mapped_Scaffold_10_D1684.” Altogether, the genome sequence file contains 1,870 sequences.

Interesting! I was not expecting to find so many sequences in this release because the previous release (dm3) contained the autosomal chromosome arms and the sex chromosomes only. But including these additional sequences makes sense. I imagine that the assembly creators at BDGP did not want to omit some-one’s favorite gene from the release just because it didn’t assemble into the main chromosome arms.

Getting gene model annotations from FlyBase

Looking around a bit more, I found these two folders: ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.03_FB2014_06/gtf/ and ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.03_FB2014_06/gff/.

I reviewed the files in each folder, starting with the GFF files. I’ve never had good experiences with GFF, mainly because different people use it in difference ways, despite many attempts to standardize the format. At first glance, the FlyBase GFF looked too complex to handle easily.

So next, I turned to the files in the GTF directory. I don’t know why FlyBase included these files in their ftp site, but I’m guessing it had something to do with supporting users who want to use CuffLinks to analyze RNA-Seq data. CuffLinks uses GTF to represent gene models, andI already have code that converts CuffLinks GTF to BED-detail, our preferred format for distributing gene model data files in IGB. A quick review of the file showed me that the format was indeed simpler, and so I decided to work with that file instead of the GFF files.

Unfortunately, my Cufflinks GTF conversion code failed on the FlyBase GTF file because it lacked transcript features. Also, FlyBase included many in-line comments.

Nonetheless, it looked simpler, and so I wrote a new script to handle FlyBase GTF as distinct from other forms of GTF. The script is named flybaseGtfToBed.py and resides in bitbucket repo https://bitbucket.org/lorainelab/genomes_src. I used this new code to create a BED-detail file named D_melanogaster_Jul_2014.bed, which contains gene models converted from the GTF file.

About BED-detail and how we use it in IGB

BED-detail, also known as BED14, has all the same fields as BED12, plus two more. In IGB, we use these extra fields to provide useful meta-data about a gene model. In IGB, field 13 contains the gene name for a gene model. And field 14 contains human-friendly text describing the gene’s function. When users open these files in IGB, IGB formats them internally in a way that allows users to do keyword searches on field 14. This is very useful, but of course it depends on having human-friendly text available to use in this field. For Arabidopsis, we use the “curator summary” text in field 14. I’m not sure what we can use for fruit fly as I have not been able to find an equivalent to the Arabidopsis “curator summary.”

BED-detail benefits in other areas – high-throughput data analysis

We use BED-detail mainly because it’s compact and easy to index using tabix. (More on this later.) Also, the structure of the BED-detail files makes them extremely useful in high-throughput data analysis using R/Bioconductor tools. My favorite RNA-Seq data analysis trick to find the smallest start and largest end position for all transcripts belonging to the same gene, and I identify those by grouping on gene symbol. Then I use the coordinates to count reads overlapping the gene. Once I have those counts, I can feed them to edgeR or DESeq or other similar tools.

Deploying the new gene model BED-detail file on IGBQuickLoad

For simplicity, I named the file after the genome release: D_melanogaster_Jul_2014.bed. Next, I sorted the file by sequence name and start position, compressed it using bgzip, indexed it using tabix, and added the files to our subversion repository:

sort -k1,1 -k2,2n D_melanogaster_Jul_2014.bed | bgzip > D_melanogaster_Jul_2014.bed.gz
tabix -s 1 -b 2 -e 3 D_melanogaster_Jul_2014.bed.gz
svn add D_melanogaster_Jul_2014.bed.gz.*
svn ci -m "Converted from FlyBase GTF using flybaseGtfToBed.py"

Adding meta-data file: annots.xml and HEADER.html

to-do

Viewing the data in IGB

Here’s an image from IGB showing an alternatively spliced gene (para) from chromosome X. The depth graph made from the gene models (FlyBase) track highlights differentially spliced regions. Places where the depth graph changes height indicate alternative splicing.

DmelX

Moving genomes source code to bitbucket (by Ann)

Today I’m moving source code (mainly python) from the genomes subversion repository over to a new repository on BitBucket.

The genomes subversion repository contains version-control data files for many different genomes that we’ve collected from many sources, such as the UCSC Genome  downloads page, the UCSC Table Browser, Phytozome (for plants), and model organism databases like DictyBase. We launched the genomes repository back in 2008 as part of an Arabidospis 2010 project grant that support IGB from 2008 to 2012. Our original idea was to use version control systems (cvs or svn) to track big data files from Arabidopsis and other species. The repository would not only serve as a useful data archive, it would also provide a way to feed data into IGB.

However, the genomes repo also contains source code we wrote for converting data files, setting up QuickLoad sites, and other tasks. The source code was stored in a directory named “src” (for “source”). You can browse it here: https://svn.transvar.org/repos/genomes/trunk/pub/src/.

Today, I’m moving all that source code over to a new repo hosted on BitBucket: https://bitbucket.org/lorainelab/genomes_src.

I’m doing this because the BitBucket UI for browsing code, reviewing changes, and managing the project is about a thousand times better than what we are currently using on our subversion server. Since BitBucket offers free source code hosting for reasonably sized repositories, there’s no point in hosting any of our source code ourselves, and during the last year, I’ve been happily migrating all our lab code onto BitBucket. Also, I’d like to start using git for version control, at least for all our source code. Setting up a git repository on our server would probably not be difficult, but then we would have yet another service to maintain.  Rather than try to host everything we do on our own machines (which consume electricity and sysadmin time) I’ve realized we should off-load as much as we can onto public, free resources like BitBucket. And at least for now, BitBucket seems like the best option because it’s run by Atlassian, which seems like it has staying power, a mature management structure, and excellent support.

More about the genomes repository:

The genomes repo contains a lot of data, not just code. In fact, most of is data – more than 11 Gb worth of sequence files, annotation files, and some meta-data files describing what’s there.

I’d love to migrate the data onto BitBucket and start using git and not svn for managing the data. I’d love to make it possible for people to fork the repo, improve the annotations, and then issue pull requests back to the main site. The problem is: I don’t know if BitBucket can handle a repository as big as ours – it’s 11 Gb. Also, most of our files are in binary formats (2bit, .tbi, .gz, etc) and I’m not sure how git will handle those files when I or other people make a change. Will it store multiple copies of an enormous file if I modify a small part of it? How would diffs work? It would be terrific if git could somehow handle our binary formats in a smart way.

So for now, I’ll continue to use our subversion server to host and version control the data sets, but I’ll use bitbucket for source code, which is probably what the team at Atlassian would prefer.

One last comment: For now, I’m leaving the “src” directory untouched in the original genomes repo, but I’ll add a notice letting users know that this part of the repository is moving to bitbucket.

Welcome to our blog

Thanks to funding from public sources (NSF, NIH, North Carolina) we get to do interesting research that benefits many people. In this blog, we plan to share some what we’ve learned.

Check back here to read posts on research and related topics from Loraine Lab members.