Students Explore Personal Genomics

Students from the local Kannapolis middle and High School had the unique opportunity to explore the human genome and learn about bioinformatics – the application of computer technology to biological information.


Nowlan working with a student to find out what genetic risks he has inherited.

In 2012, three generations of my family and I had our genetic markers commercially sequenced. The students used this DNA data to identify who was related to who, what kinds of diseases I was most at risk for, and make new discoveries about my genetic inheritance.


Ivory helping a student answer questions about genetics and inheritance.

The purpose of the workshop was to drive home how new genetic technologies are increasingly being used, as well as to give the students experience using genomics software – Integrated Genome Browser. Thanks to Tanner Deal and Ivory Blakley for helping design and lead the workshop, and to Doug Vernon for organizing the “Scientist for a Day” program. For more information about the “Scientist for a Day” program, check out the story in the Independent Tribune.

Plant & Animal Genome Conference

The Loraine lab had a strong showing at the 2016 Plant and Animal Genome Conference. Dr. Loraine gave a talk on the draft blueberry genome and on using the Integrated Genome Browser. I gave a talk on using ProtAnnot – an app for IGB to visualize protein function. Check out the workshop page for more information.

With over 150 talks and more than 3,000 attendees, this year’s PAG had a lot to offer and everyone who attended from the Loraine Lab had a great time. And of course we also enjoyed the local San Diego attractions.

Panda in the early morning at the San Diego Zoo


2015 Plant Biology Meeting

The Loraine lab traveled to the Minneapolis convention center for the 2015 American Society for Plant Biology meeting. There were many great talks given on recent advances in plant biology and crop sciences. Our own April Estrada gave a talk on the role of the gene SR45a in stress response in plants. I gave a talk on using IGB as a resource for teaching, as well as a workshop introducing visual analysis of RNA-seq data.

April giving her talk on SR45a

April giving her talk on SR45a

Everyone had a great time at the various talks, workshops, and exhibits. It was also a great chance to network with other researchers. Of course, we also made sure to take some time to visit the Twin Cities.

Enjoying Minneapolis

Enjoying Minneapolis

2015 Society for Developmental Biology

I want to thank the Society for Developmental Biology for inviting me to their annual meeting in Snowbird, Utah. I had the opportunity to give a talk on the work that April Estrada and I have done on the role of SR45a in alternative splicing in stress response. I also led a workshop on using Integrated Genome Browser to visually analyze high-throughput sequence data. We had a great turnout, as many of the attendees were very interested in using IGB in their work.


SDB attendees finding out more about IGB.

Snowbird is a ski resort located in the mountains near Salt Lake City. I was able to take the tram to the top of the mountain and take some photos. It was a great location for a conference.


View from top of Snowbird, looking out over Salt Lake City.


IGB workshop at 2015 SESDB

IGB made an appearance at the 2015 Southeast Regional Society for Developmental Biology (SESDB) conference at Clemson University. I led a workshop on visual analysis of sequencing data using IGB. There was a good turnout of conference attendees, as well as several students from Clemson University.

Following the workshop there were many exciting talks on current research in developmental biology. Some of my favorite talks were on regenerating hearts and spinal cords in fish, how hair cells develop, and how to use sharks to better understand brain development.

As a graduate of Clemson University, it was a lot of fun to return to Clemson, lead a workshop, and enjoy the company of many brilliant scientists.


Everyone enjoying the final night’s food and festivities, including the band FNKY music.

4th Annual Catalyst Symposium

The Loraine lab had a strong showing at the 4th annual Catalyst Symposium, with Ivory, April, and I presenting posters on our current research.

The title of the symposium was “Progress in NCRC”. The theme was to highlight the highly diverse and interdisciplinary research being conducted across the North Carolina Research Campus. There were nine talks, eighteen posters, and over a hundred attendees. The talks and posters were very good, covering topics such as the role of obesity in promoting cancer and finding what genes control the taste of fruits and vegetables.

Ivory and April presented posters on their work in rice and Arabidopsis, respectively, while I presented the latest features in IGB. It was a lot of work to prepare for the symposium, but everyone had a great time and learned a lot.


IGB at Lenoir-Rhyne University

In spring of 2015, Mason and I visited Dr. Scott Shaeffer’s genetics class at Lenoir-Rhyne University in Hickory, North Carolina.

The goal of our visit was was to teach students about genomics and genetics using Integrated Genome Browser. We also hoped to gain fresh insight into how new users respond to the IGB interface.


Lecture on genes and development.

To start, I gave a talk on developmental genetics, describing how a single mutation in a gene can have huge consequences for a developing embryo. I explained  how advances in technology have made sequencing genomes more affordable than ever, allowing researchers to quickly identify disease causing mutations.


Hands on training with IGB.

Students then worked hands-on with genomic data using IGB. Their first task was to find genetic mutations by exploring whole genome sequencing data. They then used the data to build coverage graphs, allowing them to find deletions. The final task was to explore my own personal 23andMe data using IGB to see if I was at risk for any genetic diseases.

After a quick demonstration, the students had no trouble visualizing the data in IGB. Also, we were happy to see that no-one had any trouble installing IGB on their personal laptops.


Mason and Dr. Schaefer answering questions.

This was the second time the IGB team has visited Lenoir-Rhyne. In 2013, Alyssa Gulledge visited another  of Dr. Schaeffer’s classes, which was using IGB to annotate the newly assembled blueberry genome.  At the time Mason was still attending school there, and this visit was his first exposure to genome visualization tools. IGB really stuck in Mason’s mind, and after graduating he joined the Loraine Lab as an intern, eventually taking over the role of lead tester on the IGB project.

We hope that this year’s visit will inspire other students to pursue careers in science and technology.


We hope everyone had fun learning about developmental genetics and IGB.

Adding a Sorghum version 2.0 genome assembly to IGBQuickLoad

Recently, a researcher working with the sorghum genome posted this question on BioStars – Question: IGB- genome sequence in sequence viewer just shows dashes

Updating to a more recent version of IGB solved the problem, but the question reminded me that Phytozome had released a new sorghum assembly and that we needed to update the IGBQuickLoad site.

In this post, I’ll describe how I updated IGB to the latest sorghum genome assembly using data from Phytozome and a few easy-to-use command line tools.

Step One. Get the data

To start, Nowlan downloaded gene annotations, sequence, and a gene information file from The files were:

  • Sbicolor_255_v2.0.fa.gz – sequence data
  • Sbicolor_255_v2.1.gene_exons.gff3.gz – gene models
  • Sbicolor_255_v2.1.defline.txt – gene information file, summarize function

He put them in our shared IGB Dropbox, saving me the trouble of logging in and figuring out where to look. Thank you Nowlan!

Step Two. Find out when the genome was released and what it’s called.

Using Google searches, I found a page describing the new assembly and associated gene annotations. The genome assembly is called “version 2.0” and the annotations are called “version 2.1.”

A search for “Sorghum 2.0 released” found this news page which contained text:

[07 Jun 2013]   v2.1 of Sorghum bicolor now available as early release.

This new item refers to the genome annotations, not the genome assembly, but probably the two were released about the same time, so I decided to use this. Why does it matter? It matters because including the month and year in the name of a genome release tells us which release is most recent. This is why IGB gives every  assembly a name that tells it (and us) how old the genome is.

Step Three: Check out a copy of IGBQuickLoad.

We are using subversion (a version control system) to track and distribute files that are part of the IGBQuickLoad site.

To add the new data to IGBQuickLoad, I first have to check out a copy using svn. (For more information about svn, see:

svn co

I’ve configured my computer to provide the correct user name and password when I execute svn commands. Otherwise, I would need to supply my user name and password uisng the –username and –password options. For more information, run

svn co -h

Step Four: Add new directory for this new genome

Following IGB conventions, I added a new directory to the IGBQuickLoad data repository using svn:

svn mkdir S_bicolor_Jun_2013

By convention, we keep genome assembly data files and meta-data files in a directory named after the species and release date.

I also added new lines to the .htaccess and contents.txt files:


AddDescription "Sorghum bicolor v. 2.0" S_bicolor_Jun_2013

This ensures the directory listing on the Web site will report the version name and species.


S_bicolor_Jun_2013      Sorghum bicolor v2.0 (June 2013)

This ensures IGB will list the genome in the Current Genome tab. When IGB first accesses an IGBQuickLoad site, it fetches this “contents.txt” file and uses it to determine the names and locations of genome assemblies available on the site.

This contents.txt file is a tab-delimited file with two columns – the first column lists genome directories and the second column lists the human-friendly name of the genome, which is what IGB displays in the IGB window title bar when users selects that genome. Note that sometimes two sites could contain different text in the second genome, but this won’t break IGB. IGB will just show the data from whichever file it read first. So far, this has not posed any problems.

Step Five. Make 2bit sequence file.

I used Jim Kent’s faToTwoBit program to convert the fasta sequence file to 2bit. This format is more compact than fasta and is also random access, which means IGB can use the file to retrieve small parts of the genome quickly when users click the Load Sequence button. By convention, we name the 2bit file after the genome version – this ensures IGB will be able to find it on the site.

faToTwoBit Sbicolor_255_v2.0.fa S_bicolor_Jun_2013.2bit

The fasta file is 702 Mb (almost 1 Gb!) and the 2bit file was only 173M. Much better!

This file must reside in the genome directory S_bicolor_Jun_2013. Because it is not very big, I used svn to add it to the repository.

Step Six. Make genome.txt file.

To populate the Current Sequence tab, IGB needs a “genome.txt” file that lists chromosomes and their sizes – this file, like the 2bit file, resides the in the genome directory. To make this, I use twoBitInfo, another Jim Kent program:

twoBitInfo S_bicolor_Jun_2013.2bit genome.txt

IGB’s Current Sequence table will list the chromosomes in the same order they appear in this file, so I view the first few lines of the file to make sure the most useful (fully assembled) chromosomes appear first in the list:

Chr01   73727935
Chr02   77694824
Chr03   74408397
Chr04   67966759
Chr05   62243505
Chr06   62192017
Chr07   64263908
Chr08   55354556
Chr09   59454246
Chr10   61085274
super_10        8818317

This looks fine. But if it didn’t I might re-order the listing by sequence size:

sort -k2,2nr genome.txt > tmp
mv tmp genome.txt

Step Seven. Make annotations meta-data file (annots.xml)

Here, I used a program named, available from a Loraine Lab git repository. This program converts GFF to BED12 format: -g Sbicolor_255_v2.1.gene_exons.gff3.gz -b tmp.bed

I used a Unix one-liner convert the BED12 file to BED-detail format, sort the file by sequence name and gene model start position, compress it using bgzip, and then save the result to a new file named for the genome version: -g Sbicolor_255_v2.1.defline.txt -b tmp.bed | sort -k1,1 -k2,2n | bgzip > S_bicolor_Jun_2013.bed.gz

To share reference gene model annotations, we use BED-detail format.  It’s compact and more convenient for data analysis than GFF3. It contains all the usual field in a BED12 file, plus two extra fields. In IGB, our BED-detail files

  • report locus names in field 13, useful for grouping alternative transcripts arising from the same transcriptional unit
  • don’t use the itemRGB field (we don’t color-code gene models, but we might do that someday)
  • insert the start position in the thickStart and thickStop fields when a gene model doesn’t encode a protein

When we distribute annotations file, we also index them using tabix, which requires that the indexed file be sorted and block compressed using bgzip. (Google “tabix” to find a copy – it’s open source and associated with the samtools package.)

To index the file using tabix, I did this:

tabix -s 1 -b 2 -e 3 S_bicolor_Jun_2013.bed.gz

This created a file named S_bicolor_Jun_2013.bed.gz.tbi (“tbi” for tabix index) which must reside in the same directory as its corresponding data file for IGB (or other programs) to use it. The index file enables IGB to quickly load the file or retrieve just parts of it via byte-range queries.

Next, I viewed the first few lines of the file:

Chr01    24317    38116    Sobic.001G000400.2    0    -    24631    36898    19    590,255,172,228,134,84,291,333,107,126,166,104,161,901,91,291,85,83,3870,683,1377,1661,2230,8581,8772,9179,9608,9929,10147,10436,10651,10922,11953,12118,12497,13040,13412,    Sobic.001G000400    similar to PDR-like ABC transporter
Chr01    24317    42528    Sobic.001G000400.1    0    -    24631    42234    20    590,255,172,228,134,84,291,333,107,126,166,104,161,901,91,291,85,83,121,500,    0,683,1377,1661,2230,8581,8772,9179,9608,9929,10147,10436,10651,10922,11953,12118,12497,13040,17387,17711,    Sobic.001G000400    similar to PDR-like ABC transporter
Chr01    50023    50488    Sobic.001G000500.1    0    -    50037    50488    54,263,    0,202,    Sobic.001G000500    Not Available
Chr01    53429    60786    Sobic.001G000600.1    0    -    60555    60738    204,419,    0,6938,    Sobic.001G000600    Not Available

This looked good. Field 4 contained gene model name, field 13 contained locus name, and field 14 contained a description taken from the gene information file (Sbicolor_255_v2.1.defline.txt), but reported “Not Available” when nothing about that gene was listed.

How many gene models have no information?

gunzip -c S_bicolor_Jun_2013.bed.gz | grep -c "Not Available" 
gunzip -c S_bicolor_Jun_2013.bed.gz | wc -l

Out of 39,441 gene models, around 14,000 had no functional information. This isn’t great, but it’s not unusual for a newly annotated genome.

Step Eight. Make a HEADER.html file – documentation

Next, I created a HEADER.html file that documents the files and where they came from. When user visit the new genome directory on the web, they’ll see the HEADER file contents displayed along with a listing of the files.

Step Nine. Make an annots.xml file

To ensure that IGB will show the genome, I next needed to add an “annots.xml” meta-data file that reports the data sets that are available for the genome. Because I’m a little lazy, I use the “svn cp” (svn copy) command to make an annots.xml file using the file from a previous genome version.

svn cp ../S_bicolor_Jan_2009/annots.xml .

I used a text editor to update the data for this new genome:

  <file name="S_bicolor_Jun_2013.bed.gz" 
        description="Version 2.1 gene models from Phytozome, Nov. 2014."
        load_hint="Whole Sequence"

What all this means:

  • name – name of the file in the file system; IGB assumes it resides in the same directory as the annots.xml file, inside the genome directory S_bicolor_Jun_2013
  • title – name IGB gives this data set in the Available Data sets section of the Data Access tab. This is also the name IGB displays in the track label once the data are loaded into a track.
  • description – displayed as a tooltip when users position the cursor over the “i” icon next to the data set under Available Data Sets.
  • url – if users click on the “i” icon, a Web browser will open to this location relative to the IGBQuickLoad root directory. ( in this case.)
  • foreground – annotation color
  • name_size – font size for the track label
  • background – background color for the track
  • load_hint – how and when to load the data. “Whole Sequence” means: load all the data at once.
  • show2tracks – whether or not to divide the annotations into two tracks – one for minus strand genes and another for plus strand genes

Step Ten. See how it looks in IGB.

For testing and development, I’ve been adding and updating the files in a local copy of the IGBQuickLoad repository that I’ve checked out onto my computer. To see how the new files will look in IGB, I use the Data Sources tab in the Preferences window to add the local QuickLoad site to IGB.

Then, I used the Current Sequence tab to select species Sorghum bicolor and genome version S_bicolor_Jun_2013. I also ran a quick keyword search (Advanced Search tab) to make sure the “details” part of the BED-detail file is read correctly into IGB.

Everything looks good!


Step Eleven. Check it into the repository and update the public IGB QuickLoad sites.

I then used “svn add” and “svn commit” to add the new files to the repository. Then I logged into the main IGBQuickLoad site and the IGBquickLoad backup site and ran “svn up” to add new files and update old ones.

This entire process, including writing this post and coffee breaks, took about two hours of my time.










Using Table Browser at UCSC to get a data set for IGB

The UCSC Genome Bioinformatics site offers a wealth of data from many genomes, ranging from tiny (but deadly) genomes like ebola virus to much larger genomes like our own human genome. Indeed, the reference genome sequence and gene model data sets for many of the genomes the IGB teams hosts on the data repository site are originally from UCSC. Other projects – like Galaxy – also rely on UCSC for core data sets. If you have used Galaxy, you may have noticed that Galaxy has built-in genomes for many animal species; these data were imported from the UCSC ecosystem.

When you visit a UCSC-supported genome in IGB, you’ll see a folder named “UCSC (DAS)” in the Available Data Sets  section of the Data Access Panel. If you open the folder, you’ll see many data sets with seemingly cryptic titles – like “nestedRepeats” and “altLocations.” These names correspond to tables in the UCSC database. If you select these data sets and click Load Data in IGB, these data will flow from the UCSC system’s Distributed Annotation Server into IGB.

DAS4HumanHowever, for technical reasons, the UCSC DAS site can only support some of their data sets. To view UCSC data that are not supported in their DAS site, you can use the UCSC Table Browser to download them onto your computer and open them in IGB.

In this post, I’ll explain how you can use the UCSC Table Browser to obtain and then open data sets in Integrated Genome Browser. I’ll also explain how you can use tabix and bgzip to compress and index files that are too large to load into IGB all at once.

Part I: Getting data from the UCSC Web site

In this example, I’ll show you how to get human variation data (SNPs) from the UCSC Web site.

  1. Go  to
  2. Click Tables (top of page)

Configure the browser to access the latest (as of this writing) human genome assembly:

  1. clade Mammal
  2. genome Human
  3. assembly Dec 2013 (GRCh38/hg38) (the latest)
  4. group Variation
  5. track Common SNPs(141)
  6. table snp141Common
  7. region genome
  8. output format BED – browser extensible data
  9. file type returned gzip compressed

Enter a name for the output file – e.g., snp141Common.hg38.bed.gz. It should end with “.gz” to ensure your computer will recognize it as a gzip-compressed file.

Tip: Click the button Describe Table Schema to see what data are in the table.


to be continued