【佳学基因检测】RNA测序结果分析起点数据标准

RNA测序数据分析导读：

出于推广基因信息技术的目的，在这里，佳学基因所推出的数据分析和操作标准都可以采用共享程序、开源程序可以完成的。RNA测序分析开源程序大多数可以从Bioconductor软件中找到，从而支持从端到端的基因水平的RNA测序数据中的基因差异差异表达分析。佳学基因从FASTQ文件开始，展示这些文件是如何与人类基因组的参考基因组对齐，并生成一个计数矩阵。从该矩阵统计每个样本每个基因内RNA测序数据中的测序数据、表达片段。佳学基因应导大家进行探索性数据分析（EDA），从而对数据质量进行质量评估，并探索样本之间的关系，执行差异基因表达分析，并生成可用于高索引文章发表的图表。

RNA测序数据开源分析软件介绍

佳学基因是国际开源软件联盟成员。成员软件库Bioconductor有许多支持高通量序列数据分析的软件包，包括RNA测序（RNA seq）。佳学基因在展示、示范过程中使用的软件包包括由Bioconductor核心团队维护的核心软件包，用于导入和处理原始测序数据以及对RNA测序数据进行基因注释。其中的部分软件包可以进行部分统计分析和序列数据图表的生成。Bioconductor按计划每6个月进行一次更新，从而确保项目中的所有软件包能够协调一致地工作。此工作流中使用的软件包带有库功能，可以按照Bioconductor软件包安装说明进行安装。

RNA测序数据分析时的起点数据

该工作流程中使用的数据存储来源于真实的实验数据。实验中的气道平滑肌细胞使用地塞米松（一种具有抗炎作用的合成糖皮质激素类固醇）进行处理。在现实生活中，哮喘患者使用糖皮质激素来减轻气道炎症。在实验中，四个原代人气道平滑肌细胞系用1微摩尔地塞米松处理18小时。对于四个细胞系中的每一个，有一个实验样本和一个空白对照样本。原代ASM细胞是从四名无慢性疾病的流产肺移植供体中分离出来的。第4至7代ASM细胞维持在添加10%FBS的Hams F12培养基中，用于所有实验。对于RNA Seq和qRT PCR验证实验，来自每个供体的细胞用1µM DEX或空白对照溶液处理18小时。

Preliminary processing of raw reads was performed using Casava 1.8 (Illumina, Inc., San Diego, CA). Subsequently, Taffeta scripts (https://github.com/blancahimes/taffeta) were used to analyze RNA-Seq data, which included use of FastQC [54] (v.0.10.0) to obtain overall QC metrics. Based on having sequence bias in the initial 12 bases on the 5′ end of reads, the first 12 bases of all reads were trimmed with the FASTX Toolkit (v.0.0.13) [55]. FastQC reports for each sample revealed that each was successfully sequenced. Trimmed reads for each sample were aligned to the reference hg19 genome and known ERCC transcripts using TopHat [56] (v.2.0.4), while constraining mapped reads to be within reference hg19 or ERCC transcripts. Additional QC parameters were obtained to assess whether reads were appropriately mapped. Bamtools [57] was used to the number of mapped reads, including junction spanning reads. The Picard Tools (http://picard.sourceforge.net) RnaSeqMetrics function was used to compute the number of bases assigned to various classes of RNA, according to the hg19 refFlat file available as a UCSC Genome Table. For each sample, Cufflinks [21] (v.2.0.2) was used to quantify ERCC Spike-In and hg19 transcripts based on reads that mapped to the provided hg19 and ERCC reference files. For three samples that contained ERCC Spike-Ins, we created dose response curves (i.e. plots of ERCC transcript FPKM vs. ERCC transcript molecules) following the manufacturer's protocol [58]. Ideally, the slope and R2 would equal 1.0. For our samples (Dex.2, Control.4, Dex.4), the slope (R2) values were 0.90 (0.90), 0.92 (0.84), 0.82 (0.86), respectively. Raw read plots were created by displaying bigwig files for each sample in the UCSC Genome Browser.

Differential expression of genes and transcripts in samples treated with DEX vs. untreated samples was obtained using Cuffdiff [21] (v.2.0.2) with the quantified transcripts computed by Cufflinks (v.2.0.2), while applying bias correction. The CummeRbund [59] R package (v.0.1.3) was used to measure significance of differentially expressed genes and create plots of the results. As a positive control of gene expression, the FPKM values for four housekeeping genes (i.e., B2M, GABARAP, GAPDH, RPL19) were obtained. Each had high FPKM values that did not differ significantly by treatment status [Figure S11]. The NIH Database for Annotation, Visualization and Integrated Discovery (DAVID) was used to perform gene functional annotation clustering using Homo Sapiens as background, and default options and annotation categories (Disease: OMIM_DISEASE; Functional Categories: COG_ONTOLOGY, SP_PIR_KEYWORDS, UP_SEQ_FEATURE; Gene_Ontology: GOTERM_BP_FAT, GOTERM_CC_FAT, GOTERM_MF_FAT; Pathway: BBID, BIOCARTA, KEGG_PATHWAY; Protein_Domains: INTERPRO, PIR_SUPERFAMILY, SMART) [28]. The RNA-Seq data is available at the Gene Expression Omnibus Web site (http://www.ncbi.nlm.nih.gov/geo/) under accession GSE52778.

(责任编辑：佳学基因)