- NCBI gene IDs for 100 nifH genes were gathered in nifh_accessions.csv
- gene_IDs.R isolated the gene IDs in integer form.
- Gene IDs were copied and pasted into the loop object of gene_id_parser.sh
- gene_id_parser.sh retrieved the gene sequences for each gene ID and put them in a fasta file called nifh_sequences1.fasta. For pipeline see: sequence_generator.sh
- msa.R produced a multiple sequence alignment out.fasta by taking nifh_sequences1.fasta as input
- The MView program was used to produce visualizations in figure1.html and figure2.html
- take the out.fasta file and filter out sequences with >60% identity to create DMinput.fasta
- use DMinput.fasta to calculate the distance matrix and construct and phylogenetic tree using distance_matrix_and_tree.r
- output after running distance_matrix_and_tree.r will be a folder with four pdfs: DistMat(indel).pdf, DistMat(JC69&K80&K81&TN93).pdf, and DistMat(TS&TV).pdf to show the distance matrix and Phylogeny.pdf to show the phylogenetic tree